[Django]-Django/Apache freezing with mod_wsgi

4👍

Use:

http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Extracting_Python_Stack_Traces

to embed functionality that you can trigger at a time where you expect stuck requests and find out what they are doing. Likely the requests are accumulating over time rather than happening all at once, so you could do it periodically rather than wait for total failure.

As a fail safe, you can add the option:

inactivity-timeout=600

to WSGIDaemonProcess directive.

What this will do is restart the daemon mode process if it is inactive for 10 minutes.

Unfortunately at the moment this happens in two scenarios though.

The first is where there have been no requests at all for 10 minutes, the process will be restarted.

The second, and the one you want to kick in, is if all request threads are blocked and none of them has read any input from wsgi.input, nor have any yielded any response content, in 10 minutes, the process will again be restarted automatically.

This will at least mean your process should recover automatically and you will not be called out of bed. Because you are running so many processes, chances are that they will not all get stuck at the same time so restart shouldn’t be noticed by new requests as other processes will still handle the requests.

What you should work out is how low you can make that timeout. You don’t want it so low that processes will restart because of no requests at all as it will unload the application and next request if lazy loading being used will incur slow down.

What I should do is actually add a new option blocked-timeout which specifically checks for all requests being blocked for the defined period, therefore separating it from restarts due to no requests at all. This would make this more flexible as having it restart due to no requests brings its own issues with loading application again.

Unfortunately one can’t easily implement a request-timeout which applies to a single request because the hosting configuration could be multithreaded. Injecting Python exceptions into a request will not necessarily unblock the thread and ultimately you would have to kill process anyway and interupt other concurrent requests. Thus blocked-timeout is probably better.

Another interesting thing to do might be for me to add stuff into mod_wsgi to report such forced restarts due to blocked processes into the New Relic agent. That would be really cool then as you would get visibility of them in the monitoring tool. 🙂

1👍

We had a similar problem at my work. Best we could ever figure out was race/deadlock issues with the app, causing mod_wsgi to get stuck. Usually killing one or more mod_wsgi processes would un-stick it for a while.

Best solution was to move to all-processes, no-threads. We confirmed with our dev teams that some of the Python libraries they were pulling in were likely not thread-safe.

Try:

WSGIDaemonProcess web1 user=web1 group=web1 processes=16 threads=1 maximum-requests=500 python-path=/home/web1/django_env/lib/python2.6/site-packages display-name=%{GROUP}

Downside is, processes suck up more memory than threads do. Consequently we usually end up with fewer overall workers (hence 16×1 instead of 8×15). And since mod_wsgi provides virtually nothing for reporting on how busy the workers are, you’re SOL apart from just blindly tuning how many you have.

Upside is, this problem never happens anymore and apps are completely reliable again.

Like with PHP, don’t use a threaded implementation unless you’re sure it’s safe… that means the core (usually ok), the framework, your own code, and anything else you import. 🙂

👤jakem

0👍

If I’ve understood your problem properly, you may try the following options:

  • move URL fetching out of the request/response cycle (using e.g. celery);
  • increase thread count (they can handle such blocks better than processes because they consume less memory);
  • decrease timeout for the urllib2.urlopen;
  • try gevent or eventlet (they will magically solve your problem but can introduce another subtle issues)

I don’t think this is a deployment issue, this is more of a code issue and there is no apache configuration solving it.

Leave a comment