7đ
worker daemon managing the connection
Your picture currently looks like:
user -----------> webserver <--------[1]--> 3rd party DB
connection [1] is expensive.
You could solve this with:
user ----> webserver <---> task queue[1] <---> worker daemon <--[2]-> 3rd party DB
[1] task queue can be redis, celery or rabbitmq.
[2] worker daemon keeps connection open.
A worker daemon would do the connection to the 3rd party database and keep the connection open. This would mean that each request would not have to pay the connection costs. The task queue would be the inter-process communication, dispatching work to the daemon and do the queries in the 3rd party db. The webserver should be as light as possible in terms of processing and let workers do expensive tasks.
preloading with apache + modwsgi
You can actually preload
and have the expensive connection done before the first request. This is done with the WSGIImportScript
configuration directive. I donât remember at the top of my head if having a pre-load + forking configuration means each request will already have the connection opened and share it; but since you have most of the code, this could be an easy experiment.
preloading with uwsgi
uwsgi
supports preloading too. This is done with the import
directive.
3đ
As far as I can tell, you have ruled out most (all?) of the common solutions to this type of problem:
- Store connection in a dictionary ⊠need N workers and canât guarantee which request goes to which worker
- Store data in cache ⊠too much data
- Store connection info in cache ⊠connection is not serialisable
As far as I can see there is really only 1 âmetaâ solution to this, use @Gahbuâs suggestion of a dictionary and guarantee the requests for a given user
go to the same worker. I.e. figure out a way to map from the User
object to a given worker the same way every time (maybe hash their name and MOD by the number of workers?).
This solution would not make the most of your N workers if the currently active Users all mapped to the same worker, but if all Users are equally likely to be active at the same time then the work should be equally spread. (If they are not all equally likely then the mapping may be able to account for that).
The two possible ways I can think of doing this would be:
1. Write a custom request allocator
Iâm not really familiar with the apache/wsgi interfacing land but ⊠it might be possible to replace the component within your Apache server that dispatches the HTTP requests to the workers with some custom logic, such that it always dispatches to the same process.
2. Run a load-balancer/proxy in front of N single threaded workers
Iâm not sure if you can use a ready to go package here or not, but the concept would be:
- Run a proxy that implements this âbind the User to an indexâ logic
- Have the proxy then forward the requests to one of N copies of your Apache/wsgi webserver which each has a single worker.
NB: This second idea I came across here: https://github.com/benoitc/gunicorn/issues/183
Summary
For both options the implementation in your existing application is pretty simple. Your application just changes to use a dictionary for storing the persistent connection (creating one if there isnât one already). Testing a single instance is the same in dev as in production. In production, the instances themselves are none the wiser that they are always asked about the same users.
I like Option 2 here for the following reasons:
- Maybe there is an existing server package that allows you to define this proxy trickiness
- If not, creating a custom proxy application to sit in front of your current application might not be too hard (especially considering the restrictions you are (already) under when the request reaches the
strangedb
service)
- [Django]-Passing variable urlname to url tag in django template
- [Django]-MySQL ERROR 2026 â SSL connection error â Ubuntu 20.04
- [Django]-How to reset migrations in Django 1.7
3đ
Rather than having multiple worker processes, you can use use the WSGIDaemonProcess
directive to have multiple worker threads which all run in a single process. That way, all the threads can share the same DB connection mapping.
With something like this in your apache configâŠ
# mydomain.com.conf
<VirtualHost *:80>
ServerName mydomain.com
ServerAdmin webmaster@mydomain.com
<Directory />
Require all granted
</Directory>
WSGIDaemonProcess myapp processes=1 threads=50 python-path=/path/to/django/root display-name=%{GROUP}
WSGIProcessGroup myapp
WSGIScriptAlias / /path/to/django/root/myapp/wsgi.py
</VirtualHost>
âŠyou can then use something as simple as this in your Django appâŠ
# views.py
import thread
from django.http import HttpResponse
# A global variable to hold the connection mappings
DB_CONNECTIONS = {}
# Fake up this "strangedb" module
class strangedb(object):
class connection(object):
def query(self, *args):
return 'Query results for %r' % args
@classmethod
def connect(cls, *args):
return cls.connection()
# View for homepage
def home(request, username='bob'):
# Remember thread ID
thread_info = 'Thread ID = %r' % thread.get_ident()
# Connect only if we're not already connected
if username in DB_CONNECTIONS:
strangedb_connection = DB_CONNECTIONS[username]
db_info = 'We reused an existing connection for %r' % username
else:
strangedb_connection = strangedb.connect(username)
DB_CONNECTIONS[username] = strangedb_connection
db_info = 'We made a connection for %r' % username
# Fake up some query
results = strangedb_connection.query('SELECT * FROM my_table')
# Fake up an HTTP response
text = '%s\n%s\n%s\n' % (thread_info, db_info, results)
return HttpResponse(text, content_type='text/plain')
âŠwhich, on the first hit, producesâŠ
Thread ID = 140597557241600
We made a connection for 'bob'
Query results for 'SELECT * FROM my_table'
âŠand, on the secondâŠ
Thread ID = 140597145999104
We reused an existing connection for 'bob'
Query results for 'SELECT * FROM my_table'
Obviously, youâll need to add something to tear down the DB connections when theyâre no longer required, but itâs tough to know the best way to do that without more info about how your app is supposed to work.
Update #1: Regarding I/O multiplexing vs multithreading
I worked with threads twice in my live and each time it was a
nightmare. A lot of time was wasted on debugging non reproducible
problems. I think an event-driven and a non-blocking I/O architecture
might be more solid.
A solution using I/O multiplexing might be better, but would be more complex, and would also require your âstrangedbâ library to support it, i.e. it would have to be able to handle EAGAIN
/EWOULDBLOCK
and have the capacity to retry the system call when necessary.
Multithreading in Python is far less dangerous than in most other languages, due to Pythonâs GIL, which, in essence, makes all Python bytecode thread-safe.
In practice, threads only run concurrently when the underlying C code uses the Py_BEGIN_ALLOW_THREADS
macro, which, with its counterpart, Py_END_ALLOW_THREADS
, are typically wrapped around system calls, and CPU-intensive operations.
The upside of this is that itâs almost impossible to have a thread collision in Python code, although the downside is that it wonât always make optimal use of multiple CPU cores on a single machine.
The reason I suggest the above solution is that itâs relatively simple, and would require minimal code changes, but there may be a better option if you could elaborate more on your âstrangedbâ library. It seems rather odd to have a DB which requires a separate network connection per concurrent user.
Update #2: Regarding multiprocessing vs multithreading
âŠthe GIL limitations around threading seem to be a bit of an issue.
Isnât this one of the reasons why the trend is to use separate
processes instead?
Thatâs quite possibly the main reason why Pythonâs multiprocessing
module exists, i.e. to provide concurrent execution of Python bytecode across multiple CPU cores, although there is an undocumented ThreadPool
class in that module, which uses threads rather than processes.
The âGIL limitationsâ would certainly be problematic in cases where you really need to exploit every single CPU cycle on every CPU core, e.g. if you were writing a computer game which had to render 60 frames per second in high-definition.
Most web-based services, however, are likely to spend most of their time waiting for something to happen, e.g. network I/O or disk I/O, which Python threads will allow to occur concurrently.
Ultimately, itâs trade-off between performance and maintainability, and given that hardware is usually much cheaper than a developerâs time, favoring maintainability over performance is usually more cost-effective.
Frankly, the moment you decide to use a virtual machine language, such as Python, instead of a language which compiles into real machine code, such as C, youâre already saying that youâre prepared to sacrifice some performance in exchange for convenience.
See also The C10K problem for a comparison of techniques for scaling web-based services.
- [Django]-Django dumpdata django.contrib.auth
- [Django]-Get request data in Django form
- [Django]-Visual Editor for Django Templates?
0đ
One simple way to do this would be have another python process manage the pool of persistant connection (one for each user and can timeout when needed). And then the other python process and django can communicate with something fast like zeromq. interprocess communication in python
- [Django]-Multi-tenant Django applications: altering database connection per request?
- [Django]-How can I resolve 'django_content_type already exists'?
- [Django]-Iterate over model instance field names and values in template