1đź‘Ť
Caveat : I never used Amazon SQS so there might be some specific stuff here, but the answer below still holds true whatever the broker.
Actually there’s no way to garantee a task will start executing “immediatly”. First you have some latency before the task hits one of your workers due to network latency and broker’s processing (whatever the broker), then you may not have a worker immediatly available. The fact you got (almost) immediate execution with RabbitMQ is not conclusive by itself – if all your workers are busy, you’ll get the same behaviour whatever the broker (this is a fact – we’ve been having this behaviour – and sometimes still do when we get a huge traffic peak – using RabbitMQ).
You can mitigate the issue by a combination of the followings:
- if you have both long and short running tasks, set up different queues and workers for the different kind of tasks so short running tasks won’t wait for long running ones to complete
- set your workers prefetch_count low or (if using the prefork pool) disable prefetching completely – this may seem counter-intuitive, but it will ensure your tasks will go to the first available worker process, hence ensuring a slightly more predictable behaviour
- and of course the plain obvious solution: add as many celery workers as your celery server can handle, then add more celery servers.
Now you might also have an issue with with Amazon SQS (either configuration, extended network latency or whatever), which might be fixable or not. As I said I don’t have any experience here so I’ll let someone more knowledgeable chime in on this point – but the fact is that even with the best possible setup you will still be restricted by the concurrent tasks count / available workers ratio unless you can garantee you always have more available workers than concurrent tasks.
edit : If all your workers are idle when you submit a task then the only delay is the caller -> broker -> worker chain (no “wait for a free worker” time). Some network latency overhead is to be expected when using a remote broker (compared to a local rabbitmq instance), but “5 to 10s” seems a bit much for the network part so SQS might indeed be the culprit. You may want to have a look here and here for similar issues.
1đź‘Ť
Came across this in the documentation, long polling is enabled by default and is set to poll every 10 seconds. You may need to adjust the value in BROKER_TRANSPORT_OPTIONS
accordingly.