2
I believe the problem you experience is related to CUDA contexts. As of CUDA 4.0 a CUDA context is required per process and per device.
Behind the scenes celery will spawn processes for the task workers. When a process/task starts it will not have a context available. In pyCUDA the context creation happens in the autoinit module. That’s why your code will work if you run it as a standalone (no extra process is created and the context is valid) or if you put the import autoinit
inside the CUDA task (Now the process/task will have a context, I believe you tried that already).
If you want to avoid the import you may be able to use the make_default_context
from pycuda.tools
although I’m not very familiar with pyCUDA and how it handles context management.
from pycuda.tools import make_default_context
@task()
def photo_function(photo_id,...):
ctx = make_default_context()
print 'Got photo...'
... Do some stuff ...
result = do_photo_manipulation(photo_id)
return result
Beware that context creation is an expensive process. CUDA deliberately front loads a lot of work in the context in order to avoid non expected delays later on. That’s why you have a stack of contexts that you can push/pop between host threads (but not between processes). If your kernel code is very fast you may experience delays because of the context create/destroy procedure.