[Answered ]-Running multiple web crawlers at the same time in Django

1👍

You should be able to do this with gevent, something like:

import gevent
from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def handle(self, *args, **options):
        pexel_crawler = PexelCrawler()
        magdeleine_crawler = MagdeleineCrawler()
        pexel_job = gevent.spawn(pexel_crawler.crawl)
        magdeleine_job = gevent.spawn(magdeleine_crawler.crawl)
        gevent.joinall([pexel_job, magdeleine_job])

I believe that will work, and keep the management command running in the foreground for as long as both crawlers are running. I would be careful though, because if this works as expected, it will truly be an infinite loop and never stop.

1👍

I suggest you to use Celery for that task.
Crawl operation can take many time and if we invoke it from cmd it’s ok we control task but on production you will call it from cron/view/etc so better to have control over
task life cycle.

Install Celery and Django management tool djcelery

pip install celery
pip install djcelery

For message broker i suggest to install RabbitMQ

apt-get install rabbitmq-server

in settings.py of your Django project add

import djcelery

djcelery.setup_loader()

CELERYBEAT_SCHEDULER = 'djcelery.schedulers.DatabaseScheduler' #To make crawl call by shedule.

Create file tasks.py in your project and put this code.

from __future__ import absolute_import
from celery import shared_task
from django.core.management import call_command
@shared_task
def run_task():
    print call_command('your_management_command', verbosity=3, interactive=False)

To control your task install flower.

apt-get install flower

Run your task at first:

Run rabbitmq server

service rabbitmq-server start  

Then run celery

service celeryd start

And then flower to control execution of your tasks.

service flower start

That’s it you can now run your crawler tasks and you would have any troubles with this.

👤Darius

Leave a comment