78
I think the main misconception is the package path vs the settings module path. In order to use djangoβs models from an external script you need to set the DJANGO_SETTINGS_MODULE
. Then, this module has to be importable (i.e. if the settings path is myproject.settings
, then the statement from myproject import settings
should work in a python shell).
As most projects in django are created in a path outside the default PYTHONPATH
, you must add the projectβs path to the PYTHONPATH
environment variable.
Here is a step-by-step guide to create a fully working (and minimal) Django models integration into a Scrapy project:
Note: This instructions work at the date of the last edit. If it doesnβt work for you, please add a comment and describe your issue and scrapy/django versions.
-
The projects will be created within
/home/rolando/projects
directory. -
Start the django project.
$ cd ~/projects $ django-admin startproject myweb $ cd myweb $ ./manage.py startapp myapp
-
Create a model in
myapp/models.py
.from django.db import models class Person(models.Model): name = models.CharField(max_length=32)
-
Add
myapp
toINSTALLED_APPS
inmyweb/settings.py
.# at the end of settings.py INSTALLED_APPS += ('myapp',)
-
Set my db settings in
myweb/settings.py
.# at the end of settings.py DATABASES['default']['ENGINE'] = 'django.db.backends.sqlite3' DATABASES['default']['NAME'] = '/tmp/myweb.db'
-
Create the database.
$ ./manage.py syncdb --noinput Creating tables ... Installing custom SQL ... Installing indexes ... Installed 0 object(s) from 0 fixture(s)
-
Create the scrapy project.
$ cd ~/projects $ scrapy startproject mybot $ cd mybot
-
Create an item in
mybot/items.py
.
Note: In newer versions of Scrapy, you need to install scrapy_djangoitem
and use from scrapy_djangoitem import DjangoItem
.
from scrapy.contrib.djangoitem import DjangoItem
from scrapy.item import Field
from myapp.models import Person
class PersonItem(DjangoItem):
# fields for this item are automatically created from the django model
django_model = Person
The final directory structure is this:
/home/rolando/projects
βββ mybot
β βββ mybot
β β βββ __init__.py
β β βββ items.py
β β βββ pipelines.py
β β βββ settings.py
β β βββ spiders
β β βββ __init__.py
β βββ scrapy.cfg
βββ myweb
βββ manage.py
βββ myapp
β βββ __init__.py
β βββ models.py
β βββ tests.py
β βββ views.py
βββ myweb
βββ __init__.py
βββ settings.py
βββ urls.py
βββ wsgi.py
From here, basically we are done with the code required to use the django models in a scrapy project. We can test it right away using scrapy shell
command but be aware of the required environment variables:
$ cd ~/projects/mybot
$ PYTHONPATH=~/projects/myweb DJANGO_SETTINGS_MODULE=myweb.settings scrapy shell
# ... scrapy banner, debug messages, python banner, etc.
In [1]: from mybot.items import PersonItem
In [2]: i = PersonItem(name='rolando')
In [3]: i.save()
Out[3]: <Person: Person object>
In [4]: PersonItem.django_model.objects.get(name='rolando')
Out[4]: <Person: Person object>
So, it is working as intended.
Finally, you might not want to have to set the environment variables each time you run your bot. There are many alternatives to address this issue, although the best it is that the projectsβ packages are actually installed in a path set in PYTHONPATH
.
This is one of the simplest solutions: add this lines to your mybot/settings.py
file to set up the environment variables.
# Setting up django's project full path.
import sys
sys.path.insert(0, '/home/rolando/projects/myweb')
# Setting up django's settings module name.
# This module is located at /home/rolando/projects/myweb/myweb/settings.py.
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'myweb.settings'
# Since Django 1.7, setup() call is required to populate the apps registry.
import django; django.setup()
Note: A better approach to the path hacking is to have setuptools
-based setup.py
files in both projects and run python setup.py develop
which will link your project path into the pythonβs path (Iβm assuming you use virtualenv
).
That is enough. For completeness, here is a basic spider and pipeline for a fully working project:
-
Create the spider.
$ cd ~/projects/mybot $ scrapy genspider -t basic example example.com
The spider code:
# file: mybot/spiders/example.py from scrapy.spider import BaseSpider from mybot.items import PersonItem class ExampleSpider(BaseSpider): name = "example" allowed_domains = ["example.com"] start_urls = ['http://www.example.com/'] def parse(self, response): # do stuff return PersonItem(name='rolando')
-
Create a pipeline in
mybot/pipelines.py
to save the item.class MybotPipeline(object): def process_item(self, item, spider): item.save() return item
Here you can either use
item.save()
if you are using theDjangoItem
class or import the django model directly and create the object manually. In both ways the main issue is to define the environment variables so you can use the django models. -
Add the pipeline setting to your
mybot/settings.py
file.ITEM_PIPELINES = { 'mybot.pipelines.MybotPipeline': 1000, }
-
Run the spider.
$ scrapy crawl example
5
Even though Rhoβs answer seems very good I thought Iβd share how I got scrapy working with Django Models (aka Django ORM) without a full blown Django project since the question only states the use of a βDjango databaseβ. Also I do not use DjangoItem.
The following works with Scrapy 0.18.2 and Django 1.5.2. My scrapy project is called scrapping in the following.
-
Add the following to your scrapy
settings.py
filefrom django.conf import settings as d_settings d_settings.configure( DATABASES={ 'default': { 'ENGINE': 'django.db.backends.postgresql_psycopg2', 'NAME': 'db_name', 'USER': 'db_user', 'PASSWORD': 'my_password', 'HOST': 'localhost', 'PORT': '', }}, INSTALLED_APPS=( 'scrapping', ) )
-
Create a
manage.py
file in the same folder as yourscrapy.cfg
:
This file is not needed when you run the spider itself but is super convenient for setting up the database. So here we go:#!/usr/bin/env python import os import sys if __name__ == "__main__": os.environ.setdefault("DJANGO_SETTINGS_MODULE", "scrapping.settings") from django.core.management import execute_from_command_line execute_from_command_line(sys.argv)
Thatβs the entire content of
manage.py
and is pretty much exactly the stockmanage.py
file you get after runningdjango-admin startproject myweb
but the 4th line points to your scrapy settings file.
Admittedly, usingDJANGO_SETTINGS_MODULE
andsettings.configure
seems a bit odd but it works for the onemanage.py
commands I need:$ python ./manage.py syncdb
. -
Your
models.py
Your models.py should be placed in your scrapy project folder (ie.scrapping.modelsΒ΄).
$ python ./manage.py syncdb`. It may look like this:
After creating that file you should be able to run youfrom django.db import models class MyModel(models.Model): title = models.CharField(max_length=255) description = models.TextField() url = models.URLField(max_length=255, unique=True)
-
Your
items.py
andpipeline.py
:
I used to use DjangoItem as descriped in Rhoβs answer but I ran into trouble with it when running many crawls in parallel with scrapyd and using Postgresql. The exceptionmax_locks_per_transaction
was thrown at some point breaking all the running crawls. Furthermore, I did not figure out how to properly roll back a faileditem.save()
in the pipeline. Long story short, I ended up not using DjangoItem at all which solved all my problems. Here is how:
items.py
:from scrapy.item import Item, Field class MyItem(Item): title = Field() description = Field() url = Field()
Note that the fields need to have the same name as in the model if you want to unpack them conveniently as in the next step!
pipelines.py
:from django.db import transaction from models import MyModel class Django_pipeline(object): def process_item(self, item, spider): with transaction.commit_on_success(): scraps = MyModel(**item) scraps.save() return item
As mentioned above, if you named all your item fields like you did in your
models.py
file you can use**item
to unpack all the fields when creating your MyModel object.
Thatβs it!
- [Django]-Difference between APIView class and viewsets class?
- [Django]-FileUploadParser doesn't get the file name
- [Django]-Images from ImageField in Django don't load in template