1👍
You can process your urls by batch by only queueing up a few at time every time the spider idles. This avoids having a lot of requests queued up in memory. The example below only reads the next batch of urls from your database/file and queues them as requests only after all the previous requests are done processing.
More info on the spider_idle
signal: http://doc.scrapy.org/en/latest/topics/signals.html#spider-idle
More info on debugging memory leaks: http://doc.scrapy.org/en/latest/topics/leaks.html
from scrapy import signals, Spider
from scrapy.xlib.pydispatch import dispatcher
class ExampleSpider(Spider):
name = "example"
start_urls = ['http://www.example.com/']
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
# connect the function to the spider_idle signal
dispatcher.connect(self.queue_more_requests, signals.spider_idle)
def queue_more_requests(self, spider):
# this function will run everytime the spider is done processing
# all requests/items (i.e. idle)
# get the next urls from your database/file
urls = self.get_urls_from_somewhere()
# if there are no longer urls to be processed, do nothing and the
# the spider will now finally close
if not urls:
return
# iterate through the urls, create a request, then send them back to
# the crawler, this will get the spider out of its idle state
for url in urls:
req = self.make_requests_from_url(url)
self.crawler.engine.crawl(req, spider)
def parse(self, response):
pass
1👍
You will not be able to reach closure when recursing links over the whole Internet. You will need to limit the recursion in one way or another. Unfortunately the part of the code where you would do this is not shown. The easiest way would be to set a fixed size to the list of pending links to crawl and just don’t add any more to the list until it is less than this cap. More advanced solutions would assign priority to pending links based on their surrounding context in the parent page and then do sorted adds to the sorted, fixed-maximum-size priority list of pending links.
Instead of trying to edit or hack the existing code, however, you should see if the built-in settings can accomplish what you want. See this doc page for reference: http://doc.scrapy.org/en/latest/topics/settings.html. It looks like the DEPTH_LIMIT
setting with a value of 1 or more would limit your depth of recursion off the starting pages.
- [Answered ]-Filtering a Django queryset once a slice has been taken
- [Answered ]-'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) when i add ForeignKey relation in model