2👍
Here’s what you have to do:
- Create a django website that takes data you want to display from a
database (can be sqlite) and shows it in desired format - Create a crawling script
-
Add a view that renders a form (your desired url input functionality) and starts the script. There are actually two was you
can go about it:3.1 Have the script start in the main job – freezes your site for the
user until crawling is done, but is easier to do3.2 Have it schedule a crawling job via celery or cron – all-round
better, doesn’t freeze anything, allows more flexibility, allows to
see current progress and so on, but required to set up a job queue
and generally harder to do for the first time. -
Make your script put scraped urls and required info to the same databse django is taking data from.
Now for dynamic progress display, I am by no means specialist, but I see some ways:
- Have script keep a log of events (can do it via a django model, so that events are stored in db) (e.g. “parsed url http://foo.bar“), have a page that displays events for a certain job.
- Make the whole interactive crawling process a separate application that runs an async server that sends feedback. For example do it via websockets: django serves a js file. In the js file the application connects to a websocket application (preferrably running from the same host as django) that’s doing the crawling and reporting progress over websockets. Mind you this is tricky to set up, but possible.
- You could have django display stuff from log files, but I think it could get tricky easily.
For dynamic progress display you will still need some kind of async eiher way. You could do it with long polling: have a js script on django side poll django via AJAX GET for new info to display every second or so. This technique is falling out of fashion lately (because it spams the server with expensive requests), but it still works and is quite simple.
I think the best option is to have celery jobs put crawled data and logs in a database, have django show logs and data to user and accept user input.