1👍
✅
i can think of few optimisations:
- parse as you are reading a file from the stream
- use SAX parser (which will be great with point above)
- use HEAD to get size of the images
- use queue to put your images, then use few threads to connect and get file sizes
example of HEAD request:
$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl
HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close
Connection closed by foreign host.
1👍
You can use the headers attribute of the file like object returned by urllib2.urlopen (I don’t know about urllib).
Here’s a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.
|milo|laurie|¥ cat test.py
import urllib2
uri = "http://download.thinkbroadband.com/1GB.zip"
def get_file_size(uri):
file = urllib2.urlopen(uri)
content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")]
_, str_length = content_header.split(':')
length = int(str_length.strip())
return length
if __name__ == "__main__":
get_file_size(uri)
|milo|laurie|¥ time python2 test.py
python2 test.py 0.06s user 0.01s system 35% cpu 0.196 total
Source:stackexchange.com