128👍
Nate C was close, but not quite.
From the docs:
You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:
for e in Entry.objects.all(): print e.headline
So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.
From my reading of the docs, iterator()
does nothing more than bypass QuerySet’s internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.
Iterating over large datasets efficiently is something we still haven’t gotten quite right, but there are some snippets out there you might find useful for your purposes:
58👍
Might not be the faster or most efficient, but as a ready-made solution why not use django core’s Paginator and Page objects documented here:
https://docs.djangoproject.com/en/dev/topics/pagination/
Something like this:
from django.core.paginator import Paginator
from djangoapp.models import model
paginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can
# change this to desired chunk size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can do whatever you want with the row
print "done processing page %s" % page
- [Django]-Auth.User.groups: (fields.E304) Reverse accessor for 'User.groups' clashes with reverse accessor for 'UserManage.groups'
- [Django]-Troubleshooting Site Slowness on a Nginx + Gunicorn + Django Stack
- [Django]-Django – No module named _sqlite3
42👍
Django’s default behavior is to cache the whole result of the QuerySet when it evaluates the query. You can use the QuerySet’s iterator method to avoid this caching:
for event in Event.objects.all().iterator():
print event
https://docs.djangoproject.com/en/stable/ref/models/querysets/#iterator
The iterator() method evaluates the queryset and then reads the results directly without doing caching at the QuerySet level. This method results in better performance and a significant reduction in memory when iterating over a large number of objects that you only need to access once. Note that caching is still done at the database level.
Using iterator() reduces memory usage for me, but it is still higher than I expected. Using the paginator approach suggested by mpaf uses much less memory, but is 2-3x slower for my test case.
from django.core.paginator import Paginator
def chunked_iterator(queryset, chunk_size=10000):
paginator = Paginator(queryset, chunk_size)
for page in range(1, paginator.num_pages + 1):
for obj in paginator.page(page).object_list:
yield obj
for event in chunked_iterator(Event.objects.all()):
print event
- [Django]-What's the best solution for OpenID with Django?
- [Django]-Django Cannot set values on a ManyToManyField which specifies an intermediary model. Use Manager instead
- [Django]-Django: Multiple forms possible when using FormView?
9👍
For large amounts of records, a database cursor performs even better. You do need raw SQL in Django, the Django-cursor is something different than a SQL cursur.
The LIMIT – OFFSET method suggested by Nate C might be good enough for your situation. For large amounts of data it is slower than a cursor because it has to run the same query over and over again and has to jump over more and more results.
- [Django]-How does Django's nested Meta class work?
- [Django]-How to upload a file in Django?
- [Django]-New url format in Django 1.9
9👍
Django doesn’t have good solution for fetching large items from database.
import gc
# Get the events in reverse order
eids = Event.objects.order_by("-id").values_list("id", flat=True)
for index, eid in enumerate(eids):
event = Event.object.get(id=eid)
# do necessary work with event
if index % 100 == 0:
gc.collect()
print("completed 100 items")
values_list can be used to fetch all the ids in the databases and then fetch each object separately. Over a time large objects will be created in memory and won’t be garbage collected til for loop is exited. Above code does manual garbage collection after every 100th item is consumed.
- [Django]-<Django object > is not JSON serializable
- [Django]-Using Cloudfront with Django S3Boto
- [Django]-A field with precision 10, scale 2 must round to an absolute value less than 10^8
8👍
This is from the docs:
http://docs.djangoproject.com/en/dev/ref/models/querysets/
No database activity actually occurs until you do something to evaluate the queryset.
So when the print event
is run the query fires (which is a full table scan according to your command.) and loads the results. Your asking for all the objects and there is no way to get the first object without getting all of them.
But if you do something like:
Event.objects.all()[300:900]
http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets
Then it will add offsets and limits to the sql internally.
- [Django]-How to send email via Django?
- [Django]-Django migration strategy for renaming a model and relationship fields
- [Django]-What is "load url from future" in Django
8👍
There are a lot of outdated results here. Not sure when it was added, but Django’s QuerySet.iterator()
method uses a server-side cursor with a chunk size, to stream results from the database. So if you’re using postgres, this should now be handled out of the box for you.
- [Django]-How to set environment variables in PyCharm?
- [Django]-What's the difference between CharField and TextField in Django?
- [Django]-Rendering a template variable as HTML
7👍
Massive amount of memory gets consumed before the queryset can be iterated because all database rows for a whole query get processed into objects at once and it can be a lot of processing depending on a number of rows.
You can chunk up your queryset into smaller digestible bits. I call the pattern to do this "spoonfeeding". Here’s an implementation with a progress-bar I use in my management commands, first pip3 install tqdm
from tqdm import tqdm
def spoonfeed(qs, func, chunk=1000, start=0):
"""
Chunk up a large queryset and run func on each item.
Works with automatic primary key fields.
chunk -- how many objects to take on at once
start -- PK to start from
>>> spoonfeed(Spam.objects.all(), nom_nom)
"""
end = qs.order_by('pk').last()
progressbar = tqdm(total=qs.count())
if not end:
return
while start < end.pk:
for o in qs.filter(pk__gt=start, pk__lte=start+chunk):
func(o)
progressbar.update(1)
start += chunk
progressbar.close()
To use this you write a function that does operations on your object:
def set_population(town):
town.population = calculate_population(...)
town.save()
and than run that function on your queryset:
spoonfeed(Town.objects.all(), set_population)
- [Django]-How to use regex in django query
- [Django]-How to serve media files on Django production environment?
- [Django]-STATIC_ROOT vs STATIC_URL in Django
3👍
Here a solution including len and count:
class GeneratorWithLen(object):
"""
Generator that includes len and count for given queryset
"""
def __init__(self, generator, length):
self.generator = generator
self.length = length
def __len__(self):
return self.length
def __iter__(self):
return self.generator
def __getitem__(self, item):
return self.generator.__getitem__(item)
def next(self):
return next(self.generator)
def count(self):
return self.__len__()
def batch(queryset, batch_size=1024):
"""
returns a generator that does not cache results on the QuerySet
Aimed to use with expected HUGE/ENORMOUS data sets, no caching, no memory used more than batch_size
:param batch_size: Size for the maximum chunk of data in memory
:return: generator
"""
total = queryset.count()
def batch_qs(_qs, _batch_size=batch_size):
"""
Returns a (start, end, total, queryset) tuple for each batch in the given
queryset.
"""
for start in range(0, total, _batch_size):
end = min(start + _batch_size, total)
yield (start, end, total, _qs[start:end])
def generate_items():
queryset.order_by() # Clearing... ordering by id if PK autoincremental
for start, end, total, qs in batch_qs(queryset):
for item in qs:
yield item
return GeneratorWithLen(generate_items(), total)
Usage:
events = batch(Event.objects.all())
len(events) == events.count()
for event in events:
# Do something with the Event
- [Django]-How do I import the Django DoesNotExist exception?
- [Django]-What does on_delete do on Django models?
- [Django]-Is it secure to store passwords as environment variables (rather than as plain text) in config files?
0👍
I usually use raw MySQL raw query instead of Django ORM for this kind of task.
MySQL supports streaming mode so we can loop through all records safely and fast without out of memory error.
import MySQLdb
db_config = {} # config your db here
connection = MySQLdb.connect(
host=db_config['HOST'], user=db_config['USER'],
port=int(db_config['PORT']), passwd=db_config['PASSWORD'], db=db_config['NAME'])
cursor = MySQLdb.cursors.SSCursor(connection) # SSCursor for streaming mode
cursor.execute("SELECT * FROM event")
while True:
record = cursor.fetchone()
if record is None:
break
# Do something with record here
cursor.close()
connection.close()
Ref:
- [Django]-ManyRelatedManager object is not iterable
- [Django]-How to change empty_label for modelForm choice field?
- [Django]-How to check Django version