[Django]-Converting Django QuerySet to pandas DataFrame

152👍

import pandas as pd
import datetime
from myapp.models import BlogPost

df = pd.DataFrame(list(BlogPost.objects.all().values()))
df = pd.DataFrame(
    list(
        BlogPost.objects.filter(
            date__gte=datetime.datetime(2012, 5, 1)
        ).values()
    )
)

# limit which fields
df = pd.DataFrame(
    list(
        BlogPost.objects.all().values(
            "author", "date", "slug"
        )
    )
)

The above is how I do the same thing. The most useful addition is specifying which fields you are interested in. If it’s only a subset of the available fields you are interested in, then this would give a performance boost I imagine.

40👍

Convert the queryset on values_list() will be more memory efficient than on values() directly. Since the method values() returns a queryset of list of dict (key:value pairs), values_list() only returns list of tuple (pure data). It will save about 50% memory, just need to set the column information when you call pd.DataFrame().

Method 1:

queryset = models.xxx.objects.values("A", "B", "C", "D")

## consumes much memory
df = pd.DataFrame(list(queryset))

## works, but no much change on memory usage
df = pd.DataFrame.from_records(queryset)

Method 2:

queryset = models.xxx.objects.values_list(
    "A", "B", "C", "D"
)

## this will save 50% memory
df = pd.DataFrame(
    list(queryset), columns=["A", "B", "C", "D"]
)

## It does not work. Crashed with datatype is queryset not list.
df = pd.DataFrame.from_records(
    queryset, columns=["A", "B", "C", "D"]
)

I tested this on my project with >1 million rows data, the peak memory is reduced from 2G to 1G.

32👍

Django Pandas solves this rather neatly: https://github.com/chrisdev/django-pandas/

From the README:

class MyModel(models.Model):
    full_name = models.CharField(max_length=25)
    age = models.IntegerField()
    department = models.CharField(max_length=3)
    wage = models.FloatField()

from django_pandas.io import read_frame
qs = MyModel.objects.all()
df = read_frame(qs)

2👍

From the Django perspective (I’m not familiar with pandas) this is fine. My only concern is that if you have a very large number of records, you may run into memory problems. If this were the case, something along the lines of this memory efficient queryset iterator would be necessary. (The snippet as written might require some rewriting to allow for your smart use of .values()).

2👍

You maybe can use model_to_dict

import datetime
from django.forms import model_to_dict
pallobjs = [ model_to_dict(pallobj) for pallobj in PalletsManag.objects.filter(estado='APTO_PARA_VENTA')] 
df = pd.DataFrame(pallobjs)
df.head()

Leave a comment