19👍
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
-
increase the cache size setting of your DB to use more RAM (default ones always very
small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages; -
do journalling in RAM instead of disk (this does cause slight
problem if system fails, but something I consider to be negligible
given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY -
last and perhaps most important one: do not build index while
inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
The iterable_data could be a list or something alike, or even an open file reader.
- [Django]-How to chcek if a variable is "False" in Django templates?
- [Django]-Can I Make a foreignKey to same model in django?
- [Django]-What's the recommended approach to resetting migration history using Django South?
- [Django]-Is it secure to store passwords as environment variables (rather than as plain text) in config files?
- [Django]-No module named urllib.parse (How should I install it?)
- [Django]-Django ALLOWED_HOSTS IPs range
12👍
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
- Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
- Insert 3000 rows with Pandas
DataFrame.to_sql()
and don’t get IDs: 774ms - Insert 3000 rows with Django manager
.bulk_create(Model(**df.to_records()))
and don’t get IDs: 574ms - Insert 3000 rows with
to_csv
toStringIO
buffer andCOPY
(cur.copy_from()
) and don’t get IDs: 118ms - Insert 3000 rows with
to_csv
andCOPY
and get IDs via simpleSELECT WHERE ID > [max ID before insert]
(probably not threadsafe unlessCOPY
holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit
.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I’m not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY
command – would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi")
functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv
and cursor.copy_from
was still around 38% faster than the fastest alternative, which was a single-row df.to_sql
, and bulk_import
was occasionally comparable, but often slower still (up to double the time, Django 2.2).
- [Django]-Pylint "unresolved import" error in Visual Studio Code
- [Django]-Itertools.groupby in a django template
- [Django]-How to get value from form field in django framework?
5👍
There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.
This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) –etc etc). This should greatly improve performance.
It also appears to be heavily documented, which is always a plus.
- [Django]-Get the name of a decorated function?
- [Django]-How to get a list of all users with a specific permission group in Django
- [Django]-How to update() a single model instance retrieved by get() on Django ORM?
4👍
Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It’s a simple manager I used on a project.
The other snippet wasn’t as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.
- [Django]-Setting Django up to use MySQL
- [Django]-Checking for empty queryset in Django
- [Django]-Open the file in universal-newline mode using the CSV Django module
3👍
Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
- [Django]-Django Forms: if not valid, show form with error message
- [Django]-Celery. Decrease number of processes
- [Django]-Adding css class to field on validation error in django