[Django]-Use fuzzy matching in django queryset filter

20đź‘Ť

When you’re using the ORM, the thing to understand is that everything you do converts to SQL commands and it’s the performance of the underlying queries on the underlying database that matter. Case in point:

SELECT COUNT (*) ...

Is that fast? Depends on whether your database stores any records to give you that information – MySQL/MyISAM does, MySQL/InnoDB does not. In English – this is one lookup in MYISAM, and n in InnoDB.

Next thing – in order to do exact match lookups efficiently in SQL you have to tell it when you create the table – you can’t just expect it to understand. For this purpose SQL has the INDEX statement – in django, use db_index=True in the field options of your model. Bear in mind that this has an added performance hit on writes (to create the index) and obviously extra storage is needed (for the datastructure) so you cannot “INDEX all the things”. Also, I don’t think it will help for fuzzy matching – but it’s worth noting anyway.

Next consideration – how do we do fuzzy matching in SQL? Well apparently LIKE and CONTAINS allow a certain amount of searching and wildcard-results to be executed in SQL. These are T-SQL links – translate for your database server 🙂 You can achieve this via Model.objects.get(fieldname__contains=value) which will produce LIKE SQL, or similar. There are a number of options available there for different lookups.

This may or may not be powerful enough for you – I’m not sure.

Now, for the big question: performance. Chances are if you’re doing a contains search that the SQL server will have to hit all of the rows in the database – don’t take my word on that, but it would be my bet – even with indexing on. With 6000 rows this might not take all that long; then again, if you’re doing this on a per-connection-to-your-app basis it’s probably going to create a slowdown.

Next thing to understand about the ORM: if you do this:

Model.objects.get(fieldname__contains=value)
Model.objects.get(fieldname__contains=value)

You will issue two queries to the database server. In other words, the ORM doesn’t always cache the results – so you might just want to do an .all() and search in memory. Do read about caching and querysets.

Further on on that last page, you’ll also see Q objects – useful for more complicated queries.

So in summary then:

  • SQL contains some basic fuzzy matching-like parameters.
  • Whether or not these are sufficient depends on your needs.
  • How they perform depends on your SQL server – definitely measure it.
  • Whether you can cache these results in memory depends on how likely scaling is – again might be worth measuring the memory commit as a result – if you can share between instances and if the cache will be frequently invalidated (if it will be, don’t do it).

Ultimately, I’d start by getting your fuzzy matching working, then measure, then tweak, then measure until you work out how to improve performance. 99% of this I learnt doing exactly that 🙂

👤user257111

10đź‘Ť

with postgres as database, you can use TrigramSimilarity to do fuzzy search and rank your results on different weight as well. Here is the link to documentation :

https://docs.djangoproject.com/en/2.0/ref/contrib/postgres/search/#trigram-similarity

For full text search you can refer to https://czep.net/17/full-text-search.html

2đź‘Ť

If you need something stronger than contains lookup, have a look at regex lookups: https://docs.djangoproject.com/en/1.0/ref/models/querysets/#regex

👤Daniel Grezo

Leave a comment