[Answered ]-Django Python Arabic Search

1👍

This problem is known, but is not specific to Django. Django does not define how custom insensitive search works, that is the work of the database. The set of rules on how to treat characters when ordering or checking equivalence is collation.

This Medium article by Ahmed Essam explains problems with simple utf8_unicode_ci collation. If I understand the article correctly the Unicode collation has some shortcomings. Depending on the database, you can construct a custom collation, that looks for example like in the article:

<collation name="utf8_arabic_ci" id="1029">
  <rules>
      <reset>\u0627</reset>   <!-- Alef 'ا' -->
      <i>\u0623</i>           <!-- Alef With Hamza Above 'أ' -->
      <i>\u0625</i>           <!-- Alef With Hamza Below 'إ' -->
      <i>\u0622</i>           <!-- Alef With Madda Above 'آ' -->
  </rules>
  <rules>
      <reset>\u0629</reset>   <!-- Teh Marbuta 'ة' -->
      <i>\u0647</i>           <!-- Heh 'ه' -->
  </rules>
  <rules>
      <reset>\u0000</reset>   <!-- Unicode value of NULL  -->
      <i>\u064E</i>           <!-- Fatha 'َ' -->
      <i>\u064F</i>           <!-- Damma 'ُ' -->
      <i>\u0650</i>           <!-- Kasra 'ِ' -->
      <i>\u0651</i>           <!-- Shadda 'ّ' -->
      <i>\u064F</i>           <!-- Sukun 'ْ' -->
      <i>\u064B</i>           <!-- Fathatan 'ً' -->
      <i>\u064C</i>           <!-- Dammatan 'ٌ' -->
      <i>\u064D</i>           <!-- Kasratan 'ٍ' -->
  </rules>
</collation>

It shows that it will rewrite \u0623 so an Alef with Hamza above as a simple Alef (u0627`), and then try to match it.

Then you can set this collation algorithm as the one for these specific column(s), and likely other columns where you use Arabic.

Leave a comment