[Fixed]-Can't find strings that aren't words in Django Haystick/Elasticsearch

0👍

✅

Solving my own question – appreciate the input by solarissmoke as it has helped me track down what was causing this.

My answer is based on Greg Baker’s answer on the question
ElasticSearch: EdgeNgrams and Numbers

The issue appears to be related to the use of numeric values within the search text (in my case, the N133TC pattern). Note that I was using the snowball analyzer at first, before switching to pattern – none of these worked.

I adjusted my analyzer setting in settings.py:

"edgengram_analyzer": {
    "type": "custom",
    "tokenizer": "standard",
    "filter": ["haystack_edgengram"]
}

Thus changing the tokenizer value to standard from the original lowercase analyzer used.

I then set the default analyzer to be used in my backend to the edgengram_analyzer (also on settings.py):

ELASTICSEARCH_DEFAULT_ANALYZER = "edgengram_analyzer"

This does the trick! It still works as an EdgeNgram field should, but allows for my numeric values to be returned properly too.

I’ve also followed the advice in the answer by solarissmoke and removed all the underscores from my index files.

👤Cameron

How can I make Django url regular expression not to catch all words?

1👍

It doesn’t fully explain the behaviour you are seeing, but I think the problem is with how you are indexing your data – specifically the text field (which is what gets searched when you filter on content).

Take the example data you provided, callsign N133TC, flight name Shahrul Nizam. The text document for this data becomes:

flight___N133TC___Shahrul Nizam

You have set this field as an EdgeNgramField (min 4 chars, max 15). Here are the ngrams that are generated when this document is indexed (I’ve ignored the lowercase filter for simplicity):

flig
fligh
flight
flight_
flight___
flight___N
flight___N1
flight___N13
flight___N133
flight___N133T
flight___N133TC
Niza
Nizam

Note that the tokenizer does not split on underscores. Now, if you search for N133TC, none of the above tokens will match. (I can’t explain why Shahrul works… it shouldn’t, unless I’ve missed something, or there are spaces at the start of that field).

If you changed your text document to:

flight N133TC Shahrul Nizam

Then the indexed tokens would be:

flig
flight
N133
N133T
N133TC
Shah
Shahr
Shahru
Shahrul
Niza
Nizam

Now, a search for N133TC should match.

Note also that the flight___ string in your document generates a whole load of (most likely) useless tokens – unless this is deliberate you may be better off without it.

👤solarissmoke

Source:stackexchange.com

Leave a comment Cancel reply