[Django]-Django Haystack – How to force exact attribute match without stemming?

6👍

Python3, Django 1.10, Elasticsearch 2.4.4.

TL;DR: define custom tokenizer (not filter)


Verbose explanation

a) use EdgeNgramField:

# search_indexes.py
class PersonIndex(indexes.SearchIndex, indexes.Indexable):

    text = indexes.EdgeNgramField(document=True, use_template=True)
    ...

b) template:

# templates/search/indexes/people/person_text.txt
{{ object.name }}

c) create custom search backend:

# backends.py
from django.conf import settings

from haystack.backends.elasticsearch_backend import (
    ElasticsearchSearchBackend,
    ElasticsearchSearchEngine,
)


class CustomElasticsearchSearchBackend(ElasticsearchSearchBackend):

    def __init__(self, connection_alias, **connection_options):
        super(CustomElasticsearchSearchBackend, self).__init__(
            connection_alias, **connection_options)

        setattr(self, 'DEFAULT_SETTINGS', settings.ELASTICSEARCH_INDEX_SETTINGS)


class CustomElasticsearchSearchEngine(ElasticsearchSearchEngine):

    backend = CustomElasticsearchSearchBackend

d) define custom tokenizer (not filter!):

# settings.py
HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'apps.persons.backends.CustomElasticsearchSearchEngine',
        'URL': 'http://127.0.0.1:9200/',
        'INDEX_NAME': 'haystack',
    },
}

ELASTICSEARCH_INDEX_SETTINGS = {
    "settings": {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "custom_ngram_tokenizer",
                    "filter": ["asciifolding", "lowercase"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "custom_edgengram_tokenizer",
                    "filter": ["asciifolding", "lowercase"]
                }
            },
            "tokenizer": {
                "custom_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 12,
                    "token_chars": ["letter", "digit"]
                },
                "custom_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 12,
                    "token_chars": ["letter", "digit"]
                }
            }
        }
    }
}

HAYSTACK_DEFAULT_OPERATOR = 'AND'

e) use AutoQuery (more versatile):

# views.py
search_value = 'Simons'
...
person_sqs = \
    SearchQuerySet().models(Person).filter(
        content=AutoQuery(search_value)
    )

f) reindex after changes:

$ ./manage.py rebuild_index
👤Ukr

1👍

I was facing a similar problem. if you change the settings of your haystacks elasticsearch back end like:

DEFAULT_SETTINGS = {
    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["haystack_ngram", "lowercase"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["haystack_edgengram", "lowercase"]
                }
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 6,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 6,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 6,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 6,
                    "max_gram": 15
                }
            }
        }
    }
}

Then it will tokenize only when the query is more than 6 character.

If you want results like “xyzsimonsxyz”, then you would need to use ngram analyzer instead of EdgeNGram or you could use both depending on your requirements. EdgeNGram generates tokens only from the beginning.

with NGram ‘simons’ will be one of the generated tokens for term xyzsimonsxyz assuming max_gram >=6 and you will get expected results, also search_analyzer needs to be different or you will get weird results.

Also index size might get pretty big with ngram if you have huge chunk of text

-1👍

Not use CharField use EdgeNgramField.

# search_indexes.py
class PersonIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    name = indexes.EdgeNgramField(model_attr="name")

    def get_model(self):
        return Person

    def index_queryset(self, using=None):
        return self.get_model().objects.all()

And not user filter, user autocomplete

person_sqs = SearchQuerySet().models(Person)
person_sqs.autocomplete(name="Simons")

source: http://django-haystack.readthedocs.org/en/v2.0.0/autocomplete.html

Leave a comment