Full Text Search with Django

Full Text Search with Django

May 24, 2024 | seedling, permanent

tags :

Full Text Search using Django #

youtube

django supports postgres out of the box

# django.contrib.postgres.search

search lookup #

>>> Entry.objects.filter(body_text__search="Cheese")
# [<Entry: Cheese on Toast recipes>, <Entry: Pizza Recipes>]

Data type PostgreSQL for FTS #

ref

tsvector #

To convert plain text to a tsvector, use the Postgres to_tsvector function. This function reduces the original text to a set of word skeletons known as lexeme .

Lexemes are important because they help match related words. For instance, the words satisfy, satisfying and satisfied would convert to satisfi. This means a search for satisfy will return results containing any of the other terms as well. Stop words such as “a,” “on,” “of,” “you,” “who,” etc. are removed because they appear too frequently to be relevant in searches. The to_tsvector function returns the lexemes, along with a digit that denotes each word’s position in the text.

Note that the output of the function is language-dependent. You should tell PostgreSQL to treat the text as English (or whatever language your results are stored in). To convert the sentence “A Fanciful Documentary of a Frisbee And a Lumberjack who must Chase a Monkey in A Shark Tank” to a tsvector, run the following:

SELECT to_tsvector('english', 'A Fanciful Documentary of a Frisbee And a Lumberjack who must Chase a Monkey in A Shark Tank') AS search;

You’ll see output like the following:

                    search
----------------------------------------------------------------------
'chase':12 'documentari':3 'fanci':2 'frisbe':6 'lumberjack':9 'monkey':14 'must':11 'shark':17 'tank':18

This shows each word’s root as well as its position in the text. For example, the word fanciful, the second word in the text, has been broken down into the lexeme “fanci”, so you see ’fanci’:2.

tsquery #

Text search systems have two major components:

  1. the text being searched and

  2. the keyword being searched for.

    In the case of FTS, both components must be vectorized. You saw how searchable data is converted to a tsvector in the previous section, so now you’ll see how search terms are vectorized into tsquery values.

Postgres offers functions that will convert text fields to tsquery values such as to_tsquery, plainto_tsquery and phraseto_tsquery.

  • Search terms can also be combined with the & (AND), | (OR), and ! (NOT) operators, and parentheses can be used to group operators and determine their order. to_tsquery converts the search terms to tokens and discards stop words.

The following query:

SELECT to_tsquery('english', 'a & beautifully & very & quickly') AS search;

Returns the lexemes “beauti” and “quick” because “a” and “very” are stop words:

                    search
----------------------------------------------------------------------
'beauti' & 'quick'

Django #

from django.db import Models

class Film(models.Model):
    film_id = models.AutoField(primary_key=True)
    title = models.CharField(max_length=255)
    description = models.TextField(blank=True, null=True)

    def __str__(self):
        return ', '.join(['film_id=' + str(self.film_id), 'title=' + self.title, 'description=' + self.description])

SearchVector #

If you want to use the tsvector on its own, you can use the Django SearchVector class

>>> Film.objects.annotate(search=SearchVector('title', 'description', config='english')).filter(search='love')
<QuerySet [<Film: film_id=374, title=Graffiti Love, description=A Unbelieveable Epistle of a Sumo Wrestler And a Hunter who must Build a Composer in Berlin>,
<Film: film_id=448, title=Idaho Love, description=A Fast-Paced Drama of a Student And a Crocodile who must Meet a Database Administrator in The Outback>,
 <Film: film_id=458, title=Indian Love, description=A Insightful Saga of a Mad Scientist And a Mad Scientist who must Kill a Astronaut in An Abandoned Fun House>,
<Film: film_id=511, title=Lawrence Love, description=A Fanciful Yarn of a Database Administrator And a Mad Cow who must Pursue a Womanizer in Berlin>,
<Film: film_id=535, title=Love Suicides, description=A Brilliant Panorama of a Hunter And a Explorer who must Pursue a Dentist in An Abandoned Fun House>,
<Film: film_id=536, title=Lovely Jingle, description=A Fanciful Yarn of a Crocodile And a Forensic Psychologist who must Discover a Crocodile in The Outback>]>

SearchQuery #

SearchQuery is the abstraction of the to_tsquery, plainto_tsquery and phraseto_tsquery functions in Postgres. There are several ways to use the SearchQuery class including using two keywords in a search:

>>> SearchQuery("story beautiful")

Or searching for a specific phrase:

>>> SearchQuery("mad scientist", search_type="phrase")

Unlike SearchVector, SearchQuery supports boolean operators. The boolean operators combine search terms using logic just like they did in Postgres:

>>> SearchQuery("('epic' | 'beautiful' | 'brilliant') & ('tale' | 'story')", search_type="raw")

Using SearchVector and SearchQuery together in a search allows you to create powerful custom searches in Django:

>>> vector = SearchVector('title', 'description', config='english') # search the title and description columns..
>>> query = SearchQuery("('epic' | 'beautiful' | 'brilliant') & ('tale' | 'story')", search_type="raw") # ..with the search term
>>> Film.objects.annotate(search=vector).filter(search=query)
<QuerySet [
    <Film: film_id=8, title=Airport Pollock, description=A Epic Tale of a Moose And a Girl who must Confront a Monkey in Ancient India>, <Film: film_id=30, title=Anything Savannah, description=A Epic Story of a Pastry Chef And a Woman who must Chase a Feminist in An Abandoned Fun House>,
    <Film: film_id=46, title=Autumn Crow, description=A Beautiful Tale of a Dentist And a Mad Cow who must Battle a Moose in The Sahara Desert>, <Film: film_id=97, title=Bride Intrigue, description=A Epic Tale of a Robot And a Monkey who must Vanquish a Man in New Orleans>,
   <Film: film_id=196, title=Cruelty Unforgiven, description=A Brilliant Tale of a Car And a Moose who must Battle a Dentist in Nigeria>,
   <Film: film_id=202, title=Daddy Pittsburgh, description=A Epic Story of a A Shark And a Student who must Confront a Explorer in The Gulf of Mexico>...]>

SearchRank #

>>> from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
>>> vector = SearchVector('title', 'description', config='english')
>>> query = SearchQuery("('epic' | 'beautiful' | 'brilliant') & ('tale' | 'story')", search_type="raw")
>>> Film.objects.annotate(rank=SearchRank(vector, query)).order_by('-rank')
>>> vector = SearchVector('title', weight='A') + SearchVector('description', config='english', weight='B')
>>> query = SearchQuery("('epic' | 'beautiful' | 'brilliant') & ('tale' | 'story')", search_type="raw")
>>> Film.objects.annotate(rank=SearchRank(vector, query)).order_by('-rank')

Optimizing Search Performance in Django #

from django.db import Models

from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex # add the Postgres recommended GIN index

class Film(models.Model):
    film_id = models.AutoField(primary_key=True)
    title = models.CharField(max_length=255)
    description = models.TextField(blank=True, null=True)
    vector_column = models.SearchVectorField(null=True)  # new field

    def __str__(self):
        return ', '.join(['film_id=' + str(self.film_id), 'title=' + self.title, 'description=' + self.description])

    class Meta
        indexes = (GinIndex(fields=["vector_column"]),) # add index

To add this field and index to a model, use the GinIndex and SearchVectorField classes like this:

from django.db import Models

from django.contrib.postgres.search import SearchVectorField
from django.contrib.postgres.indexes import GinIndex # add the Postgres recommended GIN index

class Film(models.Model):
    film_id = models.AutoField(primary_key=True)
    title = models.CharField(max_length=255)
    description = models.TextField(blank=True, null=True)
    vector_column = models.SearchVectorField(null=True)  # new field

    def __str__(self):
        return ', '.join(['film_id=' + str(self.film_id), 'title=' + self.title, 'description=' + self.description])

    class Meta
        indexes = (GinIndex(fields=["vector_column"]),) # add index

GIN #

Generalized Inverted Index

  • Fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.

GIST #

Generalized Search Tree

GIN vs GIST #

ref

Experiments lead to the following observations:

  • creation time - GiN takes 3x time to build than GiST

  • size of index - GiN is 2-3 times bigger than GiST

  • search time - GiN is 3 times faster than GiST

  • update time - GiN is about 10 times slower than GiST

    • Specifically, GiST indexes are very good for dynamic data and

    • fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.

Update the index when fields are updated #

ref, triggers in postgres 12 +


No notes link to this note

Go to random page

Previous Next