n-gram

May 7, 2024 | seedling, permanent

tags :

Summary #

An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation. N-gram modeling is one of the many techniques used to convert text from an unstructured format to a structured format. ref

Letters #

n = 1 (Unigrams - letters) #

n = 2 (Bigrams - letters) #

n = 3 (Trigrams - letters) #

Words #

n = 1 (Unigrams - words)/ #

The
quick
brown

n = 2 (Bigrams - words) #

The quick
quick brown
brown fox

n = 3 (Trigrams - words) #

The quick brown
quick brown fox
brown fox jumps

edge n-grams #

Elasticsearch and OpenSearch

N-grams in Elasticsearch | n-grams, edge n-grams, youtube Forms an n-gram of a specified length from the beginning of a token.

For example, you can use the edge_ngram token filter to change quick to qu. ref, elasticsearch

GET _analyze

  "tokenizer": "standard",
  "filter": [
     "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 2

  ],
  "text": "the quick brown fox jumps"

# output
## [ t, th, q, qu, b, br, f, fo, j, ju ]