n-gram
tags :
Summary #
An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation. N-gram modeling is one of the many techniques used to convert text from an unstructured format to a structured format. ref
Letters #
n = 1 (Unigrams - letters) #
- h
- e
- l
n = 2 (Bigrams - letters) #
- he
- el
- lp
n = 3 (Trigrams - letters) #
- hel
- elp
- lpf
Words #
n = 1 (Unigrams - words)/ #
- The
- quick
- brown
n = 2 (Bigrams - words) #
- The quick
- quick brown
- brown fox
n = 3 (Trigrams - words) #
- The quick brown
- quick brown fox
- brown fox jumps
edge n-grams #
N-grams in Elasticsearch | n-grams, edge n-grams, youtube Forms an n-gram of a specified length from the beginning of a token.
For example, you can use the edge_ngram token filter to change quick to qu. ref, elasticsearch
GET _analyze
"tokenizer": "standard",
"filter": [
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 2
],
"text": "the quick brown fox jumps"
# output
## [ t, th, q, qu, b, br, f, fo, j, ju ]