https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers
Elasticsearch Tokenizers – Partial Word Tokenizers
In this tutorial, we're gonna look at 2 tokenizers that can break up text or words into small fragments, for partial word matching: N-Gram Tokenizer and Edge N-Gram Tokenizer.
I. N-Gram Tokenizer
ngram
tokenizer does 2 things:
- break up text into words when it encounters specified characters (whitespace, punctuation...)
- emit N-grams of each word of the specified length (quick with length = 2 -> [qu, ui, ic, ck] )
=> N-grams are like a sliding window of continuous letters.
For example:
POST _analyze
{
"tokenizer": "ngram",
"text": "Spring 5"
}
It will generate terms with a sliding (1 char min-width, 2 chars max-width) window:
[ "S", "Sp", "p", "pr", "r", "ri", "i", "in", "n", "ng", "g", "g ", " ", " 5", "5" ]
Configuration
-
min_gram
: minimum length of characters in a gram (min-width of the sliding window). Defaults to 1. -
max_gram
: maximum length of characters in a gram (max-width of the sliding window). Defaults to 2. -
token_chars
: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to: - letter (a, b, ...)
- digit (1, 2, ...)
- whitespace (" ", "\n", ...)
- punctuation (!, ", ...)
- symbol ($, %, ...)
Defaults to [] (keep all characters).
For example, we will create a tokenizer with sliding window (width = 3) and character classes: only letter & digit.
PUT jsa_index_n-gram
{
"settings": {
"analysis": {
"analyzer": {
"jsa_analyzer": {
"tokenizer": "jsa_tokenizer"
}
},
"tokenizer": {
"jsa_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
POST jsa_index_n-gram/_analyze
{
"analyzer": "jsa_analyzer",
"text": "Tut101: Spring 5"
}
Terms:
More at:
https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers
Elasticsearch Tokenizers – Partial Word Tokenizers
Top comments (0)