https://grokonez.com/elasticsearch/elasticsearch-tokenizers-word-oriented-tokenizers
Elasticsearch Tokenizers – Word Oriented Tokenizers
A tokenizer breaks a stream of characters up into individual tokens (characters, words...), then outputs a stream of tokens. We can also use tokenizer to record the order or position of each term (for phrase and word proximity queries), or the start and end character offsets of the original word which the term represents (for highlighting search snippets).
In this tutorial, we're gonna look at how to use some Word Oriented Tokenizers which tokenize full text into individual words.
1. Standard Tokenizer
standard
tokenizer provides grammar based tokenization:
POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
Token:
{
"tokens": [
{
"token": "The",
"start_offset": 0,
"end_offset": 3,
"type": "",
"position": 0
},
{
"token": "2",
"start_offset": 4,
"end_offset": 5,
"type": "",
"position": 1
},
{
"token": "QUICK",
"start_offset": 6,
"end_offset": 11,
"type": "",
"position": 2
},
{
"token": "Brown",
"start_offset": 12,
"end_offset": 17,
"type": "",
"position": 3
},
{
"token": "Foxes",
"start_offset": 18,
"end_offset": 23,
"type": "",
"position": 4
},
{
"token": "jumped",
"start_offset": 24,
"end_offset": 30,
"type": "",
"position": 5
},
{
"token": "over",
"start_offset": 31,
"end_offset": 35,
"type": "",
"position": 6
},
{
"token": "the",
"start_offset": 36,
"end_offset": 39,
"type": "",
"position": 7
},
{
"token": "lazy",
"start_offset": 40,
"end_offset": 44,
"type": "",
"position": 8
},
{
"token": "dog's",
"start_offset": 45,
"end_offset": 50,
"type": "",
"position": 9
},
{
"token": "bone",
"start_offset": 51,
"end_offset": 55,
"type": "",
"position": 10
}
]
}
To keep things simple, we can write term from tokens in this way:
[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
Max Token Length
We can configure maximum token length (max_token_length
- Defaults to 255).
If a token exceeds this length, it is split at max_token_length
intervals.
For example, we set max_token_length
to 4, it makes QUICK separate to QUIC and K.
More at:
https://grokonez.com/elasticsearch/elasticsearch-tokenizers-word-oriented-tokenizers
Elasticsearch Tokenizers – Word Oriented Tokenizers
Top comments (0)