Introduction
My previous blog posts have chronicled my journey with ElasticSearch, starting from my early experiences using it in the Telescope open-source project. It's amazing to think that after almost a year, I still have the privilege of working with this powerful search and analytics engine as part of my job.
This month, I am fortunate enough to attend specialized ElasticSearch courses that are fully funded by my workplace. The courses are led by industry experts, giving me a unique opportunity to gain an in-depth understanding of ElasticSearch and its many features.
Although ElasticSearch is planned to be removed from Telescope, it seems fitting to share some of the key takeaways and new knowledge that I have gained. While I'm not sure how many posts there will be, I am excited to continue exploring and expanding my understanding of this powerful tool, especially in revising my explanation of how auto-complete indexing works.
Text Analysis
Text analysis is utilized in two scenarios: indexing text and querying search with text. This can be done through analyzers
.
Anatomy of an Analyzer
In general, an analyzer
can be considered as a bundled configuration for character filters, tokenizer, and token filters.
Sequential flow of text analysis:
___________ ___________ _________ tokens
| | | | | | ------> Index
Text --> | Character | ---> | Tokenizer | ---> | Token |
| Filters | | | | Filters | query
|___________| |___________| |_________| ------> Search
Character Filters
Are used for pre-processing text before it is sent to the tokenizer.
Some built in character filters include
- HTML Strip Character Filter to strip out HTML elements
- Mapping Character Filter to replace based on specified strings
- Pattern Replace Character Filter to replace based on regex patterns
Tokenizer
A tokenizer is the component that is in charge of separating a sequence of characters into distinct tokens (commonly words), and produce a stream of tokens as the output. It is worth noting that an analyzer must have and can only have a single tokenizer.
Besides being able to create Custom Tokenizers, commonly used built-in tokenizers include
- Standard Tokenizer for grammar-based tokenization
- Keyword Tokenizer which outputs the whole text as a single token
- Whitespace Tokenizer breaks text by whitespace
- Pattern Tokenizer splits text based on Java regex
- UAX URL Email Tokenizer the standard tokenizer while also being able to recognize URLs and emails.
Token Filters
Token filters continue to process the token stream, and may remove, edit or add new tokens based on the filter.
Some filters include:
- lowercase - to switch all to characters to lower case
- ngram/edge-ngram - for splitting the token into ngram chunks
- stemmer/snowball - stemming text to reduce words to its root form
- hunspell - dictionary-based text stemming
- stop removes stop words from the stream such as "a", "and", "the", etc.
What we have in Telescope
PUT /posts
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete",
"filter": ["lowercase", "remove_duplicates"]
},
"autocomplete_search_analyzer": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20,
"token_chars": ["letter", "digit"]
}
}
}
},
"mappings": {
...
}
}
Breaking down the Posts
index in Telescope, we have
- Two custom
analyzers
- One named autocomplete_search_analyzer that doesn't have any
character filters
or extratoken filters
, as it uses the built-inlowercase tokenizer
- One named autocomplete_analyzer, which uses our
custom tokenizer
, autocomplete, which includes a customized setting for theedge_ngram
type. Ouranalyzer
also includes built-inlowercase
andremove duplicates
token filter
To be continued
For now, this is all I have time for. In later blogs, I'll continue to try and share my findings on text analysis for auto-completion, including insights on ngrams
/edge-ngrams
and index mappings
.
Top comments (0)