ElasticSearch Continued #1

Introduction

My previous blog posts have chronicled my journey with ElasticSearch, starting from my early experiences using it in the Telescope open-source project. It's amazing to think that after almost a year, I still have the privilege of working with this powerful search and analytics engine as part of my job.

This month, I am fortunate enough to attend specialized ElasticSearch courses that are fully funded by my workplace. The courses are led by industry experts, giving me a unique opportunity to gain an in-depth understanding of ElasticSearch and its many features.

Although ElasticSearch is planned to be removed from Telescope, it seems fitting to share some of the key takeaways and new knowledge that I have gained. While I'm not sure how many posts there will be, I am excited to continue exploring and expanding my understanding of this powerful tool, especially in revising my explanation of how auto-complete indexing works.

Text Analysis

Text analysis is utilized in two scenarios: indexing text and querying search with text. This can be done through analyzers.

Anatomy of an Analyzer

In general, an analyzer can be considered as a bundled configuration for character filters, tokenizer, and token filters.

Sequential flow of text analysis:
          ___________        ___________        _________  tokens 
         |           |      |           |      |         | ------>  Index 
Text --> | Character | ---> | Tokenizer | ---> |  Token  |          
         |  Filters  |      |           |      | Filters |  query       
         |___________|      |___________|      |_________| ------>  Search

Character Filters

Are used for pre-processing text before it is sent to the tokenizer.
Some built in character filters include

HTML Strip Character Filter to strip out HTML elements
Mapping Character Filter to replace based on specified strings
Pattern Replace Character Filter to replace based on regex patterns

Tokenizer

A tokenizer is the component that is in charge of separating a sequence of characters into distinct tokens (commonly words), and produce a stream of tokens as the output. It is worth noting that an analyzer must have and can only have a single tokenizer.

Besides being able to create Custom Tokenizers, commonly used built-in tokenizers include

Standard Tokenizer for grammar-based tokenization
Keyword Tokenizer which outputs the whole text as a single token
Whitespace Tokenizer breaks text by whitespace
Pattern Tokenizer splits text based on Java regex
UAX URL Email Tokenizer the standard tokenizer while also being able to recognize URLs and emails.

Token Filters

Token filters continue to process the token stream, and may remove, edit or add new tokens based on the filter.
Some filters include:

lowercase - to switch all to characters to lower case
ngram/edge-ngram - for splitting the token into ngram chunks
stemmer/snowball - stemming text to reduce words to its root form
hunspell - dictionary-based text stemming
stop removes stop words from the stream such as "a", "and", "the", etc.

What we have in Telescope

PUT /posts
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete_analyzer": {
          "tokenizer": "autocomplete",
          "filter": ["lowercase", "remove_duplicates"]
        },
        "autocomplete_search_analyzer": {
          "tokenizer": "lowercase"
        }
      },
     "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    ...
  }
}

Breaking down the Posts index in Telescope, we have

Two custom analyzers
One named autocomplete_search_analyzer that doesn't have any character filters or extra token filters, as it uses the built-in lowercase tokenizer
One named autocomplete_analyzer, which uses our custom tokenizer, autocomplete, which includes a customized setting for the edge_ngram type. Our analyzer also includes built-in lowercase and remove duplicates token filter

To be continued

For now, this is all I have time for. In later blogs, I'll continue to try and share my findings on text analysis for auto-completion, including insights on ngrams/edge-ngrams and index mappings.

DEV Community