DEV Community

Roxanne Lee
Roxanne Lee

Posted on

ElasticSearch Continued #1

Introduction

My previous blog posts have chronicled my journey with ElasticSearch, starting from my early experiences using it in the Telescope open-source project. It's amazing to think that after almost a year, I still have the privilege of working with this powerful search and analytics engine as part of my job.

This month, I am fortunate enough to attend specialized ElasticSearch courses that are fully funded by my workplace. The courses are led by industry experts, giving me a unique opportunity to gain an in-depth understanding of ElasticSearch and its many features.

Although ElasticSearch is planned to be removed from Telescope, it seems fitting to share some of the key takeaways and new knowledge that I have gained. While I'm not sure how many posts there will be, I am excited to continue exploring and expanding my understanding of this powerful tool, especially in revising my explanation of how auto-complete indexing works.


Text Analysis

Text analysis is utilized in two scenarios: indexing text and querying search with text. This can be done through analyzers.

Anatomy of an Analyzer

In general, an analyzer can be considered as a bundled configuration for character filters, tokenizer, and token filters.

Sequential flow of text analysis:
          ___________        ___________        _________  tokens 
         |           |      |           |      |         | ------>  Index 
Text --> | Character | ---> | Tokenizer | ---> |  Token  |          
         |  Filters  |      |           |      | Filters |  query       
         |___________|      |___________|      |_________| ------>  Search 

Enter fullscreen mode Exit fullscreen mode

Character Filters

Are used for pre-processing text before it is sent to the tokenizer.
Some built in character filters include


Tokenizer

A tokenizer is the component that is in charge of separating a sequence of characters into distinct tokens (commonly words), and produce a stream of tokens as the output. It is worth noting that an analyzer must have and can only have a single tokenizer.

Besides being able to create Custom Tokenizers, commonly used built-in tokenizers include


Token Filters

Token filters continue to process the token stream, and may remove, edit or add new tokens based on the filter.
Some filters include:

  • lowercase - to switch all to characters to lower case
  • ngram/edge-ngram - for splitting the token into ngram chunks
  • stemmer/snowball - stemming text to reduce words to its root form
  • hunspell - dictionary-based text stemming
  • stop removes stop words from the stream such as "a", "and", "the", etc.

What we have in Telescope

PUT /posts
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete_analyzer": {
          "tokenizer": "autocomplete",
          "filter": ["lowercase", "remove_duplicates"]
        },
        "autocomplete_search_analyzer": {
          "tokenizer": "lowercase"
        }
      },
     "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    ...
  }
}
Enter fullscreen mode Exit fullscreen mode

Breaking down the Posts index in Telescope, we have

  • Two custom analyzers
  • One named autocomplete_search_analyzer that doesn't have any character filters or extra token filters, as it uses the built-in lowercase tokenizer
  • One named autocomplete_analyzer, which uses our custom tokenizer, autocomplete, which includes a customized setting for the edge_ngram type. Our analyzer also includes built-in lowercase and remove duplicates token filter

To be continued

For now, this is all I have time for. In later blogs, I'll continue to try and share my findings on text analysis for auto-completion, including insights on ngrams/edge-ngrams and index mappings.

Top comments (0)