Analyzers in ElasticSearch

Kalyan Kumar — Fri, 20 May 2022 17:44:28 +0000

It is important to understand how analyzers in ElasticSearch work to have an appropriate indexing strategy for your use-case. If you don’t choose a suitable indexing strategy, it would become complex to write queries for your use-case and the queries could end up being slow.

What is an analyzer?

We can think of it as a pipeline. When a document is indexed in ElasticSearch, the string data of the document is given input to the pipeline and an output of terms is generated. These terms are then reverse indexed (as discussed in the previous blog).

What is this pipeline made of?

This algorithmic pipeline contains 3 components.

Character filter
Tokenizer
Token filter

Each component does some specific operation on the string data and outputs it to the next component. Let’s discuss in brief what these components individually do:

Character filter

This component can add, remove or modify the characters present in the string.

Let’s understand this through an example below.

HTML strip character filter:

Think of a use-case where you are storing html file contents as string data in ElasticSearch. This filter can remove all the HTML tags like <b>, <h1>, <p> from the input string data.
Example:
<p>Ironman is flying<p> would be modified to Ironman is flying

The modified string is then passed as input to the tokenizer.

Tokenizer

A tokenizer takes a string and breaks it into multiple tokens. The logic based on which the string is broken down into tokens depends on the type of tokenizer you are using.

For example, a Whitespace Tokenizer would split the string whenever it encounters a whitespace in the string. Ironman is flying would result in tokens: [Ironman, is, flying]

The output list of tokens are then passed to the token filter.

Token filter

This component adds, modifies or filters out the individual tokens generated in the previous step.

Let’s consider Edge-ngram token filter. This token filter takes each token and produces all the possible prefix substrings of the token.

[Ironman, is, flying] would generate following tokens:

[ I, Ir, Iro, Iron, Ironm, Ironma, Ironman, i, is, f, fl, fly, flyi, flyin, flying]

Note: The example presented here is very specific. There are many types of Tokenizers, Token filters and Character filters to use. You can construct a custom analyzer specific to your use-case.

How does ElasticSearch work?

Kalyan Kumar — Wed, 11 May 2022 04:13:18 +0000

ElasticSearch is a search engine technology used by many companies for use-cases like: full-text search, autocomplete suggestions, data analytics, etc.

It is a document-oriented database. The high level architecture of ElasticSearch is as follows (bottom-up):

Document

Documents are the basic units of information you can store in ElasticSearch. You store the data in the form of a JSON document. Every document has a unique ID which is used for reverse indexing the data (discussed at the end).

Index

An index is a collection of the similar type documents. An index is analogous to the database in relational databases. When you query in ElasticSearch, you query against a particular index.

Node and Cluster

An ElasticSearch cluster is a collection of nodes connected together.

Shard

It is not necessary that an index in ElasticSearch cluster need to be present only on a single node. It could be broken down into parts and shared among multiple nodes to achieve horizontal scalability. Each part of an index present on a node is called a Shard.
Example:

users index has 10k documents which are shared between node1 and node2. Each part of 5k documents is called a shard.

Now that we understand high level architecture of ElasticSearch, let's have a look at how it internally works.

Reverse Indexing

ElasticSearch uses a special data structure called reverse index. Every term in the data is indexed against the document IDs of documents containing the term.
Let's understand this through an example below:

Consider 2 documents as stated above with IDs: 1 and 2. When these documents are indexed(loaded into the index), every term from the document is taken and it is mapped to the document ID. When you query for a certain term, it is like a lookup in the dictionary. This is what primarily makes ElasticSearch very fast.

More articles to follow in this series of ElasticSearch articles. Please comment down if you want me to write an article about a specific topic.

DEV Community: Kalyan Kumar