Prithvi S

Posted on Apr 10 • Edited on Apr 20

Beyond Keywords: Unpacking Elasticsearch's Inverted Index for Sub-Millisecond Search

#elasticsearch #search #database #performance

Introduction: The Need for Speed in Search

In today's data-driven world, users expect instant access to information. Whether it's finding a product on an e-commerce site, searching through legal documents, or sifting through vast logs for an anomaly, every millisecond counts. Traditional relational databases, while excellent for structured queries and transactional integrity, often fall short when it comes to the complex, full-text search requirements of modern applications.

This is where Elasticsearch shines. At its core, Elasticsearch is a distributed, RESTful search and analytics engine capable of tackling billions of documents and returning results in sub-millisecond times. But what's the magic behind this speed? The answer lies in its foundational data structure: the Inverted Index.

In this deep dive, we'll peel back the layers of Elasticsearch to understand the inverted index-what it is, how it works, and why it's the cornerstone of Elasticsearch's remarkable performance. We'll also cover how it differs from traditional database indexing and why it's crucial for achieving lightning-fast, highly relevant search experiences.

The Problem with Traditional Database Search

Consider a typical relational database table, perhaps storing articles:

id	title	content
1	The Quick Brown Fox	The quick brown fox jumps over the lazy dog.
2	Lazy Dog's Adventures	The lazy dog enjoys chasing squirrels in the park.
3	Squirrels and Foxes	Foxes and squirrels often cross paths in the wild.

If you wanted to find articles containing the word "fox," you might write a SQL query like:

SELECT * FROM articles WHERE content LIKE '%fox%';

This query, while simple, is problematic for large datasets. The LIKE '%fox%' clause with a leading wildcard prevents the database from using a standard B-tree index efficiently. The database would likely resort to a full table scan, reading every row and checking the content column, which is incredibly slow for millions of records. Even with full-text search extensions built into some databases, they often struggle with the scale and feature set that dedicated search engines offer.

Enter the Inverted Index: A Paradigm Shift

Instead of mapping records to the words they contain (like a traditional index), an inverted index flips this relationship. It maps words to the documents (or articles, in our example) that contain them.

Let's take our example sentences and build a simple inverted index. First, we break down each document into individual terms, a process called tokenization, and normalize them (e.g., lowercase).

Document 1: "The quick brown fox jumps over the lazy dog."
Tokens: the, quick, brown, fox, jumps, over, lazy, dog

Document 2: "The lazy dog enjoys chasing squirrels in the park."
Tokens: the, lazy, dog, enjoys, chasing, squirrels, in, park

Document 3: "Foxes and squirrels often cross paths in the wild."
Tokens: foxes, and, squirrels, often, cross, paths, in, wild

Now, let's construct the inverted index:

Term	Document List (Posting List)
and	3
brown	1
chasing	2
cross	3
dog	1, 2
enjoys	2
fox	1
foxes	3
in	2, 3
jumps	1
lazy	1, 2
often	3
over	1
park	2
paths	3
quick	1
squirrels	2, 3
the	1, 2
wild	3

Notice a few things:

Each unique term is listed once.
Next to each term is a "posting list" – a list of document IDs where that term appears.
Terms like "fox" and "foxes" are treated separately here, but in a real-world Elasticsearch setup, an analyzer could stem them to a common root (fox) if desired.

How Elasticsearch Uses the Inverted Index for Speed

When you query Elasticsearch for "fox," it doesn't scan documents. Instead, it performs an O(1) lookup in the inverted index for the term "fox." This immediately returns Document 1. If you searched for "lazy dog," it would:

Look up "lazy" → [1, 2]
Look up "dog" → [1, 2]
Perform an intersection of these two posting lists: [1, 2] INTERSECT [1, 2] = [1, 2]
The result is Document 1 and Document 2.

This process is incredibly fast because it's operating on sorted lists of document IDs, making intersections highly efficient.

Beyond Simple Lookups: Position and Frequency

Real-world search is more complex than just knowing if a word exists in a document. We need to know:

How often a word appears (term frequency)
Where it appears (position)

To achieve this, the inverted index stores more than just document IDs in its posting lists. It also includes:

Term Frequency (TF): How many times a term appears in a document. This is crucial for relevance scoring (e.g., a document mentioning "Elasticsearch" ten times is likely more relevant than one mentioning it once).
Term Position: The positions within the document where the term occurs. This is vital for phrase queries ("quick brown fox") and proximity searches (words appearing near each other).

Let's enhance our inverted index fragment:

Term	Document (ID, TF, Positions)
fox	1: (TF=1, Positions: [3]), 3: (TF=1, Positions: [0])
lazy	1: (TF=1, Positions: [6]), 2: (TF=1, Positions: [1])
dog	1: (TF=1, Positions: [7]), 2: (TF=1, Positions: [2])
quick	1: (TF=1, Positions: [1])
brown	1: (TF=1, Positions: [2])

Now, if we search for the phrase "quick brown fox":

Elasticsearch finds documents containing "quick" at position X, "brown" at X+1, and "fox" at X+2.
It quickly identifies Document 1 as a match because "quick" is at 1, "brown" at 2, and "fox" at 3.

This additional metadata, while increasing the storage footprint of the index itself, drastically improves the speed and accuracy of complex full-text searches.

The Indexing Pipeline: From Raw Text to Inverted Index

How does raw text transform into this optimized inverted index? Elasticsearch uses a sophisticated indexing pipeline:

Document Arrival: A new document (e.g., a JSON object) is sent to an Elasticsearch coordinating node.
Shard Routing: The coordinating node determines which primary shard the document belongs to. This is typically done using a hash of the document's ID. For example, shard = hash(document_id) % num_primary_shards.
Write Buffer: The document is temporarily stored in an in-memory buffer on the data node hosting the target primary shard.
Analysis: The document's text fields undergo an "analysis" process. This is where text is transformed:
- Character Filters: Clean up text (e.g., remove HTML tags, convert special characters).
- Tokenizer: Breaks the text into individual terms (tokens). The "standard" tokenizer splits on whitespace and punctuation.
- Token Filters: Process tokens (e.g., lowercase them, remove stop words like "the", apply stemming to reduce words to their root form like "fishing" -> "fish", handle synonyms, etc.). This analyzed output is what forms the terms in the inverted index.
Segment Creation: Periodically (typically every 1 second by default), the terms in the in-memory write buffer are flushed to disk, creating a new "segment." A segment is a mini-inverted index, immutable once written. This is when the data becomes searchable. This process is called "refresh."
```
POST /my-index/_doc
{ "title": "Elasticsearch is fast" }
```
This single document hits the write buffer, gets analyzed (e.g., "elasticsearch", "fast"), and then after a refresh, these terms appear in a new segment.
Replication: For high availability and read scalability, Elasticsearch replicates the document to its replica shards.
Commit: To ensure durability, segments are periodically "flushed." This involves writing all in-memory segments to a new commit point on disk and creating a commit file that lists all known segments. The transaction log (translog) is also flushed, ensuring that even if the system crashes, no data is lost between flushes.
Merging: Over time, many small segments accumulate. Searching across numerous small segments can be less efficient. Elasticsearch intelligently merges these smaller segments into larger ones in the background. This process is resource-intensive but crucial for maintaining search performance.

Importance of Immutability

Each segment in Elasticsearch is immutable. Once written to disk, it cannot be changed. When a document is updated, Elasticsearch doesn't modify the existing segment. Instead, it marks the old document as deleted (logically, not physically) and indexes the new version of the document into a new segment. Deleted documents are eventually removed during segment merging.

This immutability offers significant advantages:

Concurrency: No need for complex locking mechanisms when reading segments, as they never change.
Caching: Segments can be aggressively cached by the operating system, improving read performance.
Reliability: Simpler to manage and recover from failures.

Relevance Scoring: Beyond Just Existence

The inverted index provides the raw material for finding documents, but how does Elasticsearch rank them? This is where relevance scoring comes in, and the inverted index's detailed term information is invaluable.

Elasticsearch uses a scoring algorithm called BM25 (Best Match 25) by default. While a full explanation of BM25 is beyond this post, it primarily considers:

Term Frequency (TF): How often a term appears in a document. More occurrences generally mean higher relevance.
Inverse Document Frequency (IDF): How rare a term is across all documents. Rarer terms are more significant.
Field Length: Documents where a term appears in a shorter field (e.g., title) are often considered more relevant than if it appears in a very long field.

The inverted index provides the Term Frequency and Term Position directly. The IDF can be calculated across all segments. By combining these factors, BM25 assigns a score to each matching document, allowing Elasticsearch to return the most relevant results first.

Conclusion: The Unsung Hero of Fast Search

The inverted index is more than just an internal detail; it's fundamental to understanding why Elasticsearch excels at search. It's a clever reversal of traditional indexing, transforming the challenge of full-text search into an efficient lookup and set intersection problem.

From its role in fast lookups and phrase queries to providing the necessary statistics for sophisticated relevance scoring, the inverted index empowers Elasticsearch to deliver the sub-millisecond search experiences that users now demand. By understanding this core component, you gain valuable insight into how to design, optimize, and troubleshoot your Elasticsearch deployments for peak performance and accuracy.

So next time you perform a lightning-fast search, take a moment to appreciate the unsung hero working tirelessly beneath the surface: the mighty inverted index.

About the Author

I'm Prithvi S, Staff Software Engineer at Cloudera and Open Source Enthusiast. Follow my work on GitHub: https://github.com/iprithv

DEV Community