Introduction: Why Elasticsearch?
You've probably heard of Elasticsearch. Maybe you've used it for log analytics with the ELK stack, or perhaps you've seen it power lightning-fast search on e-commerce sites. It's the go-to solution for full-text search, real-time analytics, and geospatial queries at scale.
But here's the thing: Elasticsearch doesn't do the heavy lifting alone.
Strip away the distributed architecture, the REST APIs, and the cluster management — and you'll find Apache Lucene, a battle-tested Java library that's been quietly revolutionizing search since 1999.
Elasticsearch, OpenSearch, and Solr? They're all essentially distributed Lucene clusters with orchestration and APIs wrapped around them.
So if you really want to understand how Elasticsearch works — how it finds documents in milliseconds, how it ranks results by relevance, how it handles millions of writes without breaking a sweat — you need to understand Lucene.
This blog is my journey into Lucene's internals. Let's dive deep.
The Problem: Why Traditional Databases Fail at Search
Imagine you're building a hotel booking platform. You have millions of hotel listings, and users want to search:
- "Hotels with rooftop pools in Dhaka"
- "Luxury spa resorts near the beach"
- "Budget hostels with free WiFi"
You try PostgreSQL:
SELECT * FROM hotels
WHERE description LIKE '%rooftop%'
AND description LIKE '%pool%'
AND city = 'Dhaka';
What happens?
Your database scans every single row, checking if the description contains those words. No index helps with arbitrary LIKE '%term%' patterns. It's slow, it doesn't rank by relevance, and it doesn't handle typos or synonyms.
Now, you might argue, "What about PostgreSQL's advanced features?" And you'd be right. PostgreSQL offers ILIKE for case-insensitivity and a powerful Full-Text Search engine using tsvector and tsquery. This approach uses special GIN indexes for speed, supports stemming (finding 'pools' when searching for 'pool'), and even provides basic relevance ranking.
However, even PostgreSQL's native search has limits. At massive scale, its performance can lag. Its relevance ranking is basic compared to advanced algorithms like BM25. And it lacks built-in features for handling typos (fuzzy search), complex language analysis, and real-time indexing updates.
This is where Lucene enters the picture.
Lucene's mission: Enable fast, accurate, relevance-based search across massive text collections.
The Core Idea: Inverted Index
To understand Lucene, you first need to understand the inverted index — the data structure that makes search fast.
Traditional (Forward) Index vs Inverted Index
A traditional database stores data like this:
Doc1 → "Rooftop pool and bar"
Doc2 → "Luxury hotel with rooftop view"
Doc3 → "Pool near airport"
To search, you scan each document checking for your term. O(N) — slow.
An inverted index flips this:
bar → [Doc1]
hotel → [Doc2]
luxury → [Doc2]
pool → [Doc1, Doc3]
rooftop → [Doc1, Doc2]
view → [Doc2]
Now searching for "rooftop AND pool" means:
- Lookup
rooftop→ [Doc1, Doc2] - Lookup
pool→ [Doc1, Doc3] - Intersect them → [Doc1]
Constant-time lookup instead of full scans. This is Lucene's magic.
Posting Lists: More Than Just Document IDs
In reality, Lucene's posting lists store much more:
term: "rooftop"
└─ [
{docID: 1, frequency: 1, positions: [2]},
{docID: 2, frequency: 1, positions: [5]}
]
- docID: which document contains the term
- frequency: how many times it appears (for scoring)
- positions: where in the document (for phrase queries like "rooftop pool")
These posting lists are sorted by docID — crucial for efficient boolean operations (AND, OR, NOT).
Analysis Pipeline: From Text to Terms
Before building an inverted index, Lucene needs to convert raw text into searchable terms. This is where analyzers come in.
The Analysis Chain
Input: "Hotels in Dhaka City"
↓
Tokenizer: ["Hotels", "in", "Dhaka", "City"]
↓
LowercaseFilter: ["hotels", "in", "dhaka", "city"]
↓
StopwordFilter: ["hotels", "dhaka", "city"]
↓
Stemming: ["hotel", "dhaka", "city"]
Only the final tokens become terms in the inverted index.
Critical insight: At search time, Lucene runs the same analyzer on your query. This ensures "Hotels" in a query matches "hotel" in the index.
Why This Matters
Without proper analysis:
- "Hotel" wouldn't match "hotels"
- "running" wouldn't match "run"
- Case differences would break searches
Let's see this in action. In Kibana Dev Console:
POST _analyze
{
"analyzer": "standard",
"text": "Running through the Hotels in Paris"
}
Response:
{
"tokens": [
{"token": "running", "position": 0},
{"token": "through", "position": 1},
{"token": "the", "position": 2},
{"token": "hotels", "position": 3},
{"token": "in", "position": 4},
{"token": "paris", "position": 5}
]
}
Notice: everything's lowercased, but not stemmed (that requires a different analyzer). Stopwords like "the" remain because the standard analyzer doesn't remove them by default.
Segments: Lucene's Secret to Fast Writes
Here's something that surprised me when I first learned it: Lucene never updates data in place.
The Segment Architecture
When you add documents to Lucene, it doesn't append to a giant monolithic index. Instead, it creates segments — small, immutable mini-indexes.
/index/
├── segment_1/
│ ├── .tim (term dictionary)
│ ├── .doc (posting lists)
│ ├── .fdt (stored fields)
│ ├── .dvd (doc values)
│ └── .si (segment metadata)
├── segment_2/
│ └── ...
└── segment_3/
└── ...
Each segment is a complete, standalone inverted index with its own:
- Term dictionary
- Posting lists
- Stored fields (original data)
- Doc values (for sorting/aggregations)
Why Immutable Segments?
Problem: Updating data in place requires locks, complex coordination, and is crash-prone.
Lucene's solution: Append-only, immutable segments.
- Adding documents? Write to a new segment.
-
Deleting documents? Mark them in a
.delfile — don't remove them. - Updating documents? Delete + Add (atomically).
The Document Lifecycle
Let me walk you through what happens when you index a document:
1. Document arrives → Analyzer breaks it into tokens
2. Tokens buffered in RAM (DocumentsWriterPerThread)
3. When buffer fills (~16MB) → Flush to disk as new segment
4. Segment becomes searchable after "refresh" (default: 1 second)
5. Background merge process combines small segments
Here's the real beauty: searches never block writes. While you're indexing new documents, queries run on the existing segments. When a new segment is ready, it's atomically added to the searchable set.
Segment Merging
Over time, you accumulate many small segments:
segment_1 (10 docs, 2 deleted)
segment_2 (8 docs)
segment_3 (5 docs, 1 deleted)
A background merge process combines them:
segment_4 (20 live docs) ← merged, deleted docs physically removed
This keeps search fast (fewer segments to scan) and reclaims disk space.
Practical Verification
Let's see segments in action. Create an index and add documents:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"refresh_interval": "1s"
}
}
POST /test_index/_doc
{"text": "First document"}
POST /test_index/_doc
{"text": "Second document"}
Check segments:
GET /test_index/_segments
You'll see segment details including:
- Number of documents
- Deleted document count
- Size on disk
- Generation number
Scoring: How Lucene Ranks Results
Finding matching documents is easy — ranking them by relevance is where Lucene shines.
It uses the BM25 algorithm, an evolution of TF-IDF, to score how well each document matches your query.
In simple terms, a document ranks higher when:
- The search term appears frequently within it (Term Frequency)
- The term is rare across all documents (Inverse Document Frequency)
- The document isn’t excessively long (Length Normalization)
TL;DR — Lucene rewards documents that mention your query terms often, use rarer words, and get to the point.
You can peek inside the scoring math directly:
GET /test_index/_search
{
"query": { "match": { "text": "lucene search" } },
"explain": true
}
Elasticsearch will show exactly how Lucene calculated each score — TF, IDF, and normalization factors included.
That’s how it knows which “search” result feels most relevant to you.
Doc Values: The Secret Behind Fast Aggregations and Sorting
Lucene’s inverted index (term → docIDs) is great for finding text matches — but it’s terrible for things like sorting or aggregations, which need docID → field_value.
That’s where Doc Values come in.
They store field values in a columnar format on disk:
docID | price | rating
1 | 120 | 4.5
2 | 85 | 4.8
3 | 200 | 4.2
This structure lets Elasticsearch:
- Sort results by numeric fields (like price or date)
- Run aggregations (avg, sum, percentiles) efficiently
- Keep memory low by using OS-level memory mapping
So when you run a query like:
GET /hotels/_search
{
"size": 0,
"aggs": { "avg_price": { "avg": { "field": "price" } } }
}
Lucene doesn’t load every document — it simply scans the Doc Values column for price, making aggregations blazing fast.
In short: Inverted index finds → Doc Values calculate.
Together, they make Elasticsearch both smart and scalable.
Elasticsearch: Distributed Lucene
Now that you understand Lucene, Elasticsearch makes perfect sense: it's a distributed system for managing many Lucene indexes.
The Architecture
Cluster
├── Node 1 (Master)
├── Node 2 (Data)
│ ├── Shard 0 (primary) ← Lucene index
│ └── Shard 2 (replica) ← Lucene index
└── Node 3 (Data)
├── Shard 1 (primary) ← Lucene index
└── Shard 0 (replica) ← Lucene index
Key concepts:
- Cluster: One or more Elasticsearch nodes
- Node: A running Elasticsearch instance
- Index: A logical collection of documents
- Shard: A subset of an index's data — each shard is a Lucene index
- Replica: A copy of a primary shard for redundancy
Indexing Flow
When you index a document:
- Request hits any node → becomes coordinating node
- Hash of
_iddetermines target shard:hash(_id) % num_primary_shards - Request routed to the node holding that primary shard
- Primary shard (Lucene) indexes the document
- Changes replicated to replica shards
- After refresh (1s default), document becomes searchable
Query Flow
When you search:
- Request hits any node → becomes coordinating node
- Query broadcasted to all relevant shards (primary or replica)
- Each shard (Lucene) executes the query independently
- Results merged by coordinating node
- Global top-K results returned
This is the fan-out/fan-in pattern — queries run in parallel across shards.
The Routing Hash is Forever
Here's a critical detail I learned the hard way: the number of primary shards is fixed at index creation.
Why? Because routing depends on: hash(_id) % num_primary_shards
If you change the number of shards, the hash function breaks — documents would route to the wrong shards.
To scale beyond your initial shard count, you must reindex into a new index with more shards.
Cluster Check
curl -H "Authorization: ApiKey $ES_LOCAL_API_KEY" \
$ES_LOCAL_URL/_cat/nodes?v
Response:
ip heap.percent ram.percent cpu load_1m node.role master name
127.0.0.1 45 78 8 0.50 cdfhilmrstw * node-1
The node.role shows: cold, data, frozen, hot, ingest, ml, master, remote, search, transform, warm.
The * indicates this is the elected master node.
Refresh, Flush, and Merge: The Triangle of Durability
One of the trickiest aspects of Lucene/Elasticsearch is understanding when data becomes searchable and durable.
Refresh (Near Real-Time Search)
- Frequency: Every 1 second (default)
- Action: In-memory segments → written to disk, become searchable
- Result: New documents visible in search results
But data isn't durable yet — it's in the filesystem cache, not fsync'd.
Flush (Durability)
- Frequency: Every 30 minutes or when translog gets large
- Action: Forces fsync to disk, clears translog
- Result: Data is now crash-safe
Merge (Compaction)
- Frequency: Continuous background process
- Action: Combines small segments, removes deleted documents
- Result: Better query performance, reclaimed disk space
The Translog
Between flushes, Elasticsearch maintains a transaction log (translog):
- Every write is appended to the translog
- On crash, the translog replays writes since the last flush
- This ensures durability without waiting for expensive fsyncs
Questions That Still Intrigue Me
The deeper I go, the more questions I find myself asking — the fun kind that keep you curious:
- How are skip pointers actually stored in Lucene’s posting lists, and when do they help or slow things down?
- How do BKD trees manage huge numeric or geo datasets, and why are they sometimes faster than inverted indexes?
- After a crash, how does Elasticsearch replay translog operations without redoing already-flushed data?
- What logic decides which node gets a new shard or when data should rebalance across the cluster?
- If Elasticsearch is “schemaless,” why do we still define mappings — and how flexible is it, really?
- What’s the best way to paginate through millions of results without performance falling off a cliff?
- How do aggregations stay fast when the data is massive and spread across many shards?
- How does the cardinality aggregation guess unique counts so accurately with so little memory?
- When should segments merge, and can tuning that ever make indexing noticeably faster?
There’s so much more beneath each of these.
Note: This post just scratches the surface — every one of these questions could be a full deep dive on its own.
I’ll keep learning and hope to write more as I explore further.If you’ve experimented with any of these — drop a comment, I’d love to compare notes.
References & Further Reading
- Lucene Core Documentation
- Elasticsearch from the Bottom Up
- Visualizing Lucene's Segment Merges
- BM25 Scoring in Lucene
- - Lucene Tutorial
- Ishan Upamanyu — 7 Lucene Concepts
- Apache Lucene Blog
- Stanford CS276: Lucene Slides
- Mike McCandless: Segment Merges
This post is part of my ongoing learning journey about Elasticsearch internals.
If you spot anything I misunderstood — please comment! I’m learning, too. 💬
Top comments (0)