Prithvi S

Posted on Jun 9 • Originally published at dev.to

Apache Lucene: Everything You Need to Know About the World's Most Used Search Engine

#java #lucene #search #elasticsearch

A learning journey from zero to production-ready search.

What is Lucene?

Imagine you have a million documents - product descriptions, log files, legal contracts, tweets - and you need to find the ones about "machine learning" that also mention "production deployment" and were written in 2024. A database LIKE '%machine learning%' query would scan every row and take forever. You need something smarter.

Apache Lucene is that something smarter. It's a high-performance, full-featured text search engine library written in Java. It solves one core problem: given a collection of documents, find the ones that match a query, ranked by relevance, fast.

Why Lucene Matters

Lucene isn't just another library. It's the foundation of the search industry:

Elasticsearch - built directly on Lucene
OpenSearch (AWS fork) - built on Lucene
Apache Solr - built on Lucene
MongoDB Atlas Search - Lucene under the hood
Neo4j Full-Text Search - Lucene
Couchbase Full-Text Search - Lucene

Every time you search on LinkedIn, GitHub, Netflix, or Wikipedia, Lucene is likely involved somewhere in the stack.

What Problems Does Lucene Solve?

Full-text search - Find documents containing specific words, phrases, or patterns
Relevance ranking - Score results so the most relevant appear first
Boolean queries - Combine conditions (AND, OR, NOT) efficiently
Range queries - Find numbers/dates within ranges (fast, using BKD trees)
Faceting & aggregation - Count occurrences, group results
Geospatial search - Find points within distance, polygons, etc.
Vector search - Semantic search with embeddings (KNN/HNSW)

Who Should Read This Guide?

Backend engineers building search features
Data engineers working with Elasticsearch/OpenSearch
Software architects choosing search infrastructure
Students learning information retrieval
Contributors wanting to understand Lucene internals

You should know Java basics and understand that computers store data in files. Everything else, we'll build from the ground up.

What is an Inverted Index?

This is the fundamental concept. Everything in Lucene is built around this idea. Understand this, and you understand 50% of Lucene.

The "Forward" Index (What You'd Build Naturally)

If you were storing documents in a database, you'd probably do this:

Document 1: "The cat sat on the mat"
Document 2: "The dog sat on the log"  
Document 3: "The cat chased the dog"

This is a forward index - you know the document, and you can read its words. But to find which documents contain "cat", you'd have to scan every document. O(n) time. Slow.

The Inverted Index (Lucene's Core Insight)

An inverted index flips this around: word → list of documents.

"cat"    → [Doc 1, Doc 3]
"dog"    → [Doc 2, Doc 3]
"sat"    → [Doc 1, Doc 2]
"mat"    → [Doc 1]
"log"    → [Doc 2]
"chased" → [Doc 3]
"the"    → [Doc 1, Doc 2, Doc 3]

Now finding documents with "cat" is O(1) - look up the word, get the list. That's why it's called inverted - the relationship is inverted from document→words to word→documents.

Why It's Called "Inverted"

Think of a book index at the back. Instead of reading page by page (forward), you look up a word and get page numbers (inverted). Lucene's inverted index is that concept, but for millions of documents, with frequencies, positions, and scores.

The Full Posting (What Lucene Actually Stores)

For each term, Lucene doesn't just store document IDs. It stores a posting - a rich record:

"cat" → [
  {doc: 1, freq: 1, positions: [2]},    // "cat" appears once in Doc 1, at word position 2
  {doc: 3, freq: 1, positions: [2]}     // "cat" appears once in Doc 3, at word position 2
]

"sat" → [
  {doc: 1, freq: 1, positions: [3]},
  {doc: 2, freq: 1, positions: [3]}
]

"the" → [
  {doc: 1, freq: 2, positions: [1, 5]},  // "the" appears TWICE in Doc 1
  {doc: 2, freq: 2, positions: [1, 5]},
  {doc: 3, freq: 2, positions: [1, 5]}
]

Why this matters: With frequencies, Lucene can rank documents (more occurrences = more relevant). With positions, Lucene can do phrase queries ("cat sat" means "cat" at position 2, "sat" at position 3). With offsets, Lucene can highlight exact text regions.

The Core Components

Before we write code, let's understand each building block conceptually. Think of these as the cast of characters in Lucene's story.

Document - The Container

A Document is a collection of named fields. Think of it like a JSON object or a database row.

Document {
  "title": "Lucene in Action",
  "author": "Michael McCandless",
  "year": 2024,
  "body": "Lucene is a search library..."
}

Analogy: A row in a spreadsheet. Each document gets a unique ID (docID) when indexed - 0, 1, 2, 3, etc.

Field - The Typed Attribute

A Field is a named, typed value within a document. Fields are the most important decision in Lucene - how you define a field determines how it can be searched.

Field "title" = Text (analyzed, searchable, stored)
Field "author" = String (exact match, not analyzed)
Field "year" = IntPoint (numeric, for range queries)
Field "price" = NumericDocValues (for sorting/faceting)

Analogy: A column in a database, but each column can have different storage and indexing rules.

Why this matters: Lucene has no schema enforcement, but you must be consistent. If you index "year" as text in one document and as a number in another, you can't do range queries properly.

Analyzer - The Text Processor

The Analyzer transforms raw text into terms (tokens) that go into the inverted index. It's the bridge between human language and Lucene's data structures.

Input:  "The quick brown foxes jump!"
Output: ["quick", "brown", "fox", "jump"]

Notice what happened:

"The" was removed (stop word)
"quick" stayed (lowercased)
"brown" stayed
"foxes" became "fox" (stemming)
"jump!" became "jump" (punctuation removed)

Analogy: A translator that converts human text into Lucene's vocabulary. Same words must map to same terms, or search won't work.

Why this matters: The analyzer runs at index time (when writing) and usually again at query time (when searching). If they don't match, you search for "fox" but indexed "foxes" - no results. This is the #1 cause of "why doesn't my search work?"

IndexWriter - The Builder

The IndexWriter is the only component that modifies the index. It:

Receives documents
Runs them through the analyzer
Builds in-memory data structures
Flushes to disk as immutable segments
Merges segments in the background

Analogy: A construction crew building a library. They add books, organize shelves, and occasionally consolidate shelves to make room.

Why this matters: IndexWriter is not thread-safe for writes in the naive sense, but Lucene cleverly uses DocumentsWriterPerThread (DWPT) so multiple threads can index concurrently without locking.

IndexReader - The Reader

The IndexReader provides a point-in-time view of the index. It opens segments and exposes them for searching. It's read-only and thread-safe.

Analogy: A snapshot of the library catalog. Even as new books arrive (via IndexWriter), readers see the catalog as it was when they opened it. Multiple readers can share the same view without interfering.

Why this matters: IndexReader is the gateway to all search operations. Opening an IndexReader is expensive, so you typically open one and reuse it, reopening only when you need to see new documents (Near-Real-Time search).

IndexSearcher - The Search Coordinator

The IndexSearcher wraps an IndexReader and executes queries. It:

Accepts a Query
Rewrites/optimizes it
Creates per-segment scorers
Collects and ranks results
Returns TopDocs (top N results)

Analogy: A librarian who takes your request ("books about cats"), looks up the catalog (IndexReader), checks multiple sections (segments), and returns the best matches.

Why this matters: IndexSearcher is where the magic happens - query optimization, scoring, and result collection all happen here. It's the single most important class for understanding search performance.

Query - The Search Request

A Query represents what the user is looking for. Lucene has many query types:

TermQuery - exact term match
PhraseQuery - terms in sequence ("machine learning")
BooleanQuery - AND/OR/NOT combinations
RangeQuery - numeric/date ranges
PrefixQuery - starts with ("aut*")
WildcardQuery - pattern matching ("cat")
FuzzyQuery - approximate match ("lucne" → "lucene")

Analogy: A search request form. "Find documents where title contains 'lucene' AND body contains 'search' AND year is between 2020-2024."

Why this matters: Queries are composable. You can build complex boolean trees from simple queries. The query you build determines which data structures Lucene uses (FST for terms, BKD for ranges, etc.).

Scorer - The Ranker

The Scorer traverses matching documents and assigns a relevance score (a float). Higher = more relevant.

For a TermQuery, the scorer:

Gets the term's postings list from the inverted index
For each document in the list, calculates a score
Considers term frequency, document length, and rarity

Analogy: A judge scoring contestants. Each document gets a score based on how well it matches the criteria.

Why this matters: Scoring is where Lucene's ranking quality comes from. The default BM25 model is the result of decades of information retrieval research. Understanding scoring helps you debug "why did this document rank higher?"

Collector - The Gatherer

The Collector gathers results from the scorer. Different collectors do different things:

TopScoreDocCollector - top N by score (most common)
TopFieldDocCollector - top N sorted by a field
TotalHitCountCollector - just count matches (no scoring)
FacetCollector - collect facet counts
GroupingCollector - group results by field

Analogy: A basket that collects the best items. Some baskets keep only top 10. Others count everything. Others group by category.

Why this matters: Collectors are pluggable. You can write custom collectors for specialized behaviors (e.g., collect only documents with a minimum score, or deduplicate by field).

How Documents Are Stored: The Internal Journey

Let's trace exactly what happens when you call writer.addDocument(doc). This is the most important 10 seconds of Lucene internals.

Your Code

Document doc = new Document();
doc.add(new TextField("title", "Lucene in Action", Field.Store.YES));
doc.add(new TextField("body", "Lucene is a search library", Field.Store.YES));
writer.addDocument(doc);

Step 1: Document Enters the DocumentsWriterPerThread (DWPT)

┌─────────────────────────────────────┐
│  IndexWriter.addDocument(doc)       │
│        ↓                            │
│  ┌─────────────────────────────┐    │
│  │  DWPT (Thread 1)            │    │
│  │  - RAM Buffer (16MB default)│    │
│  │  - Postings Hash Map        │    │
│  │  - Stored Fields Buffer     │    │
│  │  - DocValues Buffer         │    │
│  └─────────────────────────────┘    │
│        ↓                            │
│  [Other DWPTs for other threads]   │
└─────────────────────────────────────┘

What happens: Lucene assigns the document to a DWPT based on the current thread. Each thread gets its own DWPT. No locks between threads. The DWPT buffers the document in RAM.

Why this matters: This is how Lucene achieves high concurrency. 10 threads = 10 DWPTs = 10x indexing throughput (roughly). Each DWPT has its own RAM buffer, field infos, and postings hash.

Step 2: Analysis - Text Becomes Terms

Input: "Lucene is a search library"

StandardAnalyzer:
  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
  │ StandardTokenizer │→│ LowerCaseFilter │→│ StopFilter   │
  └──────────────┘   └──────────────┘   └──────────────┘
       ↓                   ↓                   ↓
  ["Lucene", "is",       ["lucene",           ["lucene",
   "a", "search",          "is", "a",           "search",
   "library"]              "search",            "library"]
                           "library"]

Output terms: ["lucene", "search", "library"]

What happens: The analyzer tokenizes the text, lowercases it, removes stop words ("is", "a"), and produces the final list of terms.

For each term, the DWPT updates its postings hash:

Before adding doc:
  "lucene"  → []  
  "search"  → []
  "library" → []

After adding doc (docID=0):
  "lucene"  → [{doc: 0, freq: 1, pos: 1}]
  "search"  → [{doc: 0, freq: 1, pos: 3}]
  "library" → [{doc: 0, freq: 1, pos: 4}]

Step 3: Stored Fields Are Buffered

Stored Fields Buffer (row-oriented):
┌───────┬─────────────────────────────┐
│ docID │ field data (compressed)     │
├───────┼─────────────────────────────┤
│ 0     │ title="Lucene in Action"    │
│       │ body="Lucene is a search..."│
└───────┴─────────────────────────────┘

What happens: The original field values are stored in a row-oriented buffer for later retrieval. This is your _source equivalent in Elasticsearch.

Step 4: DocValues Are Buffered

DocValues Buffer (column-oriented):
┌──────────────┬──────────────┐
│ docID (all)  │ year (none)  │
├──────────────┼──────────────┤
│ 0            │ (no year field)│
└──────────────┴──────────────┘

What happens: If you had numeric or sorted fields, they'd be stored in column-oriented buffers. These enable fast sorting and faceting at search time.

Step 5: Flush Trigger - RAM Buffer Full

When the DWPT's RAM buffer hits the limit (default 16MB):

BEFORE FLUSH:
┌─────────────────────────┐
│  DWPT In-Memory         │
│  ├─ Postings Hash       │
│  ├─ Stored Fields       │
│  ├─ DocValues           │
│  └─ Field Infos         │
└─────────────────────────┘
           ↓ serialize
AFTER FLUSH:
┌─────────────────────────┐
│  Segment Files on Disk  │
│  ├─ _0.fdt (stored data)│
│  ├─ _0.fdx (stored index)│
│  ├─ _0.tim (term dictionary)│
│  ├─ _0.tip (term index)│
│  ├─ _0.doc (postings)   │
│  ├─ _0.pos (positions)   │
│  ├─ _0.dvd (docvalues)   │
│  └─ _0.si (segment info) │
└─────────────────────────┘

What happens: All in-memory data structures are serialized to disk files. A new segment is born - an immutable, self-contained chunk of the index.

The segment files:

_0.si   → Segment metadata (version, diagnostics, number of docs)
_0.fnm  → Field names and types (field "title" is TextField, etc.)
_0.fdx  → Stored fields index (quick lookup into .fdt)
_0.fdt  → Stored fields data (compressed document content)
_0.tim  → Term dictionary (FST - compressed term map)
_0.tip  → Term index (in-memory FST for fast term lookup)
_0.doc  → Postings lists (doc IDs and frequencies)
_0.pos  → Positions (where each term occurs in each doc)
_0.pay  → Payloads and offsets (optional per-occurrence data)
_0.dvd  → DocValues data (columnar numeric/sorted fields)
_0.dvm  → DocValues metadata
_0.nvd  → Norms (1 byte per doc per field for length normalization)
_0.nvm  → Norms metadata

Step 6: Commit

When you call writer.commit():

Before commit:
  segments_1 → [Segment_0, Segment_1, Segment_2]

After commit (new docs added):
  segments_2 → [Segment_0, Segment_1, Segment_2, Segment_3, Segment_4]

  (segments_1 is kept as backup until segments_2 is fsync'd)

What happens:

All pending DWPTs are flushed to new segments
A new segments_N file is written with the complete list of segments
The file is fsync'd to disk - durable even if the JVM crashes
The commit point is now visible to new IndexReaders

Why this matters: Uncommitted documents are NOT durable. If the JVM crashes after addDocument() but before commit(), those documents are lost. But they ARE searchable via NRT (Near-Real-Time) readers before commit.

Step 7: Merge (Background)

Before merge:
  [2MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB]  ← 11 segments

After merge (TieredMergePolicy):
  [20MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB] [2MB]  ← 1 big + 10 small

Eventually:
  [200MB] [50MB] [20MB] [5MB] [2MB]  ← logarithmic tiering

What happens: A background thread merges small segments into larger ones. Old segments are deleted after the merge. This keeps the number of segments manageable (O(log N) segments for N documents).

Why this matters: Too many segments = slower search (more files to open, more term dictionaries to consult). Merging keeps search fast but costs I/O and CPU. The merge policy controls this trade-off.

How Documents Are Read: The Internal Journey

Now let's trace exactly what happens when you call searcher.search(query, 10).

Your Code

Query query = new TermQuery(new Term("body", "search"));
TopDocs results = searcher.search(query, 10);

Step 1: Query Parsing (If Using a Parser)

If you use QueryParser:

User input: "lucene AND search OR library"
         ↓
QueryParser:
  ┌─────────────────────────────────┐
  │ BooleanQuery                    │
  │  ├── MUST: TermQuery("lucene")  │
  │  └── SHOULD: BooleanQuery       │
  │       ├── TermQuery("search")   │
  │       └── TermQuery("library")  │
  └─────────────────────────────────┘

What happens: The query parser converts user text into a tree of Query objects. If you build queries programmatically (as in our example), you skip this step.

Why this matters: QueryParser is convenient but dangerous. It can throw ParseException for malformed input. Production systems often use SimpleQueryParser or build queries programmatically for safety.

Step 2: Query Rewriting

Before rewrite:
  BooleanQuery
    ├── MUST: TermQuery("body", "search")
    └── MUST: MatchAllDocsQuery

After rewrite:
  TermQuery("body", "search")  ← MatchAllDocsQuery is redundant, removed

What happens: Lucene rewrites the query for optimization before execution. Common rewrites:

BooleanQuery with single MUST → unwrap to inner query
PhraseQuery with one term → TermQuery
MultiTermQuery (prefix, wildcard) → BooleanQuery of TermQueries (up to max_expansions)
TermQuery with empty term → MatchNoDocsQuery

Why this matters: Rewriting simplifies the query tree, making execution faster. It's a compile-time optimization for search.

Step 3: Weight Creation

Weight weight = query.createWeight(searcher, ScoreMode.TOP_SCORES, 1.0f);

What happens: The Weight binds the query to index statistics:

docFreq - how many documents contain this term
totalTermFreq - total occurrences across all docs
numDocs - total documents in the collection
sumDocFreq - sum of document frequencies
sumTotalTermFreq - sum of total term frequencies

These statistics are needed for scoring (BM25 IDF calculation).

Why this matters: Weight is where the query "learns" about the index. It's computed once per query, then used to create per-segment Scorers. This avoids recomputing statistics for every segment.

Step 4: Per-Segment Scorer Creation

IndexReader has 3 segments: [Seg_0, Seg_1, Seg_2]

Weight.scorer(Seg_0) → Scorer_0
Weight.scorer(Seg_1) → Scorer_1
Weight.scorer(Seg_2) → Scorer_2

What happens: For each segment, the Weight creates a Scorer. The scorer knows how to iterate over the matching documents in that segment using the segment's inverted index.

For a TermQuery, the Scorer:

Opens the segment's .tip file (FST in memory)
Looks up "search" in the FST → gets file pointer to postings
Seeks to that position in the .doc file
Reads the postings list: [doc=5, freq=2, ...]

Step 5: The Search Loop (Scorer → Collector)

Collector (min-heap of top 10)
    ↑
    │  collect(doc, score)
    │
Scorer for Segment 0:
  postings = [doc=5, freq=3, pos=[...]]
  postings = [doc=12, freq=1, pos=[...]]
  postings = [doc=23, freq=2, pos=[...]]

  for each posting:
    score = BM25Score(freq, docLength, avgLength, idf)
    if score > minHeap.min():
      add to heap
      update minCompetitiveScore

What happens: The Scorer iterates through matching documents. For each document, it calculates a score and passes it to the Collector. The Collector maintains a min-heap of the top N results.

MAXSCORE Optimization (Lucene 8+):

For each block of 64 docs:
  blockMaxScore = precomputed maximum score for this block
  if blockMaxScore < minCompetitiveScore:
    SKIP ENTIRE BLOCK (64 docs)
    continue
  else:
    score each doc individually

Why this matters: This is how Lucene searches millions of documents but returns top-10 in milliseconds. It skips 30-70% of documents without scoring them. This is the WAND (Weak AND) / MAXSCORE optimization.

Step 6: BM25 Scoring (With Real Numbers)

Let's say we have:

Total documents: N = 1000
Documents containing "search": n(q) = 50
Document 5: "search" appears 3 times, field length = 100 words
Average field length: avgDL = 200
Parameters: k1 = 1.2, b = 0.75

Step 6a: Calculate IDF

IDF("search") = log(1 + (N - n(q) + 0.5) / (n(q) + 0.5))
              = log(1 + (1000 - 50 + 0.5) / (50 + 0.5))
              = log(1 + 950.5 / 50.5)
              = log(1 + 18.82)
              = log(19.82)
              = 2.99

A common term (appears in 50/1000 docs) has lower IDF than a rare term. If "search" appeared in only 5 docs:

IDF = log(1 + (1000 - 5 + 0.5) / (5 + 0.5))
    = log(1 + 995.5 / 5.5)
    = log(181.9)
    = 5.20

Rare terms score higher. This makes sense - matching a rare term is more significant.

Step 6b: Calculate Term Frequency Component

tfComponent = (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * |D|/avgDL))

For Document 5 (freq=3, |D|=100, avgDL=200):
  = (3 * (1.2 + 1)) / (3 + 1.2 * (1 - 0.75 + 0.75 * 100/200))
  = (3 * 2.2) / (3 + 1.2 * (0.25 + 0.375))
  = 6.6 / (3 + 1.2 * 0.625)
  = 6.6 / (3 + 0.75)
  = 6.6 / 3.75
  = 1.76

Step 6c: Calculate Final Score

Score = IDF * tfComponent
      = 2.99 * 1.76
      = 5.26

What if the document was shorter? Say |D| = 50:

tfComponent = (3 * 2.2) / (3 + 1.2 * (0.25 + 0.75 * 50/200))
            = 6.6 / (3 + 1.2 * (0.25 + 0.1875))
            = 6.6 / (3 + 1.2 * 0.4375)
            = 6.6 / (3 + 0.525)
            = 6.6 / 3.525
            = 1.87

Score = 2.99 * 1.87 = 5.59  ← Higher score! Shorter docs get boost.

Why this matters: This is the heart of relevance ranking. BM25 is battle-tested across billions of queries. The math ensures:

More term occurrences = higher score (but saturates - 10th occurrence matters less than 1st)
Shorter documents = higher score (title match beats body match)
Rare terms = higher score (matching "Lucene" is more specific than matching "the")

Step 7: Collector Returns TopDocs

Collector's min-heap (top 10 by score):
  1. Doc 5, Score: 5.26
  2. Doc 12, Score: 4.89
  3. Doc 23, Score: 4.71
  ...
  10. Doc 89, Score: 3.42

TopDocs {
  totalHits: 156 (how many docs matched total)
  scoreDocs: [ScoreDoc(5, 5.26), ScoreDoc(12, 4.89), ...]
}

What happens: The Collector returns a TopDocs object containing the top N results and the total hit count. The scores are floats - higher is better.

Why this matters: The total hit count tells you "156 documents matched, here are the top 10." This is crucial for pagination and UI ("Showing 1-10 of 156 results").

Deep Dive: Analysis

Analysis is the art of turning text into indexable terms. It's the most common source of "why doesn't my search work?" bugs.

The Analysis Pipeline

Raw Text
   ↓
CharFilter (optional)  ← e.g., HTML strip, mapping chars
   ↓
Tokenizer              ← Split into tokens
   ↓
TokenFilter            ← Transform tokens (lowercase, stop, stem)
   ↓
TokenFilter            ← More transformations
   ↓
Terms → Index

StandardAnalyzer (The Default)

Analyzer analyzer = new StandardAnalyzer();

// What it does:
// 1. StandardTokenizer: Unicode-aware word boundaries
//    "Hello, world! Check out https://example.com"
//    → ["Hello", "world", "Check", "out", "https://example.com"]
//
// 2. LowerCaseFilter: "Hello" → "hello"
//
// 3. StopFilter: Removes "the", "is", "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"

Custom Analyzer for E-Commerce

Analyzer ecommerceAnalyzer = new Analyzer() {
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        // Tokenize on whitespace (preserves product codes like "ABC-123")
        Tokenizer tokenizer = new WhitespaceTokenizer();

        TokenStream stream = new LowerCaseFilter(tokenizer);

        // Synonyms: "laptop" = "notebook" = "portable computer"
        SynonymMap.Builder builder = new SynonymMap.Builder(true);
        builder.add(new CharsRef("laptop"), new CharsRef("notebook"), true);
        builder.add(new CharsRef("laptop"), new CharsRef("portable computer"), true);
        SynonymMap synonymMap = builder.build();
        stream = new SynonymGraphFilter(stream, synonymMap, true);

        // Stemming: "running" → "run", "shoes" → "shoe"
        stream = new PorterStemFilter(stream);

        // Edge n-grams for autocomplete: "lap" → "laptop"
        stream = new EdgeNGramTokenFilter(stream, 2, 10);

        return new TokenStreamComponents(tokenizer, stream);
    }
};

Why this matters: E-commerce search needs synonyms ("laptop" = "notebook"), stemming ("shoes" = "shoe"), and autocomplete. A custom analyzer is the difference between a search that works and one that frustrates users.

Multilingual Analysis

// English
Analyzer english = new EnglishAnalyzer();  // Standard + English stopwords + Porter stemmer

// French
Analyzer french = new FrenchAnalyzer();      // French stopwords + French stemming

// Chinese/Japanese/Korean
Analyzer cjk = new CJKAnalyzer();            // Bigram tokenization (no spaces in CJK)

// ICU (International Components for Unicode) - handles all languages
Analyzer icu = new ICUNormalizer2CharFilterFactory();  // NFKC normalization

Why this matters: Different languages need different tokenization. Chinese has no spaces - you need bigram or dictionary-based tokenization. German has compound words - you need decompounding. Arabic has prefix/suffix variations - you need normalization.

Analysis Debugging

// See EXACTLY what your analyzer produces
String text = "The quick brown foxes jump!";

try (TokenStream stream = analyzer.tokenStream("field", text)) {
    CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class);
    PositionIncrementAttribute posAttr = stream.addAttribute(PositionIncrementAttribute.class);
    OffsetAttribute offsetAttr = stream.addAttribute(OffsetAttribute.class);
    TypeAttribute typeAttr = stream.addAttribute(TypeAttribute.class);

    stream.reset();
    while (stream.incrementToken()) {
        System.out.printf("term=%s, pos=%d, offset=%d-%d, type=%s%n",
            termAttr.toString(),
            posAttr.getPositionIncrement(),
            offsetAttr.startOffset(),
            offsetAttr.endOffset(),
            typeAttr.type());
    }
    stream.end();
}

// Output:
// term=quick, pos=2, offset=4-9, type=<ALPHANUM>
// term=brown, pos=1, offset=10-15, type=<ALPHANUM>
// term=fox, pos=1, offset=16-21, type=<ALPHANUM>
// term=jump, pos=1, offset=22-27, type=<ALPHANUM>

Why this matters: When search doesn't work, analyze the analyzer. The #1 debugging tool is printing tokens. If you index "foxes" but the analyzer outputs "fox", your query must also use the same analyzer to get "fox".

Deep Dive: IndexWriter

The IndexWriter is Lucene's most complex class. Understanding its configuration is crucial for production.

Configuration Options

IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());

// RAM buffer size - controls how much memory before flush
config.setRAMBufferSizeMB(256.0);  // Default 16MB. More = faster indexing, more memory.

// Max buffered docs - alternative trigger (whichever comes first)
config.setMaxBufferedDocs(10000);  // Flush after 10,000 docs

// Merge policy - controls segment merging
config.setMergePolicy(new TieredMergePolicy());

// Merge scheduler - controls how merges run
config.setMergeScheduler(new ConcurrentMergeScheduler());  // Default, merges in background threads
// config.setMergeScheduler(new SerialMergeScheduler());   // For testing, merges in foreground

// Open mode
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
// CREATE = overwrite existing
// APPEND = add to existing
// CREATE_OR_APPEND = create if none, append if exists

// Index deletion policy - controls commit history
config.setIndexDeletionPolicy(new KeepOnlyLastCommitDeletionPolicy());
// KeepOnlyLastCommitDeletionPolicy = only keep last commit (default)
// SnapshotDeletionPolicy = allow snapshotting commits for backup

// Similarity - scoring model
config.setSimilarity(new BM25Similarity());
// config.setSimilarity(new ClassicSimilarity());  // TF-IDF (legacy)

// Codec - on-disk format
config.setCodec(new Lucene99Codec());
// Lucene99Codec = current default (Lucene 9.x)
// Lucene95Codec = older format
// You can write custom codecs!

// Info stream - debug logging
config.setInfoStream(System.out);  // See EVERYTHING IndexWriter does

// Flush on close - ensure all docs are flushed before closing
config.setCommitOnClose(true);

IndexWriter writer = new IndexWriter(FSDirectory.open(Paths.get("/index")), config);

Flush vs Commit vs Merge

┌─────────────────────────────────────────────────────────────┐
│                    TIMELINE                                  │
├─────────────────────────────────────────────────────────────┤
│  addDocument() → addDocument() → addDocument() → ...       │
│       │              │              │                        │
│       └──────┬───────┴──────┬──────┘                        │
│              │              │                                │
│              ▼              ▼                                │
│        [RAM Buffer fills]  [Or commit() called]              │
│              │              │                                │
│              ▼              ▼                                │
│         FLUSH ──────────── FLUSH                             │
│              │              │                                │
│              ▼              ▼                                │
│      New Segment created  New Segment created                │
│              │              │                                │
│              └──────┬──────┘                                │
│                     │                                       │
│                     ▼                                       │
│              [Background Thread]                             │
│                     │                                       │
│                     ▼                                       │
│                   MERGE                                     │
│                     │                                       │
│                     ▼                                       │
│           Small segments → Large segments                   │
└─────────────────────────────────────────────────────────────┘

Flush: In-memory → disk segment. Fast. Not durable (no fsync).
Commit: Flush + write segments_N + fsync. Durable. Expensive (disk sync).
Merge: Background consolidation of segments. I/O intensive. Configurable throttling.

Why this matters: For high-throughput indexing (logs, events), you want large RAM buffers and infrequent commits. For document storage (search engine), you want smaller buffers and more frequent commits for durability.

NRT (Near-Real-Time) Search

// Option 1: Commit + reopen (durable, ~1 second latency)
writer.commit();
DirectoryReader newReader = DirectoryReader.openIfChanged(reader);

// Option 2: NRT - no commit needed! (~100ms latency)
DirectoryReader nrtReader = DirectoryReader.open(writer);
IndexSearcher nrtSearcher = new IndexSearcher(nrtReader);
TopDocs results = nrtSearcher.search(query, 10);

// Reopen to see newer docs (still no commit!)
DirectoryReader newNrtReader = DirectoryReader.openIfChanged(nrtReader);

How NRT works:

IndexWriter flushes a DWPT to a new segment (files on disk)
The segment is NOT in the segments_N file (not committed)
DirectoryReader.open(writer) opens the in-progress segment list directly from the writer's internal state
New segments are visible to searchers without a full commit

Why this matters: This is how Elasticsearch/OpenSearch achieve 1-second refresh intervals. NRT readers see documents within milliseconds of flush, not seconds of commit. The trade-off: uncommitted segments are lost if JVM crashes.

Deep Dive: IndexReader & IndexSearcher

IndexReader Types

// 1. DirectoryReader - reads from disk (FSDirectory)
DirectoryReader reader = DirectoryReader.open(FSDirectory.open(Paths.get("/index")));

// 2. NRT Reader - reads uncommitted segments from IndexWriter
DirectoryReader nrtReader = DirectoryReader.open(writer);

// 3. MultiReader - reads multiple indices as one
IndexReader multiReader = new MultiReader(reader1, reader2, reader3);

// 4. SlowCompositeReaderWrapper - flattens multi-segment to single (slower, for compatibility)
IndexReader slowReader = new SlowCompositeReaderWrapper(reader);

// 5. FilterDirectoryReader - wraps another reader with filtering

IndexSearcher Patterns

// Basic search
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(query, 10);

// Search with sorting
Sort sort = new Sort(new SortField("price", SortField.Type.INT, false));  // ascending
TopFieldDocs sortedResults = searcher.search(query, 10, sort);

// Search with filter (cheaper than query for reusable filters)
Query filter = new TermQuery(new Term("status", "active"));
Query filteredQuery = new BooleanQuery.Builder()
    .add(query, BooleanClause.Occur.MUST)
    .add(filter, BooleanClause.Occur.FILTER)  // FILTER doesn't score
    .build();
TopDocs filteredResults = searcher.search(filteredQuery, 10);

// Search with collector manager (for concurrent search across segments)
CollectorManager<TopScoreDocCollector, TopDocs> manager = 
    TopScoreDocCollector.createSharedManager(10, null);
TopDocs concurrentResults = searcher.search(query, manager);

Thread Safety

IndexReader  → Thread-safe for reading, NOT for reopening
IndexSearcher → Thread-safe for searching
  (Internally, one IndexSearcher per thread for concurrent segment search)

Why this matters: Create one IndexSearcher, share it across all request threads. Reopen periodically to see new documents. Never share an IndexWriter across threads (use one per thread, or use DWPT which is automatic).

Deep Dive: Query Types

TermQuery - Exact Match

// Find documents where "field" contains exactly "value"
Query q = new TermQuery(new Term("title", "lucene"));

// Use case: Finding documents by ID, category, exact keyword

PhraseQuery - Proximity Search

// "machine learning" must appear as consecutive words
Query q = new PhraseQuery("body", "machine", "learning");

// With slop (allows words in between)
PhraseQuery.Builder builder = new PhraseQuery.Builder();
builder.add(new Term("body", "machine"), 0);
builder.add(new Term("body", "learning"), 1);
builder.setSlop(2);  // Allow up to 2 words between
Query q = builder.build();

// "machine [anything] [anything] learning" matches

BooleanQuery - Logical Combinations

Query q = new BooleanQuery.Builder()
    .add(new TermQuery(new Term("title", "lucene")), BooleanClause.Occur.MUST)      // AND
    .add(new TermQuery(new Term("body", "search")), BooleanClause.Occur.SHOULD)     // OR (boosts score)
    .add(new TermQuery(new Term("status", "draft")), BooleanClause.Occur.MUST_NOT)   // NOT
    .add(new TermQuery(new Term("year", "2024")), BooleanClause.Occur.FILTER)       // AND (no score contribution)
    .build();

// MUST = required (AND)
// SHOULD = optional (OR), contributes to score
// MUST_NOT = excluded (NOT)
// FILTER = required, no scoring (faster than MUST for caching)

RangeQuery - Numeric/Date Ranges

// Integer range (uses BKD tree, NOT inverted index)
Query q = IntPoint.newRangeQuery("price", 10, 100);

// Long range (timestamps)
Query q = LongPoint.newRangeQuery("timestamp", 
    Instant.parse("2024-01-01T00:00:00Z").toEpochMilli(),
    Instant.parse("2024-12-31T23:59:59Z").toEpochMilli());

// Double range
Query q = DoublePoint.newRangeQuery("rating", 4.0, 5.0);

Why this matters: Range queries on numeric fields use BKD trees, which are 10-100x faster than scanning term dictionaries. This is why you should use IntPoint/LongPoint for numbers, not TextField.

PrefixQuery / WildcardQuery / RegexpQuery

// Prefix: "aut*" matches "auto", "automobile", "autumn"
Query q = new PrefixQuery(new Term("title", "aut"));

// Wildcard: "*cat*" matches "cat", "catch", "concatenate"
Query q = new WildcardQuery(new Term("title", "*cat*"));

// Regex: "[a-z]+cat[a-z]*" matches words containing "cat"
Query q = new RegexpQuery(new Term("title", "[a-z]+cat[a-z]*"));

Warning: Leading wildcards (*cat) and broad regexes are slow because they can't use the FST efficiently. They may scan the entire term dictionary. Use with caution or apply filters to limit the document set first.

FuzzyQuery - Approximate Matching

// "lucne" matches "lucene" with edit distance 1
Query q = new FuzzyQuery(new Term("title", "lucne"), 1);

// "lucne" matches "lucene" with edit distance 2
Query q = new FuzzyQuery(new Term("title", "lucne"), 2);

// Fuzzy + prefix: "lucn*" must match prefix, then fuzzy within
Query q = new FuzzyQuery(new Term("title", "lucne"), 2, 3);  // prefix length 3

How it works: Lucene builds a Levenshtein automaton (finite state machine) for the query term, then intersects it with the FST term dictionary. Shared prefixes are traversed once, then automaton states track edit distance.

MatchAllDocsQuery / MatchNoDocsQuery

// Match everything (useful for testing, or with filters)
Query q = new MatchAllDocsQuery();

// Match nothing (useful for degenerate cases)
Query q = new MatchNoDocsQuery();

Deep Dive: Scoring

BM25 Similarity (Default Since Lucene 6)

// Default parameters
BM25Similarity similarity = new BM25Similarity(1.2f, 0.75f);
// k1 = 1.2 (term frequency saturation)
// b = 0.75 (length normalization)

Understanding k1:

k1 = 0.0:  No term frequency saturation. All occurrences weighted equally.
k1 = 1.2:  Default. Diminishing returns after ~3-5 occurrences.
k1 = 10.0: Nearly linear TF weighting. Very sensitive to frequency.

Example: Document with 1 vs 10 occurrences of "lucene"
- k1=0.0:   Score ratio = 1:1 (no difference!)
- k1=1.2:   Score ratio ≈ 1:2 (10x frequency → 2x score)
- k1=10.0:  Score ratio ≈ 1:7 (10x frequency → 7x score)

Understanding b:

b = 0.0:  No length normalization. Short and long docs equal.
b = 0.75: Default. Moderate length penalty.
b = 1.0:  Full length normalization. Long docs heavily penalized.

Example: Title match vs Body match
- "lucene" in title (5 words) vs "lucene" in body (500 words)
- b=0.0:   Title score = Body score (same TF)
- b=0.75:  Title score ≈ 2.5x Body score (shorter = better)
- b=1.0:   Title score ≈ 5x Body score (extreme preference for short)

Custom Scoring with FunctionScoreQuery

// Combine BM25 with a custom function (e.g., boost by popularity)
Query baseQuery = new TermQuery(new Term("title", "lucene"));

DoubleValuesSource popularityBoost = DoubleValuesSource.fromIntField("popularity");

Query boostedQuery = new FunctionScoreQuery(baseQuery, popularityBoost) {
    @Override
    protected float score(float docScore, double funcScore) {
        return docScore * (float) Math.log1p(funcScore);  // BM25 * log(popularity)
    }
};

ClassicSimilarity (TF-IDF - Legacy)

// Pre-Lucene 6 scoring
ClassicSimilarity similarity = new ClassicSimilarity();

score = tf * idf * norm
  where tf = sqrt(termFrequency)
  where idf = 1 + log(numDocs / (docFreq + 1))
  where norm = 1 / sqrt(docLength)

Why BM25 replaced TF-IDF: BM25 has better term frequency saturation and length normalization. It's the result of 20+ years of IR research. TF-IDF over-penalizes long documents and under-saturates high term frequencies.

Deep Dive: Data Structures

FST (Finite State Transducer) - Term Dictionary

Why FSTs? The term dictionary needs to support:

Exact lookup (term → postings pointer)
Prefix lookup (autocomplete)
Range queries ("a" to "c")
Fuzzy matching (Levenshtein intersection)
All in minimal memory

FST Structure (Shared Prefixes):

Terms: "cat", "cats", "dog", "dogs", "door", "dorm"

FST (simplified):
    c ──→ a ──→ t ──→ $  (cat)
                  │
                  └─→ s ──→ $  (cats)
    d ──→ o ──→ g ──→ $  (dog)
                  │
                  ├─→ s ──→ $  (dogs)
                  │
                  ├─→ o ──→ r ──→ $  (door)
                  │
                  └─→ r ──→ m ──→ $  (dorm)

$ = final state (valid term end)
Arc output = postings pointer + doc frequency

Memory: ~10-50 bytes per term. 100M terms → ~1-2 GB RAM.

File: .tip (FST in memory), .tim (term blocks accessed via FST).

BKD Tree - Numeric/Geospatial Index

Construction:

Points: [(1, 2), (3, 4), (5, 6), (7, 8), (2, 3), (4, 5), (6, 7), (8, 9)]

Step 1: Sort by dimension 0, find median
  Sorted: [(1,2), (2,3), (3,4), (4,5), (5,6), (6,7), (7,8), (8,9)]
  Median: (4,5) and (5,6) → split at 4.5

Step 2: Left half [(1,2), (2,3), (3,4), (4,5)]
  Sort by dim 1: [(1,2), (2,3), (3,4), (4,5)]
  Split at 2.5

Step 3: Recurse until ≤ 1024 points per leaf

On disk:
  .dii = index tree (inner nodes)
  .dim = leaf blocks (packed points, doc IDs, min/max bounds)

Query execution:

Range query: [x: 3-7, y: 4-8]
  Traverse tree:
    Root bounds: [1-8, 2-9] → intersects → go deeper
    Left child: [1-4, 2-5] → intersects → go deeper
    Right child: [5-8, 6-9] → intersects → go deeper
    ...
    Leaf [3-4, 4-5] → check each point individually
    Prune branches that don't intersect!

Why BKD wins: For 1M unique prices, an inverted index would have 1M terms. BKD stores actual values in a tree. Range queries are O(log n) tree traversal instead of O(n) term dictionary scan.

DocValues - Columnar Storage

The Problem: Lucene's default storage is row-oriented (stored fields). For "sort by price", you'd have to read every document's stored fields - expensive.

DocValues Solution: Store each field in a separate column file.

Documents: 1000
Field "price" values: [10, 25, 10, 50, 25, 10, 100, 50, ...]

Row-oriented (Stored Fields): 
  Doc 0: {title: "A", price: 10}  ← must read entire doc to get price
  Doc 1: {title: "B", price: 25}
  Doc 2: {title: "C", price: 10}
  ...

Column-oriented (DocValues):
  price.dvd: [10, 25, 10, 50, 25, 10, 100, 50, ...]
  ← sequential read, cache-friendly, no decompression of unrelated fields

DocValues Types:

Type	Storage	Use Case
NUMERIC	packed ints, GCD compression	Sorting, filtering, aggregations
BINARY	raw bytes with length index	Field retrieval without stored fields
SORTED	ordinals + unique values	Single-value string sorting
SORTED_SET	ordinals + bitset per doc	Multi-value faceting
SORTED_NUMERIC	multiple numeric values per doc	Multi-value numeric fields

Compression:

Monotonic: Values always increasing → store deltas only
GCD: All values multiples of 100 → store value/100 (fewer bits)
Table: Block offsets stored for random access
Direct: Small doc count → raw values

Memory: Memory-mapped via MMapDirectory. Hot values stay in OS page cache. No JVM heap pressure.

Norms - Field Length Normalization

// One byte per document per field
// Stores: 1 / (1 + b * (fieldLength / avgFieldLength - 1))

// Encoding: 256 discrete values (log-scaled)
// ~0.4% relative error

// For a document with fieldLength = 100, avgFieldLength = 200:
norm = 1 / (1 + 0.75 * (100/200 - 1))
     = 1 / (1 + 0.75 * (-0.5))
     = 1 / (1 - 0.375)
     = 1 / 0.625
     = 1.6

// Encoded to 1 byte: ~230 (out of 256)

Why this matters: Shorter documents get higher scores. A title match (5 words) gets a bigger boost than a body match (500 words). Without norms, a 1000-word document with 10 occurrences of "lucene" would always beat a 10-word document with 1 occurrence.

Advanced Query Types

Span Queries - Positional Search

Span queries allow complex positional logic:

// "quick" within 5 positions of "fox"
SpanQuery quick = new SpanTermQuery(new Term("body", "quick"));
SpanQuery fox = new SpanTermQuery(new Term("body", "fox"));
SpanQuery near = new SpanNearQuery(new SpanQuery[]{quick, fox}, 5, true);
// "quick brown fox" matches (distance 2)
// "quick ... fox" matches (distance up to 5)
// "fox quick" doesn't match (ordered=true requires quick before fox)

// "quick" NOT within 10 positions of "lazy"
SpanQuery quick = new SpanTermQuery(new Term("body", "quick"));
SpanQuery lazy = new SpanTermQuery(new Term("body", "lazy"));
SpanQuery notNear = new SpanNotQuery(quick, lazy, 10);

// Complex: "a" near "b" near "c" within 20 positions
SpanQuery a = new SpanTermQuery(new Term("body", "a"));
SpanQuery b = new SpanTermQuery(new Term("body", "b"));
SpanQuery c = new SpanTermQuery(new Term("body", "c"));
SpanQuery ab = new SpanNearQuery(new SpanQuery[]{a, b}, 10, false);
SpanQuery abc = new SpanNearQuery(new SpanQuery[]{ab, c}, 20, false);

Why this matters: Span queries are for linguistic search - finding words near each other, in specific order, or NOT near each other. They use positions data (.pos file), so they're slower than TermQuery but much more expressive.

Payload Queries - Custom Per-Position Data

// Index time: store custom payload with each term occurrence
TokenStream stream = analyzer.tokenStream("body", text);
PayloadAttribute payloadAttr = stream.addAttribute(PayloadAttribute.class);
// ... for each token, set payload = "POS=NOUN" or "WEIGHT=0.9" ...

// Search time: boost by payload
PayloadScoreQuery query = new PayloadScoreQuery(
    new SpanTermQuery(new Term("body", "lucene")),
    new MaxPayloadFunction()  // or AveragePayloadFunction, MinPayloadFunction
);

Why this matters: Payloads let you attach custom metadata to each term occurrence. Use cases: part-of-speech tagging (boost nouns over verbs), entity weighting (boost "Apple" as company over "apple" as fruit), or custom signals.

Function Queries - Custom Scoring Functions

// Boost by a numeric field value
Query base = new TermQuery(new Term("title", "lucene"));
DoubleValuesSource recency = DoubleValuesSource.fromLongField("timestamp");
DoubleValuesSource popularity = DoubleValuesSource.fromIntField("views");

// Recency boost: newer = higher score
DoubleValuesSource recencyBoost = new ReciprocalDoubleValuesSource(recency);

// Combine: BM25 * recency * log(popularity)
DoubleValuesSource combined = new ProductDoubleValuesSource(
    new ProductDoubleValuesSource(recencyBoost, popularity)
);

Query functionQuery = new FunctionScoreQuery(base, combined);

CustomScoreQuery - Full Control

Query base = new TermQuery(new Term("title", "lucene"));

CustomScoreQuery customQuery = new CustomScoreQuery(base) {
    @Override
    protected CustomScoreProvider getCustomScoreProvider(LeafReaderContext context) {
        return new CustomScoreProvider(context) {
            @Override
            public float customScore(int doc, float subQueryScore, float[] valSrcScores) {
                // Custom scoring logic:
                // subQueryScore = BM25 score from base query
                // valSrcScores = scores from value sources (if any)

                float popularity = getPopularity(doc);  // your custom field
                return subQueryScore * (1 + popularity / 100.0f);
            }
        };
    }
};

MoreLikeThis - Find Similar Documents

// Find documents similar to document 42
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{"title", "body"});
mlt.setMinTermFreq(1);
mlt.setMinDocFreq(1);
mlt.setAnalyzer(analyzer);

Query likeQuery = mlt.like(42);  // Generate query from document 42
TopDocs similarDocs = searcher.search(likeQuery, 10);

// Or from text directly:
Reader textReader = new StringReader("This is the reference text...");
Query likeQuery = mlt.like(textReader);

How it works: MoreLikeThis extracts the most interesting terms from the input document (high TF, medium DF - not too rare, not too common), then builds a BooleanQuery with those terms.

Fuzzy Matching Internals

// Levenshtein distance 2, prefix length 3
FuzzyQuery fuzzy = new FuzzyQuery(new Term("title", "lucne"), 2, 3);

// Internally:
// 1. Build Levenshtein automaton for "lucne" with max edits=2
// 2. Intersect automaton with FST term dictionary
// 3. Collect matching terms: "lucene" (1 edit), "lance" (2 edits), "lucie" (2 edits)
// 4. Rewrite to BooleanQuery(TermQuery("lucene"), TermQuery("lance"), TermQuery("lucie"))
// 5. Execute with max_expansions limit (default 50)

Warning: Fuzzy queries with high edit distance or short prefix can expand to thousands of terms, causing performance issues. Always set a reasonable prefix length and limit.

Regex Query Optimization

// Fast regex: starts with literal prefix
RegexpQuery fast = new RegexpQuery(new Term("title", "lucene[0-9]+"));
// FST can find "lucene" prefix, then check regex on remaining characters

// Slow regex: no literal prefix
RegexpQuery slow = new RegexpQuery(new Term("title", "[a-z]+lucene"));
// Must scan entire term dictionary

Tip: Always structure regex queries to have a literal prefix. Lucene uses the FST to find the prefix, then applies the regex to a much smaller subset.

Performance Tuning

JVM Tuning

# Heap size: Enough for FSTs, query caches, but leave room for OS page cache
-Xms16g -Xmx16g

# G1GC is generally best for Lucene workloads
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200

# Disable biased locking (not helpful for Lucene's concurrency)
-XX:-UseBiasedLocking

# Large pages for heap (if supported by OS)
-XX:+UseLargePages

# Disable explicit GC calls (some libraries call System.gc())
-XX:+DisableExplicitGC

Why this matters: Lucene is I/O-bound, not CPU-bound. The OS page cache (for memory-mapped files) is as important as JVM heap. A 32GB machine with 16GB heap leaves 16GB for OS cache - perfect for hot index data.

Directory Types

// MMapDirectory - memory-mapped files (default, fastest for most cases)
Directory dir = new MMapDirectory(Paths.get("/index"));
// Pros: OS page cache, no JVM heap, fast random access
// Cons: 64-bit only, may have issues with very large files on some OS

// NIOFSDirectory - NIO FileChannel
Directory dir = new NIOFSDirectory(Paths.get("/index"));
// Pros: Works on all platforms, predictable
// Cons: Slower than MMap, more system calls

// SimpleFSDirectory - plain RandomAccessFile
Directory dir = new SimpleFSDirectory(Paths.get("/index"));
// Pros: Simple, no dependencies
// Cons: Slow, not recommended for production

// RAMDirectory - in-memory (deprecated, use ByteBuffersDirectory instead)
Directory dir = new RAMDirectory();
// Use for: testing, small temporary indices

Recommendation: Use MMapDirectory on 64-bit systems. It's the default and fastest.

Merge Tuning

TieredMergePolicy mergePolicy = new TieredMergePolicy();
mergePolicy.setMaxMergeAtOnce(10);          // Max segments to merge at once
mergePolicy.setSegmentsPerTier(10.0);       // Target segments per size tier
mergePolicy.setMaxMergedSegmentMB(5000);    // Max segment size (5GB)
mergePolicy.setFloorSegmentMB(2);           // Minimum segment size for tiering
mergePolicy.setForceMergeDeletesPctAllowed(10); // Merge segments with >10% deleted docs

config.setMergePolicy(mergePolicy);

// Throttle merges to avoid impacting search
ConcurrentMergeScheduler scheduler = new ConcurrentMergeScheduler();
scheduler.setMaxMergesAndThreads(3, 2);  // 3 max merges, 2 threads
config.setMergeScheduler(scheduler);

Why this matters: Merges are I/O and CPU intensive. Too aggressive = search latency spikes. Too relaxed = too many segments = slow search. The defaults are good for most cases; tune if you have specific SLAs.

Indexing Throughput Optimization

// 1. Increase RAM buffer (more docs in memory before flush)
config.setRAMBufferSizeMB(256.0);  // Default 16MB. Try 256MB-512MB.

// 2. Use multiple indexing threads (DWPT handles this automatically)
// Each thread gets its own DWPT. Just index from multiple threads.

// 3. Disable unnecessary features
// If you don't need positions:
FieldType type = new FieldType(TextField.TYPE_STORED);
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS);  // No positions
// Saves ~30-50% index size and indexing time

// 4. Bulk document addition
List<Document> docs = loadBatch();  // 100-1000 docs
writer.addDocuments(docs);  // Bulk add (slightly more efficient)

// 5. Disable norms if not needed
FieldType type = new FieldType(TextField.TYPE_STORED);
type.setOmitNorms(true);  // Saves 1 byte per doc per field

// 6. Use stored fields sparingly
// Only store fields you need to retrieve. Everything else = index only.

Search Latency Optimization

// 1. Warm the index
// On startup, run typical queries to warm OS page cache
for (Query warmQuery : typicalQueries) {
    searcher.search(warmQuery, 1);  // Don't care about results, just load data
}

// 2. Use filters for caching
Query filter = new TermQuery(new Term("status", "active"));
Query constantScore = new ConstantScoreQuery(filter);  // Cacheable

// 3. Limit wildcard/prefix queries
// Set max expansions
PrefixQuery query = new PrefixQuery(new Term("title", "a"));
query.setRewriteMethod(new MultiTermQuery.TopTermsRewrite(100));  // Max 100 terms

// 4. Use query cache
LRUQueryCache queryCache = new LRUQueryCache(1000, 100_000_000);  // 1000 queries, 100MB
IndexSearcher searcher = new IndexSearcher(reader);
searcher.setQueryCache(queryCache);

// 5. Collector-level optimization
// If you only need count, don't score
TotalHitCountCollector countCollector = new TotalHitCountCollector();
searcher.search(query, countCollector);
int totalHits = countCollector.getTotalHits();

// 6. Parallel search across segments
ExecutorService executor = Executors.newFixedThreadPool(4);
IndexSearcher parallelSearcher = new IndexSearcher(reader, executor);

Cache Configuration

// Query cache (caches query results per segment)
LRUQueryCache queryCache = new LRUQueryCache(
    1000,    // Max cached queries
    100_000_000,  // Max cache size in bytes (100MB)
    context -> true,  // Cache all queries (or filter by cost)
    1.0f     // Cache ratio (1.0 = cache everything eligible)
);

// Field cache (for DocValues, automatically managed)
// No configuration needed - DocValues are memory-mapped

// Filter cache (caches filter bitsets)
// Use CachingWrapperFilter or LRUQueryCache with filters

Why this matters: Caching is crucial for repeated queries. A faceted e-commerce search often runs the same base query with different filter combinations. Caching the base query bitset saves 50-90% of query time.

Production Operations

Backup and Restore

// Backup using SnapshotDeletionPolicy
SnapshotDeletionPolicy snapshotPolicy = new SnapshotDeletionPolicy(
    new KeepOnlyLastCommitDeletionPolicy()
);
config.setIndexDeletionPolicy(snapshotPolicy);

// Take a snapshot
IndexCommit commit = snapshotPolicy.snapshot();
Collection<String> fileNames = commit.getFileNames();
// Copy all fileNames to backup location

// Release snapshot when done
snapshotPolicy.release(commit);

// Restore: just copy files back and open
Directory restoredDir = FSDirectory.open(Paths.get("/restored"));
// Verify with CheckIndex

Index Corruption Recovery

# CheckIndex tool - verify and optionally fix index
java org.apache.lucene.index.CheckIndex /path/to/index

# Output shows:
# - Segment integrity
# - File checksums
# - Doc count verification
# - Orphaned file detection

# Fix index (removes corrupt segments)
java org.apache.lucene.index.CheckIndex /path/to/index -fix

// Programmatic check
CheckIndex checkIndex = new CheckIndex(dir);
CheckIndex.Status status = checkIndex.checkIndex();
if (status.clean) {
    System.out.println("Index is clean!");
} else {
    System.err.println("Index has problems: " + status.segmentsChecked);
}

Index Migration Between Versions

// Lucene 8 to 9: Use Lucene's IndexUpgrader
IndexUpgrader upgrader = new IndexUpgrader(dir, new Lucene99Codec(), true);
upgrader.upgrade();

// Or simply open with newer version and optimize (force merge)
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setCodec(new Lucene99Codec());
IndexWriter writer = new IndexWriter(dir, config);
writer.forceMerge(1);  // Merge all segments to new format
writer.close();

Monitoring and Metrics

// IndexWriter metrics
IndexWriter writer = new IndexWriter(dir, config);

// Number of segments
int numSegments = writer.getDirectory().listAll().length;

// Merge statistics
MergeScheduler mergeScheduler = config.getMergeScheduler();
// Use MergeScheduler.Info to get running merges

// Segment info
List<SegmentCommitInfo> segments = writer.getSegmentInfos().asList();
for (SegmentCommitInfo seg : segments) {
    System.out.println("Segment: " + seg.info.name);
    System.out.println("  Docs: " + seg.info.maxDoc());
    System.out.println("  Size: " + seg.sizeInBytes());
    System.out.println("  Del: " + seg.getDelCount());
}

// Searcher metrics
IndexSearcher searcher = new IndexSearcher(reader);
// Track query latency, cache hit rates, segment counts externally

Hot/Warm/Cold Architecture

┌─────────────────────────────────────────────┐
│  HOT TIER (SSD, Recent Data)                │
│  ├── Last 7 days of logs                    │
│  ├── Active products                        │
│  └── Frequent queries → cached in RAM       │
├─────────────────────────────────────────────┤
│  WARM TIER (SSD, Older Data)                │
│  ├── Last 30 days of logs                   │
│  ├── Seasonal products                      │
│  └── Less frequent queries                  │
├─────────────────────────────────────────────┤
│  COLD TIER (HDD/S3, Archive)                │
│  ├── Historical data                        │
│  └── Force-merged to 1 segment              │
│  └── On-demand loading                      │
└─────────────────────────────────────────────┘

Implementation: Use multiple indices and MultiReader or application-level routing. Elasticsearch's ILM (Index Lifecycle Management) does this automatically.

Lucene in the Wild

How Elasticsearch Uses Lucene

Elasticsearch Cluster
  ├── Node 1
  │     ├── Shard 0 (Primary) → Lucene Index
  │     │     ├── Segment 1
  │     │     ├── Segment 2
  │     │     └── Segment 3
  │     └── Shard 1 (Replica) → Lucene Index
  ├── Node 2
  │     ├── Shard 0 (Replica) → Lucene Index
  │     └── Shard 1 (Primary) → Lucene Index
  └── Node 3
        ├── Shard 2 (Primary) → Lucene Index
        └── Shard 2 (Replica) → Lucene Index

ES adds:
  - Distributed architecture (cluster coordination)
  - REST API
  - Document-level operations (CRUD)
  - Mapping/schema management
  - Aggregations framework (on top of DocValues)
  - Replication and failover
  - Index lifecycle management
  - Machine learning integration
  - Ingest pipelines

Elasticsearch's refresh: refresh_interval (default 1s) triggers NRT reader reopen. This is why ES has 1-second visibility latency.

Elasticsearch's flush: translog provides durability. When flush_threshold_size is reached, a Lucene commit is triggered. This is the true durability boundary.

How OpenSearch Differs

OpenSearch is a fork of Elasticsearch 7.10.2. Lucene usage is identical at the core level. Differences:

OpenSearch focuses on open-source governance (no proprietary license changes)
Some plugins differ (security, alerting, ML)
Version alignment: OpenSearch 2.x uses Lucene 9.x

How Solr Uses Lucene

Solr Core
  ├── Lucene Index (same as above)
  ├── Schema (managed, with field types)
  ├── Request Handlers (/select, /update, etc.)
  ├── Update Processors (custom indexing logic)
  ├── Search Components (faceting, highlighting, grouping)
  └── Replication (master-slave or SolrCloud)

Solr adds:
  - XML/JSON config-based schema
  - Rich search components (facet, stats, cluster, etc.)
  - SolrCloud (ZooKeeper-based distributed coordination)
  - Built-in faceting (more mature than early ES)

Case Study: Wikipedia Search

Index size: ~20TB of text
Documents: 6M+ articles, with revisions
Queries: 10,000+ QPS
Lucene usage: Custom Solr deployment with:
- Custom analyzers for 300+ languages
- BKD trees for geo search (coordinates in articles)
- Suggesters for autocomplete
- Custom scoring (boost by article quality, recency)
- Faceting for categories, namespaces

Case Study: Netflix Search

Index size: ~100K titles, but rich metadata per title
Queries: Complex boolean with personal preference vectors
Lucene usage: Elasticsearch with:
- Custom analyzers for multi-language content
- DocValues for runtime fields (personalization scores)
- KNN vector search for semantic recommendations
- Custom rescorer for ML-based ranking

Recent Development & Roadmap

Lucene 9.x Features (Current)

KNN Vector Search (HNSW):

// Index vectors
float[] vector = embeddingModel.embed("query text");
doc.add(new KnnVectorField("embedding", vector, VectorSimilarityFunction.COSINE));

// Search vectors
Query knnQuery = new KnnVectorQuery("embedding", queryVector, 100);
TopDocs results = searcher.search(knnQuery, 10);

HNSW Internals:

Graph-based approximate nearest neighbor search
Layered graph: base layer (all nodes) + upper layers (sparse)
Search starts at top layer, greedily navigates to closest node, drops to next layer
New files: .vec (vector data), .vem (metadata), .veq (HNSW graph)
Merge complexity: HNSW graphs must be merged when segments merge (expensive)

MaxScore/WAND Optimization:

Block-level skipping for disjunction queries (OR)
30-70% of documents skipped for typical top-N queries
Major latency improvement for broad queries

Unified Highlighter:

Single highlighter implementation that works with postings, term vectors, or analysis
Replaces the confusing matrix of three different highlighters

Lucene99Codec:

Improved block compression for postings
Better DocValues compression (GCD, table-of-contents)

Lucene 10.x Plans

Java 21 Virtual Threads (Project Loom):

// Future: Concurrent indexing with virtual threads
// IndexWriter will use virtual threads for concurrent DWPT flushes
// IndexSearcher will use virtual threads for per-segment concurrent search
// No more thread pool management!

SIMD Scoring:

// Future: Java Vector API for BM25 scoring
// 2-5x speedup for scoring-heavy queries
// Multiple document scores computed in parallel using SIMD instructions

Vector Search Maturity:

Incremental HNSW updates (currently bulk-only)
Deletion support in HNSW graphs
Multi-vector fields (one doc, multiple vectors)
Better integration with BKD for hybrid queries (vector + filter)

New Codec:

Lucene 10 codec with rethought postings format
Possibly Roaring Bitmaps for doc IDs
Better skip lists for faster conjunctions
Backward-incompatible: migration tools provided

Cloud-Native Index Format:

Index structures designed for object storage (S3)
Lazy loading of segments, terms, and postings
Reduced local disk requirements

Contributing to Lucene

How to Read the Code

Start with the entry points:

Indexing flow: IndexWriter.addDocument() → DefaultIndexingChain.processDocument() → FreqProxTermsWriterPerField.addTerm() → flush() → writeSegment()
Search flow: IndexSearcher.search() → createWeight() → Weight.scorer() → BulkScorer.score() → TopScoreDocCollector.collect()
Codec flow: Lucene99Codec → Lucene99PostingsFormat → BlockTreeTermsReader → FST + PostingsReader

Key tracing technique:

// Enable debug logging to see everything
config.setInfoStream(System.out);

// Or use a file for analysis
config.setInfoStream(new PrintStream("/tmp/lucene-indexing.log"));

How to Run Tests

# Clone and build
git clone https://github.com/apache/lucene.git
cd lucene
./gradlew assemble

# Run all tests (takes hours!)
./gradlew test

# Run specific module tests
./gradlew :lucene-core:test

# Run specific test class
./gradlew :lucene-core:test --tests "TestIndexWriter"

# Run specific test method
./gradlew :lucene-core:test --tests "TestIndexWriter.testCommit"

# Run with random seed (for reproducibility)
./gradlew :lucene-core:test --tests "TestIndexWriter" -Dtests.seed=DEADBEEF

Lucene uses randomized testing: Tests run with different random seeds, document counts, and merge policies to catch edge cases. If a test fails, note the seed - you can reproduce it.

How to Submit a PR

JIRA first: Create an issue at https://issues.apache.org/jira/projects/LUCENE
Discuss: For significant changes, email dev@lucene.apache.org
Fork and branch: git checkout -b LUCENE-12345-fix-description
Code: Follow the style guide (checkstyle is enforced)
Test: Add unit tests. Lucene requires tests for every bug fix and feature.
Commit: Format: LUCENE-12345: Brief description
PR: Submit via GitHub. Apache Lucene uses GitHub PRs now (migrated from SVN).
Review: Address feedback from committers. Typical review cycle: 1-3 rounds.

Code Review Process

Minimum 1 committer approval required
Tests must pass (GitHub Actions CI)
Backwards compatibility: Lucene is strict about API compatibility within major versions
Documentation: Javadoc for public APIs, CHANGES.txt entry

Where Everything Lives in the Codebase

Repository Structure Overview

lucene/
├── lucene/                          # Main code modules
│   ├── core/                        # Core indexing and search
│   ├── analysis/                    # Analyzers and tokenizers
│   ├── codecs/                      # Codec implementations
│   ├── demo/                        # Demo applications
│   ├── facet/                       # Faceting module
│   ├── group/                       # Result grouping
│   ├── highlighter/                 # Highlighting implementations
│   ├── join/                        # Parent/child joins
│   ├── memory/                      # Memory-based indices
│   ├── misc/                        # Miscellaneous utilities
│   ├── queries/                     # Additional query types
│   ├── queryparser/                 # Query parsers
│   ├── suggest/                     # Autocomplete/suggest
│   ├── benchmark/                   # Performance benchmarks
│   └── test-framework/              # Testing utilities
├── gradle/                          # Gradle build files
├── dev-docs/                        # Developer documentation
└── versions.lock                    # Dependency versions

Concept-to-Code Mapping

Concept	Package	Key Files
Inverted Index	`lucene/core/src/java/org/apache/lucene/index/`	`IndexWriter.java`, `DefaultIndexingChain.java`, `FreqProxTermsWriter.java`
Postings Format	`lucene/core/src/java/org/apache/lucene/codecs/lucene99/`	`Lucene99PostingsFormat.java`, `Lucene99PostingsReader.java`, `Lucene99PostingsWriter.java`
FST Term Dictionary	`lucene/core/src/java/org/apache/lucene/util/fst/`	`FST.java`, `FSTEnum.java`, `Util.java`
Term Dictionary Reader	`lucene/core/src/java/org/apache/lucene/codecs/blocktree/`	`BlockTreeTermsReader.java`, `BlockTreeTermsWriter.java`
BKD Tree	`lucene/core/src/java/org/apache/lucene/util/bkd/`	`BKDReader.java`, `BKDWriter.java`, `BKDWriter.java`
DocValues Format	`lucene/core/src/java/org/apache/lucene/codecs/lucene99/`	`Lucene99DocValuesFormat.java`, `Lucene99DocValuesConsumer.java`, `Lucene99DocValuesProducer.java`
BM25 Scoring	`lucene/core/src/java/org/apache/lucene/search/similarities/`	`BM25Similarity.java`, `Similarity.java`, `SimilarityBase.java`
IndexWriter	`lucene/core/src/java/org/apache/lucene/index/`	`IndexWriter.java`, `DocumentsWriter.java`, `DocumentsWriterPerThread.java`
IndexReader	`lucene/core/src/java/org/apache/lucene/index/`	`IndexReader.java`, `DirectoryReader.java`, `SegmentReader.java`, `StandardDirectoryReader.java`
IndexSearcher	`lucene/core/src/java/org/apache/lucene/search/`	`IndexSearcher.java`, `TopDocs.java`, `ScoreDoc.java`
BooleanQuery	`lucene/core/src/java/org/apache/lucene/search/`	`BooleanQuery.java`, `BooleanWeight.java`, `BooleanScorer.java`
TermQuery	`lucene/core/src/java/org/apache/lucene/search/`	`TermQuery.java`, `TermWeight.java`, `TermScorer.java`
PhraseQuery	`lucene/core/src/java/org/apache/lucene/search/`	`PhraseQuery.java`, `PhraseWeight.java`, `PhraseScorer.java`
Query Parsing	`lucene/queryparser/src/java/org/apache/lucene/queryparser/classic/`	`QueryParser.java`, `QueryParserBase.java`, `ParseException.java`
Analysis Pipeline	`lucene/core/src/java/org/apache/lucene/analysis/`	`Analyzer.java`, `TokenStream.java`, `Tokenizer.java`, `TokenFilter.java`
StandardTokenizer	`lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/`	`StandardTokenizer.java`, `StandardAnalyzer.java`, `StandardFilter.java`
Merge Policy	`lucene/core/src/java/org/apache/lucene/index/`	`TieredMergePolicy.java`, `LogByteSizeMergePolicy.java`, `MergePolicy.java`, `MergeScheduler.java`
ConcurrentMergeScheduler	`lucene/core/src/java/org/apache/lucene/index/`	`ConcurrentMergeScheduler.java`, `MergeScheduler.java`
HNSW Vectors	`lucene/core/src/java/org/apache/lucene/util/hnsw/`	`HnswGraphBuilder.java`, `HnswGraphSearcher.java`, `HnswGraph.java`, `RandomAccessVectorValues.java`
KnnVectorQuery	`lucene/core/src/java/org/apache/lucene/search/`	`KnnVectorQuery.java`, `KnnCollector.java`
Stored Fields Format	`lucene/core/src/java/org/apache/lucene/codecs/lucene99/`	`Lucene99StoredFieldsFormat.java`, `CompressedStoredFieldsFormat.java`
Norms Format	`lucene/core/src/java/org/apache/lucene/codecs/lucene99/`	`Lucene99NormsFormat.java`, `Lucene99NormsConsumer.java`, `Lucene99NormsProducer.java`
Codec Framework	`lucene/core/src/java/org/apache/lucene/codecs/`	`Codec.java`, `PostingsFormat.java`, `DocValuesFormat.java`, `StoredFieldsFormat.java`
Lucene99Codec	`lucene/core/src/java/org/apache/lucene/codecs/lucene99/`	`Lucene99Codec.java`
Directory Abstraction	`lucene/core/src/java/org/apache/lucene/store/`	`Directory.java`, `FSDirectory.java`, `MMapDirectory.java`, `NIOFSDirectory.java`, `RAMDirectory.java`
Document/Field	`lucene/core/src/java/org/apache/lucene/document/`	`Document.java`, `Field.java`, `TextField.java`, `StringField.java`, `IntPoint.java`, `NumericDocValuesField.java`, `StoredField.java`
QueryVisitor	`lucene/core/src/java/org/apache/lucene/search/`	`QueryVisitor.java`, `Query.java`
Collector Framework	`lucene/core/src/java/org/apache/lucene/search/`	`Collector.java`, `TopDocsCollector.java`, `TopScoreDocCollector.java`, `TopFieldCollector.java`
Scorer	`lucene/core/src/java/org/apache/lucene/search/`	`Scorer.java`, `BulkScorer.java`, `DefaultBulkScorer.java`
Weight	`lucene/core/src/java/org/apache/lucene/search/`	`Weight.java`, `TermWeight.java`, `BooleanWeight.java`
CheckIndex	`lucene/core/src/java/org/apache/lucene/index/`	`CheckIndex.java`
IndexUpgrader	`lucene/core/src/java/org/apache/lucene/index/`	`IndexUpgrader.java`
Near-Real-Time Reader	`lucene/core/src/java/org/apache/lucene/index/`	`DirectoryReader.java` (open method), `StandardDirectoryReader.java`
Faceting	`lucene/facet/src/java/org/apache/lucene/facet/`	`Facets.java`, `FacetsCollector.java`, `FastTaxonomyFacetCounts.java`, `SortedSetDocValuesFacetCounts.java`
Highlighting	`lucene/highlighter/src/java/org/apache/lucene/search/highlight/`	`Highlighter.java`, `QueryScorer.java`, `Fragmenter.java`, `UnifiedHighlighter.java`
Suggest/Autocomplete	`lucene/suggest/src/java/org/apache/lucene/search/suggest/`	`AnalyzingInfixSuggester.java`, `FuzzySuggester.java`, `Lookup.java`, `TTFLookup.java`
Parent/Child Joins	`lucene/join/src/java/org/apache/lucene/search/join/`	`ToParentBlockJoinQuery.java`, `ToChildBlockJoinQuery.java`, `BlockJoinSelector.java`
MoreLikeThis	`lucene/queries/src/java/org/apache/lucene/queries/mlt/`	`MoreLikeThis.java`
Function Queries	`lucene/queries/src/java/org/apache/lucene/queries/function/`	`FunctionScoreQuery.java`, `FunctionQuery.java`, `ValueSource.java`
Span Queries	`lucene/core/src/java/org/apache/lucene/search/spans/`	`SpanQuery.java`, `SpanTermQuery.java`, `SpanNearQuery.java`, `SpanNotQuery.java`
FuzzyQuery	`lucene/core/src/java/org/apache/lucene/search/`	`FuzzyQuery.java`, `FuzzyTermsEnum.java`
RegexpQuery	`lucene/core/src/java/org/apache/lucene/search/`	`RegexpQuery.java`, `AutomatonQuery.java`
WildcardQuery	`lucene/core/src/java/org/apache/lucene/search/`	`WildcardQuery.java`
PrefixQuery	`lucene/core/src/java/org/apache/lucene/search/`	`PrefixQuery.java`
RangeQuery (Point)	`lucene/core/src/java/org/apache/lucene/search/`	`PointRangeQuery.java`, `IntPoint.java`, `LongPoint.java`, `DoublePoint.java`
Cache	`lucene/core/src/java/org/apache/lucene/search/`	`LRUQueryCache.java`, `QueryCache.java`, `CachingWrapperFilter.java`
GeoPoint	`lucene/core/src/java/org/apache/lucene/geo/`	`Point.java`, `Rectangle.java`, `Polygon.java`
Geo Search	`lucene/core/src/java/org/apache/lucene/search/`	`GeoPointQuery.java`, `GeoPointInPolygonQuery.java`
Test Framework	`lucene/test-framework/src/java/org/apache/lucene/tests/`	`LuceneTestCase.java`, `BaseTokenStreamTestCase.java`, `RandomIndexWriter.java`

How to Navigate the Code

Entry Points for Understanding:

Indexing Flow:

   IndexWriter.addDocument() [index/IndexWriter.java]
     → DocumentsWriter.updateDocument() [index/DocumentsWriter.java]
       → DocumentsWriterPerThread.updateDocument() [index/DocumentsWriterPerThread.java]
         → DefaultIndexingChain.processDocument() [index/DefaultIndexingChain.java]
           → FreqProxTermsWriterPerField.addTerm() [index/FreqProxTermsWriter.java]

Search Flow:

   IndexSearcher.search() [search/IndexSearcher.java]
     → createWeight() [search/IndexSearcher.java]
       → Query.createWeight() [search/Query.java]
         → Query.rewrite() [search/Query.java]
     → Weight.scorer() [search/Weight.java]
       → TermWeight.scorer() [search/TermQuery.java inner class]
         → TermScorer constructor [search/TermScorer.java]
     → BulkScorer.score() [search/BulkScorer.java]
       → TopScoreDocCollector.collect() [search/TopScoreDocCollector.java]

Codec Flow:

   Lucene99Codec [codecs/lucene99/Lucene99Codec.java]
     → Lucene99PostingsFormat [codecs/lucene99/Lucene99PostingsFormat.java]
       → BlockTreeTermsWriter [codecs/blocktree/BlockTreeTermsWriter.java]
         → FST [util/fst/FST.java]
       → Lucene99PostingsWriter [codecs/lucene99/Lucene99PostingsWriter.java]
     → Lucene99DocValuesFormat [codecs/lucene99/Lucene99DocValuesFormat.java]
       → Lucene99DocValuesConsumer [codecs/lucene99/Lucene99DocValuesConsumer.java]
     → Lucene99StoredFieldsFormat [codecs/lucene99/Lucene99StoredFieldsFormat.java]

Tips for Reading:

Use an IDE (IntelliJ, Eclipse) with "Navigate to Implementation" (Ctrl+Alt+B)
Start with the test files: TestIndexWriter.java, TestIndexSearcher.java, TestTermQuery.java
Read Javadoc comments - they're comprehensive
Follow the // NOTE comments in the code - they often explain design decisions

Package-by-Package Breakdown

Package	What It Contains	Key Classes
`org.apache.lucene.index`	Everything about indexing, segments, merging, committing	`IndexWriter`, `IndexReader`, `DirectoryReader`, `SegmentReader`, `MergePolicy`, `TieredMergePolicy`, `DocumentsWriter`, `CheckIndex`
`org.apache.lucene.search`	Everything about querying, scoring, collecting results	`IndexSearcher`, `Query`, `Weight`, `Scorer`, `Collector`, `TopDocs`, `BooleanQuery`, `TermQuery`, `PhraseQuery`, `BM25Similarity`
`org.apache.lucene.analysis`	Text processing pipeline	`Analyzer`, `TokenStream`, `Tokenizer`, `TokenFilter`, `StandardTokenizer`, `LowerCaseFilter`, `StopFilter`
`org.apache.lucene.codecs`	On-disk format implementations, pluggable codecs	`Codec`, `PostingsFormat`, `DocValuesFormat`, `StoredFieldsFormat`, `Lucene99Codec`, `Lucene99PostingsFormat`
`org.apache.lucene.store`	I/O abstraction layer	`Directory`, `FSDirectory`, `MMapDirectory`, `NIOFSDirectory`, `RAMDirectory`, `IndexInput`, `IndexOutput`
`org.apache.lucene.util`	Data structures and utilities	`FST`, `BKDReader`, `BKDWriter`, `PackedInts`, `BytesRef`, `FixedBitSet`, `Bits`
`org.apache.lucene.document`	Field types and document model	`Document`, `Field`, `TextField`, `StringField`, `IntPoint`, `LongPoint`, `StoredField`, `NumericDocValuesField`
`org.apache.lucene.facet`	Faceting implementation	`Facets`, `FacetsCollector`, `FastTaxonomyFacetCounts`, `SortedSetDocValuesFacetCounts`, `DrillDownQuery`, `DrillSideways`
`org.apache.lucene.highlight`	Highlighting implementations	`Highlighter`, `QueryScorer`, `Fragmenter`, `UnifiedHighlighter`, `TokenSources`
`org.apache.lucene.suggest`	Autocomplete and suggest	`Lookup`, `AnalyzingInfixSuggester`, `FuzzySuggester`, `AnalyzedSuggester`, `TTFLookup`
`org.apache.lucene.join`	Parent/child document joins	`ToParentBlockJoinQuery`, `ToChildBlockJoinQuery`, `BlockJoinSelector`, `BlockJoinQuery`
`org.apache.lucene.queries`	Additional query implementations	`MoreLikeThis`, `FunctionScoreQuery`, `CustomScoreQuery`, `BooleanFilter`, `TermsFilter`
`org.apache.lucene.queryparser`	Query parsers	`QueryParser` (classic), `StandardQueryParser` (flexible), `SimpleQueryParser`
`org.apache.lucene.geo`	Geospatial utilities	`Point`, `Rectangle`, `Polygon`, `Line`, `Tessellator`
`org.apache.lucene.benchmark`	Performance benchmarking	`Benchmarker`, `TaskParser`, `ContentSource`, `QueryMaker`
`org.apache.lucene.tests`	Test framework utilities	`LuceneTestCase`, `RandomIndexWriter`, `MockAnalyzer`, `BaseTokenStreamTestCase`

Common Pitfalls & Solutions

Too Many Segments

Symptom: Search latency increases over time, high file count.

Cause: Infrequent merges, small RAM buffer, or never calling optimize.

Solution:

// Check segment count
int segmentCount = writer.getSegmentInfos().size();
if (segmentCount > 50) {
    // Force merge to reduce segments (do during low traffic!)
    writer.forceMerge(10);  // Target 10 segments max
}

// Or increase merge aggressiveness
TieredMergePolicy policy = new TieredMergePolicy();
policy.setSegmentsPerTier(5.0);  // Default 10, lower = more aggressive merging
config.setMergePolicy(policy);

Merge Storms

Symptom: Sudden I/O spikes, search latency degradation.

Cause: Many small segments trigger cascading merges.

Solution:

// Throttle merges
ConcurrentMergeScheduler scheduler = new ConcurrentMergeScheduler();
scheduler.setMaxMergesAndThreads(2, 1);  // Max 2 merges, 1 thread
config.setMergeScheduler(scheduler);

// Or use a rate limiter
scheduler.setMaxMergesAndThreads(3, 2);
// Additional: setMaxMergeCount() and setMaxThreadCount() per your I/O capacity

OOM During Indexing

Symptom: OutOfMemoryError during bulk indexing.

Cause: RAM buffer too large, or too many threads with DWPTs.

Solution:

// Reduce RAM buffer
config.setRAMBufferSizeMB(64.0);  // Down from 256MB

// Or use max buffered docs instead
config.setMaxBufferedDocs(10000);
config.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);  // Disable RAM trigger

// Limit concurrent DWPTs (threads)
// Lucene limits this automatically based on RAM buffer / (maxThreadStates * 2)
// But you can control: config.setIndexerThreads(Math.min(8, Runtime.getRuntime().availableProcessors()));

Slow Range Queries

Symptom: Range queries on numeric fields take seconds.

Cause: Using TextField for numeric data instead of IntPoint/LongPoint.

Solution:

// WRONG:
doc.add(new TextField("price", "49.99", Field.Store.YES));  // String comparison! Slow!

// RIGHT:
doc.add(new IntPoint("price", 4999));  // BKD tree! Fast!
doc.add(new StoredField("price", 4999));  // Store original for retrieval

Field Cache Explosion (Pre-4.0)

Symptom: OOM on first facet/sort query.

Cause: Old FieldCache loaded entire field values into heap. Fixed in modern Lucene with DocValues.

Solution: Use DocValues (automatic in modern Lucene). If using older versions, ensure DocValues are configured for sort/facet fields.

Deleted Docs Overhead

Symptom: Index size grows despite deleting documents.

Cause: Deleted documents aren't removed until merge.

Solution:

// Force merge segments with many deleted docs
writer.forceMergeDeletes(true);  // Merge segments with > 10% deleted

// Or configure merge policy
TieredMergePolicy policy = new TieredMergePolicy();
policy.setForceMergeDeletesPctAllowed(5);  // Merge when 5% deleted
config.setMergePolicy(policy);

Analyzer Mismatch

Symptom: Query returns no results for words that exist in documents.

Cause: Index-time analyzer ≠ query-time analyzer.

Solution:

// Always use the same analyzer at index and query time
Analyzer analyzer = new StandardAnalyzer();

IndexWriterConfig config = new IndexWriterConfig(analyzer);  // Index time
QueryParser parser = new QueryParser("body", analyzer);       // Query time

// Verify with token analysis
// "Foxes" at index time → ["fox"] (PorterStemFilter)
// "Foxes" at query time → ["fox"] (same analyzer) ✓
// If query time was KeywordAnalyzer → ["foxes"] ✗ (no match!)

Lock Issues

Symptom: LockObtainFailedException on IndexWriter open.

Cause: Another process has the index locked.

Solution:

// Use NativeFSLockFactory (default, most reliable)
Directory dir = new NIOFSDirectory(Paths.get("/index"), NativeFSLockFactory.INSTANCE);

// Or for single-process, in-memory lock:
Directory dir = new NIOFSDirectory(Paths.get("/index"), new SingleInstanceLockFactory());

// Check for stale locks (if JVM crashed)
// Remove write.lock file manually if you're sure no other process is using the index

Complete Code Examples

Example 1: Basic Indexing and Search

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;
import java.nio.file.Paths;

public class BasicSearchExample {
    public static void main(String[] args) throws Exception {
        // 1. Setup
        Directory dir = new MMapDirectory(Paths.get("/tmp/lucene-index"));
        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        IndexWriter writer = new IndexWriter(dir, config);

        // 2. Index documents
        Document doc1 = new Document();
        doc1.add(new TextField("title", "Lucene in Action", Field.Store.YES));
        doc1.add(new TextField("body", "Lucene is a powerful search library", Field.Store.YES));
        doc1.add(new IntPoint("year", 2024));
        doc1.add(new StoredField("year", 2024));
        writer.addDocument(doc1);

        Document doc2 = new Document();
        doc2.add(new TextField("title", "Search Engine Architecture", Field.Store.YES));
        doc2.add(new TextField("body", "Building search systems with Lucene and Elasticsearch", Field.Store.YES));
        doc2.add(new IntPoint("year", 2023));
        doc2.add(new StoredField("year", 2023));
        writer.addDocument(doc2);

        writer.commit();
        writer.close();

        // 3. Search
        DirectoryReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);
        QueryParser parser = new QueryParser("body", analyzer);
        Query query = parser.parse("lucene search");

        TopDocs results = searcher.search(query, 10);
        System.out.println("Total hits: " + results.totalHits);

        for (ScoreDoc scoreDoc : results.scoreDocs) {
            Document doc = searcher.doc(scoreDoc.doc);
            System.out.printf("Score: %.2f, Title: %s%n", 
                scoreDoc.score, doc.get("title"));
        }

        reader.close();
        dir.close();
    }
}

Example 2: Custom Analyzer

import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.core.*;
import org.apache.lucene.analysis.en.*;
import org.apache.lucene.analysis.miscellaneous.*;
import org.apache.lucene.analysis.synonym.*;
import org.apache.lucene.analysis.standard.*;

public class CustomAnalyzerExample {
    public static Analyzer createEcommerceAnalyzer() {
        return new Analyzer() {
            @Override
            protected TokenStreamComponents createComponents(String fieldName) {
                Tokenizer tokenizer = new StandardTokenizer();
                TokenStream stream = new LowerCaseFilter(tokenizer);
                stream = new StopFilter(stream, EnglishAnalyzer.ENGLISH_STOP_WORDS_SET);

                // Synonyms
                SynonymMap.Builder builder = new SynonymMap.Builder(true);
                builder.add(new CharsRef("laptop"), new CharsRef("notebook"), true);
                builder.add(new CharsRef("phone"), new CharsRef("mobile"), true);
                try {
                    SynonymMap synonymMap = builder.build();
                    stream = new SynonymGraphFilter(stream, synonymMap, true);
                } catch (Exception e) {
                    throw new RuntimeException(e);
                }

                // Stemming
                stream = new PorterStemFilter(stream);

                // Edge n-grams for autocomplete (2-10 chars)
                stream = new EdgeNGramTokenFilter(stream, 2, 10);

                return new TokenStreamComponents(tokenizer, stream);
            }
        };
    }

    public static void main(String[] args) throws Exception {
        Analyzer analyzer = createEcommerceAnalyzer();
        String text = "The quick brown foxes jump over the laptop!";

        try (TokenStream stream = analyzer.tokenStream("body", text)) {
            CharTermAttribute termAttr = stream.addAttribute(CharTermAttribute.class);
            stream.reset();
            while (stream.incrementToken()) {
                System.out.println("Token: " + termAttr.toString());
            }
            stream.end();
        }
        // Output: qu, qui, quic, quick, br, bro, brow, brown, etc.
    }
}

Example 3: Custom Query

import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.util.*;

// A custom query that boosts documents containing a term in the first 100 positions
public class EarlyPositionBoostQuery extends Query {
    private final Term term;
    private final int maxPosition;
    private final float boost;

    public EarlyPositionBoostQuery(Term term, int maxPosition, float boost) {
        this.term = term;
        this.maxPosition = maxPosition;
        this.boost = boost;
    }

    @Override
    public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost) throws IOException {
        Weight innerWeight = new TermQuery(term).createWeight(searcher, scoreMode, boost);

        return new Weight(this) {
            @Override
            public Scorer scorer(LeafReaderContext context) throws IOException {
                Scorer innerScorer = innerWeight.scorer(context);
                if (innerScorer == null) return null;

                return new Scorer(this) {
                    @Override
                    public DocIdSetIterator iterator() {
                        return innerScorer.iterator();
                    }

                    @Override
                    public float getMaxScore(int upTo) throws IOException {
                        return innerScorer.getMaxScore(upTo) * EarlyPositionBoostQuery.this.boost;
                    }

                    @Override
                    public float score() throws IOException {
                        int doc = innerScorer.docID();

                        // Check if term appears in first 100 positions
                        PostingsEnum postings = context.reader().postings(term, PostingsEnum.POSITIONS);
                        if (postings != null && postings.advance(doc) == doc) {
                            for (int i = 0; i < postings.freq(); i++) {
                                int pos = postings.nextPosition();
                                if (pos < maxPosition) {
                                    return innerScorer.score() * EarlyPositionBoostQuery.this.boost;
                                }
                            }
                        }
                        return innerScorer.score();
                    }

                    @Override
                    public int docID() {
                        return innerScorer.docID();
                    }
                };
            }

            @Override
            public boolean isCacheable(LeafReaderContext ctx) {
                return false;
            }

            @Override
            public Explanation explain(LeafReaderContext context, int doc) throws IOException {
                return innerWeight.explain(context, doc);
            }
        };
    }

    @Override
    public String toString(String field) {
        return "EarlyPositionBoost(" + term + ", pos<" + maxPosition + ", boost=" + boost + ")";
    }

    @Override
    public boolean equals(Object other) {
        return sameClassAs(other) && term.equals(((EarlyPositionBoostQuery) other).term);
    }

    @Override
    public int hashCode() {
        return classHash() ^ term.hashCode();
    }

    @Override
    public void visit(QueryVisitor visitor) {
        visitor.visitLeaf(this);
    }
}

Example 4: Custom Scorer

import org.apache.lucene.search.*;

// A custom scorer that boosts recent documents
public class RecencyBoostScorer extends Scorer {
    private final Scorer innerScorer;
    private final long currentTime;
    private final float halfLifeDays;
    private final NumericDocValues timestampValues;

    public RecencyBoostScorer(Weight weight, Scorer innerScorer, 
                               NumericDocValues timestampValues, 
                               float halfLifeDays) {
        super(weight);
        this.innerScorer = innerScorer;
        this.timestampValues = timestampValues;
        this.currentTime = System.currentTimeMillis();
        this.halfLifeDays = halfLifeDays;
    }

    @Override
    public DocIdSetIterator iterator() {
        return innerScorer.iterator();
    }

    @Override
    public float getMaxScore(int upTo) throws IOException {
        return innerScorer.getMaxScore(upTo) * 2.0f;  // Max possible boost
    }

    @Override
    public float score() throws IOException {
        float baseScore = innerScorer.score();
        int doc = docID();

        if (timestampValues.advanceExact(doc)) {
            long docTime = timestampValues.longValue();
            long ageMs = currentTime - docTime;
            double ageDays = ageMs / (1000.0 * 60 * 60 * 24);
            double decay = Math.pow(0.5, ageDays / halfLifeDays);  // Exponential decay
            return baseScore * (float)(1.0 + decay);  // Recent docs get up to 2x boost
        }
        return baseScore;
    }

    @Override
    public int docID() {
        return innerScorer.docID();
    }
}

Example 5: Facet Search

import org.apache.lucene.facet.*;
import org.apache.lucene.facet.sortedset.*;

// Setup: index with facets
Directory dir = new MMapDirectory(Paths.get("/tmp/facet-index"));
FacetFields facetFields = new FacetFields(taxoWriter);

Document doc = new Document();
doc.add(new TextField("title", "Product A", Field.Store.YES));

// Add facets as drill-down paths
List<FacetField> facets = new ArrayList<>();
facets.add(new FacetField("category", "Electronics", "Computers"));
facets.add(new FacetField("price_range", "100-200"));
facets.add(new FacetField("brand", "Apple"));
doc.add(new FacetField("category", "Electronics", "Computers"));

writer.addDocument(doc);

// Search with faceting
DirectoryReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);

// Facet configuration
FacetsConfig facetConfig = new FacetsConfig();
facetConfig.setMultiValued("category", true);
facetConfig.setHierarchical("category", true);

// Search and collect facets
FacetsCollector facetsCollector = new FacetsCollector();
TopDocs results = FacetsCollector.search(searcher, query, 10, facetsCollector);

// Get facet counts
Facets facets = new SortedSetDocValuesFacetCounts(state, facetsCollector);
FacetResult categoryResult = facets.getTopChildren(10, "category");
FacetResult priceResult = facets.getTopChildren(10, "price_range");

// Print facet counts
System.out.println("Categories:");
for (LabelAndValue lv : categoryResult.labelValues) {
    System.out.println("  " + lv.label + ": " + lv.value);
}
// Output:
//   Electronics: 150
//   Electronics/Computers: 80
//   Electronics/Phones: 70

Example 6: Highlight Search Results

import org.apache.lucene.search.highlight.*;

// Setup
QueryParser parser = new QueryParser("body", analyzer);
Query query = parser.parse("lucene search");

// Using UnifiedHighlighter (recommended)
UnifiedHighlighter highlighter = new UnifiedHighlighter(searcher, analyzer);
String[] snippets = highlighter.highlight("body", query, results, 3);  // 3 snippets

for (String snippet : snippets) {
    System.out.println(snippet);
}
// Output: "<b>Lucene</b> is a powerful <b>search</b> library"

// Using classic Highlighter (more control)
QueryScorer scorer = new QueryScorer(query);
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 100);
Highlighter classicHighlighter = new Highlighter(scorer);
classicHighlighter.setTextFragmenter(fragmenter);

TokenStream tokenStream = analyzer.tokenStream("body", doc.get("body"));
String snippet = classicHighlighter.getBestFragment(tokenStream, doc.get("body"));

Example 7: Spell Checking

import org.apache.lucene.search.spell.*;
import org.apache.lucene.index.*;

// Build spell index from existing index
Directory spellIndexDir = new MMapDirectory(Paths.get("/tmp/spell-index"));
SpellChecker spellChecker = new SpellChecker(spellIndexDir);

// Index the dictionary from the main index
Dictionary dictionary = new LuceneDictionary(reader, "title");
spellChecker.indexDictionary(dictionary, new IndexWriterConfig(analyzer), true);

// Suggest corrections
String word = "lucne";
int numSuggestions = 5;
String[] suggestions = spellChecker.suggestSimilar(word, numSuggestions);
// Output: ["lucene", "lucien", "lune", ...]

// Did-you-mean suggestion for a full query
String userQuery = "lucne serch";
String[] words = userQuery.split(" ");
StringBuilder didYouMean = new StringBuilder();
for (String w : words) {
    String[] similar = spellChecker.suggestSimilar(w, 1);
    didYouMean.append(similar.length > 0 ? similar[0] : w).append(" ");
}
System.out.println("Did you mean: " + didYouMean.toString().trim());
// Output: "Did you mean: lucene search"

Example 8: MoreLikeThis

import org.apache.lucene.queries.mlt.MoreLikeThis;

// Find documents similar to document 42
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setFieldNames(new String[]{"title", "body"});
mlt.setAnalyzer(analyzer);
mlt.setMinTermFreq(1);   // Ignore terms that appear less than this in source doc
mlt.setMinDocFreq(1);    // Ignore terms that appear in less than this many docs
mlt.setMaxQueryTerms(25); // Max terms to include in generated query

Query likeQuery = mlt.like(42);  // Generate query from doc 42
TopDocs similarDocs = searcher.search(likeQuery, 10);

System.out.println("Documents similar to #42:");
for (ScoreDoc sd : similarDocs.scoreDocs) {
    if (sd.doc != 42) {  // Exclude the source document
        Document doc = searcher.doc(sd.doc);
        System.out.printf("  Score: %.2f, Title: %s%n", sd.score, doc.get("title"));
    }
}

// Or from external text:
Reader textReader = new StringReader("Apache Lucene is a search library...");
Query fromTextQuery = mlt.like(textReader);

Example 9: Vector Search (KNN)

import org.apache.lucene.document.KnnVectorField;
import org.apache.lucene.search.KnnVectorQuery;
import org.apache.lucene.index.VectorSimilarityFunction;

// Index documents with vector embeddings
float[] docVector = embeddingModel.embed("Lucene search library document");
Document doc = new Document();
doc.add(new TextField("title", "Lucene Guide", Field.Store.YES));
doc.add(new KnnVectorField("embedding", docVector, VectorSimilarityFunction.COSINE));
writer.addDocument(doc);

// Search by vector similarity
float[] queryVector = embeddingModel.embed("search engine library");
Query knnQuery = new KnnVectorQuery("embedding", queryVector, 100);

// Combine with text query for hybrid search
Query textQuery = new TermQuery(new Term("title", "lucene"));
Query hybridQuery = new BooleanQuery.Builder()
    .add(knnQuery, BooleanClause.Occur.SHOULD)
    .add(textQuery, BooleanClause.Occur.SHOULD)
    .build();

TopDocs results = searcher.search(hybridQuery, 10);

// Results ordered by combined vector + text relevance
for (ScoreDoc sd : results.scoreDocs) {
    Document doc = searcher.doc(sd.doc);
    System.out.printf("Score: %.4f, Title: %s%n", sd.score, doc.get("title"));
}

Example 10: Near-Real-Time (NRT) Search

import org.apache.lucene.index.*;
import org.apache.lucene.search.*;

// Setup
Directory dir = new MMapDirectory(Paths.get("/tmp/nrt-index"));
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(dir, config);

// Add initial document
writer.addDocument(doc1);
writer.commit();  // Durable commit

// Open initial NRT reader
DirectoryReader nrtReader = DirectoryReader.open(writer);
IndexSearcher searcher = new IndexSearcher(nrtReader);

// Search sees doc1
TopDocs results1 = searcher.search(new MatchAllDocsQuery(), 10);
System.out.println("Hits after initial: " + results1.totalHits);  // 1

// Add new document WITHOUT commit!
writer.addDocument(doc2);
// No commit! But NRT reader can still see it after reopen

// Reopen NRT reader (sees uncommitted doc2!)
DirectoryReader newReader = DirectoryReader.openIfChanged(nrtReader);
if (newReader != null) {
    nrtReader.close();
    nrtReader = newReader;
    searcher = new IndexSearcher(nrtReader);
}

// Search now sees doc2
TopDocs results2 = searcher.search(new MatchAllDocsQuery(), 10);
System.out.println("Hits after NRT reopen: " + results2.totalHits);  // 2

// doc2 is NOT durable yet! If JVM crashes now, doc2 is lost.
writer.commit();  // Now doc2 is durable

Conclusion & Learning Path

What We Covered

We built Lucene understanding from the ground up:

The Inverted Index - The core data structure that makes search fast
Documents & Fields - How to structure data for indexing
Analysis - How text becomes searchable terms
IndexWriter - How documents flow from memory to immutable segments
IndexReader & IndexSearcher - How queries find and rank documents
BM25 Scoring - The math behind relevance ranking
Data Structures - FST, BKD, DocValues, Norms - each optimized for its job
Advanced Queries - Span, payload, function, fuzzy, regex
Performance Tuning - JVM, merge policy, directory types, caching
Production Operations - Backup, recovery, monitoring, hot/warm/cold
Codebase Navigation - Where every concept lives in the actual code

Key Takeaways

Concept	Remember This
Immutable Segments	Everything is append-only. Merge for cleanup. Enables concurrent readers.
Analyzer Consistency	Index-time and query-time analyzers must match. #1 bug source.
Field Types Matter	TextField for search, StringField for exact match, Point for ranges, DocValues for sort/facet.
FST + BKD + DocValues	Three specialized data structures. No one-size-fits-all.
MAXSCORE/WAND	Modern Lucene skips 30-70% of docs without scoring. This is the speed secret.
NRT Search	Uncommitted docs are searchable. This is how ES gets 1-second refresh.
MMapDirectory	Let the OS cache hot data. Don't fill JVM heap with index data.
BM25	Default scoring since Lucene 6. Better than TF-IDF. k1=1.2, b=0.75.

Recommended Learning Path

Week 1: Fundamentals

Read this guide's sections 1-6 (inverted index, components, storage/read journeys)
Build the basic indexing/search example (Example 1)
Experiment with different analyzers and observe token output

Week 2: Querying

Read sections 7-9 (analysis, queries, scoring)
Implement all query types from Example 1-5
Build a small search application with boolean, phrase, and range queries
Debug scoring with searcher.explain(query, docId)

Week 3: Advanced Features

Read sections 10-12 (advanced queries, data structures, performance)
Implement faceting (Example 5) and highlighting (Example 6)
Add spell checking (Example 7) and MoreLikeThis (Example 8)
Profile search performance with IndexSearcher's execution time

Week 4: Production & Internals

Read sections 13-18 (tuning, operations, codebase, contributing)
Download Lucene source code and trace through IndexWriter.addDocument()
Run CheckIndex on your test index and inspect the output
Read the actual BM25Similarity.java source code
Submit a small documentation fix PR to Lucene

Ongoing:

Follow the Lucene developer mailing list
Read JIRA issues marked newbie or good first issue
Benchmark your queries with lucene-benchmark

Resources for Further Learning

Resource	What It's For
Lucene Core Javadoc	API reference for every class
Lucene In Action	Deep dive book (covers older versions but concepts hold)
Tantivy	Rust implementation of Lucene's design - great for understanding concepts in a different language
Elasticsearch Guide	Production search at scale (built on Lucene)
OpenSearch Documentation	Open-source alternative to Elasticsearch
Lucene JIRA	Track issues, understand roadmap, find contributions
Lucene Wiki	Design documents, architecture decisions
Information Retrieval Book	Free textbook on IR theory behind BM25, scoring, etc.
Lucene/Solr Revolution Talks	Conference talks on real-world usage

Final Words

Lucene is 25+ years old and still the best search library in the world. That longevity comes from a simple, powerful design: immutable segments, pluggable components, and specialized data structures for each access pattern. Every search server you use builds on these foundations.

Understanding Lucene isn't just about using a library. It's about understanding how to organize data for fast retrieval, how to rank by relevance, and how to build systems that scale. These skills transfer to databases, caching systems, recommendation engines, and beyond.

The best way to learn is to build. Start with a simple index, add documents, run queries, and watch the magic happen. When something doesn't work, trace the data flow - from document to term to posting to score. The answers are all there, in the code.

Happy searching. 🔍

About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. I contribute to Apache Lucene, OpenSearch, and related projects. Follow my work on GitHub.