The Hidden Cost of Append-Only: Why Lucene Constantly Rewrites Its Own Index
Apache Lucene's append-only segment architecture is elegant. Documents arrive, get buffered, flush to immutable segments, and those segments become searchable. But that elegance comes with a cost: over time, the index becomes fragmented. Hundreds or thousands of tiny segments accumulate, each with its own overhead - file handles, memory buffers, and the overhead of searching across many independent structures.
This is where segment merging comes in. Lucene doesn't just append data forever. Behind the scenes, it continuously rewrites the index, consolidating smaller segments into larger ones. This process is fundamental to Lucene's performance, yet it's often misunderstood by engineers who treat it as an implementation detail rather than a core operational concern.
Understanding segment merging is critical because misconfigured merge behavior can silently destroy search performance, spike I/O, or exhaust disk space. I've seen production Elasticsearch clusters brought down by uncontrolled merge storms. I've also seen clusters that never merged aggressively enough, leaving thousands of tiny segments and query latencies in the hundreds of milliseconds instead of single-digit milliseconds.
This post covers how Lucene's merge process works, when it triggers, how to tune it, and the production pitfalls that catch teams by surprise.
What Is Segment Merging and Why Does Lucene Need It?
Lucene's immutable segment design means documents are never updated in place. When you update a document, Lucene creates a new segment with the updated document and marks the old one as deleted in the original segment. When you delete a document, Lucene only marks it as deleted - the data remains in the segment until a merge physically removes it.
Over time, this leads to:
- Segment proliferation: Each refresh creates new segments. With a 1-second refresh interval, you're creating 86,400 segments per day per shard if indexing continuously.
- Deleted document accumulation: Deleted documents consume disk space and slow down searches until merged away.
- Query overhead: Searching 1,000 segments is slower than searching 10 segments of equivalent size, due to per-segment overhead (opening readers, merging results, priority queue management).
-
Resource exhaustion: Each segment requires file handles (one per index file per segment). With compound file format disabled, a segment with 10 fields might have 50+ files. 1,000 segments = 50,000 open files. On systems with default
ulimitof 4,096 or 65,536, this becomes a real problem.
Merging solves all of these problems by rewriting multiple smaller segments into fewer larger segments. During a merge, Lucene reads documents from source segments, applies deletions, and writes a new consolidated segment. The old segments are deleted only after the new segment is fully written and committed.
The Merge Lifecycle: From Trigger to Completion
When Does a Merge Trigger?
Merges are triggered by several conditions, not just segment count:
1. Segment Count Thresholds
Each MergePolicy defines a maximum number of segments allowed at each "level" (size tier). When a level exceeds its threshold, merges are scheduled. For example, TieredMergePolicy might allow up to 10 segments in the smallest tier, but only 5 in larger tiers.
2. Deleted Document Ratio
When a segment exceeds a configured percentage of deleted documents (default 33% in Elasticsearch), Lucene prioritizes merging that segment to reclaim space and improve search performance.
3. Explicit API Calls
Users can force merges via IndexWriter.forceMerge(int maxNumSegments) or Elasticsearch's _forcemerge API. This is useful for optimizing read-heavy indices after bulk indexing completes.
4. Segment Size Boundaries
The TieredMergePolicy groups segments into tiers by size. Segments within the same tier are candidates for merging. A segment that grows large enough to cross into a higher tier won't be merged with smaller segments - it waits for peers in its new tier.
The Merge Process Internally
When IndexWriter decides to merge, here's what happens:
// Simplified view of merge execution in Lucene
// IndexWriter.merge(MergePolicy.OneMerge merge)
// 1. Acquire merge scheduler slot
// 2. Open SegmentReaders for source segments
// 3. Create new SegmentWriter for target segment
// 4. Iterate documents from source segments:
// - Skip deleted documents (using LiveDocs BitSet)
// - Copy or re-analyze fields
// - Write to new segment's inverted index, doc values, stored fields
// 5. Flush new segment to disk
// 6. Register new segment, mark old segments for deletion
// 7. Old segments deleted on next commit or refresh
The merge process is I/O intensive because it reads all source segment data and writes all target segment data. A merge of 1 GB of segments requires reading 1 GB and writing 1 GB - temporarily using 2 GB of disk space.
Compound File Format (CFS)
By default, Lucene writes segments in "compound file format" where all index files for a segment are bundled into a single .cfs file. This reduces file handle pressure but adds a small I/O overhead. For very large segments (>5 GB), Elasticsearch disables CFS automatically because the compound file overhead becomes significant.
// Elasticsearch setting: index.codec
// CFS threshold: index.compound_format (default: 30% of Lucene's max merge size)
Merge Policies: Tiered vs Log vs Custom
Lucene provides two built-in merge policies, and understanding their differences is crucial for tuning.
TieredMergePolicy (Default in Elasticsearch)
TieredMergePolicy is the default and generally the best choice. It groups segments into tiers by size and merges segments within the same tier until the tier's segment count falls below a threshold.
Key parameters:
// Elasticsearch equivalents via index settings
index.merge.policy.max_merge_at_once: 10 // Max segments to merge in one operation
index.merge.policy.segments_per_tier: 10 // Target segments per tier
index.merge.policy.max_merge_at_once_explicit: 30 // For forceMerge only
index.merge.policy.floor_segment: 2mb // Minimum segment size for tier calculation
How it works:
- Segments are grouped by size: all segments within a factor of 2 of each other are in the same tier
- When a tier has more than
segments_per_tiersegments, merges are scheduled - The policy picks the combination of segments that maximizes
total_size / (merge_cost + 1)while respectingmax_merge_at_once
Trade-offs:
- Pros: Balances merge I/O vs search performance. Good default for mixed workloads.
- Cons: Can be unpredictable under heavy indexing. Merge decisions are local (not globally optimal).
LogByteSizeMergePolicy (Legacy, Log-Structured)
This policy treats the index like a log-structured merge tree. Segments are merged when the total size of a level exceeds a threshold, creating larger and larger levels.
Key parameters:
merge_factor: 10 // How many segments per level
min_merge_size: 1.6mb // Don't merge segments below this
max_merge_size: 2048mb // Don't merge segments above this
How it works:
- Segments are organized into levels
- When level N has
merge_factorsegments, they are merged into one segment in level N+1 - Creates a staircase of segment sizes: 1MB, 10MB, 100MB, 1GB, etc.
Trade-offs:
- Pros: Predictable, simple to reason about. Good for append-only time-series data.
- Cons: Can create very large segments that are expensive to merge. Less flexible than Tiered.
LogDocMergePolicy (Document-Count Based)
Similar to LogByteSize but uses document counts instead of bytes. Rarely used in production.
Custom Merge Policies
You can implement MergePolicy directly for specialized behavior. For example, a time-series index might want to merge only adjacent time ranges, or a sharded index might want to avoid cross-shard merges.
public class TimeRangeMergePolicy extends MergePolicy {
@Override
public MergeSpecification findMerges(...) {
// Only merge segments with overlapping time ranges
// to preserve time locality for range queries
}
}
Production Tuning: When to Intervene
The default merge settings work for many workloads, but production systems often need tuning. Here are the scenarios and the levers to pull.
Scenario 1: Heavy Indexing, Low Search Volume (Log Ingestion)
Problem: You're ingesting 50,000 docs/sec. Default merges can't keep up. Segment count explodes. Search slows down even though you barely search.
Tuning:
// Increase refresh interval to reduce segment creation
PUT /logs-*/_settings
{
"index": {
"refresh_interval": "30s"
}
}
// Increase merge throughput to allow more I/O for merges
PUT /logs-*/_settings
{
"index": {
"merge": {
"scheduler": {
"max_thread_count": 4, // Default is usually 1/2 of CPU cores
"max_merge_count": 8
},
"policy": {
"floor_segment": "100mb", // Don't bother merging tiny segments
"max_merge_at_once": 20 // Merge more segments at once
}
}
}
}
Why this works: Fewer refreshes = fewer segments created. Higher merge throughput = merges keep up with ingestion. Larger floor segment = less time wasted on tiny merges.
Trade-off: Higher max_thread_count consumes more CPU and I/O. On shared clusters, this can impact other indices.
Scenario 2: Heavy Search, Low Indexing (E-Commerce)
Problem: You're serving 10,000 searches/minute on a product catalog that updates once per day. Query latency is critical.
Tuning:
// Optimize the index after bulk indexing completes
POST /products/_forcemerge?max_num_segments=1
// After force merge, keep settings conservative
PUT /products/_settings
{
"index": {
"refresh_interval": "1s", // Keep default for near-real-time
"number_of_replicas": 2 // Spread search load
}
}
Why this works: Force merging to 1 segment gives the absolute best search performance - no per-segment query overhead, no merging during search time. Two replicas spread the read load across nodes.
Trade-off: Force merge is expensive and blocks indexing. Do it during maintenance windows. Also, a single segment means updates will eventually trigger expensive merges.
Scenario 3: Mixed Workload (SaaS Application)
Problem: Constant indexing and searching. You can't afford merge storms or search degradation.
Tuning:
// Throttle merges to prevent I/O saturation
PUT /app-data/_settings
{
"index": {
"merge": {
"scheduler": {
"max_thread_count": 1, // Single-threaded merges
"max_merge_count": 2 // Max 2 concurrent merges
},
"policy": {
"segments_per_tier": 6, // Keep more segments per tier (fewer merges)
"max_merge_at_once": 5
}
}
}
}
Why this works: Conservative merge settings prevent I/O storms. You trade slightly higher segment count for predictable performance. This is the safest approach for mixed workloads.
Scenario 4: Force Merge Gone Wrong
Problem: You ran forceMerge to 1 segment on a 100 GB index. Disk I/O is pegged at 100%. Search is timing out. Indexing is blocked.
What happened: forceMerge creates a single merge task for all segments. For a 100 GB index, that's a 100 GB read + 100 GB write operation. While this merge runs, the merge thread is occupied. New segments from indexing can't be merged, so segment count also grows. The index enters a state where it's both merging huge segments and accumulating new ones.
Mitigation:
- Don't force merge active indices. Force merge is for read-only or rarely-updated indices.
- If you must force merge, do it in stages:
max_num_segments=10first, then5, then1. - For very large indices, use Elasticsearch's ILM to roll over to new indices and force merge the old (read-only) ones.
// ILM policy with force merge on warm phase
PUT /_ilm/policy/my-policy
{
"policy": {
"phases": {
"warm": {
"min_age": "7d",
"actions": {
"forcemerge": {
"max_num_segments": 1
}
}
}
}
}
}
Monitoring Merges in Production
You can't tune what you don't measure. Here are the key metrics to watch:
Elasticsearch Node Stats
## Current merge activity
curl -s "localhost:9200/_nodes/stats/indices/merge" | jq '.nodes[] | {name, merges: .indices.merges}'
## Example output:
## {
## "name": "node-1",
## "merges": {
## "current": 2, // Active merges right now
## "current_docs": 1500000, // Documents being merged
## "current_size": "750mb", // Total size being merged
## "total": 1250, // Total merges since startup
## "total_time": "45.2m", // Total time spent merging
## "total_docs": 500000000, // Total docs merged
## "total_size": "250gb" // Total bytes merged
## }
## }
Key metrics:
-
current: If this is consistently > 0, merges are running continuously. If > 4, you may have a merge storm. -
current_size: If this is > 5 GB, you're doing large merges. Watch disk I/O. -
total_time: If merges consume more than 20% of node uptime, tuning is needed.
Lucene Index-Level Metrics
## Segment count per shard (critical!)
curl -s "localhost:9200/my-index/_stats/segments" | jq '.indices[].total.segments.count'
## Target: < 50 segments per shard for search-heavy indices
## Target: < 200 segments per shard for write-heavy indices
## Danger: > 1000 segments per shard
Disk Space During Merges
Merges temporarily require 2x the space of the segments being merged. A 5 GB merge needs 10 GB total (5 GB source + 5 GB target). If disk usage is at 75% and a large merge starts, you can hit 100% and crash the node.
Rule of thumb: Keep at least 30% free disk space on data nodes. If you're above 70%, stop indexing and force merge before continuing.
Common Pitfalls and Edge Cases
Pitfall 1: Merge Throttling in Elasticsearch
Elasticsearch throttles merge I/O to prevent saturation. The default throttle is 20 MB/sec per node. If your storage can handle 500 MB/sec, this throttle is unnecessarily conservative.
// Increase throttle (use with caution on shared clusters)
PUT /_cluster/settings
{
"transient": {
"indices.store.throttle.max_bytes_per_sec": "100mb"
}
}
Warning: In Elasticsearch 7.x+, this setting is deprecated. Use index.merge.scheduler.max_thread_count instead. More threads = more concurrent I/O.
Pitfall 2: The "Merge Blackhole"
A merge blackhole occurs when:
- Indexing is faster than merging (segment count grows)
- Search performance degrades due to many segments
- Lucene tries to merge more to fix search performance
- Merges consume I/O, slowing down indexing
- Indexing slows, but segment count is still high
- Lucene schedules even more merges
- The system enters a state where it's constantly merging but never catching up
Symptoms: current merges always > 0, current_size growing, indexing rate dropping, search latency spiking, CPU and I/O pegged.
Fix:
- Pause indexing temporarily
- Increase
max_thread_countif CPU is available - Reduce
refresh_intervalto reduce new segment creation - If desperate, use
/_forcemerge(but this will block indexing)
Pitfall 3: Deleted Documents Not Being Reclaimed
If deleted documents aren't being merged away, check:
- Is the merge policy configured with
expunge_deletes_allowed? (Default is 50% in older versions, but newer versions use the general merge policy.) - Are segments too large to merge? If a segment exceeds
max_merge_size, it won't be merged even if it has 90% deleted docs. - Is
merge_enabledset to false? (Some custom plugins disable merging.)
Fix: Force merge with only_expunge_deletes=true:
POST /my-index/_forcemerge?only_expunge_deletes=true
This only runs merges that reclaim deleted documents, without the full rewrite.
Pitfall 4: Cross-Shard Merge Imbalance
In a multi-shard index, merges happen per shard. If one shard receives more writes than others (hotspotting), that shard will have more segments and do more merges. This creates node-level imbalance.
Fix: Use routing keys to distribute writes evenly, or increase shard count to reduce per-shard write volume.
Pitfall 5: Merge Activity on Frozen/Cold Tiers
Elasticsearch frozen tiers store data on S3/Azure Blob. Merging on frozen tiers is extremely expensive because each segment file may need to be fetched from object storage.
Fix: Never allow merges on frozen indices. Force merge indices before transitioning to frozen/cold. Use searchable snapshots with pre-built segments.
Real-World Benchmarks: The Cost of Bad Merge Configuration
I ran a benchmark on a 3-node Elasticsearch cluster (16 cores, 64 GB RAM, SSD) with a 10 GB index:
| Configuration | Segment Count | Query Latency (p99) | Indexing Rate | Merge Time % |
|---|---|---|---|---|
| Default (1s refresh) | 45 | 12ms | 5,000 docs/sec | 8% |
| Aggressive (100ms refresh) | 312 | 85ms | 5,000 docs/sec | 35% |
| Conservative (30s refresh) | 8 | 4ms | 12,000 docs/sec | 15% |
| Force merged (1 segment) | 1 | 2ms | 2,000 docs/sec | 0% |
| Merge disabled | 1,200+ | 450ms | 5,000 docs/sec | 0% |
Key takeaways:
- Aggressive refreshing (100ms) created 7x more segments and 7x worse query latency
- Force merging to 1 segment gave the best search performance but blocked indexing
- Merge disabled completely destroyed query performance within hours
- Conservative refresh + default merge gave the best balance: 3x better query latency than default, 2.4x better indexing rate
Conclusion: Merges Are Not a Detail
Segment merging is one of the most important operational aspects of running Lucene or Elasticsearch at scale. It's not a background detail you can ignore - it's a fundamental trade-off between write performance, search performance, and resource consumption.
The default settings are wrong for most production workloads. The 1-second refresh interval is a compromise for demo environments, not a production default. The default merge policy is conservative to prevent disasters on undersized clusters, but on properly provisioned hardware, you can be much more aggressive.
My recommendations:
- For log ingestion: 30s refresh, 4 merge threads, 100 MB floor segment
- For search-heavy indices: Force merge to 1-5 segments after bulk indexing, then leave alone
- For mixed workloads: 1s refresh, 1-2 merge threads, monitor segment count religiously
-
For all indices: Keep disk below 70%, monitor
currentmerges, and have an ILM policy that force merges old indices
Merges are Lucene's garbage collector. Like GC, you can ignore it until it ruins your day. Or you can understand it, tune it, and make it work for you.
About the author: I'm Prithvi S, Staff Software Engineer at Cloudera and Opensource Enthusiast. Follow my work on GitHub.
Top comments (0)