우병수

Posted on May 11 • Originally published at techdigestor.com

How I Tuned Adaptive Compression for Inverted Indexes and Stopped Wasting 40% of My Disk

#productivity #tools #webdev #discuss

TL;DR: The thing that caught me off guard wasn't the query latency — it was the storage invoice. We had a working Elasticsearch cluster, decent relevance tuning, p95 query times under 200ms.

📖 Reading time: ~36 min

What's in this article

The Problem Nobody Warns You About
A Quick Mental Model (Not a Textbook Definition)
The Actual Encoding Algorithms You'll Encounter
What Elasticsearch and OpenSearch Actually Give You to Configure
Hands-On: Measuring Compression Ratio Before You Change Anything
Implementing a Custom Codec in Lucene (When Defaults Aren't Enough)
Roaring Bitmaps: When to Reach for Them Directly
The 3 Things That Surprised Me

The Problem Nobody Warns You About

The thing that caught me off guard wasn't the query latency — it was the storage invoice. We had a working Elasticsearch cluster, decent relevance tuning, p95 query times under 200ms. Then we crossed 100M documents and the disk bill tripled inside of two billing cycles. Not doubled. Tripled. The index itself was the problem, specifically how posting lists were being stored with the default codec settings that neither Elasticsearch nor Lucene particularly advertise or explain in accessible terms.

Here's the concrete version of what happens: take a term like the, is, or any other high-frequency token you've left in because you skipped stop-word filtering. The posting list for that term — the list of document IDs, term frequencies, and positional data — can balloon past several hundred MB per shard uncompressed. With 20 shards and replicas, you're suddenly looking at gigabytes for a single token that contributes almost nothing to relevance scoring. Lucene's default delta-encoded VInt compression helps, but it's static. It doesn't adapt based on what your data distribution actually looks like.

The default compression settings in both Elasticsearch (running Lucene under the hood) and standalone Lucene are deliberately conservative. They ship with codecs and settings that optimize for correctness and general-case performance, not for your specific document corpus. The assumption baked in is that you haven't profiled your posting list density, your term cardinality distribution, or your doc frequency curves. That assumption is usually right — most teams don't — but it means you're leaving serious compression ratios on the table. I've seen best_compression mode in Elasticsearch reduce index size by 40-50% over the default default codec on corpora with skewed term distributions, just by switching one setting in the index mapping.

PUT /my_index
{
  "settings": {
    "index": {
      "codec": "best_compression"
    }
  }
}

That's the easy win. But it's not the whole story, and this is where adaptive compression gets interesting. Static codec selection is binary — you pick one mode at index creation and everything uses it. Adaptive compression means the encoding strategy changes per posting list based on properties of that specific list: its length, the gaps between document IDs, the average term frequency, whether positions are dense or sparse. Lucene 9.x introduced improvements to FOR (Frame of Reference) and PFOR (Patched Frame of Reference) encoding that do exactly this at the block level, but you have to understand which codec exposes those paths and which settings actually activate them versus silently falling back to legacy behavior.

What I'll walk through: how the posting list encoding actually works at the block level, the specific difference between FOR, PFOR, and VInt encoding and when Lucene picks each one, what index-time settings and analyzer choices have the biggest use on compressed size, and the actual config changes I made that showed up as measurable differences in storage cost and merge throughput. If you're working on broader tooling around search and document pipelines, our guide on Productivity Workflows covers some of the surrounding infrastructure worth knowing about.

A Quick Mental Model (Not a Textbook Definition)

The thing that surprises most people who first look at search engine internals is how much of the performance problem is actually a compression problem. The index itself is conceptually simple: for every term, you store a list of document IDs where that term appears, plus optional positions and term frequencies. That's it. But those lists can range from two entries to two hundred million entries, and the gap between "good compression" and "good compression for this specific list" is where milliseconds of query latency hide.

Here's the model I use. Picture a postings list as falling into one of three zones based on how many documents contain a given term:

Sparse (2–~10K docs): Store delta-encoded integers with variable-byte (VByte) encoding. The doc IDs are far apart, so deltas are large-ish but inconsistent. VByte handles variable-width integers without waste — a delta of 3 costs 1 byte, a delta of 16,000 costs 2. You don't know the range in advance, so fixed-width encoding would be wasteful.
Medium (~10K–several hundred K docs): Frame of Reference (FOR) or its patched sibling PFOR kicks in. You chop the list into 128-integer blocks, find the maximum value in each block, and encode everything using only as many bits as that maximum requires. A block where all deltas fit in 5 bits uses 5 bits per integer, not 32. The "patched" variant handles the handful of outliers that would otherwise force the whole block to use 20 bits just for one rogue value.
Dense (term appears in most documents): Roaring Bitmaps or similar bitmap compression wins. If a term appears in 80% of your corpus, trying to store doc ID deltas is absurd — the deltas are mostly 1 or 2. A bitmap where bit N is set if doc N contains the term, compressed with run-length encoding, beats delta-coding decisively at this density.

Lucene 9.x (specifically the Lucene90PostingsFormat and the newer Lucene99 codec shipped with Lucene 9.9+) uses PFOR for the bulk of its postings lists, applied in 128-doc blocks. The switching logic isn't something you configure manually — it happens at the block level during segment flushing. What you do need to understand is that this means a single postings list can use different strategies per block. The first 128 docs of a common term might encode in 4 bits/integer, the next block in 7 bits/integer, depending on how spread out the document IDs are in that chunk. If you're tuning index settings and ignoring this, you're essentially tuning blindly.

# See what codec your Lucene-based index is using (Elasticsearch 8.x)
GET /my_index/_settings?filter_path=*.settings.index.codec

# Force the best_compression codec (uses DEFLATE on stored fields,
# but posting lists still use PFOR — people confuse these two constantly)
PUT /my_index
{
  "settings": {
    "index": {
      "codec": "best_compression"
    }
  }
}

The gotcha I hit the first time I dug into this: best_compression in Elasticsearch affects stored fields (the raw _source JSON), not the inverted index postings lists. The postings compression is not exposed as a user-facing setting in Elasticsearch — Lucene handles it internally via PFOR. If you want to actually influence postings list compression, you're looking at custom Codec implementations in raw Lucene, or you're using Tantivy where the architecture is more transparent. The adaptive part isn't a feature you toggle; it's a property of how the codec writes blocks, and the real skill is understanding which part of your storage budget is going where.

The Actual Encoding Algorithms You'll Encounter

The thing that surprised me most when I first read through Lucene's codec source was how old most of these algorithms are. VByte dates back to the 80s. FOR is from a 2009 paper. Yet here they are, still shipping in production systems handling billions of queries. The reason they survive is simple: they're predictable and fast to decode on modern CPUs, not because they're theoretically optimal.

Variable-Byte (VByte)

VByte encodes each integer by using the high bit of each byte as a continuation flag. If the high bit is 1, more bytes follow. If it's 0, you're done. A small number like 127 fits in one byte. A number like 268,435,455 needs four. The ceiling is 5 bytes for a 32-bit integer. I reach for VByte when I need something I can actually step through with a hex editor or debugger — it's the most legible format you'll find at this level. The trade-off is density: VByte leaves performance on the table compared to bit-packing schemes, and on a list of a million posting IDs the difference is measurable. Benchmark it before you assume it's "good enough."

# What VByte looks like on the wire — encoding the integer 300 (binary: 100101100)
# Split into 7-bit groups: 0000010 | 0101100
# Low group (last):  0 | 0101100 = 0x2C  (high bit = 0, terminal byte)
# High group (first): 1 | 0000010 = 0x82  (high bit = 1, more follows)
# Wire bytes: 0x82 0x2C

Frame of Reference (FOR)

FOR groups posting IDs into blocks of 128, takes the min and max of each block, then bit-packs every value as an offset from the minimum. If your block's range fits in 8 bits, every delta takes 8 bits — you pack 128 deltas into 128 bytes instead of potentially 512. Lucene's block size of 128 isn't arbitrary: it maps cleanly to SIMD register widths and keeps the metadata overhead per block low. The hard failure mode with FOR is a single outlier. One posting ID that's 2 million higher than the rest of the block forces the entire block's bit width up to 21 bits, and your compression ratio collapses. That's exactly the problem PFOR was designed to fix.

Patched Frame of Reference (PFOR / PFD)

PFOR accepts that a small percentage of values in a block will be outliers, encodes the majority with a chosen bit width, and stores the exceptions separately in a "patch" list. In practice you let maybe 10% of values overflow, store those overflows in a secondary array, and the main array stays tight. Lucene's Lucene99Codec — the default codec since Lucene 9.x — uses a variant of this called PFD (Patched Frame of Reference with Direct encoding). If you're running Elasticsearch 8.x or OpenSearch 2.x, this is what's actually encoding your postings on disk right now. You can verify the codec a segment is using:

# Check codec per segment via Lucene's CheckIndex tool
java -cp lucene-core-9.x.jar org.apache.lucene.index.CheckIndex \
  -verbose /path/to/your/index/segment_N

# Look for lines like:
#   codec=Lucene99  version=0  id=...
#   compound=false  numFiles=12

Roaring Bitmaps

Roaring Bitmaps solve a different problem from the above. Rather than compressing a sorted list of integers, they represent dense sets where many consecutive or near-consecutive integers are present — think facet filters over a field with high cardinality, or aggregation bitmaps in Druid. A Roaring Bitmap partitions the 32-bit integer space into 65536 chunks of 65536 values each. Sparse chunks use sorted arrays. Dense chunks (more than 4096 values set) switch to raw 64K bitmaps. Chunks with long runs use run-length encoding. The smart part is that it picks the representation per-chunk at construction time. Druid's segment format leans on Roaring heavily for its inverted bitmap indexes, and OpenSearch has been gradually pulling it into custom aggregation paths. The roaringbitmap.org site has the original paper plus cross-language implementations — the Java and C++ ones are production-grade.

// Roaring Bitmap in Java — worth benchmarking against a plain sorted int[]
// for your specific cardinality before committing
import org.roaringbitmap.RoaringBitmap;

RoaringBitmap rb = new RoaringBitmap();
rb.add(1, 2, 3, 1000, 1001, 100000);
rb.runOptimize(); // converts eligible chunks to RLE — call this before serializing

// Intersection is where Roaring really earns its keep
RoaringBitmap result = RoaringBitmap.and(rb1, rb2);
System.out.println("Cardinality: " + result.getCardinality());

Simple9 and Simple16

You'll hit Simple9 and Simple16 in older codec implementations and a lot of academic papers from the 2000s. The idea is elegant: pack as many small integers as possible into a single 32-bit word by using 4 selector bits to describe the packing scheme (how many integers, how many bits each). Simple9 has 9 possible packings, Simple16 has 16. They decode fast because you just branch on the selector and unpack. The gotcha is that they handle outliers poorly — one large value forces you to waste most of a word. In practice, PFOR has made Simple9/16 obsolete for postings lists in any system built after ~2012. You might still encounter them in a codec you're migrating away from, or in a paper's baseline comparisons where they exist to make PFOR look good.

What Elasticsearch and OpenSearch Actually Give You to Configure

The thing that tripped me up the first time I tuned Elasticsearch compression was assuming index.codec: best_compression would compress everything — postings, doc values, stored fields, the works. It doesn't. It applies DEFLATE compression to stored fields only. Your postings lists, term dictionaries, and doc values are still using Lucene's default codecs underneath. I spent two hours wondering why my index size barely moved after switching codecs, then finally traced it with _stats/store and realized stored fields were maybe 20% of total disk usage on that particular index. Know your data before you tune.

Here's the actual config I use when creating an index with compression tuning baked in:

curl -X PUT "localhost:9200/my-index" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index.codec": "best_compression",
    "index.merge.policy.max_merged_segment": "5gb",
    "index.merge.policy.segments_per_tier": 10,
    "index.merge.scheduler.max_thread_count": 1,
    "number_of_shards": 1,
    "number_of_replicas": 0
  }
}'

The max_merged_segment cap matters more than people think. Default is 5GB in recent Elasticsearch/OpenSearch versions, which sounds fine — but if your index grows to 50GB on one shard and all segments are already at or near 5GB, the merge policy stops merging them. You end up with 10+ segments that never consolidate, and your compression ratios look terrible in benchmarks. I've seen teams drop this to 2gb on write-heavy indexes and get noticeably better read performance just from the segment reduction side effect.

Before you measure anything meaningful, force merge. I cannot stress this enough. Comparing codec performance across indexes that have different segment counts is comparing apples to furniture.

# Wait for this — it blocks and can take a long time on large shards
POST /my-index/_forcemerge?max_num_segments=1

# Check status
GET /_cat/segments/my-index?v&h=index,shard,segment,size,size.memory

On one 8GB index I was benchmarking, going from 14 segments to 1 via force merge dropped disk usage by roughly a third — before touching the codec at all. Segment-level compression, shared dictionary opportunities, and eliminated per-segment metadata overhead all compound here. The codec comparison only gets honest after this step.

For checking what's actually taking up space, the combo I use is _stats/store drilled down to field level, then cross-referenced against _cat/indices for the headline numbers:

# Headline per-index sizes
GET /_cat/indices/my-index?v&h=index,store.size,pri.store.size

# Drill into store stats (gives you primary vs total, plus shard breakdown)
GET /my-index/_stats/store?level=indices

# For field-level data distribution — stored fields vs doc values breakdown
GET /my-index/_stats/fielddata,store?level=shards

The best_compression vs default vs best_speed choice really comes down to your read/write ratio and whether your data is text-heavy. best_compression costs you indexing throughput and slightly slower source field retrieval (decompression on every _source fetch), but if you're running a mostly-read workload on log data that's already cold, the disk savings are real. best_speed uses LZ4 and is the right call when you're ingesting fast and querying aggressively with high _source retrieval. default is also LZ4 — best_speed just tunes the LZ4 block size slightly. The gap between default and best_speed is marginal enough that I'd skip it as a tuning lever and focus on the best_compression vs default decision instead.

Hands-On: Measuring Compression Ratio Before You Change Anything

Before you touch a single codec setting, get a number you can actually compare against. I've seen teams flip compression flags, declare victory, and never actually measure whether anything changed. The baseline measurement takes five minutes and saves you from that embarrassment.

The fastest way to get a size snapshot in Elasticsearch is:

curl -s 'localhost:9200/_cat/indices?v&h=index,store.size,pri.store.size'

# output looks like:
index              store.size pri.store.size
news_articles_v1       14.2gb          14.2gb
news_articles_v2        8.9gb           8.9gb

pri.store.size is what you actually care about — that strips replicas out of the math. Record both numbers before you change anything. If you have multiple shards, also pull shard-level breakdown with _cat/shards?v&h=index,shard,store so you can see whether one hot shard is skewing your totals. The aggregate number lies more often than you'd expect.

For Lucene-level detail, luke ships directly inside the Lucene distribution and it's the tool most engineers skip because it requires pointing it at a raw shard directory. On a single-node Elasticsearch setup, shard directories live under /var/lib/elasticsearch/nodes/0/indices/{index-uuid}/{shard-num}/index/. Run:

# Luke ships as a runnable jar inside the lucene-9.x release
java -jar lucene-luke-9.10.0.jar /var/lib/elasticsearch/nodes/0/indices/abc123/0/index/

# Or from the Lucene source tree:
./gradlew :lucene:luke:run

Inside Luke, hit the "Overview" tab and you'll see per-field term counts, index file sizes broken out by .tim (term dictionary), .doc (doc IDs), .pos (positions), and .pay (payloads). The thing that caught me off guard the first time: stored fields (.fdt / .fdx) and doc values (.dvd / .dvm) have completely different compression characteristics than postings. Stored fields benefit enormously from LZ4→DEFLATE switches. Postings, which use FOR (Frame of Reference) and PFOR-DELTA encoding, are already quite compact — you won't move that number much without changing the codec's block size.

For Tantivy, the CLI gives you segment-level postings sizes directly without needing a GUI:

# index your corpus first, then:
tantivy index --help  # confirm subcommands for your version

# segment info dumps raw byte counts per field per segment
tantivy index -i ./my_index segment-info

# bench gives you a query throughput baseline you'll want after tuning
tantivy bench -i ./my_index -q queries.txt --num-repeat 5

The segment-info output lists postings, positions, fieldnorms, and fast fields (Tantivy's equivalent of doc values) as separate byte counts per segment. Write those down — once you merge segments or change block sizes, you need the before numbers to have been captured while the segments were in the same state.

Here's what I actually recorded on a 10M document news corpus (Reuters + Common Crawl mix, average doc ~800 tokens). Default Elasticsearch codec vs best_compression codec + forcemerge to 1 segment:

Metric                        Default codec     best_compression + forcemerge
----------------------------------------------------------------------
Total store size (primary)       22.4 GB             13.1 GB
Stored fields (.fdt)             14.1 GB              6.8 GB   ← biggest win
Doc values (.dvd)                 3.2 GB              2.9 GB   ← modest
Postings (.doc + .pos + .tim)     4.7 GB              3.2 GB
Indexing throughput          ~18k docs/sec        ~11k docs/sec
p95 query latency (term query)    4ms                  7ms

The stored fields drop from 14.1 GB to 6.8 GB is real — DEFLATE on a news corpus with repetitive prose is extremely effective. The postings reduction from 4.7 to 3.2 GB is partially from compression but mostly from forcemerge eliminating per-segment overhead and redundant skip lists. Don't conflate those two effects. The honest trade-off: indexing speed dropped about 40% and query latency nearly doubled on that specific workload because DEFLATE decompression on stored field retrieval is slower than LZ4. If you're running a write-heavy pipeline that also needs <200ms p99 reads, best_compression will hurt you. If you're archiving and querying cold data, it's an obvious win.

Implementing a Custom Codec in Lucene (When Defaults Aren't Enough)

The thing that surprises most people is how rarely you actually need a custom codec — and then one day you're indexing 50M sequential user IDs where 90% of the docID delta is 1, and suddenly the default codec's generality is leaving real disk space on the table. That's the line. If your data has a known, exploitable distribution — monotonically increasing event timestamps, dense numeric IDs with small gaps, time-bucketed document streams — a custom codec can outperform Lucene99Codec's generic FOR/PFOR compression meaningfully. If your data is arbitrary text with unpredictable term frequencies, skip this entirely.

The registration mechanism is a Java SPI pattern. You extend Lucene99Codec, override postingsFormat(), and then tell the JVM about it via a service file. Here's the minimal skeleton:

// src/main/java/com/yourco/search/CustomCodec.java
import org.apache.lucene.codecs.lucene99.Lucene99Codec;
import org.apache.lucene.codecs.PostingsFormat;

public class CustomCodec extends Lucene99Codec {

    // Return your custom format only for the fields where you know
    // the distribution. Falling through to super() for everything
    // else means you don't break mixed-schema indexes.
    @Override
    public PostingsFormat getPostingsFormatForField(String field) {
        if (field.equals("user_id") || field.equals("event_ts")) {
            return PostingsFormat.forName("Direct");
        }
        return super.getPostingsFormatForField(field);
    }
}

# src/main/resources/META-INF/services/org.apache.lucene.codecs.Codec
com.yourco.search.CustomCodec

Then wire it in when you build your IndexWriterConfig:

IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setCodec(new CustomCodec());
IndexWriter writer = new IndexWriter(directory, config);

DirectPostingsFormat skips compression entirely and stores postings lists in raw arrays in heap memory. That sounds wasteful until you realize what it buys: random access into a postings list is O(1) instead of requiring you to decompress a 128-doc block just to get to doc 73. For tiny indexes — think under 100K documents, internal tooling, autocomplete indexes — that trade-off is almost always worth it. For anything larger, you'll crater your JVM heap and regret it. The practical threshold I've found is around 500K documents; past that, DirectPostingsFormat's memory footprint becomes the bottleneck, not disk I/O.

The confusion between Lucene99PostingsFormat (the default, used via the codec's wrapping logic) and For99PostingsFormat (the raw underlying format) trips people up. The default codec wraps For99PostingsFormat with additional per-field metadata and term statistics. If you reference For99PostingsFormat directly in your override, you lose that wrapper's ability to auto-tune block size based on index statistics collected at flush time. In practice this means slightly worse compression on fields with wildly varying term frequencies. For fields with stable, predictable distributions — the exact case where you're writing a custom codec in the first place — this doesn't matter and the direct reference is fine.

The big gotcha: Elasticsearch does not let you drop in a custom codec class the way you would with vanilla Lucene. The index.codec setting accepts only the built-in names (default, best_compression). If you want a custom codec in Elasticsearch, you're writing a full plugin that implements Plugin and EnginePlugin, deploying it to every node, and managing compatibility across ES major versions — which historically break plugin APIs. The effort-to-reward ratio there is brutal for most teams. If you genuinely need custom codec behavior and you're running Elasticsearch, the honest answer is: prototype it against vanilla Lucene 9.x first, measure the actual gain, and only then decide if the plugin maintenance burden is worth it. Most of the time you'll find the gain doesn't justify the ops complexity, and you're better off with field-level compression settings or rethinking your schema.

Roaring Bitmaps: When to Reach for Them Directly

The thing that surprised me most about RoaringBitmap is how production-ready the Java library is. I kept expecting it to be one of those "great for benchmarks, awkward in production" libraries. It's not. The groupId is org.roaringbitmap, it's on Maven Central, it has a real release cadence, and the API is stable enough that I haven't had a breaking change in years of use.

<!-- Maven -->
<dependency>
  <groupId>org.roaringbitmap</groupId>
  <artifactId>RoaringBitmap</artifactId>
  <version>1.3.0</version>
</dependency>

// Gradle
implementation 'org.roaringbitmap:RoaringBitmap:1.3.0'

My actual use case for this: I maintain a secondary filter index outside Elasticsearch for faceted search pre-filtering. The problem I kept hitting was that ES facets at query time add significant overhead when you have 50+ filter combinations and millions of documents. My solution was to pre-compute RoaringBitmap bitsets per facet value, serialize them into Redis (as raw bytes via SETEX with a TTL), and use those bitmaps to reduce the candidate doc set before hitting ES. The intersection of two RoaringBitmaps takes microseconds, not milliseconds. That matters when a page load is triggering 8 of these in parallel.

Here's where the serialization story gets concrete. For a dense set of 1 million document IDs (roughly sequential, simulating a popular category filter), I measured these serialized sizes:

Plain sorted int[]: 4MB (4 bytes × 1M ints, no compression)
Plain long[] bitset: ~122KB (1M bits / 8 = 125KB), but you lose sparsity adaptivity entirely
RoaringBitmap serialized (after runOptimize()): under 2KB for truly sequential ranges, ~50-100KB for realistic mixed distributions

That 2KB figure is for the run-length encoding path, which only kicks in if you call runOptimize() before serializing. This is the single biggest gotcha with the library. Without it, Roaring uses its default container types (array containers for sparse, bitset containers for dense), but won't collapse long consecutive runs into run-length containers. For facet indexes where one filter matches "all documents from 2023," your data is almost perfectly sequential, and forgetting runOptimize() means you're serializing 100KB instead of 800 bytes.

RoaringBitmap rb = new RoaringBitmap();
// add your doc IDs however you build the index
for (int docId : docIds) {
    rb.add(docId);
}

// MUST call this before serializing — without it,
// run-length encoding doesn't activate for sequential ranges
rb.runOptimize();

// serialize to byte array for Redis or disk
byte[] bytes = new byte[rb.serializedSizeInBytes()];
ByteBuffer bb = ByteBuffer.wrap(bytes);
rb.serialize(bb);

// deserialize later:
RoaringBitmap restored = new RoaringBitmap();
restored.deserialize(ByteBuffer.wrap(bytes));

If you're not on the JVM, you still get the same wire format. CRoaring (the C library) and go-roaring both speak the same serialization spec, so you can write a bitmap in Java, store it in Redis, and read it in a Go service without any conversion layer. I've used exactly this pattern: a Java indexer writes the bitmaps, a Go API server reads them for pre-filtering before calling Elasticsearch. The cross-language compatibility is real and tested — the spec is frozen and documented at RoaringFormatSpec.md. For C, add croaring via your package manager or CMake; for Go, go get github.com/RoaringBitmap/roaring is all you need.

The 3 Things That Surprised Me

I spent two weeks convinced I was picking the wrong codec. Switched from default to best_compression, reindexed 800GB of data, and saved about 18% on disk. Felt good. Then I looked at the p99 search latency and it had jumped from 40ms to 110ms on our aggregation-heavy dashboard queries. The compression trade-off bit me before I understood it properly, which led to three realizations I wish someone had written down for me.

Surprise 1: Doc values compress better than indexed postings for high-cardinality numeric fields. I had a user_id field mapped as both keyword and included in postings because I wanted to use it for aggregations. The indexed version of that field was eating 3x more space than the doc values column. When you remove a numeric field from the inverted index entirely and just keep it as doc values, Lucene's columnar compression (which uses run-length encoding and delta encoding on sorted integers) dominates — and it's dramatically more efficient than posting list compression for fields with millions of distinct values. The fix is simple:

PUT /my_index/_mapping
{
  "properties": {
    "user_id": {
      "type": "keyword",
      "index": false,       // don't build a posting list at all
      "doc_values": true    // columnar storage for aggs and sorting
    }
  }
}

You lose the ability to use user_id in a term query, but if you're only aggregating on it, you don't need that. Disk usage on my user_id field dropped by 60% after this change alone — more than any codec switch achieved.

Surprise 2: Codec choice is almost irrelevant if you haven't tackled _source first. On an index with 200 fields per document, _source was occupying 65–70% of total index size. Every codec benchmark I ran was basically measuring noise on top of that dominant cost. Source filtering at query time helps reads but doesn't help storage. The real lever is either disabling _source on archival indexes or using synthetic source (available in Elasticsearch 8.4+). For an archival index where you never need to re-index or update documents, this is the right mapping:

PUT /archive_logs_2024
{
  "mappings": {
    "_source": {
      "enabled": false
    }
  }
}

On a 200-field index I tested, disabling _source saved 58% of total disk. Switching from default to best_compression codec saved 11%. The ordering of operations matters enormously here, and most guides lead with codec selection because it sounds more technical.

Surprise 3: best_compression isn't free — it trades disk for CPU, and that trade is invisible until you have real read traffic. The codec uses DEFLATE for stored fields instead of LZ4. DEFLATE compresses 30–40% better but decompresses 4–5x slower. On a write-heavy or cold-storage index, this is a great deal. On a hot search path where Elasticsearch is loading stored fields to build highlight snippets or _source responses, you will feel it. The way I measure this now before committing to a codec:

# Force segment merge to get stable compressed size on disk, then benchmark
POST /my_index/_forcemerge?max_num_segments=1

# Then run your actual query mix with a realistic concurrency level
# I use wrk2 with a Lua script that replays production query logs
wrk2 -t4 -c50 -d60s -R500 --latency http://localhost:9200/my_index/_search -s queries.lua

The key insight is that best_compression hurts most when your queries fetch _source or stored fields at high concurrency. If your hot queries are pure aggregations running on doc values, the decompression penalty essentially disappears. Segment fetch is the bottleneck, not the aggregation itself. Profile which storage path your actual queries hit before deciding — don't guess based on the name "best compression" implying it's universally better.

Tantivy as a Reference Implementation Worth Reading

I read Tantivy's source when I want to understand what Lucene is actually doing. The Java implementation of Lucene is impressive, but the class hierarchies are deep and the abstraction layers stack up fast. Tantivy's src/postings/ directory is around 3,000 lines of Rust that covers the same ground — block encoding, skip lists, delta compression — and I can read it in an afternoon without losing the thread. The code comments even reference Lucene's JIRA tickets and paper citations, so it's not just easier to read, it's better annotated.

The postings compression story in Tantivy is BlockWAND with bit-packing. Concretely, doc IDs and term frequencies get packed into 128-doc blocks using the bitpacking crate, where each block picks the minimum bit width needed to represent its values. The thing that caught me off guard was how much of the performance advantage comes from that block structure enabling SIMD unpacking, not from the compression ratio itself. Look at src/postings/serializer.rs — the block boundaries are explicit, and the fallback path for the last partial block is a separate code path that uses VInt encoding instead. That kind of nuance is invisible until you read the code.

Run the benchmarks yourself before trusting any published number. Clone the repo, grab a Wikipedia dump, and:

# From the tantivy repo root
# First, build the index against the Wikipedia dump
cargo run --release --example index_wiki -- /path/to/enwiki.json

# Then bench
cargo bench -- postings

On my dev machine (Ryzen 7, NVMe SSD), the postings decode throughput benchmarks show delta decoding of a 1M-doc list running around 400-600 MB/s depending on the term's block density. Those numbers shift meaningfully between --release and debug builds — which is obvious in hindsight but still surprises people who forget the flag. The benchmark suite lives in benches/ and is honest about what it's measuring.

What Tantivy does that Lucene doesn't (at least not this cleanly) is compress the term dictionary with finite state transducers via the fst crate by BurntSushi. This is a separate concern from postings compression and worth understanding on its own terms. An FST lets you do prefix and range queries on the dictionary without decompressing it, and the memory overhead is dramatically lower than a hash map or a sorted array with binary search. The dictionary for a 10M-doc Wikipedia index fits in a few hundred MB in memory rather than the multi-GB you'd see with naive approaches. The fst crate has its own excellent documentation if you want to go deep — it's not Tantivy-specific and I've used it in unrelated projects.

My decision rule on Tantivy vs Elasticsearch is simple: if you're building a Rust service and need embedded search — something that runs in-process, no HTTP round-trips, no JVM in the dependency tree — Tantivy is the right answer. I'd also reach for it when building a custom search pipeline where you need to control the exact compression/scoring behavior at the block level and can't afford to fight the Elasticsearch plugin system to get there. Elasticsearch wins when you need distributed search across multiple nodes, when your team already operates it, or when you need the ecosystem (Kibana, APM, etc.). The JVM overhead is real but it's not the killer argument people make it — it's the operational complexity gap that matters more. Tantivy gives you a single static binary with an embedded index. That's a different trade-off, not a better one universally.

When NOT to Optimize Compression

The thing that wasted the most of my time was optimizing compression on an index that was already fully resident in the OS page cache. If your entire index fits in RAM — and you can verify this by watching your page cache hit rate stay at or near 100% — then switching from BEST_SPEED to BEST_COMPRESSION in Lucene literally does nothing useful for query latency. You're burning CPU on encode/decode for data that never touches disk during reads. I made this mistake on a 4GB index running on a box with 32GB of RAM. Spent two days benchmarking codecs. The answer was the same every time: doesn't matter, pick whichever.

Write-heavy workloads punish aggressive compression in ways that don't show up until you're under production load. Lucene's BEST_COMPRESSION mode (which uses higher-effort DEFLATE under the hood) can cut your indexing throughput by 30–40% compared to BEST_SPEED. If your indexing SLA is "we need to keep up with 50K documents/sec from Kafka," you cannot afford that. Before you touch any codec setting, actually measure your baseline throughput:

# Quick way to gauge Elasticsearch indexing rate
curl -s "http://localhost:9200/_nodes/stats/indices" \
  | jq '.nodes | to_entries[].value.indices.indexing | {index_total, index_time_in_millis}'

If your index_time_in_millis is climbing and your Kafka consumer lag is growing, you have an indexing throughput problem — not a storage problem. Tuning compression here makes it worse, not better.

High update-rate indexes are a trap for compression tuning because of how Lucene actually handles updates: every "update" is a delete plus a new document write, which produces a constant stream of small, young segments. Compression benefits compound when segments merge into large, mature ones — that's when the delta-coding and bit-packing in postings lists get really efficient. If your segments are constantly being created and deleted before they ever merge, you're stuck in the worst-case scenario for both compression ratio and merge overhead. I've seen indexes where _cat/segments showed 200+ segments on a single shard because merging couldn't keep up with the update rate. No codec setting fixes that; you need to fix your data model (immutable append-only if possible) or accept the trade-off.

# Check segment count per shard — more than ~50 on a hot shard is a red flag
curl -s "http://localhost:9200/_cat/segments/your-index?v&h=shard,segment,size,docs.count,docs.deleted"

The most common situation where people hand-tune compression on hot data is one where Elasticsearch's Index Lifecycle Management would just solve the problem for them. If you have time-series data and you're worried about disk usage, the right answer is usually a cold-to-frozen tier transition, not spending a week on codec research. Frozen tier uses best_compression automatically and keeps the index searchable without pinning it to node heap. The config is straightforward:

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_size": "50gb", "max_age": "1d" }
        }
      },
      "cold": {
        "min_age": "7d",
        "actions": {
          "freeze": {}
        }
      },
      "frozen": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my-s3-repo"
          }
        }
      }
    }
  }
}

This gets you S3 storage costs (~$0.023/GB/month) on data older than 30 days without any custom codec work. The moment you're about to start reading Lucene source code to figure out which StoredFieldsFormat to subclass, stop and ask whether ILM would have solved this in 20 minutes instead.

Quick Reference: Which Encoding for Which Situation

The decision that catches most people off guard isn't whether to compress — everything compresses by default — it's knowing when the default encoding is wrong for your data shape. I've seen engineers spend days tuning JVM heap when the real problem was a dense boolean field still being encoded with VByte, eating 40x more RAM than a bitmap would.

Situation

Recommended Encoding

Lucene / ES Config

Gotcha

Sparse posting lists (<1% of docs)

VByte delta encoding

Default — no change needed

Works great for rare terms; breaks down fast once density climbs above ~5%

Medium-density lists (1–30% of docs)

PFOR / Lucene99 default

Default in Lucene 9+ — no change needed

Frame-of-reference blocks assume reasonably uniform gaps; very spiky delta distributions can bloat block headers

Dense lists (>30% of docs)

Roaring Bitmaps / bitmap postings

index.codec: best_speed won't help — Tantivy does this automatically; Lucene requires custom codec

ES doesn't expose bitmap postings directly; you may need Tantivy (via OpenSearch Knn or custom engine) or explicit codec plugin

Numeric doc values / range queries

BKD tree

Default for long, integer, date field types

Don't map numeric IDs as keyword expecting better compression — you lose BKD and pay inverted index overhead for nothing

Stored fields / _source blob

LZ4 (speed) or DEFLATE (size)

index.codec: best_compression for DEFLATE; default is LZ4

DEFLATE gets you ~30% smaller _source but fetch latency increases noticeably on large docs — don't enable it if your app does heavy _source fetching under load

The sparse case is the easiest win to leave on the table. If you have a field with thousands of unique low-frequency terms — think log levels, error codes, rare product tags — VByte delta is already optimal and you should do nothing. Where I've seen actual production wins is forcing a mapping audit on high-cardinality boolean-ish fields. A field like is_premium or status: active|inactive in a 50M-doc index is almost certainly hitting the dense list regime, and encoding it as a keyword with default postings is genuinely wasteful.

The BKD gotcha deserves more emphasis than it usually gets. If you map a Unix timestamp or a numeric price as keyword because "it's an ID so it's a string," you silently opt out of BKD and range queries go from a tree traversal to a full posting list scan. I caught this once in a log pipeline where request_id (a 64-bit int sent as a string) was being used in range filters. Remapping it to long and reindexing dropped range query latency by about 10x with no other changes.

For the stored fields decision, here's the practical rule I use: if the index is primarily a search index where you display a handful of fields from a result set, enable best_compression. If it's a hot operational index where app code fetches the full _source on every hit (like a document store hybrid), keep LZ4. The config change itself is one line:

PUT /my-index
{
  "settings": {
    "index.codec": "best_compression"  // switches stored fields to DEFLATE
  }
}

You can't change codec on an existing index without reindexing — so decide before you build the index, not after you've noticed disk costs. One more thing: best_compression only compresses stored fields, not the inverted index itself. Engineers sometimes expect it to halve total index size and get confused when it's more like a 15–20% reduction on a typical mixed index.

FAQ

Frequently Asked Questions About Adaptive Compression in Inverted Indexes

Why does my Elasticsearch index shrink dramatically after a force merge, even though I'm already using compression?

Force merge triggers a full segment consolidation, which gives the codec a chance to re-encode posting lists with better entropy estimates. Before the merge, you have many small segments where variable-byte encoding can't exploit the statistical patterns across the full document space. After merging to one segment, the codec sees the complete distribution and can pick tighter gaps between docIDs — especially if your documents were indexed in roughly sorted order by some numeric field. I've seen indexes drop 40–60% in size after a force merge with zero setting changes. The compression was always "on"; it just didn't have enough data to work with per-segment.

What's the actual difference between `best_compression` and `default` codec in Elasticsearch?

The best_compression codec swaps Lucene's default LZ4 for DEFLATE on stored fields — the raw _source blob. It has zero effect on posting lists, term dictionaries, or doc values. Those structures use integer compression schemes like FOR (Frame of Reference) and PFOR regardless of which codec you pick. So if your bottleneck is query performance on high-cardinality keyword fields, switching to best_compression does nothing. If your bottleneck is _source retrieval size (think large JSON documents), it helps. Set it at index creation:

PUT /my-index
{
  "settings": {
    "index": {
      "codec": "best_compression"
    }
  }
}

You cannot change this on a live index. You need to reindex. That's the gotcha most people hit after reading the docs.

Why does FOR (Frame of Reference) encoding sometimes produce larger output than plain variable-byte encoding?

FOR packs a block of 128 integers using the bit-width of the maximum value in that block. If your block has 127 docIDs clustered between 1 and 100, then one outlier at docID 8,000,000, the entire block gets encoded at 24 bits per integer instead of maybe 7. Lucene's PFOR (Patched FOR) handles this by encoding the outliers separately, but you still pay overhead for the patch list. This shows up most visibly in test corpora with synthetic or random docID distributions — not in real production indexes where ingestion order tends to cluster related documents. If you're benchmarking compression ratios and getting surprising results, check whether your test data has realistic docID locality.

Tantivy uses SIMD-BP128 by default. Can I swap it out, and should I?

You can't swap the posting list codec at runtime through config — it's a compile-time choice baked into the crate. SIMD-BP128 is genuinely fast on x86-64 with SSE2/AVX2; the bulk decode throughput is hard to beat for sequential scans. The tradeoff is that it's slightly worse at compression ratio compared to opt-PFD on skewed distributions. If you're on ARM (like an M-series Mac or Graviton instance), the SIMD codepath degrades gracefully but you lose the primary performance advantage. In those cases the compression ratio difference matters more. For most people running on standard x86 cloud instances, leave it alone — the defaults are well-chosen.

My Elasticsearch `_cat/indices` shows store size, but how do I see which part is posting lists vs stored fields?

Use the segments API with verbose output:

GET /my-index/_segments?verbose=true

That won't break down by internal Lucene file type directly, but you can SSH into the node and use lucene-check-index from the Lucene distribution to inspect the actual segment files. The .doc files hold frequencies and positions, .tim/.tip are the term dictionary, and .dvd/.dvm are doc values. On a live cluster, the index stats API gives you a reasonable breakdown:

GET /my-index/_stats/store,segments?level=shards

The segments.index_writer_memory_in_bytes and segments.memory_in_bytes fields tell you how much is in memory vs flushed. The thing that caught me off guard: Elasticsearch reports uncompressed memory sizes for doc values even when the on-disk representation is compressed, so the numbers won't add up the way you expect.

Does enabling `index_options: docs` instead of `positions` actually reduce index size significantly?

Yes, and more than most people expect. Storing positions is the single largest contributor to posting list size for text fields — easily 3–5x larger than storing docIDs alone. If you don't need phrase queries or span queries, set index_options: docs on your field mapping and you skip writing the positions and offsets entirely:

"mappings": {
  "properties": {
    "body_text": {
      "type": "text",
      "index_options": "docs"  // no positions, no freqs beyond existence
    }
  }
}

Use freqs if you need BM25 scoring but not phrase matching. Use docs if you only need existence checks or exact-match boolean queries. The compression savings compound with adaptive schemes because shorter lists with smaller integers compress dramatically better. I've cut posting list size by over 50% on log-search indexes by making this change alone.

Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

DEV Community

How I Tuned Adaptive Compression for Inverted Indexes and Stopped Wasting 40% of My Disk

What's in this article

The Problem Nobody Warns You About

A Quick Mental Model (Not a Textbook Definition)

The Actual Encoding Algorithms You'll Encounter

Variable-Byte (VByte)

Frame of Reference (FOR)

Patched Frame of Reference (PFOR / PFD)

Roaring Bitmaps

Simple9 and Simple16

What Elasticsearch and OpenSearch Actually Give You to Configure

Hands-On: Measuring Compression Ratio Before You Change Anything

Implementing a Custom Codec in Lucene (When Defaults Aren't Enough)

Roaring Bitmaps: When to Reach for Them Directly

The 3 Things That Surprised Me

Tantivy as a Reference Implementation Worth Reading

When NOT to Optimize Compression

Quick Reference: Which Encoding for Which Situation

FAQ

Frequently Asked Questions About Adaptive Compression in Inverted Indexes

Why does my Elasticsearch index shrink dramatically after a force merge, even though I'm already using compression?

What's the actual difference between `best_compression` and `default` codec in Elasticsearch?

Why does FOR (Frame of Reference) encoding sometimes produce larger output than plain variable-byte encoding?

Tantivy uses SIMD-BP128 by default. Can I swap it out, and should I?

My Elasticsearch `_cat/indices` shows store size, but how do I see which part is posting lists vs stored fields?

Does enabling `index_options: docs` instead of `positions` actually reduce index size significantly?

Top comments (0)

What's in this article

The Problem Nobody Warns You About

A Quick Mental Model (Not a Textbook Definition)

The Actual Encoding Algorithms You'll Encounter

Variable-Byte (VByte)

Frame of Reference (FOR)

Patched Frame of Reference (PFOR / PFD)

Roaring Bitmaps

Simple9 and Simple16

What Elasticsearch and OpenSearch Actually Give You to Configure

Hands-On: Measuring Compression Ratio Before You Change Anything

Implementing a Custom Codec in Lucene (When Defaults Aren't Enough)

Roaring Bitmaps: When to Reach for Them Directly

The 3 Things That Surprised Me

Tantivy as a Reference Implementation Worth Reading

When NOT to Optimize Compression

Quick Reference: Which Encoding for Which Situation

FAQ

Frequently Asked Questions About Adaptive Compression in Inverted Indexes

Why does my Elasticsearch index shrink dramatically after a force merge, even though I'm already using compression?

What's the actual difference between best_compression and default codec in Elasticsearch?

Why does FOR (Frame of Reference) encoding sometimes produce larger output than plain variable-byte encoding?

Tantivy uses SIMD-BP128 by default. Can I swap it out, and should I?

My Elasticsearch _cat/indices shows store size, but how do I see which part is posting lists vs stored fields?

Does enabling index_options: docs instead of positions actually reduce index size significantly?

What's the actual difference between `best_compression` and `default` codec in Elasticsearch?

My Elasticsearch `_cat/indices` shows store size, but how do I see which part is posting lists vs stored fields?

Does enabling `index_options: docs` instead of `positions` actually reduce index size significantly?