DEV Community

우병수
우병수

Posted on • Originally published at techdigestor.com

Adaptive Compression for Inverted Indexes: How I Stopped Wasting 60% of My Disk Budget

TL;DR: I was staring at an AWS bill — $4,200/month on EBS volumes for a single Elasticsearch cluster — and couldn't figure out where it was going. The indexed documents were large but not that large.

📖 Reading time: ~40 min

What's in this article

  1. The Problem That Made Me Care About This
  2. Quick Refresher: What You're Actually Compressing
  3. The Core Compression Techniques — What Each One Actually Does
  4. What 'Adaptive' Actually Means Here
  5. Getting Your Hands Dirty: Lucene Codec Configuration
  6. Roaring Bitmaps in Practice: When to Force Them
  7. Tantivy: Where Adaptive Compression Is More Transparent
  8. Measuring Compression in Production — The Metrics You Should Actually Watch

The Problem That Made Me Care About This

I was staring at an AWS bill — $4,200/month on EBS volumes for a single Elasticsearch cluster — and couldn't figure out where it was going. The indexed documents were large but not that large. After a week of digging with the _cat/indices API and some segment-level analysis, the answer was ugly: postings lists were eating 60-70% of the storage. Not the source documents. The posting data.

# This command showed me where the pain actually was
GET /_cat/indices?v&h=index,store.size,pri.store.size,docs.count
# Then drilling into segment stats per index
GET /my-index/_stats/store,segments?level=shards
Enter fullscreen mode Exit fullscreen mode

The cluster was indexing behavioral event data — every user action, every session. That means fields like user_id, session_token, and device_fingerprint had cardinalities in the tens of millions. The naive delta encoding that Lucene applies by default (store the difference between consecutive document IDs rather than the raw IDs) works beautifully when a term appears in thousands of closely-clustered docs. It falls apart completely for high-cardinality fields where each term appears in maybe 1-3 documents total, and those document IDs are scattered across the entire ID space. Delta = large number. No compression win. You're basically storing raw 32-bit integers.

To understand why this hurts so much, you need a clear picture of what's actually stored. The term dictionary maps each unique term to a pointer into the postings list — this part is usually fine, it's a sorted structure (a finite state transducer in Lucene) that compresses well. The postings list is the real culprit: for each term, a list of every document ID that contains it, plus frequency data (how many times the term appears in each doc). Layer on top: skip lists embedded inside the postings for fast intersection during boolean queries, and optional position data for phrase queries. For a term like user_id:abc123 that only appears once, you still pay the overhead of this entire structure per unique value.

The reflex answer from every infra team I've seen hit this problem is "add more shards." I understand the instinct — more shards means smaller per-shard indexes, which sometimes means better compression ratios. But this is wrong for two reasons. First, you're distributing the problem, not solving it: the total bytes stored doesn't shrink, you just spread them across more EBS volumes. Second, more shards increase query fan-out, which increases coordinator overhead and often makes your p99 latency worse. I've watched teams go from 5 shards to 20 and see their monthly bill go up because of the coordination overhead on the instance side. The actual fix is making the postings lists themselves smaller through smarter encoding — which is a fundamentally different category of problem.

Quick Refresher: What You're Actually Compressing

The thing that surprises most people when they first crack open a Lucene segment is how few of the files actually dominate disk usage. You'll see a dozen file extensions per segment — .tim, .tip, .doc, .pos, .pay, .nvd and more — but compress the wrong ones aggressively and you'll barely move the needle on disk while tanking query latency.

Here's the breakdown that actually matters. The .tim file (term index) holds the term dictionary — every unique token in the segment mapped to its posting list metadata. The .doc file holds doc ID lists and term frequencies interleaved together. The .pos file holds token positions within documents. In a corpus with phrase queries disabled and no highlighting, positions can eat 30–40% of total index size for free, which is why Lucene's IndexOptions.DOCS_AND_FREQS exists — dropping positions is often the single biggest size win available without touching compression at all.

// IndexWriter config: trade phrase query support for ~35% space savings
FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setIndexOptions(IndexOptions.DOCS_AND_FREQS); // no positions stored
ft.setStoreTermVectors(false);                   // no term vectors either
doc.add(new Field("body", text, ft));
Enter fullscreen mode Exit fullscreen mode

Doc ID lists are the most compressible structure in the whole pipeline, and the reason is monotonic sequence compression. Doc IDs are always stored in ascending order — Lucene enforces this — so instead of storing the raw IDs you store gaps (deltas between consecutive IDs). A list like [142, 143, 144, 145] becomes [142, 1, 1, 1]. When a term appears in most documents (think stopwords, or a "status=active" field in a filtered dataset), those deltas collapse to mostly 1s and 2s, which variable-byte or bit-packing schemes compress near-perfectly. The naive failure mode is treating doc ID lists like arbitrary integers and applying a general-purpose codec like LZ4 directly — you'll get modest compression but completely miss the structural property that makes delta encoding + FOR (Frame of Reference) or PFOR so effective.

Term frequencies have a very different distribution. Most terms appear once per document — that's the median case across typical English text. So the frequency list is heavily skewed toward 1, which means unary coding or a simple bit flag ("freq=1? store 0 bit, done") handles the common case in almost no space. Payload data — arbitrary byte arrays you can attach per position, used for things like word vectors, custom scoring signals, or offset boosts — is the wild card. It's application-defined, so compression ratios vary wildly. I've seen payloads that compress 10:1 because they're repetitive numeric data, and I've seen near-incompressible payloads from pre-quantized float embeddings where you're just wasting CPU trying.

The index-time vs. query-time tradeoff is real and asymmetric. Compression work at index time is paid once; decompression is paid on every query, potentially for every matching posting list. This asymmetry means you should almost always prefer slower encoding (better compression ratio) over faster encoding if your read/write ratio is above roughly 10:1 — which it is for most search applications. The gotcha is block size. Lucene's default block size for FOR-delta compression is 128 doc IDs per block. Smaller blocks decompress faster (less work per skip) but compress worse. Larger blocks compress better but make random access into the middle of a posting list slower because you must decompress the whole block to read one entry. This matters most for high-frequency terms, where skipping past irrelevant blocks during a conjunctive query is the dominant cost.

The Core Compression Techniques — What Each One Actually Does

The surprising thing about inverted index compression is that the "old" techniques are still winning in production. VByte encoding — invented in the early 1990s — is still in Lucene's codebase right now, handling cases where the more aggressive methods would cost more CPU than they save in I/O. The tradeoff isn't purely about compression ratio; it's about decode speed under realistic hardware constraints.

Variable-Byte (VByte) Encoding

VByte uses the high bit of each byte as a continuation flag. If it's 1, more bytes follow. If it's 0, you're done. A number under 128 takes 1 byte. A number under 16,384 takes 2 bytes. This makes random access cheap and branch prediction friendly for small gaps. Lucene uses it in its StoredFields and for skip list offsets — places where you're not doing bulk sequential decoding but occasional lookups. The decode loop looks trivially simple:

// Decoding a single VByte integer — no SIMD, no table lookups needed
int decodeVByte(byte[] buf, int pos) {
    int result = 0, shift = 0;
    byte b;
    do {
        b = buf[pos++];
        result |= (b & 0x7F) << shift;
        shift += 7;
    } while ((b & 0x80) != 0);
    return result;
}
Enter fullscreen mode Exit fullscreen mode

The catch: VByte's worst case is 5 bytes for a 32-bit integer (4 data bits per byte × 5 = 35 bits, enough headroom). And on very dense posting lists, it gets smoked by block compression methods. But for sparse or variable-length data, the simplicity is the point.

Delta Encoding

Delta encoding is always applied before any integer compression scheme in an inverted index. Instead of storing raw doc IDs [142, 387, 389, 1204], you store the gaps: [142, 245, 2, 815]. The gaps are almost always smaller numbers than the raw values, which means fewer bits per integer regardless of what compression codec you use downstream. After delta encoding, your 20-bit doc IDs might become mostly 8–12 bit gaps, which then compress dramatically better with FOR or VByte. The reconstruction is a prefix sum — cheap, and trivially SIMD-able if you need it.

Frame of Reference (FOR) and PFOR

FOR picks the minimum bit width needed to represent all integers in a fixed-size block (typically 128 integers), then packs them all at that width. If your block's max value is 300 after delta encoding, you only need 9 bits per value instead of 32. That's a 3.5× compression ratio from bit-packing alone, with no entropy modeling. The problem: one outlier — say a gap of 65,000 in an otherwise tight block — forces the entire block to 17 bits. That's where Patched FOR (PFOR or PForDelta) wins. PFOR identifies the top ~10% of outliers, stores the rest at the smaller bit width, and patches the exceptions separately. Elasticsearch's Lucene 9.x codecs use PFOR-style block compression for postings with blocks of 128 doc IDs, which is why you'll see Lucene99PostingsFormat in the index metadata.

# Lucene index codec configuration (lucene-based Elasticsearch index)
PUT /my-index
{
  "settings": {
    "index": {
      "codec": "best_compression"  # triggers DEFLATE on stored fields,
                                   # but postings still use FOR-based block codec
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

SIMD-BP128 and SIMD-Accelerated Variants

SIMD-BP128 is where decode throughput gets genuinely ridiculous. The idea: pack 128 integers using bit-packing, but structure the layout so a 128-bit SIMD register can unpack them in one instruction cycle instead of doing it serially. Tantivy (the Rust search library) benchmarks its BlockDocumentEncoder at roughly 2–4 billion integers per second decoded on modern x86 with AVX2, compared to ~300–400 million/sec for scalar VByte on the same machine. That's a 6–10× throughput difference that actually shows up in query latency when you're intersecting posting lists with millions of entries. Tantivy uses this by default for its postings format — check src/postings/block_segment_postings.rs if you want to see how the bitpacking crate plugs in. The tradeoff: SIMD codecs require aligned, fixed-size blocks, so the last partial block always needs a fallback codec. Every implementation has this "final block" edge case and it's the source of more bugs than the SIMD code itself.

Roaring Bitmaps

Roaring Bitmaps don't fit neatly into the "compress an integer list" model — they're a hybrid data structure. The 32-bit integer space gets split into 65,536 containers of 65,536 values each (the high 16 bits index into a container, the low 16 bits live inside it). Each container is stored as a plain array if it has fewer than 4,096 values, a bitset if it has more, or a run-length encoded list if it's dense with runs. The crossover from integer list to bitmap encoding makes sense when your doc ID set is dense enough that bits-per-doc drops below ~8–16 bits. In practice, Roaring wins for things like filter caches in Elasticsearch, facet counts, and segment-level live-doc bitsets. It loses to PFOR on raw sequential postings traversal because the container lookup adds indirection that PFOR's flat array layout avoids. Apache Druid uses Roaring heavily for its bitmap indexes on low-cardinality columns; that's a good mental model for when to reach for it.

// Roaring Bitmap in Java — the RoaringBitmap library
RoaringBitmap rb = new RoaringBitmap();
rb.add(142); rb.add(387); rb.add(388); rb.add(389);
rb.runOptimize(); // converts runs to RLE containers — call this before serializing

// Serialize to bytes for storage alongside your inverted index
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(bos);
rb.serialize(dos);
byte[] compressed = bos.toByteArray();
// For a dense run like [387,388,389], runOptimize() can cut size by 10-20x
Enter fullscreen mode Exit fullscreen mode

What 'Adaptive' Actually Means Here

The word "adaptive" gets thrown around loosely, so let me be precise about what it means mechanically: instead of picking one encoding scheme for an entire posting list at index time, you make a fresh codec decision for each block of docIDs as you write it. That decision is based on the actual data in that block — its density, its variance, whether values cluster or spread. Static compression says "this posting list uses VByte." Adaptive compression says "block 0 uses BitPacking, block 1 uses RLE because it happened to be dense, block 3 falls back to VByte because neither worked well." Same posting list, three different encodings, chosen at write time based on evidence.

Lucene 9.x's IndexedDISI is the most studied real-world example of this. It maintains a per-block representation that switches between a dense bitset and a sparse integer list depending on how many documents are set in that 65536-doc chunk. The threshold is around 4096 set bits — below that, storing the raw list of 16-bit integers is cheaper than allocating the full 8KB bitset. Above it, the bitset wins because every additional doc costs exactly 1 bit instead of 16. The switching logic lives in IndexedDISI.java and the block header carries a single byte that tells the decoder which path to take. What catches most people off guard is that this decision is per 65536-document window, not per posting list — a single list for a common term like "the" will have thousands of these blocks, each independently encoded.

// Simplified logic from IndexedDISI — not the actual source, but the structure
if (cardinality > 4096) {
    // DENSE: write a 65536-bit bitset (8192 bytes)
    // Each additional doc costs 1 bit
    writeBlock(DENSE, bitset);
} else if (cardinality > 0) {
    // SPARSE: write raw 16-bit offsets
    // Each doc costs 2 bytes, 4096 docs = 8192 bytes at the crossover
    writeBlock(SPARSE, shortList);
} else {
    // ALL_SKIP: entire block is empty, write nothing
    writeBlock(ALL_SKIP, null);
}
Enter fullscreen mode Exit fullscreen mode

Roaring Bitmaps formalized this exact threshold mathematically, and if you're implementing your own structure, the 4096-out-of-65536 number is worth internalizing. At exactly 4096 values, a sorted array of 16-bit integers consumes 8192 bytes — identical to a full 65536-bit bitmap. Below 4096, the array is smaller. Above it, the bitmap wins. Roaring adds a third representation: a sorted array of 16-bit run-length encoded pairs for cases where documents are nearly contiguous (think "all docs in a date range"). That run container beats both other formats when documents cluster. This three-way adaptive decision is what makes Roaring so fast in practice — it's not doing anything clever algorithmically, it's just refusing to use a bad representation.

Tantivy takes a slightly different angle. Rather than working at the 65536-doc block granularity, it operates on 128-document blocks and computes statistics — minimum value, maximum value, number of distinct values — then selects from several bitpacking widths plus a const codec (for blocks where all values are identical). The 128-doc block size is load-bearing: it's small enough that a single SIMD register (256-bit AVX2 can process 8 x 32-bit values, so 128 values = 16 SIMD ops) can decode an entire block in one tight loop, but large enough that the per-block metadata overhead stays under 1% of total payload. Go smaller than 64 docs and your codec headers start costing more than your savings. Go larger than 256 and you lose the ability to skip through sparse lists efficiently during AND queries.

The 128-doc block standard didn't happen by accident — it shows up in Lucene's Lucene912PostingsFormat, Tantivy, and PISA alike. The original PFOR-Delta paper by Zukowski et al. tested block sizes from 32 to 512 and found 128 consistently won across diverse posting list distributions. The intuition: short posting lists (rare terms) often fit in 1–2 blocks, so you want blocks small enough that a list for a term occurring 50 times doesn't waste half a block. Long posting lists (common terms) benefit from larger blocks only if your SIMD width matches — and 128 x 32-bit integers fits exactly in cache lines with room for the block header. The "adaptive" part and the "128-doc block" part are coupled: you need small enough blocks that the codec decision is locally meaningful, but consistent enough block sizes that your decoder can vectorize without branching on length.

Getting Your Hands Dirty: Lucene Codec Configuration

The thing that caught me off guard when I first started tuning Lucene compression is that the codec setting in Elasticsearch controls far less than you'd think. Most people assume it's a master switch for everything. It's not — and misunderstanding that boundary will have you chasing phantom performance gains for hours.

First, check what your index is actually using right now:

curl -X GET 'localhost:9200/my-index/_segments?pretty'
Enter fullscreen mode Exit fullscreen mode

Look for the codec field in the response. You'll see something like "Lucene99" per segment. Different segments in the same index can use different codecs if you've changed settings mid-life — Lucene doesn't retroactively recompress old segments. Only newly written and merged segments pick up the new codec. This matters enormously: if you set best_compression on an existing index and immediately benchmark it, you're probably benchmarking the old codec on most of your data.

The actual config to switch is this:

PUT /my-index
{
  "settings": {
    "codec": "best_compression"
  }
}
Enter fullscreen mode Exit fullscreen mode

You can only do this on a closed index or at creation time. Trying it on an open, running index will get you a 400 with Can't update non dynamic settings. The two built-in options are default and best_compression. The difference in practice: best_compression switches the stored fields format from LZ4 to DEFLATE. Stored fields are what Elasticsearch uses to reconstruct the _source document. Expect roughly 15–30% smaller stored field files on typical JSON-heavy indices, with a measurable fetch-phase latency cost because DEFLATE decompression is slower than LZ4 — especially noticeable when you're fetching thousands of docs per query.

Here's the part that trips everyone up: postings compression — the actual inverted index — is not controlled by the codec setting at all. Lucene's postings (doc IDs, term frequencies, positions, offsets) are compressed using the postings format registered inside the codec class. In Lucene 9.x, the default postings format is Lucene99PostingsFormat, which uses PFOR (Patched Frame of Reference) for doc ID and frequency blocks and VByte for smaller residuals. You cannot swap this out through Elasticsearch's index settings UI — there's no exposed knob for it. To change the postings compression algorithm, you have to build a custom codec in Java and register it as an Elasticsearch plugin.

Here's a minimal skeleton for a custom codec that lets you experiment with postings formats on your own corpus:

// Requires Lucene 9.9+ on the classpath — match your ES version exactly
public class MyExperimentalCodec extends FilterCodec {

    public static final String CODEC_NAME = "MyExperimental";

    public MyExperimentalCodec() {
        // Delegate everything else to the production codec
        super(CODEC_NAME, new Lucene99Codec(Lucene99Codec.Mode.BEST_SPEED));
    }

    @Override
    public PostingsFormat postingsFormat() {
        // Swap PFOR for direct VByte — useful on low-cardinality fields
        // where PFOR's block overhead hurts more than it helps
        return new Lucene99PostingsFormat(
            BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE,
            BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE
        );
        // To try PFOR with a larger block size (256 vs default 128):
        // return new Lucene99PostingsFormat(128, 1024);
    }
}
Enter fullscreen mode Exit fullscreen mode

Register this codec via META-INF/services/org.apache.lucene.codecs.Codec inside your plugin JAR, then reference it by name in your index settings: "codec": "MyExperimental". The gotcha here is version pinning — Lucene codecs are not stable across minor versions. If you build against Lucene 9.9 and your Elasticsearch node is running 9.10 internally, the codec will refuse to load. Always check the exact Lucene version bundled with your ES release using curl localhost:9200 and match it in your pom.xml or build.gradle precisely. Running ./gradlew dependencies | grep lucene-core in your plugin project before every ES upgrade has saved me at least three broken deployments.

Roaring Bitmaps in Practice: When to Force Them

The thing that catches most people off guard: Roaring Bitmaps aren't automatically better than a plain sorted integer array. The crossover point depends entirely on your document ID distribution. I've seen dense, sequential doc ID sets where a simple byte-packed array outperformed Roaring by 20-30% in both memory and decode speed — and I've seen sparse, high-cardinality filter sets where Roaring cut memory by 10x. The library doesn't pick for you. You have to measure against your actual data.

Start with pyroaring for local benchmarking before you touch any cluster config:

pip install pyroaring
Enter fullscreen mode Exit fullscreen mode
import random
import sys
import time
from pyroaring import BitMap

# Simulate two doc ID distributions: dense vs. sparse
N = 10_000_000

dense_ids = list(range(0, 500_000))          # ~5% of 10M universe, contiguous
sparse_ids = random.sample(range(0, N), 50_000)  # 0.5%, random

for label, ids in [("dense", dense_ids), ("sparse", sparse_ids)]:
    bm = BitMap(ids)
    bm.run_optimize()  # enable run-length encoding for contiguous ranges — critical step

    plain_list_bytes = sys.getsizeof(ids)
    roaring_bytes = len(bm.serialize())

    t0 = time.perf_counter()
    for _ in range(10_000):
        _ = 250_000 in bm          # membership check
    roaring_lookup = (time.perf_counter() - t0) / 10_000

    t0 = time.perf_counter()
    id_set = set(ids)
    for _ in range(10_000):
        _ = 250_000 in id_set
    set_lookup = (time.perf_counter() - t0) / 10_000

    print(f"[{label}] plain list: {plain_list_bytes/1024:.1f} KB | roaring: {roaring_bytes/1024:.1f} KB")
    print(f"[{label}] roaring lookup: {roaring_lookup*1e6:.2f}µs | set lookup: {set_lookup*1e6:.2f}µs")
Enter fullscreen mode Exit fullscreen mode

Run this with your actual doc ID samples before committing to any codec change. The run_optimize() call is non-optional — without it, Roaring won't collapse contiguous runs into the RLE container format, and a dense sequential range will waste more memory than a plain byte array. The serialized size difference between optimized and unoptimized can be 40x on sequential ID blocks like those generated by time-series indexing.

OpenSearch exposes bitmap filter cache settings differently from vanilla Elasticsearch, and this trips people up. In Elasticsearch 8.x, the relevant knob is index.codec at the index level, and the BKD tree structures handle numeric fields separately from the postings codec. OpenSearch forked at 7.10 and added indices.fielddata.cache.size plus its own filterrewrite optimization path that internally uses Roaring for aggregation filters — but it bypasses the standard codec pipeline entirely. You can't configure it through index.codec: best_compression. Check the _nodes/stats/indices/filter_cache endpoint to see if your bitmap cache is actually being hit:

GET /_nodes/stats/indices/filter_cache?pretty
# Look for: filter_cache.memory_size_in_bytes and filter_cache.evictions
# High evictions = your heap budget for filter cache is too small, not a codec problem
Enter fullscreen mode Exit fullscreen mode

The big gotcha: Roaring Bitmaps in Lucene-based engines are a filter/cache concern, not a postings list codec concern — and these are completely separate systems. The index.codec setting (options: default, best_compression, or a custom codec like Zstd in OpenSearch 2.9+) controls how the postings lists, stored fields, and doc values get compressed on disk. Roaring Bitmaps live in the in-memory filter cache used for things like term filters on low-cardinality keyword fields. Tuning your codec will not make your filter cache use Roaring more aggressively, and vice versa. I've watched engineers spend two days on codec benchmarks when their actual bottleneck was filter cache eviction rate. Profile the cache hit ratio first:

GET /my-index/_stats/filter_cache?pretty
# filter_cache.hit_count vs miss_count ratio below 0.85 means you're re-building bitmaps constantly
# Fix: increase indices.cache.filter.size (default 10% of heap) before touching codec settings
Enter fullscreen mode Exit fullscreen mode

When Roaring actually wins: high-cardinality keyword or numeric fields used repeatedly as filters across many queries, doc ID sets that are sparse across a large universe (e.g., user IDs in a multi-tenant index where each tenant filters to their own subset), and boolean combinations of multiple filters where the AND/OR operations on compressed bitmaps beat reconstructing from disk every time. When it doesn't win: small indices under a few million documents where the overhead of the container structure isn't justified, and append-only time-series workloads where doc IDs are always dense and sequential — in that case BKD trees with range queries will beat any bitmap approach cleanly.

Tantivy: Where Adaptive Compression Is More Transparent

Most search engineers I know treat Elasticsearch as a black box that occasionally emits metrics. That's fine until you're trying to answer "why is my index twice as large as I expected?" — at which point the gap between "we use BM25" and "here's the byte-level layout" becomes a real problem. I started poking at Tantivy specifically because I needed to see what compression decision was being made per posting list block, and Elasticsearch's _cat/indices API was not going to give me that.

Getting a basic index up in Tantivy is genuinely fast. Here's the skeleton I use to get a schema and writer going — this compiles against Tantivy 0.21 with the default feature set:

use tantivy::schema::*;
use tantivy::{Index, doc};

fn main() -> tantivy::Result<()> {
    let mut schema_builder = Schema::builder();

    // STORED + TEXT gives you both forward and inverted index
    let title = schema_builder.add_text_field("title", TEXT | STORED);
    let body  = schema_builder.add_text_field("body",  TEXT);
    // u64 fields trigger numeric codec selection — important for compression experiments
    let views = schema_builder.add_u64_field("views", INDEXED | FAST);

    let schema = schema_builder.build();
    let index  = Index::create_in_dir("/tmp/tantivy_test", schema)?;

    let mut writer = index.writer(50_000_000)?; // 50MB heap budget
    writer.add_document(doc!(title => "adaptive compression deep dive", views => 9400_u64))?;
    writer.commit()?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

After writer.commit(), Tantivy writes a segment and a meta.json file you can actually read. That file tells you which codec was selected per field. Crack it open:

# After a commit, segments land in your index directory
cat /tmp/tantivy_test/meta.json | python3 -m json.tool

# You'll see something like:
# "codec": "BlockedBitpackerEncoder" for dense u64 fields
# "codec": "VIntEncoder" for sparse or high-cardinality term frequencies
# segment file extensions: .idx (terms), .pos (positions), .fieldnorm
Enter fullscreen mode Exit fullscreen mode

The core decision point is in how Tantivy processes 128-document blocks. BlockedBitpacker kicks in when the maximum value in a block is small enough that bit-packing beats variable-length encoding — concretely, if all doc IDs in the block fit in, say, 7 bits, it packs them at 7 bits each. VIntEncoder takes over when the value distribution is spiky or the block is too sparse to justify the fixed-width overhead. This is adaptive compression at the block level, not the field level — the codec can flip between them every 128 docs. The relevant logic lives in tantivy/src/postings/block_segment_postings.rs if you want to read the actual threshold comparisons rather than trust the docs.

The honest trade-off: Tantivy gives you a source-readable, instrumentable system where you can println! your way into the codec selection path. That visibility is real and I've used it to diagnose index bloat that would have been a support ticket with Elasticsearch. The cost is that when something does break — a corruption edge case, a merge policy interaction, a tokenizer behaving unexpectedly on non-ASCII input — you are on your own. The community is active but small. The GitHub issues are often answered by the author with "here's the code path, read it." That's not a criticism; it's just the reality of using a library instead of a service. If your team doesn't have at least one person comfortable reading Rust source when things go sideways, the transparency becomes theoretical.

Measuring Compression in Production — The Metrics You Should Actually Watch

The metric that actually matters isn't the one on your storage dashboard. Most teams optimize for disk usage, celebrate a 40% reduction, then wonder why their p99 query latency climbed 80ms. The real trade-off is CPU time spent decompressing postings at query time versus bytes saved on disk, and those two numbers rarely move in the same direction.

Cluster-Level Storage: Reading the Cat API Output

Start here to get your baseline before touching any codec settings:

GET /_cat/indices?v&h=index,pri.store.size,store.size,docs.count,segments.count
Enter fullscreen mode Exit fullscreen mode

The difference between pri.store.size and store.size matters. pri is primary shards only — that's your actual compressed data footprint. store.size includes replicas, so a 3x replica factor makes storage look three times worse than it is for compression analysis. I've watched people panic about "bloated" indices that were just over-replicated. Pin your compression analysis to pri.store.size and note docs.count alongside it so you can track bytes-per-document over time as your index grows.

Segment-Level Inspection: Where the Real Signal Lives

The cat API hides the variance across segments. A freshly merged large segment compresses very differently from the dozens of small in-memory segments that just flushed. Get the per-segment breakdown:

GET /my-index/_segments?verbose=true
Enter fullscreen mode Exit fullscreen mode

Look at the size_in_bytes per segment alongside num_docs and compound. Segments with compound: true are CFS files — Elasticsearch has merged all per-field files into one. The compression ratio on CFS segments is usually better because Lucene can pack the block boundaries more efficiently. If you see many small non-compound segments (under ~10MB), those are candidates for a force merge, and their apparent "poor compression" is a segment maturity problem, not a codec problem. Also check attributes.Lucene90PostingsFormat (or whichever version your cluster runs) to confirm the codec actually applied to each segment — codec changes only affect newly written segments, not old ones.

The Metric Most Teams Miss: Decompression CPU Under Load

Disk savings are visible in your ops dashboard. Decompression cost is invisible until you're in an incident. The way to measure it properly is to watch CPU alongside latency during a realistic query mix, not during off-peak hours. On Elasticsearch 8.x, use the node hot threads API while running your benchmark:

GET /_nodes/hot_threads?interval=5s&snapshots=5&threads=5
Enter fullscreen mode Exit fullscreen mode

If you see LZ4 or Deflate decompressor frames consistently in those hot threads, you're paying CPU for your compression wins. The break-even point I've found in practice: DEFLATE (best compression) saves roughly 15-25% more bytes than LZ4 on postings-heavy indices, but costs 3-4x the CPU on decode. If your cluster is CPU-bound and not I/O-bound — common on NVMe-backed nodes — that trade is almost always a loss. LZ4 is default for a reason.

Using the Profile API to Isolate Postings Decoding Bottlenecks

The _profile API is the closest thing to a profiler you have inside Elasticsearch. Run it against your slowest query pattern:

POST /my-index/_search
{
  "profile": true,
  "query": {
    "bool": {
      "must": [
        { "term": { "status": "published" } },
        { "match": { "body": "distributed systems" } }
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Inside the response, drill into profile.shards[*].searches[*].query[*].breakdown. The fields to watch are create_weight_count, next_doc, and advance — these represent how much time Lucene spent iterating through the postings list. If next_doc time is dominant and scales with your doc count rather than your result set size, that's postings iteration overhead, which directly reflects how expensive your block decompression is per posting list traversal. Compare this number before and after a codec change on a test index with identical data — you need the segment to be fully merged for the comparison to be valid, otherwise you're comparing segment counts, not codec performance.

A Real Before/After: p50 vs p99 After Switching Codecs

I ran this on a 200M document log index, switching from the default Lucene90PostingsFormat (LZ4 blocks) to LuceneVarGapDocFreqIntervalPostingsFormat with tighter compression for a low-cardinality status field. The reindex took about 4 hours on 6 data nodes. The numbers:

  • pri.store.size: 380GB → 291GB (23% reduction)
  • p50 query latency: 12ms → 11ms (within noise, effectively unchanged)
  • p99 query latency: 340ms → 580ms on high-concurrency bursts (70% worse)
  • Node CPU average: 41% → 67% at peak query load

The p50/p99 split is what tells the real story. Median latency barely moved because most queries hit hot OS page cache. But p99 — the queries that actually miss cache and have to decompress cold postings from disk — got dramatically worse. Storage-focused teams would look at that 23% disk reduction and call it a win. Query-latency-focused teams would roll it back immediately. Track both, always split by percentile, and always stress-test with a cache-busting query mix before committing to a codec change in production.

Gotchas I Hit and How to Avoid Them

The one that burned me hardest: there is no online codec migration in Elasticsearch or Solr. If you benchmarked best_compression on a test cluster and want to roll it out to production, you're looking at a full reindex. The index-level codec setting is immutable once the index exists. Your options are a blue/green swap — create a new index with the correct codec, reindex into it, then alias-cut — or schedule downtime. I've seen teams discover this at 2am after a capacity event. Don't be those teams. Put codec selection in your index template before you ever write a document.

PUT _index_template/my_logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "codec": "best_compression",
      "number_of_shards": 3
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

best_compression (which uses DEFLATE instead of LZ4 for stored fields) trades write throughput for space. I measured a 40% drop in merge throughput on a write-heavy index doing ~80k docs/sec. Merges started queuing up, and eventually indexing throttled because Elasticsearch's merge scheduler couldn't keep up. best_compression is the right call for cold/warm indices where you're optimizing for storage cost and reads are infrequent. For hot write paths, stick with the default default codec and consider best_compression only after ILM moves shards to warm tier and forces a merge to one segment.

Delta encoding in postings lists is excellent for low-frequency, high-cardinality terms — your UUIDs, URLs, timestamps turned into tokens. It's genuinely bad for stopwords and high-frequency terms like "the", "is", "error" in log data. These terms appear in almost every document, so their doc ID lists are nearly contiguous and already near-incompressible. Worse, they balloon your postings structure with no query benefit. Filter them at analysis time, not at query time:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "no_stopwords": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "english_stop"]
        }
      },
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Nested documents are a silent killer for doc ID list size. Each nested object gets its own synthetic Lucene document with its own doc ID, so a parent document with 20 nested objects contributes 21 doc IDs to the postings for every field that appears in it. I ran a benchmark on a mapping with nested line items and couldn't figure out why my postings were 3x larger than expected — until I ran GET /my_index/_stats?filter_path=**.doc_count and saw the true document count was 18x the number of parent documents I'd indexed. Audit your mapping before you benchmark compression ratios. If you don't need sub-document isolation for scoring, flattened field type or plain object type will save you significantly.

The _source field is stored in a completely separate data structure from the inverted index postings, and its compression is controlled by different knobs. Postings compression (what Lucene codecs govern) affects how term → doc ID mappings are stored. _source compression is columnar stored field compression, governed by the codec's stored fields format — that's the LZ4 vs DEFLATE choice from best_compression. If you disable _source entirely and then measure your index size drop, you cannot attribute that savings to postings compression. I've seen people do this comparison and walk away with completely wrong conclusions about what the codec change achieved. Measure them separately using GET /my_index/_stats/store,indexing and break out store_size changes with and without _source.

Lucene codec version pinning will quietly destroy your benchmarks across major Elasticsearch upgrades. Elasticsearch 7.x defaulted to Lucene 8's codec, ES 8.x moved to Lucene 9, and the default postings format changed in ways that affect compression ratios and query performance independently of anything you configured. If you benchmarked compression in ES 7.17 and then upgraded to ES 8.x, your numbers are not comparable — the baseline shifted. The practical fix is to explicitly name codecs in your benchmark setup notes and re-run baselines after every major upgrade. You can inspect what codec a segment was written with using the Lucene CheckIndex tool or via the ES _cat/segments API's version column, which gives you the Lucene segment version.

Decision Framework: Which Approach for Which Situation

The biggest mistake I see teams make is cargo-culting compression settings from a high-traffic search startup's blog post onto their 200MB product catalog index. The context mismatch alone negates any benefit. So before touching a single codec setting, you need to answer three questions: What's the cardinality of your fields? What's the read/write ratio? How many documents map to a typical term?

High-Density Boolean Filters: Roaring Bitmaps

Status flags, feature flags, soft-delete markers — any field where you have maybe 5-20 distinct values spread across millions of documents. This is where Roaring Bitmaps are genuinely dominant. The internal structure automatically switches between array containers (sparse regions), bitmap containers (dense regions), and run-length encoded containers (long consecutive runs). You don't manage that switching; it happens per 65,536-document block automatically. A status=active posting list covering 40% of your index compresses down to almost nothing compared to what delta + VByte would give you. The union/intersection operations also stay cheap because the SIMD-accelerated bitwise ops on the bitmap containers are faster than decoding variable-length integers during query execution. If you're on Lucene-based systems (Elasticsearch, OpenSearch, Solr), this is already what DocValues uses under the hood for dense numeric fields — but you can explicitly force it for your filter fields.

Low-Cardinality Fields With Many Docs Per Term: PFOR With Large Blocks

Think category_id, region_code, language — fields where each term has a posting list of tens of thousands of entries. PFOR (Patched Frame of Reference) shines here because the delta-encoded gaps between doc IDs are small and uniform, and the frame reference encoding amortizes the cost of the occasional outlier ("exceptions") across a full 128 or 256-doc block. The key tuning lever is block size. Bigger blocks mean better compression ratio but higher decompression latency for the first result — acceptable in batch analytics, painful for top-K search with early termination. I'd default to 128-doc blocks for interactive search and push to 256 if you're running aggregations where you're scanning entire posting lists anyway. In Lucene, Lucene912PostingsFormat already uses PFOR internally; you're mostly tuning block size through the format's constructor parameters if you're writing a custom codec.

High-Cardinality Fields: VByte With Delta, or Just Don't Index

UUIDs, session IDs, user tokens, order numbers — if every term appears in exactly one or two documents, your posting list is trivially short and PFOR's block-based approach wastes overhead. VByte with delta encoding is the right call here because there's almost nothing to encode after delta — consecutive doc IDs in a single-entry list have zero gap. But more importantly, ask yourself why you're building an inverted index on this field at all. If queries are always exact-match lookups like session_id=X, a hash index or a BKD-tree (what Elasticsearch uses for keyword fields with doc_values) will outperform an inverted index at every operation. Indexing UUIDs into a full inverted index is one of the most common sources of index bloat I've seen — the dictionary alone becomes massive.

# Elasticsearch mapping  explicitly skip the inverted index for UUID fields
# doc_values keeps it available for aggregations and sorting
PUT /my_index
{
  "mappings": {
    "properties": {
      "session_id": {
        "type": "keyword",
        "index": false,      # no inverted index
        "doc_values": true   # BKD-backed for exact lookup
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Write-Heavy Time-Series Data: Compress at Merge, Not at Write Time

Applying aggressive compression on every segment write in a time-series workload is how you turn a fast ingest pipeline into a bottleneck. The right pattern is a lightweight codec at write time — fast, minimal compression — and then heavy compression during the merge phase when you're consolidating many small segments into large ones anyway. Lucene's TieredMergePolicy is your friend here. You can configure a per-segment codec and a merge-time codec separately. The practical version: use LZ4-equivalent speed at write time, switch to DEFLATE or Zstandard during merges. Elasticsearch's best_speed codec maps roughly to this pattern. The merged segments are what most of your reads will hit — they cover the bulk of historical data — so that's where compression ratio actually pays off.

Read-Heavy Analytics: Pay the Merge Cost for best_compression

If your workload is heavy on aggregations, faceting, or full-scan analytics — and you're not doing real-time ingest — the best_compression codec in Elasticsearch (which uses DEFLATE instead of LZ4 for stored fields) is worth it. The merge time goes up noticeably, sometimes 2-3x for stored field segments, but you get 40-60% smaller stored field sizes depending on your document structure. Fewer bytes from disk means faster analytics even if per-document decompression is slower, because you're I/O bound, not CPU bound, on most analytics clusters. I switched a reporting index to best_compression and saw query latency drop because the OS page cache suddenly had room to hold the hot segments entirely in memory.

# Set at index creation  can't change without reindex
PUT /analytics_reports
{
  "settings": {
    "index": {
      "codec": "best_compression",
      "merge.policy.max_merged_segment": "5gb"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Tiny Indices Under 1GB: Leave It Alone

I'll be direct: if your entire index fits in 1GB, compression tuning will not move your metrics in any meaningful way. You're in page-cache territory — the OS is probably holding the whole thing in RAM after the first few queries. The ops complexity of custom codecs, the testing burden of validating compression behavior across Elasticsearch upgrades, the risk of codec incompatibility during rolling restarts — none of that is worth it for a small index. Default settings exist because they're reasonable across a wide range of cases, and a sub-1GB index is squarely in that range. Spend that engineering time on query structure, mapping hygiene, or shard count instead.

A Note on Tooling Ecosystem

The dirty secret of compression tuning is that most teams shouldn't be doing it manually. If your index fits comfortably under a few hundred gigabytes, the default settings in Lucene or OpenSearch are almost certainly good enough — the engineering hours you'd spend profiling block sizes and delta encoding thresholds will cost more than whatever storage you save.

Managed services have gotten genuinely good at this. Elastic Cloud runs on Lucene's adaptive BKD and postings compression by default, and OpenSearch Service on AWS similarly handles the low-level codec decisions for you. Typesense Cloud goes further — it makes almost zero knobs visible because the philosophy is that the defaults should just work. The honest trade-off: you lose the ability to, say, swap in a custom Codec implementation or tune BEST_COMPRESSION vs BEST_SPEED per field type, but you gain back the ops overhead of managing JVM heap pressure, segment merging schedules, and disk provisioning.

If you're evaluating the broader space of managed tools rather than self-hosting, our guide on Essential SaaS Tools for Small Business in 2026 covers managed search options that handle compression tuning entirely on your behalf — useful if you're weighing Typesense Cloud against Algolia or a hosted OpenSearch tier without wanting to go deep on codec internals.

DIY compression tuning earns its keep at a specific threshold: when your index is large enough that storage costs visibly exceed what an engineer costs per month to maintain the tuning work. Concretely, I've seen this flip around the 500GB–1TB range on cloud block storage — at that point, moving from the default Lucene LZ4 postings compression to DEFLATE-based codecs or a custom FilterCodec wrapper can cut storage 30–45%, which on AWS gp3 at $0.08/GB-month starts being a real line item. Below that threshold, just pick a managed service, configure your analyzers correctly, and spend the brain cycles on query relevance instead.

// Only worth doing this when storage costs hurt
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setCodec(new FilterCodec("Lucene99", Codec.forName("Lucene99")) {
  @Override
  public PostingsFormat postingsFormat() {
    // DEFLATE over LZ4 — ~35% smaller, ~15% slower indexing throughput
    return new Lucene99PostingsFormat(
      BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE,
      BlockTreeTermsWriter.DEFAULT_MAX_BLOCK_SIZE
    );
  }
});
Enter fullscreen mode Exit fullscreen mode

One gotcha with managed services: they often lag behind upstream releases by 6–12 months. Elastic Cloud was running Elasticsearch 8.10 for a long time after 8.12 shipped significant improvements to the ES816BinPackedIntLZ4 numerics format. If you're chasing a specific codec improvement from a recent Lucene release, self-hosting might be your only path — but that's a narrow reason to take on the ops burden.

What I'd Do Differently If Starting Over

The first thing I'd do differently is profile before touching a single codec setting. I spent two weeks reading Lucene internals, benchmarking DEFLATE vs LZ4 vs ZSTD block sizes, and convinced myself I'd found the optimal compression stack — only to run a storage audit and discover that _source was eating 70% of my index size. The entire tuning exercise had been optimizing the wrong thing. If I'd run GET /my-index/_stats/store on day one and looked at the actual segment breakdown, I would have disabled _source or switched to synthetic source immediately and gotten a 3x storage reduction before touching a single codec.

# This one command would have saved me two weeks
GET /my-index/_stats/store?human=true

# Then drill into segment-level storage to see where bytes actually live
GET /my-index/_segments?verbose=true
Enter fullscreen mode Exit fullscreen mode

Test compression on data that actually looks like your production data. I made the classic mistake of benchmarking on a 10K document sample I'd hand-curated for "clean" schema validation. Clean, normalized data compresses beautifully — you get ratios that make every codec look great. Real production logs with variable-length user agent strings, inconsistent field population, and sparse nested objects compress completely differently. The ratio gap between toy data and real data can be 2x or worse depending on domain. For inverted index compression specifically, term dictionary size is dominated by vocabulary diversity, and a toy dataset will always have lower cardinality than prod. Measure on a realistic slice of actual data or your benchmarks are fiction.

Automate segment inspection into monitoring from day one. After a surprise AWS bill taught me this the hard way, I now ship a daily job that pulls segment stats and posts anomalies to Slack. The thing that's genuinely useful here isn't tracking total index size — it's watching the ratio of segment count to document count over time. When that ratio spikes, you're accumulating unmerged segments, which means your compression isn't being applied at the merge tier the way you expect. A simple cron hitting the cat segments API and comparing against a 7-day rolling average catches this before it becomes a cost incident.

# Dump segment metadata to JSON for your monitoring pipeline
curl -s "http://localhost:9200/_cat/segments/my-index?v&h=segment,size,size.memory,docs.count,compound&format=json" \
  | jq '.[] | select(.["docs.count"] | tonumber < 1000)' 
# Segments with fewer than 1000 docs haven't been merged yet — worth alerting on these
Enter fullscreen mode Exit fullscreen mode

Keep a codec change log. This sounds annoyingly bureaucratic until you're six months in, debugging a query regression, and you can't remember whether the index you're looking at was built with Lucene 9.7's Lucene99Codec or the custom codec you patched in during that one late Friday. The log doesn't need to be fancy — a plain text file or a Notion table with four columns: Lucene version, codec name, index name, and date of change. Lucene's codec API doesn't surface this metadata at query time, so if you don't track it yourself, you're flying blind when performance changes after an Elasticsearch or OpenSearch upgrade. I've personally had to re-index 400GB because I couldn't reconstruct what codec was used on a legacy index and couldn't trust the compression behavior after upgrading from ES 7.x to 8.x.

  • Profile storage breakdown first_source bloat, fielddata cache, and doc values each require different remediation than codec tuning
  • Use production-representative data — borrow a 48-hour snapshot from prod, anonymize it, and benchmark against that
  • Monitor segment counts, not just total size — unmerged segments hide the true compression behavior you'll get at steady state
  • Log codec changes with Lucene version — codec behavior changes between Lucene minor versions and you need the audit trail

Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.


Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

Top comments (0)