Yu Qian Yang

Posted on Apr 14 • Originally published at Medium

How I Built a Runtime Adaptive Compression System That Selects the Best Algorithm Without Scanning All Data

#database #performance #cpp #architecture

When I was working on a columnar time-series database built on Greenplum, we faced a
problem that every time-series storage engineer eventually hits:

No single compression algorithm works best for all data.

A sensor reading temperature every second looks nothing like a financial tick stream.
A column of integer IDs compresses completely differently from a column of floating-point
measurements. And in the real world, the same column can change character over time —
steady signal for hours, then suddenly a burst of noise.

The naive answer is: let the user pick an algorithm. But that puts the burden on the user,
and in practice they almost always pick wrong — or just leave the default.

So I designed automode: a runtime analyzer that samples incoming data, detects its
statistical character, and selects the best encoding chain automatically.

Here's how it works.

The Encoding Chain

Before talking about the analyzer, I need to briefly explain what it's selecting between.

We built a set of encoding algorithms, each optimized for a specific data pattern:

Run-Length Encoding — all values are identical
Delta-of-Delta — slowly changing values (timestamps, smooth sensors)
Delta + ZigZag — values oscillating around a baseline
Bit Packing — small integers that can be bit-packed
Gorilla — floating-point values with small XOR between adjacent values

The insight that made automode possible: these algorithms operate on fundamentally
different statistical features of data. If I can identify the dominant feature of a
data stream, I can map it directly to the right algorithm.

The Design Problem

The obvious approach — scan all the data and compute statistics — doesn't work in a
storage engine. The whole point of compression is to be fast. Scanning every value
before encoding defeats the purpose.

I needed a way to characterize a data stream with minimal overhead.

The solution: sample windows.

Instead of scanning the full column, I divide the input stream into fixed-size windows
and analyze only a small number of items from each window. Within each window, the
sample start position is randomized to avoid being fooled by local patterns.

Why randomize? Consider this sequence:

[1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3]

If you always sample from the start, you'd classify this as "all equal" and pick
Run-Length Encoding. But if you look at the full stream, it's clearly an alternating
pattern — RLE would be terrible. Random sampling within the window gives a much more
representative picture.

The Five Features

Each sampled window is analyzed for five statistical features:

All Equal — all sampled values are identical
Values bitwise small — the values themselves fit in few bits
1st-order delta small — differences between adjacent values are small
2nd-order delta small — differences between differences are small (delta-of-delta)
Float XOR small — XOR of adjacent float values is small (Gorilla pattern)

For each window, the analyzer computes 0th, 1st, and 2nd order differentials, plus
XOR pairs for float data. "Small" is defined bitwise: a value is small if its
significant bit length is below a configurable threshold. This makes the check
robust against occasional spike values — a single outlier doesn't ruin the
classification.

Strong vs. Weak Features

Not all features are equal. I introduced a Strong/Weak classification:

Strongest: All Equal → Run-Length Encoding
Strong: 2nd-order delta small → Delta-of-Delta
Strong: Float XOR small → Gorilla
Weak: Values bitwise small → Bit Packing
Weak: 1st-order delta small → Delta + ZigZag

Why does this matter? Consider a case where Delta-of-Delta and Bit Packing score
similarly. Delta-of-Delta is fundamentally better for smoothly varying time-series
data, even with a slightly lower score. The weight system ensures strong methods
are preferred when they're competitive.

One Edge Case Worth Noting

The alternating pattern [A B A B A B] is a trap.

It looks like "values bitwise small", but Bit Packing performs poorly on alternating
sequences. A general-purpose compressor actually handles this pattern better.

I added an explicit check: if odd-indexed and even-indexed items within the sample
are each internally equal, it's an alternating pattern — skip the Bit Packing
classification and fall back to general compression.

This kind of edge case only appears when you test against real production data. In our
case, it showed up in sensor data from a deployment at a large financial institution.

Feature → Algorithm Mapping

Once features are aggregated across all sample windows, the dominant feature maps to
an algorithm:

All Equal → Run-Length Encoding
2nd-order delta small → Delta-of-Delta
Values bitwise small → Bit Packing
1st-order delta small → Delta + ZigZag
Float XOR small → Gorilla
No clear pattern → General Compression (fallback)

For floating-point columns, the encoding chain adds a float-to-integer conversion pass
first — converting floats that happen to be integers (like 1.0, 2.0) into actual
integers before passing to the integer encoding path. This is worth a separate article.

Two Modes

The system supports two optimization priorities:

Compression-rate mode: maximize storage efficiency
Speed mode: maximize decompression speed

In practice, compression-rate mode is used for cold storage and analytics workloads,
and speed mode for hot data that's frequently queried. The same feature detection runs
in both cases — only the final algorithm selection weights differ.

Results

The system is deployed in production at scale, including a large-scale installation
at a major financial institution managing thousands of PBs of time-series data. In
testing, automode consistently matched or outperformed manually-tuned single-algorithm
compression across diverse real-world datasets.

The key insight that made it work: you don't need to scan all the data to understand
its character. A few dozen samples, taken randomly from fixed-size windows, are
enough to make a reliable classification — as long as you handle the edge cases.

What's Next

This article is part of a series on compression engineering in time-series databases:

Part 2: Float-to-integer encoding — how I used IEEE 754 bit manipulation to detect convertibility with near-zero overhead
Part 3: Chained encoding — a unified float compression pipeline
Part 4: An improved floating-point compression algorithm based on ELF

I'm currently available for freelance work on backend systems, storage engineering,
and systems integration. Feel free to reach out.

DEV Community