When I was working on a columnar time-series database built on Greenplum, we faced a
problem that every time-series storage engineer eventually hits:
No single compression algorithm works best for all data.
A sensor reading temperature every second looks nothing like a financial tick stream.
A column of integer IDs compresses completely differently from a column of floating-point
measurements. And in the real world, the same column can change character over time —
steady signal for hours, then suddenly a burst of noise.
The naive answer is: let the user pick an algorithm. But that puts the burden on the user,
and in practice they almost always pick wrong — or just leave the default.
So I designed automode: a runtime analyzer that samples incoming data, detects its
statistical character, and selects the best encoding chain automatically.
Here's how it works.
The Encoding Chain
Before talking about the analyzer, I need to briefly explain what it's selecting between.
We built a set of encoding algorithms, each optimized for a specific data pattern:
- Run-Length Encoding — all values are identical
- Delta-of-Delta — slowly changing values (timestamps, smooth sensors)
- Delta + ZigZag — values oscillating around a baseline
- Bit Packing — small integers that can be bit-packed
- Gorilla — floating-point values with small XOR between adjacent values
The insight that made automode possible: these algorithms operate on fundamentally
different statistical features of data. If I can identify the dominant feature of a
data stream, I can map it directly to the right algorithm.
The Design Problem
The obvious approach — scan all the data and compute statistics — doesn't work in a
storage engine. The whole point of compression is to be fast. Scanning every value
before encoding defeats the purpose.
I needed a way to characterize a data stream with minimal overhead.
The solution: sample windows.
Instead of scanning the full column, I divide the input stream into fixed-size windows
and analyze only a small number of items from each window. Within each window, the
sample start position is randomized to avoid being fooled by local patterns.
Why randomize? Consider this sequence:
[1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3]
If you always sample from the start, you'd classify this as "all equal" and pick
Run-Length Encoding. But if you look at the full stream, it's clearly an alternating
pattern — RLE would be terrible. Random sampling within the window gives a much more
representative picture.
The Five Features
Each sampled window is analyzed for five statistical features:
- All Equal — all sampled values are identical
- Values bitwise small — the values themselves fit in few bits
- 1st-order delta small — differences between adjacent values are small
- 2nd-order delta small — differences between differences are small (delta-of-delta)
- Float XOR small — XOR of adjacent float values is small (Gorilla pattern)
For each window, the analyzer computes 0th, 1st, and 2nd order differentials, plus
XOR pairs for float data. "Small" is defined bitwise: a value is small if its
significant bit length is below a configurable threshold. This makes the check
robust against occasional spike values — a single outlier doesn't ruin the
classification.
Strong vs. Weak Features
Not all features are equal. I introduced a Strong/Weak classification:
- Strongest: All Equal → Run-Length Encoding
- Strong: 2nd-order delta small → Delta-of-Delta
- Strong: Float XOR small → Gorilla
- Weak: Values bitwise small → Bit Packing
- Weak: 1st-order delta small → Delta + ZigZag
Why does this matter? Consider a case where Delta-of-Delta and Bit Packing score
similarly. Delta-of-Delta is fundamentally better for smoothly varying time-series
data, even with a slightly lower score. The weight system ensures strong methods
are preferred when they're competitive.
One Edge Case Worth Noting
The alternating pattern [A B A B A B] is a trap.
It looks like "values bitwise small", but Bit Packing performs poorly on alternating
sequences. A general-purpose compressor actually handles this pattern better.
I added an explicit check: if odd-indexed and even-indexed items within the sample
are each internally equal, it's an alternating pattern — skip the Bit Packing
classification and fall back to general compression.
This kind of edge case only appears when you test against real production data. In our
case, it showed up in sensor data from a deployment at a large financial institution.
Feature → Algorithm Mapping
Once features are aggregated across all sample windows, the dominant feature maps to
an algorithm:
- All Equal → Run-Length Encoding
- 2nd-order delta small → Delta-of-Delta
- Values bitwise small → Bit Packing
- 1st-order delta small → Delta + ZigZag
- Float XOR small → Gorilla
- No clear pattern → General Compression (fallback)
For floating-point columns, the encoding chain adds a float-to-integer conversion pass
first — converting floats that happen to be integers (like 1.0, 2.0) into actual
integers before passing to the integer encoding path. This is worth a separate article.
Two Modes
The system supports two optimization priorities:
- Compression-rate mode: maximize storage efficiency
- Speed mode: maximize decompression speed
In practice, compression-rate mode is used for cold storage and analytics workloads,
and speed mode for hot data that's frequently queried. The same feature detection runs
in both cases — only the final algorithm selection weights differ.
Results
The system is deployed in production at scale, including a large-scale installation
at a major financial institution managing thousands of PBs of time-series data. In
testing, automode consistently matched or outperformed manually-tuned single-algorithm
compression across diverse real-world datasets.
The key insight that made it work: you don't need to scan all the data to understand
its character. A few dozen samples, taken randomly from fixed-size windows, are
enough to make a reliable classification — as long as you handle the edge cases.
What's Next
This article is part of a series on compression engineering in time-series databases:
- Part 2: Float-to-integer encoding — how I used IEEE 754 bit manipulation to detect convertibility with near-zero overhead
- Part 3: Chained encoding — a unified float compression pipeline
- Part 4: An improved floating-point compression algorithm based on ELF
I'm currently available for freelance work on backend systems, storage engineering,
and systems integration. Feel free to reach out.
Top comments (0)