In a columnar time-series database, one of the most effective compression tricks is
deceptively simple: if a float value is actually an integer, store it as one.
Why Integers Compress Better Than Floats
Integer compression algorithms like Delta-of-Delta, ZigZag, and Simple8b work by
exploiting predictable bit patterns — small deltas between adjacent values, values
that fit in fewer than 64 bits, and so on. They can pack multiple values into a
single 64-bit word.
Floats don't cooperate with these schemes. Even 1.0 and 2.0 have completely
different IEEE 754 bit representations (0x3FF0000000000000 and 0x4000000000000000).
Their XOR is large, their delta is meaningless as an integer, and bit-packing is useless.
So when a column is declared as FLOAT but actually contains values like 12.0,
18.0, 25.0 — which happens more often than you'd expect, either because the schema
was designed generically or because the upstream system always emits .0 values — you're
leaving significant compression headroom on the table.
The fix: detect these integer-valued floats at encode time, convert them losslessly
to integers, and route them through the integer compression path.
A temperature sensor that reports 21.0, 21.5, 22.0 is a good example. Multiply
by 10 and you get 210, 215, 220 — plain integers with small, predictable deltas.
Delta-of-Delta or Simple8b will compress these far more efficiently than any
float-specific scheme.
The challenge: before converting, you need to check whether the scaled value can be
losslessly represented as an integer. The naive check — std::isnan + range comparison —
works but it's slower than it needs to be on the hot encoding path.
Here's the faster approach I implemented, using nothing but bit manipulation.
The Setup: Scaling Floats to Integers
The encoding scheme works in two steps:
-
Scale: multiply the float by
10^scale(configurable per column) -
Convert: cast the scaled value to integer using
std::lround
For example, with scale = 2:
-
1.23→1.23 * 100 = 123.0→123 -
45.678→45.678 * 100 = 4567.8→ overflow risk or precision loss
Step 2 only makes sense if the scaled value actually fits in the target integer type.
That's the overflow check.
The Overflow Check
The function takes a pointer to the raw float bytes and the target integer width in bytes.
It returns non-zero if the value would overflow.
Called before every conversion — if it fires, skip the integer path and fall back to
float encoding.
The key insight: you can determine whether a float overflows a given integer type
purely from the float's exponent bits, without doing any arithmetic.
Here's why.
IEEE 754 in One Paragraph
A double-precision float is stored as 64 bits:
[ sign: 1 bit ][ exponent: 11 bits ][ fraction: 52 bits ]
The value is: 1.fraction × 2^(exponent − 1023)
The 1023 is the bias — it allows the 11-bit exponent field to represent negative
exponents. The real exponent is stored_exponent − 1023.
For 32-bit floats: 8 exponent bits, bias 127, fraction 23 bits.
Extracting the Exponent
For a double:
uint64_t bits;
memcpy(&bits, src, 8); /* safe type-pun, no UB */
int16_t real_exp = (int16_t)((bits >> 52) & 0x07ff) - 1023;
Step by step:
-
memcpyinto auint64_t— reinterpret the 8 bytes as a 64-bit integer (no arithmetic, just bits) -
>> 52— shift right past the 52 fraction bits, bringing the exponent to the low end -
& 0x07ff— mask off the sign bit, keep only the 11 exponent bits -
- 1023— subtract the bias to get the real exponent
For a float:
uint32_t bits;
memcpy(&bits, src, 4);
int16_t real_exp = (int16_t)((bits >> 23) & 0xff) - 127;
Same logic: shift past 23 fraction bits, mask 8 exponent bits, subtract bias 127.
The Overflow Condition
Once you have the real exponent, the overflow check is one comparison:
is_overflow = real_exp > int_typewidth * 8 - 2;
Where does - 2 come from?
-
−1 for the sign bit: a signed integer of N bits can hold values up to
2^(N-1) - 1 -
−1 for the implicit leading 1: in IEEE 754, the fraction is
1.fraction, not0.fraction
So a float with real exponent E represents a value with E + 1 significant bits
(the implicit 1 plus E fraction bits). For it to fit in a signed N-bit integer, you
need E + 1 ≤ N - 1, which simplifies to E ≤ N - 2.
Full implementation in C:
#include <stdint.h>
#include <string.h>
/* Returns 1 if the double at src overflows a signed integer of int_bytes bytes. */
static inline int double_overflow_check(const char *src, int int_bytes)
{
uint64_t bits;
memcpy(&bits, src, 8);
int16_t real_exp = (int16_t)((bits >> 52) & 0x07ff) - 1023;
return real_exp > int_bytes * 8 - 2;
}
/* Returns 1 if the float at src overflows a signed integer of int_bytes bytes. */
static inline int float_overflow_check(const char *src, int int_bytes)
{
uint32_t bits;
memcpy(&bits, src, 4);
int16_t real_exp = (int16_t)((bits >> 23) & 0xff) - 127;
return real_exp > int_bytes * 8 - 2;
}
Total cost: one memcpy, one shift, one AND, one subtract, one compare.
No floating-point arithmetic, no branches on the value itself.
How It Fits into the Encoder
The encoder scales the value first, then calls the overflow check on the scaled result:
double scaled = orig * scaler; /* scale: e.g. orig * 100.0 */
if (double_overflow_check((char *)&scaled, sizeof(int64_t)))
return ENCODE_OVERFLOW; /* fall back to float encoding */
int64_t result = llround(scaled); /* safe: overflow already ruled out */
The scale factor is stored in the column header so the decoder can reverse the
operation: decoded = (double)stored_integer / pow(10, scale).
Why Not Just Use std::isnan + Range Check?
The conventional approach:
if (std::isnan(value)) return false;
if (value > INT64_MAX || value < INT64_MIN) return false;
return true;
This involves floating-point comparisons, which on many architectures require the
value to be loaded into a float register before comparison. On a hot encoding path
processing millions of values, the difference adds up.
The bit manipulation approach operates entirely on integer registers. The float's
bytes are reinterpreted as an integer — no floating-point unit involved until the
final std::lround conversion, which only happens when you've already confirmed
no overflow.
What This Enables
This check is the entry gate for the full encoding chain:
float column
↓
check_float_overflow ← this article
↓ (passes)
float → integer cast
↓
Delta+ZigZag encoding
↓
Simple8b bit-packing
Without a cheap overflow gate, the chain can't run on untrusted float data. With it,
each value costs one check before entering the integer compression path — which can
achieve far better compression ratios than float-specific schemes on "integer-like"
time-series data.
What's Next
This article is part of a series on compression engineering in time-series databases:
- Part 1: Runtime adaptive compression — how the system selects the best algorithm without scanning all data (published)
- Part 3: Chained encoding — the full float-to-integer → Delta+ZigZag → Simple8b pipeline
- Part 4: An improved floating-point compression algorithm based on ELF
I'm currently available for freelance work on backend systems, storage engineering,
and systems integration. Feel free to reach out.
Top comments (0)