David MARTIN

Posted on Jan 23

"Shannon Was Right, But We Can Be Smarter: How ALEC Achieves 22x Compression on IoT Data"

#iot #rust #dataengineering #shannon

The Problem That Shouldn't Exist

I was working on an IoT project—temperature sensors reporting values every second. Simple stuff:

22.5, 22.5, 22.6, 22.5, 22.5, 22.6, 22.6, 22.5, ...

Each value takes 8 bytes as a float64. That's 691 KB per sensor per day.

"No problem," I thought. "Let's compress it with gzip."

Result: ~400 KB. Only 42% compression.

Wait, what? The data is obviously redundant. A human can see the pattern instantly. Why can't gzip?

Shannon Was Right (Of Course)

Claude Shannon proved in 1948 that you can't compress data below its entropy. Period. No exceptions.

But here's the key insight Shannon himself noted: entropy depends on your model of the data.

If gzip treats 22.5, 22.5, 22.6 as arbitrary bytes, it sees one entropy. But if we model it as "temperature sensor with 0.1°C precision, typically stable", the entropy is much lower.

The trick isn't violating Shannon's theorem. It's building a better model.

The IoT Data Model

Real IoT sensor data has properties that generic compressors ignore:

Temporal stability - Values change slowly
Predictability - Next value ≈ current value
Bounded range - Temperature won't jump from 22°C to 500°C
Quantization - Sensors have finite precision (0.1°C)

An intelligent compressor that exploits these properties can dramatically outperform generic alternatives.

Enter ALEC

I built ALEC (Adaptive Lazy Evolving Compression) specifically for IoT data. The core ideas:

1. Delta Encoding

Instead of transmitting 22.5, transmit +0.0 (delta from last value).

Raw:    22.5 → 22.5 → 22.6 → 22.5
Delta:  22.5 →  0.0 → +0.1 → -0.1

Zero delta? That's 2 bits instead of 64.

2. Pattern Dictionary

Frequent values get short codes. After observing 22.5 appear 1000 times, it gets a 4-bit code instead of 64 bits.

3. Evolving Context

Encoder and decoder maintain synchronized context that improves over time:

Week 1:  "temperature=22.3°C" → 20 bytes
Week 4:  [code_7][+0.3]       → 3 bytes

The Benchmark Results

I ran ALEC against gzip on real IoT data patterns. Here's what happened:

On Variable Data (SmartGrid current sensor, only 8.7% unchanged readings)

Condition	ALEC	gzip	ALEC Advantage
Cold start	10.9x	5.1x	+113%
With preload	22.1x	8.0x	+177%

The Warmup Curve

Here's where it gets interesting. ALEC dominates at every sample count:

Samples    | gzip   | ALEC
-----------|--------|-------
10         | 0.9x   | 8.0x   ← ALEC wins immediately
100        | 2.4x   | 9.2x
1000       | 6.3x   | 11.4x
5000       | 7.4x   | 18.2x
8640       | 8.0x   | 22.0x

At 10 samples, gzip can't compress at all (0.9x = expansion!). ALEC achieves 8x because it understands the data model from the start.

The Preload Secret

The key insight: preload eliminates warmup cost.

In production, you:

Generate a preload file from historical data
Ship identical preload to encoder and decoder
Achieve near-optimal compression from byte one

Without preload, ALEC still beats gzip. With preload, it's not even close.

When NOT to Use ALEC

ALEC isn't magic. It won't help with:

Random data - No patterns to learn
Very short transmissions - <100 samples, warmup dominates
Constant data - Trivially compressed by any codec

The sweet spot: long-running IoT streams with predictable patterns.

Try It Yourself

ALEC is open source (AGPL-3.0) and available on crates.io:

use alec::{Encoder, Decoder, Context};

let mut encoder = Encoder::new();
let mut decoder = Decoder::new();
let mut context = Context::new();

// Encode
let message = encoder.encode(&data, &context);
context.observe(&data);

// Decode
let decoded = decoder.decode(&message, &context)?;

📦 Crates.io: alec
🔗 GitHub: zeekmartin/alec-codec
🌐 Website: alec-codec.com

The Bottom Line

Shannon's theorem is inviolable. But Shannon also taught us that entropy depends on our model.

Generic compressors use generic models. Domain-specific compressors use domain-specific models.

For IoT data, that difference is 22x vs 8x compression.

"Every byte counts. Everywhere."

Discussion

Have you hit bandwidth limits with IoT data? What compression approaches have you tried? Let me know in the comments!

ALEC is dual-licensed: AGPL-3.0 for open source, commercial licenses available for proprietary use.

iot #rust #compression #embedded #dataengineering

DEV Community