DEV Community

Cover image for "Shannon Was Right, But We Can Be Smarter: How ALEC Achieves 22x Compression on IoT Data"
David MARTIN
David MARTIN

Posted on

"Shannon Was Right, But We Can Be Smarter: How ALEC Achieves 22x Compression on IoT Data"

The Problem That Shouldn't Exist

I was working on an IoT project—temperature sensors reporting values every second. Simple stuff:

22.5, 22.5, 22.6, 22.5, 22.5, 22.6, 22.6, 22.5, ...
Enter fullscreen mode Exit fullscreen mode

Each value takes 8 bytes as a float64. That's 691 KB per sensor per day.

"No problem," I thought. "Let's compress it with gzip."

Result: ~400 KB. Only 42% compression.

Wait, what? The data is obviously redundant. A human can see the pattern instantly. Why can't gzip?

Shannon Was Right (Of Course)

Claude Shannon proved in 1948 that you can't compress data below its entropy. Period. No exceptions.

But here's the key insight Shannon himself noted: entropy depends on your model of the data.

If gzip treats 22.5, 22.5, 22.6 as arbitrary bytes, it sees one entropy. But if we model it as "temperature sensor with 0.1°C precision, typically stable", the entropy is much lower.

The trick isn't violating Shannon's theorem. It's building a better model.

The IoT Data Model

Real IoT sensor data has properties that generic compressors ignore:

  1. Temporal stability - Values change slowly
  2. Predictability - Next value ≈ current value
  3. Bounded range - Temperature won't jump from 22°C to 500°C
  4. Quantization - Sensors have finite precision (0.1°C)

An intelligent compressor that exploits these properties can dramatically outperform generic alternatives.

Enter ALEC

I built ALEC (Adaptive Lazy Evolving Compression) specifically for IoT data. The core ideas:

1. Delta Encoding

Instead of transmitting 22.5, transmit +0.0 (delta from last value).

Raw:    22.5 → 22.5 → 22.6 → 22.5
Delta:  22.5 →  0.0 → +0.1 → -0.1
Enter fullscreen mode Exit fullscreen mode

Zero delta? That's 2 bits instead of 64.

2. Pattern Dictionary

Frequent values get short codes. After observing 22.5 appear 1000 times, it gets a 4-bit code instead of 64 bits.

3. Evolving Context

Encoder and decoder maintain synchronized context that improves over time:

Week 1:  "temperature=22.3°C" → 20 bytes
Week 4:  [code_7][+0.3]       → 3 bytes
Enter fullscreen mode Exit fullscreen mode

The Benchmark Results

I ran ALEC against gzip on real IoT data patterns. Here's what happened:

On Variable Data (SmartGrid current sensor, only 8.7% unchanged readings)

Condition ALEC gzip ALEC Advantage
Cold start 10.9x 5.1x +113%
With preload 22.1x 8.0x +177%

ALEC vs gzip comparison

The Warmup Curve

Here's where it gets interesting. ALEC dominates at every sample count:

Samples    | gzip   | ALEC
-----------|--------|-------
10         | 0.9x   | 8.0x   ← ALEC wins immediately
100        | 2.4x   | 9.2x
1000       | 6.3x   | 11.4x
5000       | 7.4x   | 18.2x
8640       | 8.0x   | 22.0x
Enter fullscreen mode Exit fullscreen mode

At 10 samples, gzip can't compress at all (0.9x = expansion!). ALEC achieves 8x because it understands the data model from the start.

The Preload Secret

The key insight: preload eliminates warmup cost.

In production, you:

  1. Generate a preload file from historical data
  2. Ship identical preload to encoder and decoder
  3. Achieve near-optimal compression from byte one

Without preload, ALEC still beats gzip. With preload, it's not even close.

When NOT to Use ALEC

ALEC isn't magic. It won't help with:

  • Random data - No patterns to learn
  • Very short transmissions - <100 samples, warmup dominates
  • Constant data - Trivially compressed by any codec

The sweet spot: long-running IoT streams with predictable patterns.

Try It Yourself

ALEC is open source (AGPL-3.0) and available on crates.io:

use alec::{Encoder, Decoder, Context};

let mut encoder = Encoder::new();
let mut decoder = Decoder::new();
let mut context = Context::new();

// Encode
let message = encoder.encode(&data, &context);
context.observe(&data);

// Decode
let decoded = decoder.decode(&message, &context)?;
Enter fullscreen mode Exit fullscreen mode

📦 Crates.io: alec
🔗 GitHub: zeekmartin/alec-codec
🌐 Website: alec-codec.com

The Bottom Line

Shannon's theorem is inviolable. But Shannon also taught us that entropy depends on our model.

Generic compressors use generic models. Domain-specific compressors use domain-specific models.

For IoT data, that difference is 22x vs 8x compression.


"Every byte counts. Everywhere."


Discussion

Have you hit bandwidth limits with IoT data? What compression approaches have you tried? Let me know in the comments!


ALEC is dual-licensed: AGPL-3.0 for open source, commercial licenses available for proprietary use.

iot #rust #compression #embedded #dataengineering

Top comments (0)