I Built My Own Entropy Coder Because Deflate Doesn't Know What GN Knows

#llm #rust #compression #algorithms

I shipped gni-compression to npm two days ago. One of the first questions I got (from myself, running benchmarks at midnight): does it work on anything other than chat data?

Short answer: not yet. Long answer: I found out exactly why, and it led me somewhere more interesting than I expected.

The Benchmark That Told the Truth

After the npm launch I ran GN against Silesia — the standard general text compression benchmark suite. Dickens, Webster, XML logs, binaries. Here's what came back:

GN loses. Not slightly — brotli-6 is 10–30% better on general text depending on the corpus. Gzip-6 beats it too.

The obvious question is why. GN beats brotli on chat data by ~2% consistently across 12 measurements. Same algorithm, different corpus, completely different result.

What's Actually Happening

GN's pipeline looks like this:

input → sliding window learner → tokenizer → token stream + literal stream → deflate each stream → frame

The sliding window learns repeated patterns from the data. On chat data it learns role markers, JSON field names, tool call schemas, prompt fragments. On Silesia it learns... less. The vocabulary is shallower because general text has less structural repetition.

But that's not the whole story. I ran a test that revealed something more uncomfortable:

deflate on raw data:       2.563x  — 28ms
deflate on GN-tokenized:   2.525x  — 15ms

Deflate on raw beats deflate on GN-tokenized. The tokenization step is actually hurting ratio on general text. It's faster (smaller input) but it compresses worse.

This means GN's wins on chat data come entirely from the vocabulary quality on that specific domain — and when the vocabulary is weaker, we're paying overhead with nothing to show for it.

Why Deflate Is the Wrong Coder Here

Deflate was designed for mixed byte streams. It uses LZ77 + Huffman coding. It's extremely well engineered for its purpose.

But GN's token stream is not a mixed byte stream. After tokenization it's a stream of small integers — token IDs, mostly in a narrow range (top 5000 tokens out of a possible vocabulary). The symbol distribution is highly skewed and known in advance.

Deflate doesn't know any of that. It treats the token stream like arbitrary bytes and builds a fresh Huffman tree from scratch for each chunk. It's doing redundant work and missing structure that's visible to GN's own data model.

ANS is different. ANS is a modern entropy coder — it's what zstd uses internally. It can be initialized with a pre-built frequency table tuned to GN's specific token distribution. On token streams with known skewed distributions, ANS should code significantly closer to theoretical entropy than deflate.

We Already Built It

The ANS implementation is already in the codebase — gn_ans_compress, gn_ans_compress_bits, gn_ans_compress_o1 for the compress side, matching decompress variants. What's left is wiring it into the main compression path and benchmarking against deflate on the same split-stream output.

This matters for a reason beyond ratio numbers. Right now GN has one piece of its pipeline it didn't design: the entropy stage. Everything else — the rolling hash tokenizer, the codon table, the sliding window learner, the split-stream architecture, the frame format — was built for GN's specific problem. Replacing deflate with our own ANS implementation means the hot path is fully ours.

Why This Opens the Door to General Text

Here's the thing about GN's architecture: the domain-specificity lives in the vocabulary. The sliding window learns from whatever you feed it. On LLM chat data it learns chat patterns. On Silesia it could learn Silesia patterns — it's just shallower because general text has less structural repetition to exploit.

But with a coder that's tuned to GN's output distribution rather than arbitrary bytes, the floor goes up. The overhead we're currently paying on general text drops. The question becomes: how much does domain-adaptive preprocessing help when your entropy stage is no longer the bottleneck?

That's GNCompressorV2. Same architecture, own entropy coder, tested on both conversation data and general text with verified numbers.

Not there yet. But now I know exactly what the ceiling is and what's holding us below it.

Code: github.com/atomsrkuul/glasik-core | npm: gni-compression