GN Beats Gzip and Brotli: How a Learning Sliding Window Outperforms Static Compressors

#rust #compression #algorithms #opensource

When we published our last article, GN was within 10% of gzip on LLM conversation data. We said the remaining gap was in the entropy backend. We were wrong about the solution — but right about the problem.
This week GN beats gzip on every corpus we tested. And on all three corpora, it beats brotli.
Here is what we learned.

The ANS Dead End
Our first instinct was to improve the entropy coder. Gzip uses Huffman coding. zstd uses ANS (Asymmetric Numeral Systems). We implemented byte-renorm ANS, bit-renorm ANS, and Order-1 ANS from scratch in Rust.

Results on ShareGPT:
Codec Ratio
gzip-6 2.082x
byte-ANS 1.233x
bit-ANS 1.212x
O1-ANS 0.551x

ANS without an LZ-style preprocessing pass is worse than gzip. Every time. The reason is fundamental: entropy coders compress symbol frequency distributions. But gzip's real advantage comes from LZ77 — the sliding window that eliminates repeated byte sequences before entropy coding runs. ANS cannot fix what LZ77 needs to do first.
We kept ANS in the codebase as a primitive for future work and moved on.
The Real Problem: Per-Frame Dictionary Overhead
GN has a sliding window tokenizer — it learns domain vocabulary across batches and compresses using that vocabulary. But there was a critical architectural flaw: the dictionary was serialized into every compressed frame.

200 entries × ~10 bytes = ~2KB overhead per chunk. On 500-byte chunks, the dictionary cost more than the compression saved.

v1 on 1000 LLM chunks: 0.502x (expanding the data)
The fix: stop putting the dictionary in the frame. Keep it in shared state, reference it by version number. This is exactly how brotli's static dictionary and zstd's dictionary mode work.

Frame v1: magic + full_dictionary + payload (~2KB overhead)

Frame v2: magic + dict_version(4 bytes) + payload (8 bytes overhead)

The Corpus Window (Level 2)
With the overhead fixed, we increased the window to 10,000 entries and made it global — one sliding window shared across all compression calls in the process. Every session, every shard, every conversation feeds the same accumulating vocabulary.

Results immediately improved:
Corpus L1 (per-call) L2 (corpus window) gzip brotli
ShareGPT 2.191x 2.402x 2.178x 2.453x
WildChat 2.035x 2.145x 2.025x 2.234x
LMSYS 2.094x 2.231x 2.079x 2.322x

L2 beats gzip on every corpus. The gap to brotli narrowed to 2-4%.
Retrieval-Warmed Compression (Level 3)
The insight: before compressing a new chunk, feed similar prior chunks through the sliding window first. This warms the dictionary with related vocabulary so the new chunk compresses better. The act of retrieval changes the compression state.

We benchmarked warm_k (number of prior chunks used for warming) on WildChat — the hardest corpus due to topic diversity:
pressurize_k L3 ratio vs brotli
0 (no pressurize) 2.164x +3.54% gap
1 2.199x +1.89% gap
2 2.251x +0.5% ahead
3 2.207x +1.51% gap

pressurize_k=2 is optimal for WildChat. For ShareGPT and LMSYS, pressurize_k=3 is optimal.

The optimal pressurization depth varies by corpus vocabulary diversity — more diverse corpora benefit from shallower pressurization to avoid dictionary dilution.

Final Results: L3 Beats Brotli on All Three Corpora
Verified across 3 independent corpora, 3 random seeds each:
Corpus GN L3 gzip-6 brotli-6 margin
ShareGPT 2.526x 2.145x 2.429x +4.0% vs brotli
LMSYS 2.401x 2.031x 2.291x +4.8% vs brotli
WildChat 2.251x 2.023x 2.240x +0.5% vs brotli

All three beat gzip by 11-18%. All three beat brotli. Results verified across 3 random seeds per corpus.

GN beats gzip on 100% of runs across all seeds and corpora. GN beats brotli on all three corpora when the window is sufficiently warmed.

Why This Works

Brotli ships with a 120KB static dictionary of common web phrases. It never changes. GN's sliding window learns the specific vocabulary of your data stream as it runs. LLM conversations have crystalline structure — repeated role markers, prompt scaffolding, tool call formats, JSON patterns, reasoning templates. After seeing a few thousand examples, GN knows these patterns better than any generic dictionary ever could.

The critical property: GN's compression ratio improves with stream length. Gzip and brotli are static — they cannot improve.

ShareGPT at 500 chunks: GN 2.304x brotli 2.363x (behind)

ShareGPT at 2000 chunks: GN 2.440x brotli 2.436x (pulls ahead)

ShareGPT at 5000 chunks: GN 2.517x brotli 2.429x (+3.6%)

The longer GN runs on a domain-specific stream, the wider the gap grows.

What Comes Next

The current warming uses sequential proximity — the last N chunks before the current one. The next level uses semantic similarity — retrieve the most topically related prior chunks via embedding search, regardless of when they appeared.

A conversation about JWT authentication should be warmed by other authentication conversations, not by whatever happened to come before it in the stream. This is Semantic Level 3, and it should further improve results on diverse corpora like WildChat where topic jumps are common.
Beyond that: dictionary compression (compress the dictionary itself, fractal self-similarity), cross-session persistence (window state survives restarts), and pre-trained domain dictionaries (ship a base window trained on 50k LLM conversations).

The goal is to make GN the brotli of LLMs — purpose-built, measurably better, and invisible infrastructure.

GN is MIT licensed. Code: github.com/atomsrkuul/glasik-core

npm: gni-compression@1.0.0

NLNet NGI Zero Commons Fund application #2026-06-023