Buffer Overflow

Posted on Apr 5

Within 10% of gzip: What GN’s Semantic Compression Teaches Us

#compression #rust #algorithms #opensource

When we first started building, the goal was never to make another gzip clone. Generic compression already does that job incredibly well.

The real question was different:

What happens if the compressor understands the shape of the data before it ever starts packing bytes?

That question led us from the original JavaScript prototype into Glasik Core, a Rust implementation focused on semantic tokenization, rolling vocabulary windows, and domain-aware preprocessing for message and agent streams.

This week we hit a milestone that feels small on paper but huge architecturally:

GN is now within 10% of gzip on every benchmark corpus we tested.

Not better. Not faster. Not “production solved.”

Just consistently close, which is exactly why this stage is exciting.

The benchmark reality

Current corpus results:

Corpus Glasik Core gzip Relative
MEMORY.md 1.849x 2.075x 89%
ShareGPT-1k 3.752x 3.945x 95%
Ubuntu-IRC-1k 2.122x 2.357x 90%

The most important one is ShareGPT-1k hitting 95% of gzip. That corpus is extremely close to the data GN was designed for:

Repeated assistant roles
Prompt scaffolding
Tool formatting
Structured JSON-like patterns
Recurring conversational templates

Even though we have not passed gzip yet, nearly matching it on LLM-native streams is a strong validation signal.

Why being close matters more than winning right now

The remaining gap is not where many would assume. The weak point is not semantic understanding anymore.

The weak point is the final entropy backend. gzip still has decades of advantage in:

Huffman tuning
Backreference heuristics
Lazy match parsing
Highly optimized bit packing
Mature DEFLATE edge cases

That last 5–10% is the part generic compressors are legendary at.

But the semantic layer is already doing the harder thing: understanding the structure of the stream before compression begins. That’s where the long-term leverage is.

The real architectural lesson

The simplest way to explain the difference:

gzip remembers bytes. GN remembers meaning.

As the rolling vocabulary fills, repeated structures stop being treated like raw strings and start being treated as stable semantic units. That includes:

Timestamps
Speaker roles
Repeated tool calls
Theorem blocks
JSON keys
Repeated prompt shells
Agent trace scaffolding
Channel metadata

Performance improves the longer the stream runs. Instead of relying only on a fixed byte-history window, GN reinforces the vocabulary of the domain itself. That’s the core bet.

Why Rust changed the debugging loop

The JavaScript prototype proved the idea. Rust made it possible to trust the measurements.

One concrete example: during corpus benchmarking we hit a rolling-frequency bug silently inflating token counts over long windows. Compression ratios looked “better,” but only because the vocabulary statistics were wrong.

The fix only became obvious because Rust forced us to reason explicitly about integer width, overflow behavior, and ownership boundaries inside the rolling state machine.

Fixing it tightened the corpus results and gave us confidence that the “within 10%” milestone is real, not a measurement artifact. That debugging loop alone justified the rewrite.

What makes this exciting now

The missing performance is now localized. We know exactly where the gap is:

Residual encoding
Entropy refinement
Better state models
Adaptive codon dictionaries
Specialized chat residual codecs

That is a much better place to be than wondering whether the entire idea works. The semantic layer is clearly competitive. Now it’s about tightening the backend until the semantic advantage outweighs gzip’s entropy maturity.

What’s next

Tonight’s most interesting work was deeper in the backend: we now have a reference-safe ANS entropy coder implemented from scratch in Rust, using the same family of techniques that powers zstd.

The current version uses correctness-first binary renormalization so we can prove round-trip behavior before optimizing. Next step: bit-level state refinement and faster renormalization transforms.

This work directly targets the exact 5–10% gap the benchmarks are still showing.

The path forward is finally clear:

Semantic understanding is already competitive
Entropy packing is the remaining frontier
The architecture now tells us exactly where to push

At this point, GN (our semantic agent layer) and Glasik Core (the compression engine) feel less like an experiment and more like a real compression architecture.

DEV Community

Within 10% of gzip: What GN’s Semantic Compression Teaches Us

Top comments (0)