Buffer Overflow

Posted on Apr 15

I Built a Compression Algorithm That Beats Gzip in 2 Weeks. I Have an A+ Cert.

#compression #rust #algorithms #opensource

How a networking student ended up writing Rust, beating industry standard compression, and learning more about computers than any classroom taught me.

I was messing around with LLM APIs. Claude, GPT, the usual. And I kept hitting the same wall: context windows cost money. Every token you send costs. Every token the model reads costs. If you're building anything serious on top of these APIs, agents, memory systems, anything that needs conversation history you're constantly fighting the token budget.

The obvious solution is compression. Compress your context before sending it, decompress on the way back. But here's the thing nobody talks about: standard compression algorithms weren't built for LLM data. Gzip was designed for web assets. Brotli for HTTP. Neither was trained on the specific patterns that dominate LLM conversations, repeated phrases, structured JSON, code snippets, the same function names appearing dozens of times across a session.

I thought: what if I built something specifically for this?
I have an A+ cert. I'm in school for networking. I had never written Rust. I had never implemented a string matching algorithm. I didn't know what Aho-Corasick was.

Two weeks later, GN (Glasik Notation) was compressing LLM data. Six months of refinement later, it beats gzip on every corpus tested, beats brotli by 47% on short messages, and beats brotli per-message by up to 62% on real Claude conversations. Here's how it happened.

What I Knew Coming In
Networking teaches you to think in bytes. Packets, headers, payloads. MTU limits. Bandwidth vs latency tradeoffs. When you spend time thinking about why a TCP packet is structured the way it is, you develop an intuition for data representation that a lot of software developers never get.

I also knew that compression, at its core, is just finding patterns and encoding them more efficiently. ZIP finds repeated byte sequences. Huffman coding gives frequent symbols shorter codes. That's it. The magic is in how cleverly you find patterns and how efficiently you encode them.

What I didn't know: how to implement any of this. So I started reading.

The First Attempt: Embarrassingly Bad
My first compression attempt was a lookup table. I manually collected common LLM phrases and replaced them with single bytes. Classic dictionary substitution.
The ratio? About 1.1x on good days. Gzip does 2.4x. I was losing badly.
But it taught me something important: the dictionary matters more than the algorithm. If you have the right patterns, even simple substitution works. If you have the wrong patterns, nothing saves you.

Learning Aho-Corasick
The problem with simple dictionary lookup is speed. If you have 20,000 patterns and a 2,000 byte message, naively checking every pattern at every position is O(n x m x k) too slow to be useful.
Aho-Corasick builds a finite automaton from all your patterns simultaneously. You scan the input once, left to right, and the automaton tells you which patterns match at each position. Linear time regardless of how many patterns you have.

This is where my networking background helped. I already understood finite state machines from studying how network protocols work. TCP state diagrams, regex engines in firewalls, etc. Aho-Corasick was the same concept applied to string matching.
I implemented it in Rust. Its type system forced me to think clearly about ownership and memory, things networking had taught me to care about but that most high level languages hide.
The Vocabulary Problem
An AC automaton is only as good as its vocabulary. I needed to learn what patterns actually appear in LLM conversations.

I downloaded four public datasets:

· ShareGPT v3: Real ChatGPT conversation turns, avg 846B per message

· WildChat: Multi-language LLM conversations, avg 952B per message

· LMSYS-Chat: Academic LLM benchmark data, avg 915B per message

· Ubuntu IRC: Technical support dialogues, avg 67B per message

I wrote a sliding window tokenizer that learns patterns from the data. After training on 500k chunks, the L0 vocabulary had 20,000 entries covering the most common LLM patterns. The top entries: 8-space indentation, " the", "and ", "ing ", paragraph breaks, common function names, JSON structure.

Split-Stream Architecture: The Key Insight
After AC matching, you have two streams:

· Token IDs: Single bytes representing matched patterns (1-254)
· Literals: The raw bytes that didn't match any pattern

These two streams have completely different statistical properties. Token IDs are low-entropy, a small alphabet of 254 symbols, highly repetitive. Literals are higher entropy, think the unique residue.
By deflating them separately and combining the compressed streams, each gets optimally compressed for its own statistics. The token stream deflates at 9.4x. The literal stream deflates at 2x.
The frame format:

[2B token_stream_length][token_stream_deflated][literal_stream_deflated]

The GCdict Breakthrough

This is the part that surprised me most.
After getting the split-stream working, GN was still trailing brotli by 5-8% on longer messages. I spent weeks trying vocabulary expansion, LZ77 pre-processing, ANS entropy coding. Nothing moved the needle enough.
Then I asked a different question: what does brotli know that GN doesn't?

Brotli has a 120KB static dictionary trained on web text. I proved this was its primary advantage -- brotli at quality=1 (minimal static dictionary usage) actually loses to deflate-9. The dictionary is the advantage, not better LZ77.

So I needed GN's equivalent. But instead of a static web-text dictionary, I used the conversation history itself.
LLM conversations are self-referential. When someone asks about a bug in message 5, they used the same variable names in message 2. A debugging session reuses error messages. A code review reuses function names. The literal stream residue (the bytes GN's AC tokenizer didn't match) contains fragments of prior messages in the same conversation.
GCdict (GN Context Dictionary): use the last 32KB of conversation history as a preset dictionary for deflate compression of the literal stream.

AC tokenization -> split(tok_ids, literals)

tok_stream: deflate (unchanged)

lit_stream: deflate(literals, zdict=history[-32KB])
Same deflate engine. Better-initialized LZ77 window. No brotli internals. Fully lossless -- both encoder and decoder have the same history.

The result: GN beats brotli per-message on every corpus tested.
The Numbers
24 independent measurements per corpus for split-stream. 32 random seeds for GCdict.

GN Split-Stream (b=8):
Corpus GN ratio vs gzip vs brotli p50
ShareGPT (846B avg) 2.484x +2.3% -4.9% 0.43ms
WildChat (952B avg) 2.130x +3.5% -7.7% 0.54ms
LMSYS (915B avg) 2.362x +1.0% -5.2% 0.39ms
Ubuntu-IRC (67B avg) 2.534x +61.9% +47.1% 0.055ms

Per-chunk p50: 0.007ms.
GN with GCdict -- vs brotli per-message (production comparison):
Corpus GN br/msg vs brotli range
Claude convos (915B) 2.766x 2.115x +30.8% 30.6-31.2%
ShareGPT 2.765x 2.441x +13.3% 12.1-14.6%
LMSYS 2.577x 2.287x +12.7% 10.8-14.6%
WildChat 2.265x 2.212x +2.4% 0.9-3.4%
IRC 1.925x 1.184x +62.7% 59.9-64.5%

All 32 seeds positive on every corpus. Zero exceptions.
The production comparison is per-message brotli because in real LLM streaming, messages arrive one at a time. GN accumulates session history. Per-message brotli has no context.

The Ubuntu-IRC result deserves special mention. On 67-byte average messages, gzip achieves 0.857x (actually expands) and brotli achieves 1.138x. GN achieves 2.534x. This isn't close -- it's a different compression tier entirely, because GN's vocabulary is trained on the specific domain.

What I Got Wrong
The window update bug. I added per-chunk window updates inside the batch compression loop. This triggered the Aho-Corasick automaton to rebuild at every 10th chunk -- a 35x latency regression that took days to find. Fix: update vocabulary once per batch, not per chunk.
Fake benchmarks. Early on I tested with synthetic text -- repeated sentences. The ratios looked incredible. When I switched to real corpus data, they collapsed. Real data is the only benchmark that matters.
The extraction bug. I spent weeks thinking WildChat compressed at 2.83x. It was actually 2.13x. The difference: my extraction script used str(turn) instead of turn.get('content',''), including JSON metadata in every message. 89 bytes of {'language': 'English', 'redacted': False} per turn, compressing beautifully. Real numbers only come from real data extracted correctly.

Dictionary quality. My first trained vocabulary included spam patterns -- ubuntu repeated 35 times, Spotify API strings. These filled vocabulary slots with noise. Proper filtering was essential.
What Networking Taught Me That CS Didn't
Think in bytes, not objects. Every compression decision comes down to: how many bytes does this cost vs save? Networking trains you to count bytes obsessively.

Latency is not throughput. A compression algorithm that achieves 3x ratio but takes 50ms per message is useless for real-time applications. GN's p50 is 0.007ms per chunk.
The protocol mindset. GN's frame format is just a protocol. I designed it the same way I'd design a packet format: minimize overhead, make it parseable without side information, handle edge cases at the boundary.
Real systems fail in weird ways. My Node.js native addon path was 12x slower than my Python path on identical data. Finding this required the same methodical elimination I use to debug network issues.

Where It Is Now

GN is deployed in my OpenClaw setup, compressing conversation context before storage and retrieval.

Architecture:
· L0: 20,000 static entries trained on LLM corpora
· Sliding window vocabulary: adapts continuously to session content
· GCdict: conversation history as deflate preset dictionary
· Split-stream: independent tok/lit compression
· Node.js and Python bindings (napi + PyO3)

A paper (GN: Domain-Adaptive Lossless Compression for LLM Conversation Streams) is available at github.com/atomsrkuul/glasik-core.
What I'd Tell Someone Starting Out
You don't need a CS degree to build real systems. You need:
A real problem. Not a tutorial project. Something that costs you actual money or time.

The willingness to read papers. You don't need to understand the proofs. You need to understand the idea.

Real data benchmarks from day one. Never test on synthetic data. It will lie to you.

Obsessive measurement. If you can't measure it, you can't improve it. Instrument everything.

The networking mindset. Think in bytes. Count everything. Every abstraction has a cost.

The A+ cert taught me that computers are physical things moving electrons around. That turned out to be exactly the right mental model for building a compression algorithm.

GN is open source at github.com/atomsrkuul/glasik-core (MIT).
Robert Rider - Independent Researcher

DEV Community

I Built a Compression Algorithm That Beats Gzip in 2 Weeks. I Have an A+ Cert.

Top comments (0)