DEV Community: Buffer Overflow

I Built My Own Entropy Coder Because Deflate Doesn't Know What GN Knows

Buffer Overflow — Mon, 04 May 2026 00:47:51 +0000

I shipped gni-compression to npm two days ago. One of the first questions I got (from myself, running benchmarks at midnight): does it work on anything other than chat data?

Short answer: not yet. Long answer: I found out exactly why, and it led me somewhere more interesting than I expected.

The Benchmark That Told the Truth

After the npm launch I ran GN against Silesia — the standard general text compression benchmark suite. Dickens, Webster, XML logs, binaries. Here's what came back:

GN loses. Not slightly — brotli-6 is 10–30% better on general text depending on the corpus. Gzip-6 beats it too.

The obvious question is why. GN beats brotli on chat data by ~2% consistently across 12 measurements. Same algorithm, different corpus, completely different result.

What's Actually Happening

GN's pipeline looks like this:

input → sliding window learner → tokenizer → token stream + literal stream → deflate each stream → frame

The sliding window learns repeated patterns from the data. On chat data it learns role markers, JSON field names, tool call schemas, prompt fragments. On Silesia it learns... less. The vocabulary is shallower because general text has less structural repetition.

But that's not the whole story. I ran a test that revealed something more uncomfortable:

deflate on raw data:       2.563x  — 28ms
deflate on GN-tokenized:   2.525x  — 15ms

Deflate on raw beats deflate on GN-tokenized. The tokenization step is actually hurting ratio on general text. It's faster (smaller input) but it compresses worse.

This means GN's wins on chat data come entirely from the vocabulary quality on that specific domain — and when the vocabulary is weaker, we're paying overhead with nothing to show for it.

Why Deflate Is the Wrong Coder Here

Deflate was designed for mixed byte streams. It uses LZ77 + Huffman coding. It's extremely well engineered for its purpose.

But GN's token stream is not a mixed byte stream. After tokenization it's a stream of small integers — token IDs, mostly in a narrow range (top 5000 tokens out of a possible vocabulary). The symbol distribution is highly skewed and known in advance.

Deflate doesn't know any of that. It treats the token stream like arbitrary bytes and builds a fresh Huffman tree from scratch for each chunk. It's doing redundant work and missing structure that's visible to GN's own data model.

ANS is different. ANS is a modern entropy coder — it's what zstd uses internally. It can be initialized with a pre-built frequency table tuned to GN's specific token distribution. On token streams with known skewed distributions, ANS should code significantly closer to theoretical entropy than deflate.

We Already Built It

The ANS implementation is already in the codebase — gn_ans_compress, gn_ans_compress_bits, gn_ans_compress_o1 for the compress side, matching decompress variants. What's left is wiring it into the main compression path and benchmarking against deflate on the same split-stream output.

This matters for a reason beyond ratio numbers. Right now GN has one piece of its pipeline it didn't design: the entropy stage. Everything else — the rolling hash tokenizer, the codon table, the sliding window learner, the split-stream architecture, the frame format — was built for GN's specific problem. Replacing deflate with our own ANS implementation means the hot path is fully ours.

Why This Opens the Door to General Text

Here's the thing about GN's architecture: the domain-specificity lives in the vocabulary. The sliding window learns from whatever you feed it. On LLM chat data it learns chat patterns. On Silesia it could learn Silesia patterns — it's just shallower because general text has less structural repetition to exploit.

But with a coder that's tuned to GN's output distribution rather than arbitrary bytes, the floor goes up. The overhead we're currently paying on general text drops. The question becomes: how much does domain-adaptive preprocessing help when your entropy stage is no longer the bottleneck?

That's GNCompressorV2. Same architecture, own entropy coder, tested on both conversation data and general text with verified numbers.

Not there yet. But now I know exactly what the ceiling is and what's holding us below it.

Code: github.com/atomsrkuul/glasik-core | npm: gni-compression

gni-compression is on npm — What a month of building a domain-adaptive LLM compressor taught me

Buffer Overflow — Sat, 02 May 2026 19:15:02 +0000

Seven articles ago I shipped a serialization layer that recovered 1M+ messages losslessly. Today the package is on npm and the compression numbers are real.

Here's where I landed.
What shipped
gni-compression is a domain-adaptive lossless compression package for LLM conversation data. It's a Rust native binary (via napi-rs) with a thin JS wrapper.

Two functions:

const { compress, decompress } = require('gni-compression')

const compressed = await compress(Buffer.from(longContext))
const restored = await decompress(compressed)
// lossless, verified

No warmup. No session state. The domain knowledge is baked into a pre-trained dictionary (gcdict.bin) bundled with the package — trained on real LLM conversation corpora.

The numbers
Benchmarked against brotli-6 across five public corpora (50 messages each, lossless round-trip verified):

Corpus GN Ratio Savings brotli-6
WildChat 4.94x 79.8% ~2.1x
ShareGPT 8.65x 88.4% ~2.0x
LMSYS 10.38x 90.4% ~2.1x
Ubuntu IRC 8.40x 88.1% ~1.2x
Claude convos 12.40x 91.9% ~1.9x

Ubuntu IRC is the surprising one. Messages average 67 bytes — too short for brotli to do much (1.2x). GN gets 8.4x because IRC vocabulary is extremely consistent. Short repetitive messages are where a domain dictionary wins hardest.

Why the numbers are what they are
The architecture splits input into separate token-ID and literal streams before compression. Token IDs are compact integers referencing the pre-trained vocabulary. Literals are the residual bytes that didn't match anything in the dictionary.
Each stream compresses independently with different characteristics. The tok stream is tiny (integers, high redundancy). The lit stream is whatever didn't compress semantically — it gets deflate with the GCdict applied.

When I swept minimum phrase length I found the vocabulary isn't a smooth distribution — it's two clusters with a gap:

· minLen 4→5: token count drops 68% (short filler tokens)
· minLen 5–9: flat, essentially nothing lives here
· minLen 10+: another 84% drop (long phrase tokens)

This means compression cuts filler preferentially. That's probably why we see a small consistent downstream quality improvement when feeding compressed context back to models — the signal-to-noise ratio improves.

What it took to get here
Phase 1 (article 1) was a serialization layer. Caught a CRC32 bug in our own validation before it hit anyone.
Getting from that to a published package with real compression ratios took: figuring out why the pure JS engine lost to brotli on every corpus (it does — the Rust GCdict pipeline is what actually wins), solving the round-trip problem (the raw split format has no inverse without the original buffer — I had to rebuild around the interleaved format), and training a dictionary that generalizes across corpora without overfitting any single one.
The version history on npm reflects that — 3.x was the interleaved pipeline, 4.x settled the API.

Why I built it

I'm building NN Dash, a persistent AI agent scaffold that routes across Claude, GPT, and local Ollama. The goal is to make a long-running AI relationship essentially free. GN is what makes multi-thousand-message context sessions viable without the token bill killing it.
The compression work got an NLNet grant application. The algorithm is solid enough to write up formally.

Use it

npm install gni-compression

const { compress, decompress } = require('gni-compression')
const compressed = await compress(Buffer.from(longContext))
const restored = await decompress(compressed)

Source: github.com/atomsrkull/glasik-core (MIT)
Feedback on the numbers, methodology, or use cases welcome.

GN: Domain-Adaptive Lossless Compression for LLM Conversation Streams

Buffer Overflow — Wed, 15 Apr 2026 19:59:59 +0000

PDF version: github.com/atomsrkuul/glasik-core/blob/master/GN_PAPER_V2.pdf

Code: github.com/atomsrkuul/glasik-core (MIT)

Robert Rider | Independent Researcher
github.com/atomsrkuul/glasik-core (MIT)

Abstract

I present GN (Glasik Notation), a domain-adaptive lossless compression system for LLM conversation streams. GN maintains a persistent sliding window vocabulary updated continuously across compression calls, exploiting cross-chunk redundancy in real-world LLM workloads.

I introduce GCdict (GN Context Dictionary), a novel technique that uses conversation history as a preset dictionary for deflate compression of the literal stream residue. GCdict exploits the self-referential nature of LLM conversations and beats brotli per-message on all five evaluation corpora, including +30.8% on real Claude conversations.

Verified across four public datasets (ShareGPT, WildChat, LMSYS-Chat, Ubuntu IRC):

GN beats gzip-6 on all corpora
GN beats brotli-6 by +47% on Ubuntu-IRC (67B avg messages, 72 measurements)
GN beats brotli per-message on ALL corpora: Claude +30.8%, IRC +62.7%, ShareGPT +13.3%, LMSYS +12.7%, WildChat +2.4%
p50 latency 0.007ms per chunk

1. Introduction

Large language model deployments generate vast quantities of structured text: conversation histories, retrieved context, agent memory. These workloads share a distinctive statistical structure: conversations from the same deployment reuse vocabulary, formatting conventions, and domain-specific terminology.

General-purpose compressors compress each document in isolation. They cannot exploit cross-document redundancy because they maintain no state between compression calls. Each conversation turn is compressed independently, discarding vocabulary learned from prior turns.

GN maintains a persistent sliding window vocabulary across compression calls. The window accumulates frequently occurring byte sequences, building a domain-specific dictionary that improves compression monotonically with stream length. Unlike Zstandard offline dictionary training, GN adapts continuously to live data without an offline training step.

Primary contributions:

GN split-stream encoding: Separates token ID stream from literal byte stream, compressing each independently. Beats gzip on all corpora, beats brotli on short messages by up to +62%.
GCdict (GN Context Dictionary): Uses conversation history as a deflate preset dictionary for the literal stream residue. Exploits LLM conversation self-reference to beat brotli per-message on all corpora (verified 32 random seeds).

2. Architecture

2.1 Aho-Corasick Tokenizer

The core matching engine uses an Aho-Corasick automaton built from the current vocabulary. O(n) single-pass matching, independent of dictionary size. Token IDs assigned 1-254 (u8). The automaton rebuilds every 50 chunks (cold) or 100 chunks (warm), with an atomic pointer swap ensuring no blocking on the encode hot path.

2.2 Sliding Window Vocabulary (SlidingTokenizerV2)

Maintains up to 20,000 entries across compression calls, tracking byte sequence, cumulative frequency, last-seen batch, and compression saving. New patterns displace lowest-saving stale entries when the window is full.

This enables the monotonic improvement property: compression ratio increases with stream length as the vocabulary adapts to the domain. A single instance shared across all compression calls enables cross-session vocabulary accumulation.

2.3 Split-Stream Encoding

After AC tokenization, GN separates two streams:

Token ID stream: Pure symbol sequence (IDs 1-254), skewed distribution, compresses at ~9x with deflate
Literal stream: Unmatched bytes, compresses at ~2x with deflate

Frame format: [2B tok_deflated_len][tok_deflated][lit_deflated]

Separating streams improves ratio because each has distinct statistical properties. The mixed tokenized stream contains ESCAPE bytes that fragment deflate pattern matching.

2.4 GCdict: GN Context Dictionary

The core insight: LLM conversations are self-referential. The literal stream residue contains patterns from prior messages in the same conversation. A debugging session reuses error messages and variable names. A code review reuses function names and patterns. A customer support session reuses product terminology.

GCdict uses conversation history as a preset dictionary for deflate compression of the literal stream:

Input batch
  -> AC tokenization (GN vocabulary)
  -> split(tok_ids, literals)
  -> tok_stream: deflate (unchanged)
  -> lit_stream: deflate(literals, zdict=history[-32KB])
  -> frame: [2B tok_len][tok_deflated][lit_deflated]

Deflate's LZ77 engine, initialized with 32KB of conversation history, finds back-references to prior turns that standard deflate cannot see. This is GN-native -- the same deflate engine with a better-initialized LZ77 window. No brotli internals.

Both encoder and decoder maintain the same conversation history, making GCdict fully lossless and deterministic.

Why brotli's static dictionary fails where GCdict succeeds: Brotli's 120KB dictionary is trained on web text. On IRC messages (67B avg), brotli achieves 1.17x -- the web-text dictionary has minimal overlap with technical Linux support dialogue. GN's domain-specific vocabulary achieves 2.53x on the same data. Brotli quality=1 (minimal static dict usage) achieves 1.757x -- lower than deflate-9 (1.937x), confirming the static dictionary is brotli's primary advantage, not better LZ77. GCdict replaces the static web-text dictionary with a dynamic conversation-specific dictionary.

3. Experimental Evaluation

3.1 Corpora

ShareGPT V3: Real ChatGPT conversations, avg 846B per message
WildChat: Multi-language LLM conversations, avg 952B per message
LMSYS-Chat-1M: Chatbot Arena conversations, avg 915B per message
Ubuntu IRC: Technical Linux support dialogues, avg 67B per message

Content extracted as clean message text. Hardware: Intel i3-1215U.

3.2 GN Split-Stream Results (b=8, 24 measurements)

Corpus	GN ratio	range	vs gzip	vs brotli	p50/batch	MB/s
ShareGPT	2.484x	2.422-2.559	+2.3%	-4.9%	0.43ms	15.4
WildChat	2.130x	2.088-2.169	+3.5%	-7.7%	0.54ms	17.6
LMSYS	2.362x	2.335-2.396	+1.0%	-5.2%	0.39ms	19.3
Ubuntu-IRC	2.534x	2.384-2.715	+61.9%	+47.1%	0.055ms	9.3

Per-chunk p50 latency: 0.007ms (0.055ms / 8 chunks).
GN beats gzip on all corpora across all 24 measurements.

3.3 Ubuntu-IRC: Verified Dominance (72 measurements)

On 67B average messages, standard compressors essentially fail:

gzip-6 per-message: 0.857x (actually expands)
brotli-6 per-message: 1.138x (barely compresses)
GN b=8: 2.534x (+47% vs brotli, +62% vs gzip)

Verified across 72 measurements across three corpus sizes and multiple seed sets.
Every single measurement positive vs brotli. Floor: +47%, ceiling: +61%.

Domain-specific vocabulary explains this: IRC messages about Linux troubleshooting
contain sudo apt-get, /dev/sda, ubuntu, terminal -- patterns GN knows and
general-purpose compressors do not.

3.4 Claude Conversations: The Target Corpus

GN was designed for Claude LLM conversations. Tested on 41 real Claude conversations (4841 turns, avg 915B), 16 random seeds:

Corpus	GN cold	GN warmed	br/msg	vs br/msg	range
Claude convos	2.305x	2.766x	2.115x	+30.8%	30.6-31.2%

GN beats brotli per-message by +30.8% on real Claude data. Variance 30.6-31.2% across 16 random seeds -- structural, not noise. vs brotli per-batch: +15.9%.

3.5 GCdict Results: All Public Corpora (32 random seeds, all_positive=true)

Production comparison: GCdict vs brotli per-message.
In production LLM streaming, messages arrive one at a time.
GN accumulates session history. Per-message brotli does not.

Corpus	GN cold	GN warmed	br/msg	vs br/msg	range
ShareGPT	2.513x	2.765x	2.441x	+13.3%	12.1-14.6%
WildChat	2.115x	2.265x	2.212x	+2.4%	0.9-3.4%
LMSYS	2.354x	2.577x	2.287x	+12.7%	10.8-14.6%
IRC	1.708x	1.925x	1.184x	+62.7%	59.9-64.5%

GN beats brotli per-message on ALL 4 public corpora, ALL 32 seeds, zero exceptions.
WildChat minimum: +0.9% -- never negative.

vs brotli per-batch (same context, best case for brotli):
ShareGPT +3.8% (all positive), WildChat -1.3% (near tie), LMSYS +3.6%, IRC +11.7%

3.6 Literal Stream Analysis

The literal stream (unmatched bytes) is the primary compression challenge:

Literal stream = 91-95% of input on longer messages
Deflate compresses literals at 1.937x
Brotli compresses literals at 2.089x (7.8% gap)
Brotli quality=1 on literals: 1.757x -- lower than deflate-9 (1.937x)

This confirms brotli's static dictionary is its primary advantage. GCdict provides
a conversation-specific replacement that outperforms the web-text static dict.

4. Related Work

LZ77 and Deflate: Ziv & Lempel 1977. Deflate (RFC 1951) combines LZ77 with Huffman coding. Fixed 32KB window.

Brotli (RFC 7932): Adds 120KB static dictionary and context modeling. Dictionary fixed at specification time.

Zstandard: Offline dictionary training. Dictionary static after training. GN achieves adaptation without offline training.

LLM Context Compression: Token-level methods (LLMLingua, AutoCompressor) are lossy and require model inference. GN is complementary: byte-level, strictly lossless, CPU-only.

5. Limitations

GCdict requires conversation history at decode time (stateful)
Split-stream requires batching (4+ chunks) to amortize overhead
GN cold start (no session history) trails brotli by 5-8% on longer messages
WildChat -1.3% vs brotli per-batch (near tie); +2.4% vs brotli per-message (all positive)
Higher constant overhead than gzip for very small inputs under 200B

6. Conclusion

GN provides domain-adaptive compression that improves with conversation length.
GN demonstrates that LLM conversation history is itself a compression resource:
using prior turns as a preset dictionary exploits self-reference that
general-purpose compressors cannot access.

Key results:

GN beats gzip on all corpora, always
GN beats brotli per-message on all 5 corpora
Ubuntu-IRC: +47% vs brotli, 72 measurements, every run positive
p50 0.007ms per chunk -- negligible latency overhead

Source: github.com/atomsrkuul/glasik-core (MIT)

References

Ziv & Lempel (1977). IEEE Trans. Information Theory, 23(3), 337-343.
Deutsch (1996). DEFLATE. RFC 1951.
Alakuijala & Szabadka (2016). Brotli. RFC 7932.
Collet (2016). Zstandard. RFC 8878.
Zhao et al. (2024). WildChat. ICLR.
Zheng et al. (2023). LMSYS-Chat. NeurIPS.
Lowe et al. (2015). Ubuntu Dialogue Corpus. SIGDIAL.
Deletang et al. (2023). Language Modeling Is Compression. arXiv:2309.10668.
Jiang et al. (2023). LLMLingua. EMNLP.

I Built a Compression Algorithm That Beats Gzip in 2 Weeks. I Have an A+ Cert.

Buffer Overflow — Wed, 15 Apr 2026 19:20:57 +0000

How a networking student ended up writing Rust, beating industry standard compression, and learning more about computers than any classroom taught me.

I was messing around with LLM APIs. Claude, GPT, the usual. And I kept hitting the same wall: context windows cost money. Every token you send costs. Every token the model reads costs. If you're building anything serious on top of these APIs, agents, memory systems, anything that needs conversation history you're constantly fighting the token budget.

The obvious solution is compression. Compress your context before sending it, decompress on the way back. But here's the thing nobody talks about: standard compression algorithms weren't built for LLM data. Gzip was designed for web assets. Brotli for HTTP. Neither was trained on the specific patterns that dominate LLM conversations, repeated phrases, structured JSON, code snippets, the same function names appearing dozens of times across a session.

I thought: what if I built something specifically for this?
I have an A+ cert. I'm in school for networking. I had never written Rust. I had never implemented a string matching algorithm. I didn't know what Aho-Corasick was.

Two weeks later, GN (Glasik Notation) was compressing LLM data. Six months of refinement later, it beats gzip on every corpus tested, beats brotli by 47% on short messages, and beats brotli per-message by up to 62% on real Claude conversations. Here's how it happened.

What I Knew Coming In
Networking teaches you to think in bytes. Packets, headers, payloads. MTU limits. Bandwidth vs latency tradeoffs. When you spend time thinking about why a TCP packet is structured the way it is, you develop an intuition for data representation that a lot of software developers never get.

I also knew that compression, at its core, is just finding patterns and encoding them more efficiently. ZIP finds repeated byte sequences. Huffman coding gives frequent symbols shorter codes. That's it. The magic is in how cleverly you find patterns and how efficiently you encode them.

What I didn't know: how to implement any of this. So I started reading.

The First Attempt: Embarrassingly Bad
My first compression attempt was a lookup table. I manually collected common LLM phrases and replaced them with single bytes. Classic dictionary substitution.
The ratio? About 1.1x on good days. Gzip does 2.4x. I was losing badly.
But it taught me something important: the dictionary matters more than the algorithm. If you have the right patterns, even simple substitution works. If you have the wrong patterns, nothing saves you.

Learning Aho-Corasick
The problem with simple dictionary lookup is speed. If you have 20,000 patterns and a 2,000 byte message, naively checking every pattern at every position is O(n x m x k) too slow to be useful.
Aho-Corasick builds a finite automaton from all your patterns simultaneously. You scan the input once, left to right, and the automaton tells you which patterns match at each position. Linear time regardless of how many patterns you have.

This is where my networking background helped. I already understood finite state machines from studying how network protocols work. TCP state diagrams, regex engines in firewalls, etc. Aho-Corasick was the same concept applied to string matching.
I implemented it in Rust. Its type system forced me to think clearly about ownership and memory, things networking had taught me to care about but that most high level languages hide.
The Vocabulary Problem
An AC automaton is only as good as its vocabulary. I needed to learn what patterns actually appear in LLM conversations.

I downloaded four public datasets:

· ShareGPT v3: Real ChatGPT conversation turns, avg 846B per message

· WildChat: Multi-language LLM conversations, avg 952B per message

· LMSYS-Chat: Academic LLM benchmark data, avg 915B per message

· Ubuntu IRC: Technical support dialogues, avg 67B per message

I wrote a sliding window tokenizer that learns patterns from the data. After training on 500k chunks, the L0 vocabulary had 20,000 entries covering the most common LLM patterns. The top entries: 8-space indentation, " the", "and ", "ing ", paragraph breaks, common function names, JSON structure.

Split-Stream Architecture: The Key Insight
After AC matching, you have two streams:

· Token IDs: Single bytes representing matched patterns (1-254)
· Literals: The raw bytes that didn't match any pattern

These two streams have completely different statistical properties. Token IDs are low-entropy, a small alphabet of 254 symbols, highly repetitive. Literals are higher entropy, think the unique residue.
By deflating them separately and combining the compressed streams, each gets optimally compressed for its own statistics. The token stream deflates at 9.4x. The literal stream deflates at 2x.
The frame format:

[2B token_stream_length][token_stream_deflated][literal_stream_deflated]

The GCdict Breakthrough

This is the part that surprised me most.
After getting the split-stream working, GN was still trailing brotli by 5-8% on longer messages. I spent weeks trying vocabulary expansion, LZ77 pre-processing, ANS entropy coding. Nothing moved the needle enough.
Then I asked a different question: what does brotli know that GN doesn't?

Brotli has a 120KB static dictionary trained on web text. I proved this was its primary advantage -- brotli at quality=1 (minimal static dictionary usage) actually loses to deflate-9. The dictionary is the advantage, not better LZ77.

So I needed GN's equivalent. But instead of a static web-text dictionary, I used the conversation history itself.
LLM conversations are self-referential. When someone asks about a bug in message 5, they used the same variable names in message 2. A debugging session reuses error messages. A code review reuses function names. The literal stream residue (the bytes GN's AC tokenizer didn't match) contains fragments of prior messages in the same conversation.
GCdict (GN Context Dictionary): use the last 32KB of conversation history as a preset dictionary for deflate compression of the literal stream.

AC tokenization -> split(tok_ids, literals)

tok_stream: deflate (unchanged)

lit_stream: deflate(literals, zdict=history[-32KB])
Same deflate engine. Better-initialized LZ77 window. No brotli internals. Fully lossless -- both encoder and decoder have the same history.

The result: GN beats brotli per-message on every corpus tested.
The Numbers
24 independent measurements per corpus for split-stream. 32 random seeds for GCdict.

GN Split-Stream (b=8):
Corpus GN ratio vs gzip vs brotli p50
ShareGPT (846B avg) 2.484x +2.3% -4.9% 0.43ms
WildChat (952B avg) 2.130x +3.5% -7.7% 0.54ms
LMSYS (915B avg) 2.362x +1.0% -5.2% 0.39ms
Ubuntu-IRC (67B avg) 2.534x +61.9% +47.1% 0.055ms

Per-chunk p50: 0.007ms.
GN with GCdict -- vs brotli per-message (production comparison):
Corpus GN br/msg vs brotli range
Claude convos (915B) 2.766x 2.115x +30.8% 30.6-31.2%
ShareGPT 2.765x 2.441x +13.3% 12.1-14.6%
LMSYS 2.577x 2.287x +12.7% 10.8-14.6%
WildChat 2.265x 2.212x +2.4% 0.9-3.4%
IRC 1.925x 1.184x +62.7% 59.9-64.5%

All 32 seeds positive on every corpus. Zero exceptions.
The production comparison is per-message brotli because in real LLM streaming, messages arrive one at a time. GN accumulates session history. Per-message brotli has no context.

The Ubuntu-IRC result deserves special mention. On 67-byte average messages, gzip achieves 0.857x (actually expands) and brotli achieves 1.138x. GN achieves 2.534x. This isn't close -- it's a different compression tier entirely, because GN's vocabulary is trained on the specific domain.

What I Got Wrong
The window update bug. I added per-chunk window updates inside the batch compression loop. This triggered the Aho-Corasick automaton to rebuild at every 10th chunk -- a 35x latency regression that took days to find. Fix: update vocabulary once per batch, not per chunk.
Fake benchmarks. Early on I tested with synthetic text -- repeated sentences. The ratios looked incredible. When I switched to real corpus data, they collapsed. Real data is the only benchmark that matters.
The extraction bug. I spent weeks thinking WildChat compressed at 2.83x. It was actually 2.13x. The difference: my extraction script used str(turn) instead of turn.get('content',''), including JSON metadata in every message. 89 bytes of {'language': 'English', 'redacted': False} per turn, compressing beautifully. Real numbers only come from real data extracted correctly.

Dictionary quality. My first trained vocabulary included spam patterns -- ubuntu repeated 35 times, Spotify API strings. These filled vocabulary slots with noise. Proper filtering was essential.
What Networking Taught Me That CS Didn't
Think in bytes, not objects. Every compression decision comes down to: how many bytes does this cost vs save? Networking trains you to count bytes obsessively.

Latency is not throughput. A compression algorithm that achieves 3x ratio but takes 50ms per message is useless for real-time applications. GN's p50 is 0.007ms per chunk.
The protocol mindset. GN's frame format is just a protocol. I designed it the same way I'd design a packet format: minimize overhead, make it parseable without side information, handle edge cases at the boundary.
Real systems fail in weird ways. My Node.js native addon path was 12x slower than my Python path on identical data. Finding this required the same methodical elimination I use to debug network issues.

Where It Is Now

GN is deployed in my OpenClaw setup, compressing conversation context before storage and retrieval.

Architecture:
· L0: 20,000 static entries trained on LLM corpora
· Sliding window vocabulary: adapts continuously to session content
· GCdict: conversation history as deflate preset dictionary
· Split-stream: independent tok/lit compression
· Node.js and Python bindings (napi + PyO3)

A paper (GN: Domain-Adaptive Lossless Compression for LLM Conversation Streams) is available at github.com/atomsrkuul/glasik-core.
What I'd Tell Someone Starting Out
You don't need a CS degree to build real systems. You need:
A real problem. Not a tutorial project. Something that costs you actual money or time.

The willingness to read papers. You don't need to understand the proofs. You need to understand the idea.

Real data benchmarks from day one. Never test on synthetic data. It will lie to you.

Obsessive measurement. If you can't measure it, you can't improve it. Instrument everything.

The networking mindset. Think in bytes. Count everything. Every abstraction has a cost.

The A+ cert taught me that computers are physical things moving electrons around. That turned out to be exactly the right mental model for building a compression algorithm.

GN is open source at github.com/atomsrkuul/glasik-core (MIT).
Robert Rider - Independent Researcher

The ESCAPE Byte Problem: How I Beat Brotli by Separating Token Streams

Buffer Overflow — Tue, 14 Apr 2026 00:47:39 +0000

Part 4 of the Glasik Notation series.

Previous articles covered the sliding window tokenizer, Aho-Corasick O(n) matching, and GN's first verified benchmarks against gzip.

The Waste Was Hidden in Plain Sight
After implementing Aho-Corasick O(n) matching, GN was fast. Sub-millisecond per chunk, competitive with brotli on latency. But the ratio numbers kept coming back flat:

gzip-6: 2.18x
GN AC: 2.20x (+0.9% vs gzip)
brotli-6: 2.47x

We were barely beating gzip. Brotli was 12% ahead. The vocabulary was real — 31,248 tokens per 200 chunks, 190 tokens per chunk average. The matches were happening. So where were the bits going?
We ran a token stream entropy analysis:
pythonfrom collections import Counter
import math

token_ids = []
for c in sample:
raw = slider.encode_ac_raw(c)
i = 0
while i < len(raw):
if raw[i] == ESCAPE and i+1 < len(raw):
token_ids.append(raw[i+1])
i += 2
else:
i += 1

counter = Counter(token_ids)
total = sum(counter.values())
entropy = -sum(c/total * math.log2(c/total) for c in counter.values())
print(f"Token entropy: {entropy:.3f} bits/token")
Result: 7.758 bits/token.

We were encoding each token as 2 bytes: ESCAPE + id. That's 16 bits per token. The theoretical minimum was 7.758 bits. We were wasting 51.5% of every token encoding. That's where the bits were going.

Why the Mixed Stream Was Hurting Us

Our tokenized output looked like this:

[ESCAPE][id][ESCAPE][id][lit][lit][lit][ESCAPE][id][lit][ESCAPE][id]...

Every token costs 2 bytes: an ESCAPE byte (0x01) followed by the ID. We fed this into deflate expecting it to compress well. But deflate uses LZ77 — it looks for repeated byte sequences in a sliding window. The ESCAPE bytes were fragmenting every pattern.

Where deflate might have seen:
" the " " the " " the " ← repeating 6-byte sequence, compresses well
It was instead seeing:
[01][04] " t" "he" [01][04] " t" ... ← ESCAPE bytes breaking the pattern
The ESCAPE byte was acting like static on a radio signal. Present in every token, making the mixed stream look noisier than it actually was.

The Insight: Separate the Streams
What if we just... didn't mix them?
Instead of one interleaved stream, emit two:

Token stream: just the IDs — [04][04][38][20][04][07]...
Literal stream: just the literal bytes — "t" "h" "e" " " "a" ...

Then compress each independently with raw deflate.
The token stream is pure symbols. Token ID 4 (" the") fires 483 times in 200 chunks. That's a highly skewed distribution — deflate loves it. The literal stream is clean text with no ESCAPE pollution. It compresses the way text is supposed to compress.

pythontoks, lits = slider.encode_ac_split(chunk)
dt = zlib.compress(toks, 6)[2:-4] # raw deflate
dl = zlib.compress(lits, 6)[2:-4]
frame = struct.pack('>H', len(dt)) + dt + dl

This is the same insight behind why PNG separates prediction from entropy coding, why video codecs separate motion vectors from residual — when you have structurally different data, compress the structures separately.

The Numbers

We ran this across 4 corpora, 3 seeds each — 12 independent measurements. Standard protocol: warm 500 chunks, test next 300.
Batch size matters. Each chunk has ~37 token IDs. Deflate header overhead (~18 bytes) dominates a tiny stream. Batching solves this — concatenate N chunks before compressing the token stream:

GN split b=1: 2.226x 0.043ms -6.6% vs brotli ← header overhead dominates
GN split b=4: 2.385x 0.036ms +0.1% vs brotli ← already matching brotli
GN split b=8: 2.456x 0.036ms +3.1% vs brotli ← production sweet spot
GN split b=16: 2.542x 0.037ms +6.7% vs brotli ← diminishing returns

b=8 is the production choice. Beyond b=16 the marginal gain flattens and you're accumulating more latency budget than the ratio improvement justifies.

Full 12-measurement verification at b=8:
CorpusGN split b=8vs gzipvs brotlip50p99ShareGPT2.49–2.52x+15%+2%0.043ms0.061msWildChat2.48–2.51x+15%+2%0.042ms0.073msLMSYS2.50–2.56x+14%+2%0.044ms0.079msUbuntu-IRC2.06–2.09x+49%+28%0.008ms0.013ms

Every single measurement beats both gzip and brotli.
And on tail latency: GN split b=8 p99 never exceeds 0.123ms. Brotli-6 p99 reaches 0.226ms. GN has 2–4x better tail latency than brotli while achieving better compression ratio.

Why This Works (The Information Theory)
The mixed tokenized stream had:

Token entropy: 7.758 bits/token
Encoding cost: 16 bits/token
Waste: 51.5%

The split stream:

Token stream: pure symbols, deflate compresses ~2–3x on its own
Literal stream: clean text, no structural noise, deflate compresses ~1.9x

Combined result: 2.49–2.56x on the original input

The separation lets each compressor do what it was designed to do. This isn't a trick — it's giving deflate the data structure it can actually exploit.

The Frame Format

Simple and self-contained:

[2B tok_deflated_len][tok_deflated][lit_deflated]
Two bytes of length prefix for the token stream, then the two compressed streams concatenated. Given the vocabulary, you can decode it without any other external state.
The Rust implementation in codon.rs:
rustpub fn encode_ac_split(buf: &[u8], ac: &AhoCorasick) -> (Vec, Vec) {
let mut tok_ids: Vec = Vec::new();
let mut literals: Vec = Vec::new();
let mut pos = 0usize;

for m in ac.find_iter(buf) {
    for &b in &buf[pos..m.start()] {
        literals.push(b);
    }
    let pat_idx = m.pattern().as_usize();
    if pat_idx < 254 {
        tok_ids.push((pat_idx + 1) as u8);
    } else {
        for &b in &buf[m.start()..m.end()] {
            literals.push(b);
        }
    }
    pos = m.end();
}
for &b in &buf[pos..] { literals.push(b); }
(tok_ids, literals)

}
O(n) scan, single pass, clean split.

Lossless Round-Trip

The split-stream is lossless when encoder and decoder share the same vocabulary. Token IDs are indices — to decode, you need to know what pattern each ID maps to.
GN uses a stateful model in production. Encoder and decoder share a synchronized sliding window; each frame carries a 2-byte dict_version. If they diverge, the decoder requests a resync. This keeps frames small while guaranteeing correctness.
Round-trip verified: 5/5 test cases pass including empty buffers, raw ESCAPE bytes in input, and 10,000-byte repetitive inputs.

What's Next: Fractal Dictionary Sharding

The split-stream insight revealed something deeper: token and literal streams have fundamentally different statistical structure. Taking that further — different types of content have different vocabulary entirely.
Code blocks repeat function, return, const. System messages repeat role definitions. User messages repeat question structures. Compressing them with a single shared vocabulary leaves ratio on the table.
We're implementing fractal dictionary sharding: four vocabulary tiers (L0 universal, L1 domain, L2 session, L3 chunk) with per-shard-type routing and deterministic crystal identity per shard — same content always produces the same compressed shape. The FractalCompressor is implemented, wired into the napi production path, and passing roundtrip verification across all shard types.
More on that in Article 5.

Code and Paper

GitHub: github.com/atomsrkuul/glasik-core (MIT)
npm: gni-compression@1.0.0
arXiv: pending cs.IR endorsement — if you're a qualified author (3+ cs papers): code 7HWUBA

Robert Rider is an independent researcher building Glasik, an open-source compression and context management system for LLM deployments.

GN Beats Gzip and Brotli: How a Learning Sliding Window Outperforms Static Compressors

Buffer Overflow — Tue, 07 Apr 2026 19:14:49 +0000

When we published our last article, GN was within 10% of gzip on LLM conversation data. We said the remaining gap was in the entropy backend. We were wrong about the solution — but right about the problem.
This week GN beats gzip on every corpus we tested. And on all three corpora, it beats brotli.
Here is what we learned.

The ANS Dead End
Our first instinct was to improve the entropy coder. Gzip uses Huffman coding. zstd uses ANS (Asymmetric Numeral Systems). We implemented byte-renorm ANS, bit-renorm ANS, and Order-1 ANS from scratch in Rust.

Results on ShareGPT:
Codec Ratio
gzip-6 2.082x
byte-ANS 1.233x
bit-ANS 1.212x
O1-ANS 0.551x

ANS without an LZ-style preprocessing pass is worse than gzip. Every time. The reason is fundamental: entropy coders compress symbol frequency distributions. But gzip's real advantage comes from LZ77 — the sliding window that eliminates repeated byte sequences before entropy coding runs. ANS cannot fix what LZ77 needs to do first.
We kept ANS in the codebase as a primitive for future work and moved on.
The Real Problem: Per-Frame Dictionary Overhead
GN has a sliding window tokenizer — it learns domain vocabulary across batches and compresses using that vocabulary. But there was a critical architectural flaw: the dictionary was serialized into every compressed frame.

200 entries × ~10 bytes = ~2KB overhead per chunk. On 500-byte chunks, the dictionary cost more than the compression saved.

v1 on 1000 LLM chunks: 0.502x (expanding the data)
The fix: stop putting the dictionary in the frame. Keep it in shared state, reference it by version number. This is exactly how brotli's static dictionary and zstd's dictionary mode work.

Frame v1: magic + full_dictionary + payload (~2KB overhead)

Frame v2: magic + dict_version(4 bytes) + payload (8 bytes overhead)

The Corpus Window (Level 2)
With the overhead fixed, we increased the window to 10,000 entries and made it global — one sliding window shared across all compression calls in the process. Every session, every shard, every conversation feeds the same accumulating vocabulary.

Results immediately improved:
Corpus L1 (per-call) L2 (corpus window) gzip brotli
ShareGPT 2.191x 2.402x 2.178x 2.453x
WildChat 2.035x 2.145x 2.025x 2.234x
LMSYS 2.094x 2.231x 2.079x 2.322x

L2 beats gzip on every corpus. The gap to brotli narrowed to 2-4%.
Retrieval-Warmed Compression (Level 3)
The insight: before compressing a new chunk, feed similar prior chunks through the sliding window first. This warms the dictionary with related vocabulary so the new chunk compresses better. The act of retrieval changes the compression state.

We benchmarked warm_k (number of prior chunks used for warming) on WildChat — the hardest corpus due to topic diversity:
pressurize_k L3 ratio vs brotli
0 (no pressurize) 2.164x +3.54% gap
1 2.199x +1.89% gap
2 2.251x +0.5% ahead
3 2.207x +1.51% gap

pressurize_k=2 is optimal for WildChat. For ShareGPT and LMSYS, pressurize_k=3 is optimal.

The optimal pressurization depth varies by corpus vocabulary diversity — more diverse corpora benefit from shallower pressurization to avoid dictionary dilution.

Final Results: L3 Beats Brotli on All Three Corpora
Verified across 3 independent corpora, 3 random seeds each:
Corpus GN L3 gzip-6 brotli-6 margin
ShareGPT 2.526x 2.145x 2.429x +4.0% vs brotli
LMSYS 2.401x 2.031x 2.291x +4.8% vs brotli
WildChat 2.251x 2.023x 2.240x +0.5% vs brotli

All three beat gzip by 11-18%. All three beat brotli. Results verified across 3 random seeds per corpus.

GN beats gzip on 100% of runs across all seeds and corpora. GN beats brotli on all three corpora when the window is sufficiently warmed.

Why This Works

Brotli ships with a 120KB static dictionary of common web phrases. It never changes. GN's sliding window learns the specific vocabulary of your data stream as it runs. LLM conversations have crystalline structure — repeated role markers, prompt scaffolding, tool call formats, JSON patterns, reasoning templates. After seeing a few thousand examples, GN knows these patterns better than any generic dictionary ever could.

The critical property: GN's compression ratio improves with stream length. Gzip and brotli are static — they cannot improve.

ShareGPT at 500 chunks: GN 2.304x brotli 2.363x (behind)

ShareGPT at 2000 chunks: GN 2.440x brotli 2.436x (pulls ahead)

ShareGPT at 5000 chunks: GN 2.517x brotli 2.429x (+3.6%)

The longer GN runs on a domain-specific stream, the wider the gap grows.

What Comes Next

The current warming uses sequential proximity — the last N chunks before the current one. The next level uses semantic similarity — retrieve the most topically related prior chunks via embedding search, regardless of when they appeared.

A conversation about JWT authentication should be warmed by other authentication conversations, not by whatever happened to come before it in the stream. This is Semantic Level 3, and it should further improve results on diverse corpora like WildChat where topic jumps are common.
Beyond that: dictionary compression (compress the dictionary itself, fractal self-similarity), cross-session persistence (window state survives restarts), and pre-trained domain dictionaries (ship a base window trained on 50k LLM conversations).

The goal is to make GN the brotli of LLMs — purpose-built, measurably better, and invisible infrastructure.

GN is MIT licensed. Code: github.com/atomsrkuul/glasik-core

npm: gni-compression@1.0.0

NLNet NGI Zero Commons Fund application #2026-06-023

Within 10% of gzip: What GN’s Semantic Compression Teaches Us

Buffer Overflow — Sun, 05 Apr 2026 04:20:54 +0000

When we first started building, the goal was never to make another gzip clone. Generic compression already does that job incredibly well.

The real question was different:

What happens if the compressor understands the shape of the data before it ever starts packing bytes?

That question led us from the original JavaScript prototype into Glasik Core, a Rust implementation focused on semantic tokenization, rolling vocabulary windows, and domain-aware preprocessing for message and agent streams.

This week we hit a milestone that feels small on paper but huge architecturally:

GN is now within 10% of gzip on every benchmark corpus we tested.

Not better. Not faster. Not “production solved.”

Just consistently close, which is exactly why this stage is exciting.

The benchmark reality

Current corpus results:

Corpus Glasik Core gzip Relative
MEMORY.md 1.849x 2.075x 89%
ShareGPT-1k 3.752x 3.945x 95%
Ubuntu-IRC-1k 2.122x 2.357x 90%

The most important one is ShareGPT-1k hitting 95% of gzip. That corpus is extremely close to the data GN was designed for:

Repeated assistant roles
Prompt scaffolding
Tool formatting
Structured JSON-like patterns
Recurring conversational templates

Even though we have not passed gzip yet, nearly matching it on LLM-native streams is a strong validation signal.

Why being close matters more than winning right now

The remaining gap is not where many would assume. The weak point is not semantic understanding anymore.

The weak point is the final entropy backend. gzip still has decades of advantage in:

Huffman tuning
Backreference heuristics
Lazy match parsing
Highly optimized bit packing
Mature DEFLATE edge cases

That last 5–10% is the part generic compressors are legendary at.

But the semantic layer is already doing the harder thing: understanding the structure of the stream before compression begins. That’s where the long-term leverage is.

The real architectural lesson

The simplest way to explain the difference:

gzip remembers bytes. GN remembers meaning.

As the rolling vocabulary fills, repeated structures stop being treated like raw strings and start being treated as stable semantic units. That includes:

Timestamps
Speaker roles
Repeated tool calls
Theorem blocks
JSON keys
Repeated prompt shells
Agent trace scaffolding
Channel metadata

Performance improves the longer the stream runs. Instead of relying only on a fixed byte-history window, GN reinforces the vocabulary of the domain itself. That’s the core bet.

Why Rust changed the debugging loop

The JavaScript prototype proved the idea. Rust made it possible to trust the measurements.

One concrete example: during corpus benchmarking we hit a rolling-frequency bug silently inflating token counts over long windows. Compression ratios looked “better,” but only because the vocabulary statistics were wrong.

The fix only became obvious because Rust forced us to reason explicitly about integer width, overflow behavior, and ownership boundaries inside the rolling state machine.

Fixing it tightened the corpus results and gave us confidence that the “within 10%” milestone is real, not a measurement artifact. That debugging loop alone justified the rewrite.

What makes this exciting now

The missing performance is now localized. We know exactly where the gap is:

Residual encoding
Entropy refinement
Better state models
Adaptive codon dictionaries
Specialized chat residual codecs

That is a much better place to be than wondering whether the entire idea works. The semantic layer is clearly competitive. Now it’s about tightening the backend until the semantic advantage outweighs gzip’s entropy maturity.

What’s next

Tonight’s most interesting work was deeper in the backend: we now have a reference-safe ANS entropy coder implemented from scratch in Rust, using the same family of techniques that powers zstd.

The current version uses correctness-first binary renormalization so we can prove round-trip behavior before optimizing. Next step: bit-level state refinement and faster renormalization transforms.

This work directly targets the exact 5–10% gap the benchmarks are still showing.

The path forward is finally clear:

Semantic understanding is already competitive
Entropy packing is the remaining frontier
The architecture now tells us exactly where to push

At this point, GN (our semantic agent layer) and Glasik Core (the compression engine) feel less like an experiment and more like a real compression architecture.

I Built Domain Specific Compression for Messages. Here's What I Learned.

Buffer Overflow — Fri, 03 Apr 2026 17:35:19 +0000

Why gzip loses to custom compression on chat data — and what we learned building a lossless message codec from scratch.
The Problem
Message data is expensive at scale. Discord servers, Slack workspaces, OpenClaw chat logs — each message is ~500 bytes. Generic compression gets 2-3x. That's good but messages have structure generic algorithms ignore.
[2026-04-03T11:00:00Z] user: Hello, can you check the repo?
[2026-04-03T11:00:15Z] bot: Checking repository...
Timestamp format, role prefixes, platform names — these repeat thousands of times. A domain-specific dictionary front-loads that knowledge instead of discovering it slowly.
The Architecture
We built two layers that work together:
GN (Glasik Notation) — semantic compression. Extracts structure before compression, maps repeated values to IDs, recognizes message templates, reduces entropy before any algorithm touches the data.
GNI (Glasik Notation Interface) — transmission codec. Handles serialization, framing, integrity verification, wire protocol.
Together they form a complete pipeline. Neither is useful alone — GN without GNI has no reliable wire protocol, GNI without GN has no semantic advantage over gzip.
What We Shipped: GNI v1
Phase 1 delivers the foundation:

Canonical binary serialization (varint encoding)
Versioned frame format (backward compatible forever)
CRC32 integrity verification
100% lossless round-trip recovery verified on 2,000+ real messages
Zero external dependencies, 482 lines of JavaScript

Compression ratios from semantic tokenization are a Phase 2 deliverable. The tokenizer is currently stubbed — Phase 2 implements the domain-specific dictionary that gives GN its advantage.
The Bug We Caught
During validation on 1,038,324 real dialogue messages we hit a CRC32 mismatch:
Stored CRC: 1428394006
Computed CRC: -1889366573
Mismatch!
The bug: checksum computed over (header + payload) instead of just (payload). Three-line fix, proper unsigned arithmetic (>>> 0). Full corpus re-validated in 25 seconds. 6/6 tests passed.
Caught before production. Caught by us, not a user. This is why you validate before you ship.
Try It
javascriptconst GNLz4V2 = require('./src/gn-lz4-v2-complete');
const codec = new GNLz4V2();

const messages = [
{ templateId: 0, ts: 1743744000, author: 1, channel: 1, payload: 'hello world' },
{ templateId: 0, ts: 1743744001, author: 2, channel: 1, payload: 'how are you?' }
];

const result = codec.compress(messages);
const recovered = codec.decompress(result.compressed);
console.log(recovered.length + ' messages recovered losslessly');
npm test

37/37 passing

What We Are Not Claiming

Compression ratios are Phase 2. Phase 1 is serialization and framing.
Not production-proven at scale. Validated on our own systems.
No external users yet. Looking for third-party benchmarks and feedback.

What We Are Claiming

Solid foundation: lossless, versioned, integrity-verified
Real validation: 1M+ messages, caught our own bugs before release
Clear roadmap: Phase 2 connects GN semantic compression into GNI transmission layer
Applied for NLNet NGI Zero funding (application 2026-06-023) to deliver Phase 2

Links
GitHub: https://github.com/atomsrkuul/glasik-notation
License: MIT
If you compress messages and want to share results, open an issue. That kind of external validation is what makes this real.