<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Buffer Overflow</title>
    <description>The latest articles on DEV Community by Buffer Overflow (@atomsrkuul).</description>
    <link>https://dev.to/atomsrkuul</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3859861%2F4ddd8a66-6ad8-44ed-b41e-4d793e193bdd.png</url>
      <title>DEV Community: Buffer Overflow</title>
      <link>https://dev.to/atomsrkuul</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/atomsrkuul"/>
    <language>en</language>
    <item>
      <title>GN: Domain-Adaptive Lossless Compression for LLM Conversation Streams</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Wed, 15 Apr 2026 19:59:59 +0000</pubDate>
      <link>https://dev.to/atomsrkuul/gn-domain-adaptive-lossless-compression-for-llm-conversation-streams-4dp3</link>
      <guid>https://dev.to/atomsrkuul/gn-domain-adaptive-lossless-compression-for-llm-conversation-streams-4dp3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;PDF version:&lt;/strong&gt; github.com/atomsrkuul/glasik-core/blob/master/GN_PAPER_V2.pdf  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; github.com/atomsrkuul/glasik-core (MIT)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Robert Rider | Independent Researcher &lt;br&gt;
github.com/atomsrkuul/glasik-core (MIT)&lt;/p&gt;


&lt;h2&gt;
  
  
  Abstract
&lt;/h2&gt;

&lt;p&gt;I present GN (Glasik Notation), a domain-adaptive lossless compression system for LLM conversation streams. GN maintains a persistent sliding window vocabulary updated continuously across compression calls, exploiting cross-chunk redundancy in real-world LLM workloads.&lt;/p&gt;

&lt;p&gt;I introduce GCdict (GN Context Dictionary), a novel technique that uses conversation history as a preset dictionary for deflate compression of the literal stream residue. GCdict exploits the self-referential nature of LLM conversations and beats brotli per-message on all five evaluation corpora, including +30.8% on real Claude conversations.&lt;/p&gt;

&lt;p&gt;Verified across four public datasets (ShareGPT, WildChat, LMSYS-Chat, Ubuntu IRC):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GN beats gzip-6 on all corpora&lt;/li&gt;
&lt;li&gt;GN beats brotli-6 by +47% on Ubuntu-IRC (67B avg messages, 72 measurements)&lt;/li&gt;
&lt;li&gt;GN beats brotli per-message on ALL corpora: Claude +30.8%, IRC +62.7%, ShareGPT +13.3%, LMSYS +12.7%, WildChat +2.4%&lt;/li&gt;
&lt;li&gt;p50 latency 0.007ms per chunk&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;Large language model deployments generate vast quantities of structured text: conversation histories, retrieved context, agent memory. These workloads share a distinctive statistical structure: conversations from the same deployment reuse vocabulary, formatting conventions, and domain-specific terminology.&lt;/p&gt;

&lt;p&gt;General-purpose compressors compress each document in isolation. They cannot exploit cross-document redundancy because they maintain no state between compression calls. Each conversation turn is compressed independently, discarding vocabulary learned from prior turns.&lt;/p&gt;

&lt;p&gt;GN maintains a persistent sliding window vocabulary across compression calls. The window accumulates frequently occurring byte sequences, building a domain-specific dictionary that improves compression monotonically with stream length. Unlike Zstandard offline dictionary training, GN adapts continuously to live data without an offline training step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Primary contributions:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;GN split-stream encoding: Separates token ID stream from literal byte stream, compressing each independently. Beats gzip on all corpora, beats brotli on short messages by up to +62%.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GCdict (GN Context Dictionary): Uses conversation history as a deflate preset dictionary for the literal stream residue. Exploits LLM conversation self-reference to beat brotli per-message on all corpora (verified 32 random seeds).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  2. Architecture
&lt;/h2&gt;
&lt;h3&gt;
  
  
  2.1 Aho-Corasick Tokenizer
&lt;/h3&gt;

&lt;p&gt;The core matching engine uses an Aho-Corasick automaton built from the current vocabulary. O(n) single-pass matching, independent of dictionary size. Token IDs assigned 1-254 (u8). The automaton rebuilds every 50 chunks (cold) or 100 chunks (warm), with an atomic pointer swap ensuring no blocking on the encode hot path.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.2 Sliding Window Vocabulary (SlidingTokenizerV2)
&lt;/h3&gt;

&lt;p&gt;Maintains up to 20,000 entries across compression calls, tracking byte sequence, cumulative frequency, last-seen batch, and compression saving. New patterns displace lowest-saving stale entries when the window is full.&lt;/p&gt;

&lt;p&gt;This enables the monotonic improvement property: compression ratio increases with stream length as the vocabulary adapts to the domain. A single instance shared across all compression calls enables cross-session vocabulary accumulation.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.3 Split-Stream Encoding
&lt;/h3&gt;

&lt;p&gt;After AC tokenization, GN separates two streams:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token ID stream: Pure symbol sequence (IDs 1-254), skewed distribution, compresses at ~9x with deflate&lt;/li&gt;
&lt;li&gt;Literal stream: Unmatched bytes, compresses at ~2x with deflate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Frame format: [2B tok_deflated_len][tok_deflated][lit_deflated]&lt;/p&gt;

&lt;p&gt;Separating streams improves ratio because each has distinct statistical properties. The mixed tokenized stream contains ESCAPE bytes that fragment deflate pattern matching.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.4 GCdict: GN Context Dictionary
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The core insight&lt;/strong&gt;: LLM conversations are self-referential. The literal stream residue contains patterns from prior messages in the same conversation. A debugging session reuses error messages and variable names. A code review reuses function names and patterns. A customer support session reuses product terminology.&lt;/p&gt;

&lt;p&gt;GCdict uses conversation history as a preset dictionary for deflate compression of the literal stream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input batch
  -&amp;gt; AC tokenization (GN vocabulary)
  -&amp;gt; split(tok_ids, literals)
  -&amp;gt; tok_stream: deflate (unchanged)
  -&amp;gt; lit_stream: deflate(literals, zdict=history[-32KB])
  -&amp;gt; frame: [2B tok_len][tok_deflated][lit_deflated]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deflate's LZ77 engine, initialized with 32KB of conversation history, finds back-references to prior turns that standard deflate cannot see. This is GN-native -- the same deflate engine with a better-initialized LZ77 window. No brotli internals.&lt;/p&gt;

&lt;p&gt;Both encoder and decoder maintain the same conversation history, making GCdict fully lossless and deterministic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why brotli's static dictionary fails where GCdict succeeds&lt;/strong&gt;: Brotli's 120KB dictionary is trained on web text. On IRC messages (67B avg), brotli achieves 1.17x -- the web-text dictionary has minimal overlap with technical Linux support dialogue. GN's domain-specific vocabulary achieves 2.53x on the same data. Brotli quality=1 (minimal static dict usage) achieves 1.757x -- lower than deflate-9 (1.937x), confirming the static dictionary is brotli's primary advantage, not better LZ77. GCdict replaces the static web-text dictionary with a dynamic conversation-specific dictionary.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Experimental Evaluation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 Corpora
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;ShareGPT V3: Real ChatGPT conversations, avg 846B per message&lt;/li&gt;
&lt;li&gt;WildChat: Multi-language LLM conversations, avg 952B per message&lt;/li&gt;
&lt;li&gt;LMSYS-Chat-1M: Chatbot Arena conversations, avg 915B per message&lt;/li&gt;
&lt;li&gt;Ubuntu IRC: Technical Linux support dialogues, avg 67B per message&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Content extracted as clean message text. Hardware: Intel i3-1215U.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 GN Split-Stream Results (b=8, 24 measurements)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Corpus&lt;/th&gt;
&lt;th&gt;GN ratio&lt;/th&gt;
&lt;th&gt;range&lt;/th&gt;
&lt;th&gt;vs gzip&lt;/th&gt;
&lt;th&gt;vs brotli&lt;/th&gt;
&lt;th&gt;p50/batch&lt;/th&gt;
&lt;th&gt;MB/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ShareGPT&lt;/td&gt;
&lt;td&gt;2.484x&lt;/td&gt;
&lt;td&gt;2.422-2.559&lt;/td&gt;
&lt;td&gt;+2.3%&lt;/td&gt;
&lt;td&gt;-4.9%&lt;/td&gt;
&lt;td&gt;0.43ms&lt;/td&gt;
&lt;td&gt;15.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WildChat&lt;/td&gt;
&lt;td&gt;2.130x&lt;/td&gt;
&lt;td&gt;2.088-2.169&lt;/td&gt;
&lt;td&gt;+3.5%&lt;/td&gt;
&lt;td&gt;-7.7%&lt;/td&gt;
&lt;td&gt;0.54ms&lt;/td&gt;
&lt;td&gt;17.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LMSYS&lt;/td&gt;
&lt;td&gt;2.362x&lt;/td&gt;
&lt;td&gt;2.335-2.396&lt;/td&gt;
&lt;td&gt;+1.0%&lt;/td&gt;
&lt;td&gt;-5.2%&lt;/td&gt;
&lt;td&gt;0.39ms&lt;/td&gt;
&lt;td&gt;19.3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ubuntu-IRC&lt;/td&gt;
&lt;td&gt;2.534x&lt;/td&gt;
&lt;td&gt;2.384-2.715&lt;/td&gt;
&lt;td&gt;+61.9%&lt;/td&gt;
&lt;td&gt;+47.1%&lt;/td&gt;
&lt;td&gt;0.055ms&lt;/td&gt;
&lt;td&gt;9.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Per-chunk p50 latency: 0.007ms (0.055ms / 8 chunks).&lt;br&gt;
GN beats gzip on all corpora across all 24 measurements.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Ubuntu-IRC: Verified Dominance (72 measurements)
&lt;/h3&gt;

&lt;p&gt;On 67B average messages, standard compressors essentially fail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gzip-6 per-message: 0.857x (actually expands)&lt;/li&gt;
&lt;li&gt;brotli-6 per-message: 1.138x (barely compresses)&lt;/li&gt;
&lt;li&gt;GN b=8: 2.534x (+47% vs brotli, +62% vs gzip)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Verified across 72 measurements across three corpus sizes and multiple seed sets.&lt;br&gt;
Every single measurement positive vs brotli. Floor: +47%, ceiling: +61%.&lt;/p&gt;

&lt;p&gt;Domain-specific vocabulary explains this: IRC messages about Linux troubleshooting&lt;br&gt;
contain sudo apt-get, /dev/sda, ubuntu, terminal -- patterns GN knows and&lt;br&gt;
general-purpose compressors do not.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 Claude Conversations: The Target Corpus
&lt;/h3&gt;

&lt;p&gt;GN was designed for Claude LLM conversations. Tested on 41 real Claude conversations (4841 turns, avg 915B), 16 random seeds:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Corpus&lt;/th&gt;
&lt;th&gt;GN cold&lt;/th&gt;
&lt;th&gt;GN warmed&lt;/th&gt;
&lt;th&gt;br/msg&lt;/th&gt;
&lt;th&gt;vs br/msg&lt;/th&gt;
&lt;th&gt;range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude convos&lt;/td&gt;
&lt;td&gt;2.305x&lt;/td&gt;
&lt;td&gt;2.766x&lt;/td&gt;
&lt;td&gt;2.115x&lt;/td&gt;
&lt;td&gt;+30.8%&lt;/td&gt;
&lt;td&gt;30.6-31.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GN beats brotli per-message by +30.8% on real Claude data. Variance 30.6-31.2% across 16 random seeds -- structural, not noise. vs brotli per-batch: +15.9%.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.5 GCdict Results: All Public Corpora (32 random seeds, all_positive=true)
&lt;/h3&gt;

&lt;p&gt;Production comparison: GCdict vs brotli per-message.&lt;br&gt;
In production LLM streaming, messages arrive one at a time.&lt;br&gt;
GN accumulates session history. Per-message brotli does not.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Corpus&lt;/th&gt;
&lt;th&gt;GN cold&lt;/th&gt;
&lt;th&gt;GN warmed&lt;/th&gt;
&lt;th&gt;br/msg&lt;/th&gt;
&lt;th&gt;vs br/msg&lt;/th&gt;
&lt;th&gt;range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ShareGPT&lt;/td&gt;
&lt;td&gt;2.513x&lt;/td&gt;
&lt;td&gt;2.765x&lt;/td&gt;
&lt;td&gt;2.441x&lt;/td&gt;
&lt;td&gt;+13.3%&lt;/td&gt;
&lt;td&gt;12.1-14.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WildChat&lt;/td&gt;
&lt;td&gt;2.115x&lt;/td&gt;
&lt;td&gt;2.265x&lt;/td&gt;
&lt;td&gt;2.212x&lt;/td&gt;
&lt;td&gt;+2.4%&lt;/td&gt;
&lt;td&gt;0.9-3.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LMSYS&lt;/td&gt;
&lt;td&gt;2.354x&lt;/td&gt;
&lt;td&gt;2.577x&lt;/td&gt;
&lt;td&gt;2.287x&lt;/td&gt;
&lt;td&gt;+12.7%&lt;/td&gt;
&lt;td&gt;10.8-14.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IRC&lt;/td&gt;
&lt;td&gt;1.708x&lt;/td&gt;
&lt;td&gt;1.925x&lt;/td&gt;
&lt;td&gt;1.184x&lt;/td&gt;
&lt;td&gt;+62.7%&lt;/td&gt;
&lt;td&gt;59.9-64.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;GN beats brotli per-message on ALL 4 public corpora, ALL 32 seeds, zero exceptions.&lt;br&gt;
WildChat minimum: +0.9% -- never negative.&lt;/p&gt;

&lt;p&gt;vs brotli per-batch (same context, best case for brotli):&lt;br&gt;
ShareGPT +3.8% (all positive), WildChat -1.3% (near tie), LMSYS +3.6%, IRC +11.7%&lt;/p&gt;

&lt;h3&gt;
  
  
  3.6 Literal Stream Analysis
&lt;/h3&gt;

&lt;p&gt;The literal stream (unmatched bytes) is the primary compression challenge:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Literal stream = 91-95% of input on longer messages&lt;/li&gt;
&lt;li&gt;Deflate compresses literals at 1.937x&lt;/li&gt;
&lt;li&gt;Brotli compresses literals at 2.089x (7.8% gap)&lt;/li&gt;
&lt;li&gt;Brotli quality=1 on literals: 1.757x -- lower than deflate-9 (1.937x)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This confirms brotli's static dictionary is its primary advantage. GCdict provides&lt;br&gt;
a conversation-specific replacement that outperforms the web-text static dict.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Related Work
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;LZ77 and Deflate&lt;/strong&gt;: Ziv &amp;amp; Lempel 1977. Deflate (RFC 1951) combines LZ77 with Huffman coding. Fixed 32KB window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Brotli&lt;/strong&gt; (RFC 7932): Adds 120KB static dictionary and context modeling. Dictionary fixed at specification time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zstandard&lt;/strong&gt;: Offline dictionary training. Dictionary static after training. GN achieves adaptation without offline training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM Context Compression&lt;/strong&gt;: Token-level methods (LLMLingua, AutoCompressor) are lossy and require model inference. GN is complementary: byte-level, strictly lossless, CPU-only.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GCdict requires conversation history at decode time (stateful)&lt;/li&gt;
&lt;li&gt;Split-stream requires batching (4+ chunks) to amortize overhead&lt;/li&gt;
&lt;li&gt;GN cold start (no session history) trails brotli by 5-8% on longer messages&lt;/li&gt;
&lt;li&gt;WildChat -1.3% vs brotli per-batch (near tie); +2.4% vs brotli per-message (all positive)&lt;/li&gt;
&lt;li&gt;Higher constant overhead than gzip for very small inputs under 200B&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  6. Conclusion
&lt;/h2&gt;

&lt;p&gt;GN provides domain-adaptive compression that improves with conversation length.&lt;br&gt;
GN demonstrates that LLM conversation history is itself a compression resource:&lt;br&gt;
using prior turns as a preset dictionary exploits self-reference that&lt;br&gt;
general-purpose compressors cannot access.&lt;/p&gt;

&lt;p&gt;Key results:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GN beats gzip on all corpora, always&lt;/li&gt;
&lt;li&gt;GN beats brotli per-message on all 5 corpora&lt;/li&gt;
&lt;li&gt;Ubuntu-IRC: +47% vs brotli, 72 measurements, every run positive&lt;/li&gt;
&lt;li&gt;p50 0.007ms per chunk -- negligible latency overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Source: github.com/atomsrkuul/glasik-core (MIT)&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ziv &amp;amp; Lempel (1977). IEEE Trans. Information Theory, 23(3), 337-343.&lt;/li&gt;
&lt;li&gt;Deutsch (1996). DEFLATE. RFC 1951.&lt;/li&gt;
&lt;li&gt;Alakuijala &amp;amp; Szabadka (2016). Brotli. RFC 7932.&lt;/li&gt;
&lt;li&gt;Collet (2016). Zstandard. RFC 8878.&lt;/li&gt;
&lt;li&gt;Zhao et al. (2024). WildChat. ICLR.&lt;/li&gt;
&lt;li&gt;Zheng et al. (2023). LMSYS-Chat. NeurIPS.&lt;/li&gt;
&lt;li&gt;Lowe et al. (2015). Ubuntu Dialogue Corpus. SIGDIAL.&lt;/li&gt;
&lt;li&gt;Deletang et al. (2023). Language Modeling Is Compression. arXiv:2309.10668.&lt;/li&gt;
&lt;li&gt;Jiang et al. (2023). LLMLingua. EMNLP.
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>rust</category>
      <category>compression</category>
      <category>algorithms</category>
    </item>
    <item>
      <title>I Built a Compression Algorithm That Beats Gzip in 2 Weeks. I Have an A+ Cert.</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Wed, 15 Apr 2026 19:20:57 +0000</pubDate>
      <link>https://dev.to/atomsrkuul/i-built-a-compression-algorithm-that-beats-gzip-i-have-an-a-cert-53e6</link>
      <guid>https://dev.to/atomsrkuul/i-built-a-compression-algorithm-that-beats-gzip-i-have-an-a-cert-53e6</guid>
      <description>&lt;p&gt;How a networking student ended up writing Rust, beating industry standard compression, and learning more about computers than any classroom taught me.&lt;/p&gt;

&lt;p&gt;I was messing around with LLM APIs. Claude, GPT, the usual. And I kept hitting the same wall: context windows cost money. Every token you send costs. Every token the model reads costs. If you're building anything serious on top of these APIs, agents, memory systems, anything that needs conversation history you're constantly fighting the token budget.&lt;/p&gt;

&lt;p&gt;The obvious solution is compression. Compress your context before sending it, decompress on the way back. But here's the thing nobody talks about: standard compression algorithms weren't built for LLM data. Gzip was designed for web assets. Brotli for HTTP. Neither was trained on the specific patterns that dominate LLM conversations, repeated phrases, structured JSON, code snippets, the same function names appearing dozens of times across a session.&lt;/p&gt;

&lt;p&gt;I thought: what if I built something specifically for this?&lt;br&gt;
I have an A+ cert. I'm in school for networking. I had never written Rust. I had never implemented a string matching algorithm. I didn't know what Aho-Corasick was.&lt;/p&gt;

&lt;p&gt;Two weeks later, GN (Glasik Notation) was compressing LLM data. Six months of refinement later, it beats gzip on every corpus tested, beats brotli by 47% on short messages, and beats brotli per-message by up to 62% on real Claude conversations. Here's how it happened.&lt;/p&gt;

&lt;p&gt;What I Knew Coming In&lt;br&gt;
Networking teaches you to think in bytes. Packets, headers, payloads. MTU limits. Bandwidth vs latency tradeoffs. When you spend time thinking about why a TCP packet is structured the way it is, you develop an intuition for data representation that a lot of software developers never get.&lt;/p&gt;

&lt;p&gt;I also knew that compression, at its core, is just finding patterns and encoding them more efficiently. ZIP finds repeated byte sequences. Huffman coding gives frequent symbols shorter codes. That's it. The magic is in how cleverly you find patterns and how efficiently you encode them.&lt;/p&gt;

&lt;p&gt;What I didn't know: how to implement any of this. So I started reading.&lt;/p&gt;

&lt;p&gt;The First Attempt: Embarrassingly Bad&lt;br&gt;
My first compression attempt was a lookup table. I manually collected common LLM phrases and replaced them with single bytes. Classic dictionary substitution.&lt;br&gt;
The ratio? About 1.1x on good days. Gzip does 2.4x. I was losing badly.&lt;br&gt;
But it taught me something important: the dictionary matters more than the algorithm. If you have the right patterns, even simple substitution works. If you have the wrong patterns, nothing saves you.&lt;/p&gt;

&lt;p&gt;Learning Aho-Corasick&lt;br&gt;
The problem with simple dictionary lookup is speed. If you have 20,000 patterns and a 2,000 byte message, naively checking every pattern at every position is O(n x m x k) too slow to be useful.&lt;br&gt;
Aho-Corasick builds a finite automaton from all your patterns simultaneously. You scan the input once, left to right, and the automaton tells you which patterns match at each position. Linear time regardless of how many patterns you have.&lt;/p&gt;

&lt;p&gt;This is where my networking background helped. I already understood finite state machines from studying how network protocols work. TCP state diagrams, regex engines in firewalls, etc. Aho-Corasick was the same concept applied to string matching.&lt;br&gt;
I implemented it in Rust. Its type system forced me to think clearly about ownership and memory, things networking had taught me to care about but that most high level languages hide.&lt;br&gt;
The Vocabulary Problem&lt;br&gt;
An AC automaton is only as good as its vocabulary. I needed to learn what patterns actually appear in LLM conversations.&lt;/p&gt;

&lt;p&gt;I downloaded four public datasets:&lt;/p&gt;

&lt;p&gt;·  ShareGPT v3: Real ChatGPT conversation turns, avg 846B per message&lt;/p&gt;

&lt;p&gt;·  WildChat: Multi-language LLM conversations, avg 952B per message&lt;/p&gt;

&lt;p&gt;·  LMSYS-Chat: Academic LLM benchmark data, avg 915B per message&lt;/p&gt;

&lt;p&gt;·  Ubuntu IRC: Technical support dialogues, avg 67B per message&lt;/p&gt;

&lt;p&gt;I wrote a sliding window tokenizer that learns patterns from the data. After training on 500k chunks, the L0 vocabulary had 20,000 entries covering the most common LLM patterns. The top entries: 8-space indentation, " the", "and ", "ing ", paragraph breaks, common function names, JSON structure.&lt;/p&gt;

&lt;p&gt;Split-Stream Architecture: The Key Insight&lt;br&gt;
After AC matching, you have two streams:&lt;/p&gt;

&lt;p&gt;·  Token IDs: Single bytes representing matched patterns (1-254)&lt;br&gt;
·  Literals: The raw bytes that didn't match any pattern&lt;/p&gt;

&lt;p&gt;These two streams have completely different statistical properties. Token IDs are low-entropy, a small alphabet of 254 symbols, highly repetitive. Literals are higher entropy, think the unique residue.&lt;br&gt;
By deflating them separately and combining the compressed streams, each gets optimally compressed for its own statistics. The token stream deflates at 9.4x. The literal stream deflates at 2x.&lt;br&gt;
The frame format:&lt;/p&gt;

&lt;p&gt;[2B token_stream_length][token_stream_deflated][literal_stream_deflated]&lt;/p&gt;

&lt;p&gt;The GCdict Breakthrough&lt;/p&gt;

&lt;p&gt;This is the part that surprised me most.&lt;br&gt;
After getting the split-stream working, GN was still trailing brotli by 5-8% on longer messages. I spent weeks trying vocabulary expansion, LZ77 pre-processing, ANS entropy coding. Nothing moved the needle enough.&lt;br&gt;
Then I asked a different question: what does brotli know that GN doesn't?&lt;/p&gt;

&lt;p&gt;Brotli has a 120KB static dictionary trained on web text. I proved this was its primary advantage -- brotli at quality=1 (minimal static dictionary usage) actually loses to deflate-9. The dictionary is the advantage, not better LZ77.&lt;/p&gt;

&lt;p&gt;So I needed GN's equivalent. But instead of a static web-text dictionary, I used the conversation history itself.&lt;br&gt;
LLM conversations are self-referential. When someone asks about a bug in message 5, they used the same variable names in message 2. A debugging session reuses error messages. A code review reuses function names. The literal stream residue (the bytes GN's AC tokenizer didn't match) contains fragments of prior messages in the same conversation.&lt;br&gt;
GCdict (GN Context Dictionary): use the last 32KB of conversation history as a preset dictionary for deflate compression of the literal stream.&lt;/p&gt;

&lt;p&gt;AC tokenization -&amp;gt; split(tok_ids, literals)&lt;/p&gt;

&lt;p&gt;tok_stream: deflate (unchanged)&lt;/p&gt;

&lt;p&gt;lit_stream: deflate(literals, zdict=history[-32KB])&lt;br&gt;
Same deflate engine. Better-initialized LZ77 window. No brotli internals. Fully lossless -- both encoder and decoder have the same history.&lt;/p&gt;

&lt;p&gt;The result: GN beats brotli per-message on every corpus tested.&lt;br&gt;
The Numbers&lt;br&gt;
24 independent measurements per corpus for split-stream. 32 random seeds for GCdict.&lt;/p&gt;

&lt;p&gt;GN Split-Stream (b=8):&lt;br&gt;
Corpus  GN ratio    vs gzip vs brotli   p50&lt;br&gt;
ShareGPT (846B avg) 2.484x  +2.3%   -4.9%   0.43ms&lt;br&gt;
WildChat (952B avg) 2.130x  +3.5%   -7.7%   0.54ms&lt;br&gt;
LMSYS (915B avg)    2.362x  +1.0%   -5.2%   0.39ms&lt;br&gt;
Ubuntu-IRC (67B avg)    2.534x  +61.9%  +47.1%  0.055ms&lt;/p&gt;

&lt;p&gt;Per-chunk p50: 0.007ms.&lt;br&gt;
GN with GCdict -- vs brotli per-message (production comparison):&lt;br&gt;
Corpus  GN  br/msg  vs brotli   range&lt;br&gt;
Claude convos (915B)    2.766x  2.115x  +30.8%  30.6-31.2%&lt;br&gt;
ShareGPT    2.765x  2.441x  +13.3%  12.1-14.6%&lt;br&gt;
LMSYS   2.577x  2.287x  +12.7%  10.8-14.6%&lt;br&gt;
WildChat    2.265x  2.212x  +2.4%   0.9-3.4%&lt;br&gt;
IRC 1.925x  1.184x  +62.7%  59.9-64.5%&lt;/p&gt;

&lt;p&gt;All 32 seeds positive on every corpus. Zero exceptions.&lt;br&gt;
The production comparison is per-message brotli because in real LLM streaming, messages arrive one at a time. GN accumulates session history. Per-message brotli has no context.&lt;/p&gt;

&lt;p&gt;The Ubuntu-IRC result deserves special mention. On 67-byte average messages, gzip achieves 0.857x (actually expands) and brotli achieves 1.138x. GN achieves 2.534x. This isn't close -- it's a different compression tier entirely, because GN's vocabulary is trained on the specific domain.&lt;/p&gt;

&lt;p&gt;What I Got Wrong&lt;br&gt;
The window update bug. I added per-chunk window updates inside the batch compression loop. This triggered the Aho-Corasick automaton to rebuild at every 10th chunk -- a 35x latency regression that took days to find. Fix: update vocabulary once per batch, not per chunk.&lt;br&gt;
Fake benchmarks. Early on I tested with synthetic text -- repeated sentences. The ratios looked incredible. When I switched to real corpus data, they collapsed. Real data is the only benchmark that matters.&lt;br&gt;
The extraction bug. I spent weeks thinking WildChat compressed at 2.83x. It was actually 2.13x. The difference: my extraction script used str(turn) instead of turn.get('content',''), including JSON metadata in every message. 89 bytes of {'language': 'English', 'redacted': False} per turn, compressing beautifully. Real numbers only come from real data extracted correctly.&lt;/p&gt;

&lt;p&gt;Dictionary quality. My first trained vocabulary included spam patterns -- ubuntu repeated 35 times, Spotify API strings. These filled vocabulary slots with noise. Proper filtering was essential.&lt;br&gt;
What Networking Taught Me That CS Didn't&lt;br&gt;
Think in bytes, not objects. Every compression decision comes down to: how many bytes does this cost vs save? Networking trains you to count bytes obsessively.&lt;/p&gt;

&lt;p&gt;Latency is not throughput. A compression algorithm that achieves 3x ratio but takes 50ms per message is useless for real-time applications. GN's p50 is 0.007ms per chunk.&lt;br&gt;
The protocol mindset. GN's frame format is just a protocol. I designed it the same way I'd design a packet format: minimize overhead, make it parseable without side information, handle edge cases at the boundary.&lt;br&gt;
Real systems fail in weird ways. My Node.js native addon path was 12x slower than my Python path on identical data. Finding this required the same methodical elimination I use to debug network issues.&lt;/p&gt;

&lt;p&gt;Where It Is Now&lt;/p&gt;

&lt;p&gt;GN is deployed in my OpenClaw setup, compressing conversation context before storage and retrieval.&lt;/p&gt;

&lt;p&gt;Architecture:&lt;br&gt;
·  L0: 20,000 static entries trained on LLM corpora&lt;br&gt;
·  Sliding window vocabulary: adapts continuously to session content&lt;br&gt;
·  GCdict: conversation history as deflate preset dictionary&lt;br&gt;
·  Split-stream: independent tok/lit compression&lt;br&gt;
·  Node.js and Python bindings (napi + PyO3)&lt;/p&gt;

&lt;p&gt;A paper (GN: Domain-Adaptive Lossless Compression for LLM Conversation Streams) is available at github.com/atomsrkuul/glasik-core.&lt;br&gt;
What I'd Tell Someone Starting Out&lt;br&gt;
You don't need a CS degree to build real systems. You need:&lt;br&gt;
A real problem. Not a tutorial project. Something that costs you actual money or time.&lt;/p&gt;

&lt;p&gt;The willingness to read papers. You don't need to understand the proofs. You need to understand the idea.&lt;/p&gt;

&lt;p&gt;Real data benchmarks from day one. Never test on synthetic data. It will lie to you.&lt;/p&gt;

&lt;p&gt;Obsessive measurement. If you can't measure it, you can't improve it. Instrument everything.&lt;/p&gt;

&lt;p&gt;The networking mindset. Think in bytes. Count everything. Every abstraction has a cost.&lt;/p&gt;

&lt;p&gt;The A+ cert taught me that computers are physical things moving electrons around. That turned out to be exactly the right mental model for building a compression algorithm.&lt;/p&gt;

&lt;p&gt;GN is open source at github.com/atomsrkuul/glasik-core (MIT).&lt;br&gt;
Robert Rider - Independent Researcher&lt;/p&gt;

</description>
      <category>rust</category>
      <category>compression</category>
      <category>algorithms</category>
      <category>opensource</category>
    </item>
    <item>
      <title>The ESCAPE Byte Problem: How I Beat Brotli by Separating Token Streams</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Tue, 14 Apr 2026 00:47:39 +0000</pubDate>
      <link>https://dev.to/atomsrkuul/the-escape-byte-problem-how-we-beat-brotli-by-separating-token-streams-2i6j</link>
      <guid>https://dev.to/atomsrkuul/the-escape-byte-problem-how-we-beat-brotli-by-separating-token-streams-2i6j</guid>
      <description>&lt;p&gt;Part 4 of the Glasik Notation series.&lt;/p&gt;

&lt;p&gt;Previous articles covered the sliding window tokenizer, Aho-Corasick O(n) matching, and GN's first verified benchmarks against gzip.&lt;/p&gt;

&lt;p&gt;The Waste Was Hidden in Plain Sight&lt;br&gt;
After implementing Aho-Corasick O(n) matching, GN was fast. Sub-millisecond per chunk, competitive with brotli on latency. But the ratio numbers kept coming back flat:&lt;/p&gt;

&lt;p&gt;gzip-6:   2.18x&lt;br&gt;
GN AC:    2.20x  (+0.9% vs gzip)&lt;br&gt;
brotli-6: 2.47x&lt;/p&gt;

&lt;p&gt;We were barely beating gzip. Brotli was 12% ahead. The vocabulary was real — 31,248 tokens per 200 chunks, 190 tokens per chunk average. The matches were happening. So where were the bits going?&lt;br&gt;
We ran a token stream entropy analysis:&lt;br&gt;
pythonfrom collections import Counter&lt;br&gt;
import math&lt;/p&gt;

&lt;p&gt;token_ids = []&lt;br&gt;
for c in sample:&lt;br&gt;
    raw = slider.encode_ac_raw(c)&lt;br&gt;
    i = 0&lt;br&gt;
    while i &amp;lt; len(raw):&lt;br&gt;
        if raw[i] == ESCAPE and i+1 &amp;lt; len(raw):&lt;br&gt;
            token_ids.append(raw[i+1])&lt;br&gt;
            i += 2&lt;br&gt;
        else:&lt;br&gt;
            i += 1&lt;/p&gt;

&lt;p&gt;counter = Counter(token_ids)&lt;br&gt;
total = sum(counter.values())&lt;br&gt;
entropy = -sum(c/total * math.log2(c/total) for c in counter.values())&lt;br&gt;
print(f"Token entropy: {entropy:.3f} bits/token")&lt;br&gt;
Result: 7.758 bits/token.&lt;/p&gt;

&lt;p&gt;We were encoding each token as 2 bytes: ESCAPE + id. That's 16 bits per token. The theoretical minimum was 7.758 bits. We were wasting 51.5% of every token encoding. That's where the bits were going.&lt;/p&gt;

&lt;p&gt;Why the Mixed Stream Was Hurting Us&lt;/p&gt;

&lt;p&gt;Our tokenized output looked like this:&lt;/p&gt;

&lt;p&gt;[ESCAPE][id][ESCAPE][id][lit][lit][lit][ESCAPE][id][lit][ESCAPE][id]...&lt;/p&gt;

&lt;p&gt;Every token costs 2 bytes: an ESCAPE byte (0x01) followed by the ID. We fed this into deflate expecting it to compress well. But deflate uses LZ77 — it looks for repeated byte sequences in a sliding window. The ESCAPE bytes were fragmenting every pattern.&lt;/p&gt;

&lt;p&gt;Where deflate might have seen:&lt;br&gt;
" the " " the " " the "   ← repeating 6-byte sequence, compresses well&lt;br&gt;
It was instead seeing:&lt;br&gt;
[01][04] " t" "he" [01][04] " t" ...   ← ESCAPE bytes breaking the pattern&lt;br&gt;
The ESCAPE byte was acting like static on a radio signal. Present in every token, making the mixed stream look noisier than it actually was.&lt;/p&gt;

&lt;p&gt;The Insight: Separate the Streams&lt;br&gt;
What if we just... didn't mix them?&lt;br&gt;
Instead of one interleaved stream, emit two:&lt;/p&gt;

&lt;p&gt;Token stream: just the IDs — [04][04][38][20][04][07]...&lt;br&gt;
Literal stream: just the literal bytes — "t" "h" "e" " " "a" ...&lt;/p&gt;

&lt;p&gt;Then compress each independently with raw deflate.&lt;br&gt;
The token stream is pure symbols. Token ID 4 (" the") fires 483 times in 200 chunks. That's a highly skewed distribution — deflate loves it. The literal stream is clean text with no ESCAPE pollution. It compresses the way text is supposed to compress.&lt;/p&gt;

&lt;p&gt;pythontoks, lits = slider.encode_ac_split(chunk)&lt;br&gt;
dt = zlib.compress(toks, 6)[2:-4]  # raw deflate&lt;br&gt;
dl = zlib.compress(lits, 6)[2:-4]&lt;br&gt;
frame = struct.pack('&amp;gt;H', len(dt)) + dt + dl&lt;/p&gt;

&lt;p&gt;This is the same insight behind why PNG separates prediction from entropy coding, why video codecs separate motion vectors from residual — when you have structurally different data, compress the structures separately.&lt;/p&gt;

&lt;p&gt;The Numbers&lt;/p&gt;

&lt;p&gt;We ran this across 4 corpora, 3 seeds each — 12 independent measurements. Standard protocol: warm 500 chunks, test next 300.&lt;br&gt;
Batch size matters. Each chunk has ~37 token IDs. Deflate header overhead (~18 bytes) dominates a tiny stream. Batching solves this — concatenate N chunks before compressing the token stream:&lt;/p&gt;

&lt;p&gt;GN split b=1:   2.226x   0.043ms   -6.6% vs brotli   ← header overhead dominates&lt;br&gt;
GN split b=4:   2.385x   0.036ms   +0.1% vs brotli   ← already matching brotli&lt;br&gt;
GN split b=8:   2.456x   0.036ms   +3.1% vs brotli   ← production sweet spot&lt;br&gt;
GN split b=16:  2.542x   0.037ms   +6.7% vs brotli   ← diminishing returns&lt;/p&gt;

&lt;p&gt;b=8 is the production choice. Beyond b=16 the marginal gain flattens and you're accumulating more latency budget than the ratio improvement justifies.&lt;/p&gt;

&lt;p&gt;Full 12-measurement verification at b=8:&lt;br&gt;
CorpusGN split b=8vs gzipvs brotlip50p99ShareGPT2.49–2.52x+15%+2%0.043ms0.061msWildChat2.48–2.51x+15%+2%0.042ms0.073msLMSYS2.50–2.56x+14%+2%0.044ms0.079msUbuntu-IRC2.06–2.09x+49%+28%0.008ms0.013ms&lt;/p&gt;

&lt;p&gt;Every single measurement beats both gzip and brotli.&lt;br&gt;
And on tail latency: GN split b=8 p99 never exceeds 0.123ms. Brotli-6 p99 reaches 0.226ms. GN has 2–4x better tail latency than brotli while achieving better compression ratio.&lt;/p&gt;

&lt;p&gt;Why This Works (The Information Theory)&lt;br&gt;
The mixed tokenized stream had:&lt;/p&gt;

&lt;p&gt;Token entropy: 7.758 bits/token&lt;br&gt;
Encoding cost: 16 bits/token&lt;br&gt;
Waste: 51.5%&lt;/p&gt;

&lt;p&gt;The split stream:&lt;/p&gt;

&lt;p&gt;Token stream: pure symbols, deflate compresses ~2–3x on its own&lt;br&gt;
Literal stream: clean text, no structural noise, deflate compresses ~1.9x&lt;/p&gt;

&lt;p&gt;Combined result: 2.49–2.56x on the original input&lt;/p&gt;

&lt;p&gt;The separation lets each compressor do what it was designed to do. This isn't a trick — it's giving deflate the data structure it can actually exploit.&lt;/p&gt;

&lt;p&gt;The Frame Format&lt;/p&gt;

&lt;p&gt;Simple and self-contained:&lt;/p&gt;

&lt;p&gt;[2B tok_deflated_len][tok_deflated][lit_deflated]&lt;br&gt;
Two bytes of length prefix for the token stream, then the two compressed streams concatenated. Given the vocabulary, you can decode it without any other external state.&lt;br&gt;
The Rust implementation in codon.rs:&lt;br&gt;
rustpub fn encode_ac_split(buf: &amp;amp;[u8], ac: &amp;amp;AhoCorasick) -&amp;gt; (Vec, Vec) {&lt;br&gt;
    let mut tok_ids: Vec = Vec::new();&lt;br&gt;
    let mut literals: Vec = Vec::new();&lt;br&gt;
    let mut pos = 0usize;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for m in ac.find_iter(buf) {
    for &amp;amp;b in &amp;amp;buf[pos..m.start()] {
        literals.push(b);
    }
    let pat_idx = m.pattern().as_usize();
    if pat_idx &amp;lt; 254 {
        tok_ids.push((pat_idx + 1) as u8);
    } else {
        for &amp;amp;b in &amp;amp;buf[m.start()..m.end()] {
            literals.push(b);
        }
    }
    pos = m.end();
}
for &amp;amp;b in &amp;amp;buf[pos..] { literals.push(b); }
(tok_ids, literals)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;}&lt;br&gt;
O(n) scan, single pass, clean split.&lt;/p&gt;

&lt;p&gt;Lossless Round-Trip&lt;/p&gt;

&lt;p&gt;The split-stream is lossless when encoder and decoder share the same vocabulary. Token IDs are indices — to decode, you need to know what pattern each ID maps to.&lt;br&gt;
GN uses a stateful model in production. Encoder and decoder share a synchronized sliding window; each frame carries a 2-byte dict_version. If they diverge, the decoder requests a resync. This keeps frames small while guaranteeing correctness.&lt;br&gt;
Round-trip verified: 5/5 test cases pass including empty buffers, raw ESCAPE bytes in input, and 10,000-byte repetitive inputs.&lt;/p&gt;

&lt;p&gt;What's Next: Fractal Dictionary Sharding&lt;/p&gt;

&lt;p&gt;The split-stream insight revealed something deeper: token and literal streams have fundamentally different statistical structure. Taking that further — different types of content have different vocabulary entirely.&lt;br&gt;
Code blocks repeat function, return, const. System messages repeat role definitions. User messages repeat question structures. Compressing them with a single shared vocabulary leaves ratio on the table.&lt;br&gt;
We're implementing fractal dictionary sharding: four vocabulary tiers (L0 universal, L1 domain, L2 session, L3 chunk) with per-shard-type routing and deterministic crystal identity per shard — same content always produces the same compressed shape. The FractalCompressor is implemented, wired into the napi production path, and passing roundtrip verification across all shard types.&lt;br&gt;
More on that in Article 5.&lt;/p&gt;

&lt;p&gt;Code and Paper&lt;/p&gt;

&lt;p&gt;GitHub: github.com/atomsrkuul/glasik-core (MIT)&lt;br&gt;
npm: &lt;a href="mailto:gni-compression@1.0.0"&gt;gni-compression@1.0.0&lt;/a&gt;&lt;br&gt;
arXiv: pending cs.IR endorsement — if you're a qualified author (3+ cs papers): code 7HWUBA&lt;/p&gt;

&lt;p&gt;Robert Rider is an independent researcher building Glasik, an open-source compression and context management system for LLM deployments.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>compression</category>
      <category>algorithms</category>
      <category>llm</category>
    </item>
    <item>
      <title>GN Beats Gzip and Brotli: How a Learning Sliding Window Outperforms Static Compressors</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Tue, 07 Apr 2026 19:14:49 +0000</pubDate>
      <link>https://dev.to/atomsrkuul/gn-beats-gzip-and-brotli-how-a-learning-sliding-window-outperforms-static-compressors-2dg8</link>
      <guid>https://dev.to/atomsrkuul/gn-beats-gzip-and-brotli-how-a-learning-sliding-window-outperforms-static-compressors-2dg8</guid>
      <description>&lt;p&gt;When we published our last article, GN was within 10% of gzip on LLM conversation data. We said the remaining gap was in the entropy backend. We were wrong about the solution — but right about the problem.&lt;br&gt;
This week GN beats gzip on every corpus we tested. And on all three corpora, it beats brotli.&lt;br&gt;
Here is what we learned.&lt;/p&gt;

&lt;p&gt;The ANS Dead End&lt;br&gt;
Our first instinct was to improve the entropy coder. Gzip uses Huffman coding. zstd uses ANS (Asymmetric Numeral Systems). We implemented byte-renorm ANS, bit-renorm ANS, and Order-1 ANS from scratch in Rust.&lt;/p&gt;

&lt;p&gt;Results on ShareGPT:&lt;br&gt;
Codec   Ratio&lt;br&gt;
gzip-6  2.082x&lt;br&gt;
byte-ANS    1.233x&lt;br&gt;
bit-ANS 1.212x&lt;br&gt;
O1-ANS  0.551x&lt;/p&gt;

&lt;p&gt;ANS without an LZ-style preprocessing pass is worse than gzip. Every time. The reason is fundamental: entropy coders compress symbol frequency distributions. But gzip's real advantage comes from LZ77 — the sliding window that eliminates repeated byte sequences before entropy coding runs. ANS cannot fix what LZ77 needs to do first.&lt;br&gt;
We kept ANS in the codebase as a primitive for future work and moved on.&lt;br&gt;
The Real Problem: Per-Frame Dictionary Overhead&lt;br&gt;
GN has a sliding window tokenizer — it learns domain vocabulary across batches and compresses using that vocabulary. But there was a critical architectural flaw: the dictionary was serialized into every compressed frame.&lt;/p&gt;

&lt;p&gt;200 entries × ~10 bytes = ~2KB overhead per chunk. On 500-byte chunks, the dictionary cost more than the compression saved.&lt;/p&gt;

&lt;p&gt;v1 on 1000 LLM chunks: 0.502x  (expanding the data)&lt;br&gt;
The fix: stop putting the dictionary in the frame. Keep it in shared state, reference it by version number. This is exactly how brotli's static dictionary and zstd's dictionary mode work.&lt;/p&gt;

&lt;p&gt;Frame v1: magic + full_dictionary + payload  (~2KB overhead)&lt;/p&gt;

&lt;p&gt;Frame v2: magic + dict_version(4 bytes) + payload  (8 bytes overhead)&lt;/p&gt;

&lt;p&gt;The Corpus Window (Level 2)&lt;br&gt;
With the overhead fixed, we increased the window to 10,000 entries and made it global — one sliding window shared across all compression calls in the process. Every session, every shard, every conversation feeds the same accumulating vocabulary.&lt;/p&gt;

&lt;p&gt;Results immediately improved:&lt;br&gt;
Corpus  L1 (per-call)   L2 (corpus window)  gzip    brotli&lt;br&gt;
ShareGPT    2.191x  2.402x  2.178x  2.453x&lt;br&gt;
WildChat    2.035x  2.145x  2.025x  2.234x&lt;br&gt;
LMSYS   2.094x  2.231x  2.079x  2.322x&lt;/p&gt;

&lt;p&gt;L2 beats gzip on every corpus. The gap to brotli narrowed to 2-4%.&lt;br&gt;
Retrieval-Warmed Compression (Level 3)&lt;br&gt;
The insight: before compressing a new chunk, feed similar prior chunks through the sliding window first. This warms the dictionary with related vocabulary so the new chunk compresses better. The act of retrieval changes the compression state.&lt;/p&gt;

&lt;p&gt;We benchmarked warm_k (number of prior chunks used for warming) on WildChat — the hardest corpus due to topic diversity:&lt;br&gt;
pressurize_k    L3 ratio    vs brotli&lt;br&gt;
0 (no pressurize)   2.164x  +3.54% gap&lt;br&gt;
1   2.199x  +1.89% gap&lt;br&gt;
2   2.251x  +0.5% ahead&lt;br&gt;
3   2.207x  +1.51% gap&lt;/p&gt;

&lt;p&gt;pressurize_k=2 is optimal for WildChat. For ShareGPT and LMSYS, pressurize_k=3 is optimal.&lt;/p&gt;

&lt;p&gt;The optimal pressurization depth varies by corpus vocabulary diversity — more diverse corpora benefit from shallower pressurization to avoid dictionary dilution.&lt;/p&gt;

&lt;p&gt;Final Results: L3 Beats Brotli on All Three Corpora&lt;br&gt;
Verified across 3 independent corpora, 3 random seeds each:&lt;br&gt;
Corpus  GN L3   gzip-6  brotli-6    margin&lt;br&gt;
ShareGPT    2.526x  2.145x  2.429x  +4.0% vs brotli&lt;br&gt;
LMSYS   2.401x  2.031x  2.291x  +4.8% vs brotli&lt;br&gt;
WildChat    2.251x  2.023x  2.240x  +0.5% vs brotli&lt;/p&gt;

&lt;p&gt;All three beat gzip by 11-18%. All three beat brotli. Results verified across 3 random seeds per corpus.&lt;/p&gt;

&lt;p&gt;GN beats gzip on 100% of runs across all seeds and corpora. GN beats brotli on all three corpora when the window is sufficiently warmed.&lt;/p&gt;

&lt;p&gt;Why This Works&lt;/p&gt;

&lt;p&gt;Brotli ships with a 120KB static dictionary of common web phrases. It never changes. GN's sliding window learns the specific vocabulary of your data stream as it runs. LLM conversations have crystalline structure — repeated role markers, prompt scaffolding, tool call formats, JSON patterns, reasoning templates. After seeing a few thousand examples, GN knows these patterns better than any generic dictionary ever could.&lt;/p&gt;

&lt;p&gt;The critical property: GN's compression ratio improves with stream length. Gzip and brotli are static — they cannot improve.&lt;/p&gt;

&lt;p&gt;ShareGPT at 500 chunks:  GN 2.304x  brotli 2.363x  (behind)&lt;/p&gt;

&lt;p&gt;ShareGPT at 2000 chunks: GN 2.440x  brotli 2.436x  (pulls ahead)&lt;/p&gt;

&lt;p&gt;ShareGPT at 5000 chunks: GN 2.517x  brotli 2.429x  (+3.6%)&lt;/p&gt;

&lt;p&gt;The longer GN runs on a domain-specific stream, the wider the gap grows.&lt;/p&gt;

&lt;p&gt;What Comes Next&lt;/p&gt;

&lt;p&gt;The current warming uses sequential proximity — the last N chunks before the current one. The next level uses semantic similarity — retrieve the most topically related prior chunks via embedding search, regardless of when they appeared.&lt;/p&gt;

&lt;p&gt;A conversation about JWT authentication should be warmed by other authentication conversations, not by whatever happened to come before it in the stream. This is Semantic Level 3, and it should further improve results on diverse corpora like WildChat where topic jumps are common.&lt;br&gt;
Beyond that: dictionary compression (compress the dictionary itself, fractal self-similarity), cross-session persistence (window state survives restarts), and pre-trained domain dictionaries (ship a base window trained on 50k LLM conversations).&lt;/p&gt;

&lt;p&gt;The goal is to make GN the brotli of LLMs — purpose-built, measurably better, and invisible infrastructure.&lt;/p&gt;

&lt;p&gt;GN is MIT licensed. Code: github.com/atomsrkuul/glasik-core&lt;/p&gt;

&lt;p&gt;npm: &lt;a href="mailto:gni-compression@1.0.0"&gt;gni-compression@1.0.0&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;NLNet NGI Zero Commons Fund application #2026-06-023&lt;/p&gt;

</description>
      <category>compression</category>
      <category>rust</category>
      <category>algorithms</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Within 10% of gzip: What GN’s Semantic Compression Teaches Us</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Sun, 05 Apr 2026 04:20:54 +0000</pubDate>
      <link>https://dev.to/atomsrkuul/within-10-of-gzip-what-gns-semantic-compression-teaches-us-4cp1</link>
      <guid>https://dev.to/atomsrkuul/within-10-of-gzip-what-gns-semantic-compression-teaches-us-4cp1</guid>
      <description>&lt;p&gt;When we first started building, the goal was never to make another gzip clone. Generic compression already does that job incredibly well.&lt;/p&gt;

&lt;p&gt;The real question was different:&lt;/p&gt;

&lt;p&gt;What happens if the compressor understands the shape of the data before it ever starts packing bytes?&lt;/p&gt;

&lt;p&gt;That question led us from the original JavaScript prototype into Glasik Core, a Rust implementation focused on semantic tokenization, rolling vocabulary windows, and domain-aware preprocessing for message and agent streams.&lt;/p&gt;

&lt;p&gt;This week we hit a milestone that feels small on paper but huge architecturally:&lt;/p&gt;

&lt;p&gt;GN is now within 10% of gzip on every benchmark corpus we tested.&lt;/p&gt;

&lt;p&gt;Not better. Not faster. Not “production solved.”&lt;/p&gt;

&lt;p&gt;Just consistently close, which is exactly why this stage is exciting.&lt;/p&gt;

&lt;p&gt;The benchmark reality&lt;/p&gt;

&lt;p&gt;Current corpus results:&lt;/p&gt;

&lt;p&gt;Corpus  Glasik Core gzip    Relative&lt;br&gt;
MEMORY.md   1.849x  2.075x  89%&lt;br&gt;
ShareGPT-1k 3.752x  3.945x  95%&lt;br&gt;
Ubuntu-IRC-1k   2.122x  2.357x  90%&lt;/p&gt;

&lt;p&gt;The most important one is ShareGPT-1k hitting 95% of gzip. That corpus is extremely close to the data GN was designed for:&lt;/p&gt;

&lt;p&gt;Repeated assistant roles&lt;br&gt;
Prompt scaffolding&lt;br&gt;
Tool formatting&lt;br&gt;
Structured JSON-like patterns&lt;br&gt;
Recurring conversational templates&lt;/p&gt;

&lt;p&gt;Even though we have not passed gzip yet, nearly matching it on LLM-native streams is a strong validation signal.&lt;/p&gt;

&lt;p&gt;Why being close matters more than winning right now&lt;/p&gt;

&lt;p&gt;The remaining gap is not where many would assume. The weak point is not semantic understanding anymore.&lt;/p&gt;

&lt;p&gt;The weak point is the final entropy backend. gzip still has decades of advantage in:&lt;/p&gt;

&lt;p&gt;Huffman tuning&lt;br&gt;
Backreference heuristics&lt;br&gt;
Lazy match parsing&lt;br&gt;
Highly optimized bit packing&lt;br&gt;
Mature DEFLATE edge cases&lt;/p&gt;

&lt;p&gt;That last 5–10% is the part generic compressors are legendary at.&lt;/p&gt;

&lt;p&gt;But the semantic layer is already doing the harder thing: understanding the structure of the stream before compression begins. That’s where the long-term leverage is.&lt;/p&gt;

&lt;p&gt;The real architectural lesson&lt;/p&gt;

&lt;p&gt;The simplest way to explain the difference:&lt;/p&gt;

&lt;p&gt;gzip remembers bytes. GN remembers meaning.&lt;/p&gt;

&lt;p&gt;As the rolling vocabulary fills, repeated structures stop being treated like raw strings and start being treated as stable semantic units. That includes:&lt;/p&gt;

&lt;p&gt;Timestamps&lt;br&gt;
Speaker roles&lt;br&gt;
Repeated tool calls&lt;br&gt;
Theorem blocks&lt;br&gt;
JSON keys&lt;br&gt;
Repeated prompt shells&lt;br&gt;
Agent trace scaffolding&lt;br&gt;
Channel metadata&lt;/p&gt;

&lt;p&gt;Performance improves the longer the stream runs. Instead of relying only on a fixed byte-history window, GN reinforces the vocabulary of the domain itself. That’s the core bet.&lt;/p&gt;

&lt;p&gt;Why Rust changed the debugging loop&lt;/p&gt;

&lt;p&gt;The JavaScript prototype proved the idea. Rust made it possible to trust the measurements.&lt;/p&gt;

&lt;p&gt;One concrete example: during corpus benchmarking we hit a rolling-frequency bug silently inflating token counts over long windows. Compression ratios looked “better,” but only because the vocabulary statistics were wrong.&lt;/p&gt;

&lt;p&gt;The fix only became obvious because Rust forced us to reason explicitly about integer width, overflow behavior, and ownership boundaries inside the rolling state machine.&lt;/p&gt;

&lt;p&gt;Fixing it tightened the corpus results and gave us confidence that the “within 10%” milestone is real, not a measurement artifact. That debugging loop alone justified the rewrite.&lt;/p&gt;

&lt;p&gt;What makes this exciting now&lt;/p&gt;

&lt;p&gt;The missing performance is now localized. We know exactly where the gap is:&lt;/p&gt;

&lt;p&gt;Residual encoding&lt;br&gt;
Entropy refinement&lt;br&gt;
Better state models&lt;br&gt;
Adaptive codon dictionaries&lt;br&gt;
Specialized chat residual codecs&lt;/p&gt;

&lt;p&gt;That is a much better place to be than wondering whether the entire idea works. The semantic layer is clearly competitive. Now it’s about tightening the backend until the semantic advantage outweighs gzip’s entropy maturity.&lt;/p&gt;

&lt;p&gt;What’s next&lt;/p&gt;

&lt;p&gt;Tonight’s most interesting work was deeper in the backend: we now have a reference-safe ANS entropy coder implemented from scratch in Rust, using the same family of techniques that powers zstd.&lt;/p&gt;

&lt;p&gt;The current version uses correctness-first binary renormalization so we can prove round-trip behavior before optimizing. Next step: bit-level state refinement and faster renormalization transforms.&lt;/p&gt;

&lt;p&gt;This work directly targets the exact 5–10% gap the benchmarks are still showing.&lt;/p&gt;

&lt;p&gt;The path forward is finally clear:&lt;/p&gt;

&lt;p&gt;Semantic understanding is already competitive&lt;br&gt;
Entropy packing is the remaining frontier&lt;br&gt;
The architecture now tells us exactly where to push&lt;/p&gt;

&lt;p&gt;At this point, GN (our semantic agent layer) and Glasik Core (the compression engine) feel less like an experiment and more like a real compression architecture.&lt;/p&gt;

</description>
      <category>compression</category>
      <category>rust</category>
      <category>algorithms</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Built Domain Specific Compression for Messages. Here's What I Learned.</title>
      <dc:creator>Buffer Overflow</dc:creator>
      <pubDate>Fri, 03 Apr 2026 17:35:19 +0000</pubDate>
      <link>https://dev.to/atomsrkuul/we-built-domain-specific-compression-for-messages-heres-what-we-learned-2o28</link>
      <guid>https://dev.to/atomsrkuul/we-built-domain-specific-compression-for-messages-heres-what-we-learned-2o28</guid>
      <description>&lt;p&gt;Why gzip loses to custom compression on chat data — and what we learned building a lossless message codec from scratch.&lt;br&gt;
The Problem&lt;br&gt;
Message data is expensive at scale. Discord servers, Slack workspaces, OpenClaw chat logs — each message is ~500 bytes. Generic compression gets 2-3x. That's good but messages have structure generic algorithms ignore.&lt;br&gt;
[2026-04-03T11:00:00Z] user: Hello, can you check the repo?&lt;br&gt;
[2026-04-03T11:00:15Z] bot: Checking repository...&lt;br&gt;
Timestamp format, role prefixes, platform names — these repeat thousands of times. A domain-specific dictionary front-loads that knowledge instead of discovering it slowly.&lt;br&gt;
The Architecture&lt;br&gt;
We built two layers that work together:&lt;br&gt;
GN (Glasik Notation) — semantic compression. Extracts structure before compression, maps repeated values to IDs, recognizes message templates, reduces entropy before any algorithm touches the data.&lt;br&gt;
GNI (Glasik Notation Interface) — transmission codec. Handles serialization, framing, integrity verification, wire protocol.&lt;br&gt;
Together they form a complete pipeline. Neither is useful alone — GN without GNI has no reliable wire protocol, GNI without GN has no semantic advantage over gzip.&lt;br&gt;
What We Shipped: GNI v1&lt;br&gt;
Phase 1 delivers the foundation:&lt;/p&gt;

&lt;p&gt;Canonical binary serialization (varint encoding)&lt;br&gt;
Versioned frame format (backward compatible forever)&lt;br&gt;
CRC32 integrity verification&lt;br&gt;
100% lossless round-trip recovery verified on 2,000+ real messages&lt;br&gt;
Zero external dependencies, 482 lines of JavaScript&lt;/p&gt;

&lt;p&gt;Compression ratios from semantic tokenization are a Phase 2 deliverable. The tokenizer is currently stubbed — Phase 2 implements the domain-specific dictionary that gives GN its advantage.&lt;br&gt;
The Bug We Caught&lt;br&gt;
During validation on 1,038,324 real dialogue messages we hit a CRC32 mismatch:&lt;br&gt;
Stored CRC:   1428394006&lt;br&gt;
Computed CRC: -1889366573&lt;br&gt;
Mismatch!&lt;br&gt;
The bug: checksum computed over (header + payload) instead of just (payload). Three-line fix, proper unsigned arithmetic (&amp;gt;&amp;gt;&amp;gt; 0). Full corpus re-validated in 25 seconds. 6/6 tests passed.&lt;br&gt;
Caught before production. Caught by us, not a user. This is why you validate before you ship.&lt;br&gt;
Try It&lt;br&gt;
javascriptconst GNLz4V2 = require('./src/gn-lz4-v2-complete');&lt;br&gt;
const codec = new GNLz4V2();&lt;/p&gt;

&lt;p&gt;const messages = [&lt;br&gt;
  { templateId: 0, ts: 1743744000, author: 1, channel: 1, payload: 'hello world' },&lt;br&gt;
  { templateId: 0, ts: 1743744001, author: 2, channel: 1, payload: 'how are you?' }&lt;br&gt;
];&lt;/p&gt;

&lt;p&gt;const result = codec.compress(messages);&lt;br&gt;
const recovered = codec.decompress(result.compressed);&lt;br&gt;
console.log(recovered.length + ' messages recovered losslessly');&lt;br&gt;
npm test&lt;/p&gt;

&lt;h1&gt;
  
  
  37/37 passing
&lt;/h1&gt;

&lt;p&gt;What We Are Not Claiming&lt;/p&gt;

&lt;p&gt;Compression ratios are Phase 2. Phase 1 is serialization and framing.&lt;br&gt;
Not production-proven at scale. Validated on our own systems.&lt;br&gt;
No external users yet. Looking for third-party benchmarks and feedback.&lt;/p&gt;

&lt;p&gt;What We Are Claiming&lt;/p&gt;

&lt;p&gt;Solid foundation: lossless, versioned, integrity-verified&lt;br&gt;
Real validation: 1M+ messages, caught our own bugs before release&lt;br&gt;
Clear roadmap: Phase 2 connects GN semantic compression into GNI transmission layer&lt;br&gt;
Applied for NLNet NGI Zero funding (application 2026-06-023) to deliver Phase 2&lt;/p&gt;

&lt;p&gt;Links&lt;br&gt;
GitHub: &lt;a href="https://github.com/atomsrkuul/glasik-notation" rel="noopener noreferrer"&gt;https://github.com/atomsrkuul/glasik-notation&lt;/a&gt;&lt;br&gt;
License: MIT&lt;br&gt;
If you compress messages and want to share results, open an issue. That kind of external validation is what makes this real.&lt;/p&gt;

</description>
      <category>compression</category>
      <category>algorithms</category>
      <category>opensource</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
