I Built Domain Specific Compression for Messages. Here's What I Learned.

#compression #algorithms #opensource #performance

Why gzip loses to custom compression on chat data — and what we learned building a lossless message codec from scratch.
The Problem
Message data is expensive at scale. Discord servers, Slack workspaces, OpenClaw chat logs — each message is ~500 bytes. Generic compression gets 2-3x. That's good but messages have structure generic algorithms ignore.
[2026-04-03T11:00:00Z] user: Hello, can you check the repo?
[2026-04-03T11:00:15Z] bot: Checking repository...
Timestamp format, role prefixes, platform names — these repeat thousands of times. A domain-specific dictionary front-loads that knowledge instead of discovering it slowly.
The Architecture
We built two layers that work together:
GN (Glasik Notation) — semantic compression. Extracts structure before compression, maps repeated values to IDs, recognizes message templates, reduces entropy before any algorithm touches the data.
GNI (Glasik Notation Interface) — transmission codec. Handles serialization, framing, integrity verification, wire protocol.
Together they form a complete pipeline. Neither is useful alone — GN without GNI has no reliable wire protocol, GNI without GN has no semantic advantage over gzip.
What We Shipped: GNI v1
Phase 1 delivers the foundation:

Canonical binary serialization (varint encoding)
Versioned frame format (backward compatible forever)
CRC32 integrity verification
100% lossless round-trip recovery verified on 2,000+ real messages
Zero external dependencies, 482 lines of JavaScript

Compression ratios from semantic tokenization are a Phase 2 deliverable. The tokenizer is currently stubbed — Phase 2 implements the domain-specific dictionary that gives GN its advantage.
The Bug We Caught
During validation on 1,038,324 real dialogue messages we hit a CRC32 mismatch:
Stored CRC: 1428394006
Computed CRC: -1889366573
Mismatch!
The bug: checksum computed over (header + payload) instead of just (payload). Three-line fix, proper unsigned arithmetic (>>> 0). Full corpus re-validated in 25 seconds. 6/6 tests passed.
Caught before production. Caught by us, not a user. This is why you validate before you ship.
Try It
javascriptconst GNLz4V2 = require('./src/gn-lz4-v2-complete');
const codec = new GNLz4V2();

const messages = [
{ templateId: 0, ts: 1743744000, author: 1, channel: 1, payload: 'hello world' },
{ templateId: 0, ts: 1743744001, author: 2, channel: 1, payload: 'how are you?' }
];

const result = codec.compress(messages);
const recovered = codec.decompress(result.compressed);
console.log(recovered.length + ' messages recovered losslessly');
npm test

37/37 passing

What We Are Not Claiming

Compression ratios are Phase 2. Phase 1 is serialization and framing.
Not production-proven at scale. Validated on our own systems.
No external users yet. Looking for third-party benchmarks and feedback.

What We Are Claiming

Solid foundation: lossless, versioned, integrity-verified
Real validation: 1M+ messages, caught our own bugs before release
Clear roadmap: Phase 2 connects GN semantic compression into GNI transmission layer
Applied for NLNet NGI Zero funding (application 2026-06-023) to deliver Phase 2

Links
GitHub: https://github.com/atomsrkuul/glasik-notation
License: MIT
If you compress messages and want to share results, open an issue. That kind of external validation is what makes this real.

DEV Community

I Built Domain Specific Compression for Messages. Here's What I Learned.

37/37 passing

Top comments (0)