I built a C99 compression library that beats PAQ8l

#c #compression #algorithms #opensource

Most developers reach for gzip or zstd and move on. I wanted to understand why they work — and whether I could do better. Two years later, MaxCompression compresses English text better than PAQ8l, one of the legendary compressors in the field.

The numbers

On alice29.txt (152 KB of English text from the Canterbury corpus):

Compressor	Compressed size	Ratio
gzip -9	54,179 bytes	2.81×
bzip2 -9	43,202 bytes	3.52×
xz -9	48,492 bytes	3.14×
MaxCompression L28	35,497 bytes	4.28×
PAQ8l	~35,500 bytes	~4.28×

On the full Silesia corpus (202 MB of mixed real-world data), MaxCompression's automatic mode achieves 4.35× overall — beating bzip2 on all 12 files and xz on 9 out of 12.

How it works

MaxCompression isn't one algorithm. It's five compression engines under a unified API:

LZ77 (Levels 1–9) — fast, for general use
BWT + multi-table rANS (L10–L14) — Burrows-Wheeler with arithmetic coding
Smart Mode (L20) — automatically picks the best strategy per block
LZRC (L24–L26) — LZ with range coder for high-ratio binary compression
Context Mixing (L28) — the crown jewel

Context mixing: the deep end

Context mixing is the technique used by the world's best compressors (cmix, PAQ8px, ZPAQ). The idea is simple: instead of using one model to predict the next byte, you use dozens of models and combine their predictions.

MaxCompression's CM engine uses:

58 context models — order-0 through order-14, word contexts, sparse match, indirect contexts, bigram frequency, character class N-grams, sentence position, and more
8 neural mixers — logit-space weighted averages that learn which models to trust
3-stage APM cascade — Adaptive Probability Maps that refine the final prediction
Cross-term feature — a novel nonlinear combination of mixer outputs that gave a surprising -57 byte improvement

Each bit of the file is predicted using all 58 models. Their predictions are combined by the mixers, refined by the APM cascade, and fed to an arithmetic coder. The whole thing adapts as it reads the file.

What worked

StateMap for match prediction — replacing a hardcoded log-confidence formula with a learned 64K StateMap gave -45 bytes instantly. Lesson: let the data learn what you think you can hardcode.
Sequential split K-means for Huffman table initialization — this single change was the breakthrough that let BWT mode finally beat bzip2.
Mixer weight distribution 7:1:1:2:1:4:2:8 — the first mixer (4096-entry, high resolution) gets the most weight, but the coarser mixers still help on transitions.

What didn't work

Neural network mixer (4 hidden neurons): +1,285 bytes worse. Not enough data in 152 KB for the network to converge.
LSTM cell: +84 bytes. Without proper backpropagation through time, it's just adding noise.
9th mixer: +574 bytes. Too sparse, too many parameters for the data to fill.
Bracket depth model, dialog model, trigram frequency: all marginal or negative. The existing 58 models already capture these patterns.

The architecture has reached a plateau. Closing the gap to ZPAQ (31,200 bytes) or cmix (27,370 bytes) would require LSTM/transformer-based mixing — a fundamentally different approach, not parameter tuning.

Engineering for production

MaxCompression isn't a research toy. It's built for real use:

Portable C99 — no dependencies, compiles everywhere
21 test suites — unit, roundtrip, fuzz, stress, regression, streaming, edge cases
CI on every push — Linux (GCC + Clang), macOS, Windows, Valgrind, WASM
Memory-safe — Valgrind memcheck with zero leaks in CI
Prebuilt binaries — download and run on Linux, macOS, or Windows
Python and Rust bindings — mcx_compress() from any language
30+ CLI commands — compress, decompress, bench, stat, diff, verify, hash, pipe...

#include <maxcomp/maxcomp.h>

// Compress
size_t bound = mcx_compress_bound(src_size);
uint8_t* dst = malloc(bound);
size_t compressed = mcx_compress(dst, bound, src, src_size, 20);

// Decompress
mcx_frame_info info;
mcx_get_frame_info(&info, dst, compressed);
uint8_t* out = malloc(info.original_size);
size_t decompressed = mcx_decompress(out, info.original_size, dst, compressed);

The ranking

On alice29.txt, MaxCompression ranks approximately #4 worldwide among single-file compressors:

cmix — ~27,370 bytes (5.56×)
PAQ8px — ~29,370 bytes (5.18×)
ZPAQ — ~31,200 bytes (4.87×)
MaxCompression — 35,497 bytes (4.28×)
PAQ8l — ~35,500 bytes (~4.28×)

The top 3 use LSTM/transformer-based approaches. MaxCompression achieves its result with classical context mixing only — no neural networks beyond the logit-space mixers.

Try it

# Build from source
git clone https://github.com/SamDreamsMaker/Max-Compression.git
cd Max-Compression
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Compress a file
./build/bin/mcx compress -l 20 myfile.txt

# Benchmark against system compressors
./build/bin/mcx bench --compare myfile.txt