A decompressor is an interpreter for hostile input: here's what that costs to ship safely

#c #tooling #github

A compression library is a uniquely dangerous thing to ship. It's small, fast, and dependency-light — which is exactly why it ends up linked into web servers, firmware, package managers, and SSH daemons, sitting in everyone's hot path. When xz was backdoored in early 2024 (CVE-2024-3094, CVSS 10.0), the lesson most people took was "audit your maintainers." Building ZXC, the lesson I took was narrower and more uncomfortable: most of the industry treats a decompressor like a utility function, when it's really an interpreter for attacker-controlled input.

Think about what a decoder actually does. It reads a byte stream that, in the field, came from somewhere you don't trust — a downloaded asset, a network payload, an OTA image relayed through who-knows-what. From those bytes it computes offsets, lengths, and copy operations, then writes into a buffer. Every one of those is a chance to read out of bounds, write past the end, or loop forever. The codec's job is to be fast and to never, ever do those things — on any input, including one crafted specifically to make it.

Here's how ZXC is hardened, and what each layer is actually for. The tools aren't the interesting part; why each one catches a class of bug the others miss is.

Fuzzing finds the inputs you didn't imagine

ZXC is fuzzed with ClusterFuzzLite — on every PR (short runs) and in scheduled batch jobs (two-hour sweeps), under both ASan and UBSan. The corpus is retained across runs, and the cumulative execution count is past five billion to date.

What matters is that the fuzzing is split across five separate harnesses, because each API surface has its own state machine and its own class of bug:

roundtrip — compress, then decompress, then assert you got the original back
decompress — feed arbitrary bytes straight to the decoder and watch it reject them
pstream — the push-based streaming path
seekable — parses a seek table and jumps into the middle of a stream; a completely different attack surface from a linear decode, and the only way to exercise the jump logic against garbage offsets is to fuzz it on its own
dict — dictionary training, serialization, and use

A test suite checks the inputs you thought of. A fuzzer manufactures the malformed, truncated, and adversarial ones you didn't — and for a decoder, those are the entire threat. The most valuable thing a fuzzer produces isn't crashes; it's the corpus of weird-but-handled inputs proving the decoder rejects bad data cleanly instead of marching off the end of a buffer.

Which is why ZXC also ships a conformance suite: 17 conformance/invalid/*.zxc files that a correct decoder must reject (bad magic, truncated headers, corrupt payloads, lying size fields…), plus a conformance/valid/ set with golden .expected outputs. All checked in, public, dependency-free — so anyone writing an independent decoder for the format can prove they reject the same garbage and accept the same good data. Refusing bad input is a feature with the same status as decoding good input.

Sanitizers turn "it worked" into "it was actually correct"

A fuzzer against a plain build mostly finds hard crashes. A fuzzer against an instrumented build finds the silent corruption — the read that was one byte out of bounds but happened to land on mapped memory. ZXC's CI runs the fuzzers and tests under ASan and UBSan, the test suite under ThreadSanitizer, and a separate Valgrind pass on a debug build. That combination is what lets you say a decode was correct, not merely that it didn't segfault this time.

Static analysis: four tools, four blind spots

The point of running multiple analyzers isn't redundancy — it's that they fail differently, so each one covers another's blind spot:

Cppcheck reads the code without running it and matches known-bad patterns: null derefs, uninitialized reads, size mismatches a fuzzer might take a billion iterations to trigger. It runs in exhaustive mode against both x86 and cross-compiled ARM64 build databases, because the SIMD paths genuinely differ between them.
The Clang Static Analyzer is path-sensitive — it symbolically walks execution paths and catches the bugs that only happen on a specific branch, like a leak or use-after-free on an error path that tests rarely take.
CodeQL does whole-program dataflow: it can trace an untrusted size field from the header all the way to an allocation or a copy, the kind of taint flow that's invisible to anything looking at one function at a time.
Snyk watches the dependencies and source — because, post-xz, the scariest bug isn't the one you wrote, it's the one in what you trust. (ZXC's only third-party component is rapidhash, MIT-licensed, used for non-cryptographic integrity checksums — and that's the entire trust surface.)

Provenance: proving the binary is the source

The last layer is about the gap between "the code is sound" and "the thing you downloaded is that code." Coverage is tracked with Codecov, and the project carries an OpenSSF Scorecard badge, which grades exactly the supply-chain hygiene the xz incident exploited: pinned actions, branch protection, signed releases.

Tagged releases ship with SLSA build provenance via GitHub Artifact Attestations, plus an attested SPDX SBOM, so you can cryptographically verify a binary came from the repo's CI and wasn't swapped out somewhere downstream:

gh attestation verify zxc-0.12.0-linux-x86_64.tar.gz --repo hellobertrand/zxc

The API is a safety mechanism, not just an interface

The biggest single lever for decoder safety isn't a tool, it's the API contract: every operation requires an explicit output-buffer capacity. There is no "decode into a buffer I'll size for you" convenience function, because that convenience is precisely how decompression bombs and overflows happen. The caller allocates, the caller states the bound, and the decoder refuses to exceed it.

// The caller decides the ceiling — not the input stream.
uint64_t claimed = zxc_get_decompressed_size(src, src_size); // reads the footer; decodes nothing
if (claimed > MY_MAX_OUTPUT)                                 // reject bombs *before* allocating
    return E_TOO_BIG;

void   *dst = malloc(claimed);
int64_t n   = zxc_decompress(src, src_size, dst, claimed, &opts);
if (n < 0) {
    // malformed or oversized input is an error return, not a crash —
    // never a write past `claimed`, never a corrupted heap
}

A bad input becomes a negative return code you handle. That's the whole game: the failure mode for hostile bytes is if (n < 0), not a postmortem. The one-shot functions hold no shared mutable state either, so concurrent decodes on separate buffers can't cross-contaminate. (There's a reusable-context API for the allocation-sensitive case; a single context is single-threaded, the way you'd expect.)

The honest limit

None of this is a proof of correctness, and I'm not going to pretend otherwise. Five billion fuzz iterations is a lot of confidence, not a guarantee. Sanitizers only catch bugs that actually execute under instrumentation — they say nothing about the paths nothing ever drove. And formal verification of a real-world codec is still largely out of reach.

What the stack buys is depth. For a bug to ship, it has to slip past static analysis, survive the conformance vectors, evade billions of fuzz iterations across all five harnesses, and dodge ASan, UBSan, TSan, and Valgrind. Any one layer is beatable. Stacked, they're a high bar — and a high bar is the right ask for code that decodes untrusted bytes in your hot path.

The whole pipeline is public — workflows, fuzz harnesses, and conformance vectors are all in the repo, so you don't have to take my word for any of the above. ZXC is on GitHub under BSD-3 if you want to read it, break it, or use it.

This piece touches on security-sensitive ground. If you're evaluating a codec for a system where a decoder bug would be high-impact, treat everything here as a starting point — review the CI config and conformance vectors yourself rather than taking any project's word for it, mine included.