speed engineer

Posted on May 30 • Originally published at Medium

Checksum Everything: A Practical Habit That Catches Data Corruption Before It Hurts

#programming #tutorial #softwareengineering #reliability

Why I Started Checksumming Everything

A few months back, I shipped a system that quietly truncated 0.4% of records during a Kafka rebalance. No errors, no alarms — just a few thousand objects with subtly wrong payloads downstream. By the time anyone noticed, the damage was already in the database.

That bug taught me a habit I now apply everywhere: checksum at every boundary.

The Problem: Silent Data Corruption

Network errors are obvious. They throw. They reconnect. They give you something to log.

Silent corruption is the dangerous cousin. A flipped bit in a memory-mapped file. A truncated payload from a misconfigured proxy. A serializer that drops the last field of a struct because producer and consumer disagree on the schema.

None of these raise an exception. Your code happily processes the wrong bytes and you only find out weeks later when a customer asks why their invoice is short.

The Fix Is Embarrassingly Simple

Add a checksum field to every message you pass across a process or network boundary.

import hashlib
import json

def envelope(payload: dict) -> dict:
    body = json.dumps(payload, sort_keys=True).encode()
    return {
        "body": payload,
        "checksum": hashlib.blake2b(body, digest_size=16).hexdigest(),
    }

def verify(msg: dict) -> dict:
    body = json.dumps(msg["body"], sort_keys=True).encode()
    expected = hashlib.blake2b(body, digest_size=16).hexdigest()
    if expected != msg["checksum"]:
        raise ValueError(f"checksum mismatch")
    return msg["body"]

That's it. BLAKE2b is fast (several GB/s on modern x86), the digest is 16 bytes, and you've turned a class of invisible bugs into loud, immediate failures.

Where to Apply It

Between services: every queue message, every HTTP body, every gRPC payload.
Across disk boundaries: every file you write and re-read, every cache entry.
Across language boundaries: every JSON you serialize and parse on the other side.
Across time: every record you persist for replay later.

The rule of thumb I follow: if data crosses a boundary I don't control, it gets a checksum.

What Checksums Are NOT For

Checksums don't authenticate (use HMAC for that). They don't protect against malicious tampering. They don't fix your schema problems.

What they do is catch the dumb, mechanical bugs — the ones that would otherwise take you a week to track down because they don't surface as errors.

The Operational Payoff

Once you start checksumming, two things change:

You catch corruption at the boundary that introduced it, not three systems downstream.
Your error messages stop being mysteries. "Checksum mismatch on inbound queue message" beats "user reports invoice off by $14" every time.

A Note on Billing-Critical Systems

This habit matters most in systems where data quietly becomes money. Time-tracked hours, invoices, usage events — anything that hits a customer's bill. We use this pattern in FillTheTimesheet for exactly this reason: a corrupted timesheet entry isn't a UI bug, it's a wrong invoice. Checksums between the tracker and the billing pipeline have caught issues we never would have noticed otherwise.

Key Takeaways

Silent corruption is the dangerous class of bug — it doesn't throw.
A 16-byte checksum at every boundary catches it cheaply.
BLAKE2b or xxHash are both fast enough that you'll never notice the cost.
Checksum at the boundary that introduces corruption, not downstream.

This is a companion to the deeper write-up on Medium: "Checksum Everything: Corruption Caught Before Catastrophe" by The Speed Engineer.

DEV Community