Why Your Binary Protocol Should Care About CPU Cache Lines

#performance #networking #cpp #programming

Why Your Binary Protocol Should Care About CPU Cache Lines

If you've ever designed a custom binary protocol for a hot path — a game server, a market-data feed, an internal RPC — you've probably obsessed over byte layout, alignment, and zero-copy parsing.

There's one detail most tutorials skip that quietly costs you 2-5x throughput: cache line alignment.

The 64-byte secret

Modern CPUs don't read memory one byte at a time. They read in chunks called cache lines — typically 64 bytes on x86_64 and ARM. Every load that misses L1 pulls in a full cache line. Every store that has to be visible to other cores invalidates a cache line on those cores.

If your protocol's "hot fields" — the bits the receiver reads first and most often — sit on the boundary between two cache lines, you just doubled your memory traffic for free.

A worked example

Picture a naive market-data tick struct: a uint8_t type tag, a uint64_t timestamp, a uint32_t symbol id, an 8-byte price, an 8-byte sequence number, and an 8-bit flags field. Compiler will pad it to about 32 or 40 bytes depending on alignment rules. Looks fine.

Now imagine a queue of 1024 of these. The moment one tick straddles two cache lines, every read of that tick costs you twice. Multiply by 1024.

The fix is unsexy but free:

Pick a fixed message size that divides 64 cleanly — 32 or 64 bytes.
Order fields so the receiver's first read (often type and timestamp) sits in the same cache line.
Pad the struct explicitly so two consecutive messages never share a cache line. Use alignas(64) and a static_assert(sizeof(T) == 64) so the day someone "innocently" adds a field, the build screams.

Why I keep harping on this

The performance work that pays compounding returns isn't algorithmic — it's mechanical sympathy. Knowing what the CPU and memory subsystem are actually doing under your code.

Two more rules I keep nailed to my desk:

Hot fields first. Whatever the receiver reads in the first nanosecond should be in the first 16 bytes.
Never share a cache line between writers. If two different threads update fields in the same struct, they'll fight even when they're not touching the same field. False sharing is the silent throughput killer.

The takeaway

Your protocol's wire format isn't just about bytes on the network. It's about bytes in the receiver's L1 cache. Treat them with the same respect.

I wrote a longer deep-dive on Medium with benchmarks and the C++ template trick I use to generate cache-line-correct messages from a schema:
Binary Protocols: Designing Messages For Cache Lines

While I spend a lot of time in the low-level performance trenches, my main projects right now sit at the other end of the stack — PromptShip, a shared prompt library for non-technical teams using ChatGPT, Claude, and Gemini, and FillTheTimesheet, a time-tracking tool that's saved a few hundred freelancers from Sunday-night timesheet panic. The mechanical sympathy mindset shows up in user-facing software too — every "feels instant" UX is a thousand tiny cache-friendly decisions, just at a different layer of the stack.

Originally published on Medium by The Speed Engineer.