Imagine a typical scenario: you're optimizing a high-load backend or a network service. It doesn't matter what language you use — C, C++, Java, Go, or C#. You have several threads, and you decide to get rid of slow locks. After all, mutexes are a bottleneck, right?
You apply a classic pattern: instead of threads elbowing each other around a shared variable, you give each thread (or goroutine) its own independent counter. No shared data — no conflicts. You run your load tests expecting beautiful linear scaling, but the profiler shows something weird. The threads seem to still be waiting in line, and the code runs almost slower on a multi-core machine than on a single core.
How is this possible if there are no more locks in the code?
Welcome to the real world, where your programming language's abstractions crash into harsh hardware realities. The problem is that the CPU doesn't care how you logically separated variables in your code. The culprit is False Sharing — an invisible hardware lock.
To understand why your perfect lock-free code is lagging, we need to look under the hood and see how the CPU actually reads memory.
Cache Lines: How the CPU Sees Memory
The CPU never reads memory one byte at a time. It's simply too slow. Instead, it loads data in blocks called cache lines. In modern architectures, a single cache line is usually 64 bytes.
If the CPU needs a 4-byte variable, it grabs it along with the neighboring 60 bytes and puts the whole block into its cache.
Here is a very simplified memory hierarchy by access time:
- L1 Cache (per core): the fastest (~1 nanosecond), but small.
- L2 Cache (per core): slightly slower (~4 nanoseconds), but larger.
- L3 Cache (shared across all cores): slow (~15-20 nanoseconds), but massive.
- RAM: the "turtle" (~100 nanoseconds).
The Illusion of Independence
Let's go back to our network packets. We created a C structure holding counters for two cores:
struct PacketCounters {
int core1_count; // 4 bytes
int core2_count; // 4 bytes
};
struct PacketCounters counters;
Question: Isn't the structure a single entity? How can one core update part of it while another updates the rest?
The answer lies in the fact that the CPU doesn't care about structures. Structures only exist in the programming language. To the processor, it's just a continuous block of memory. core1_count and core2_count sit right next to each other in memory. Together they take up only 8 bytes, which means they are guaranteed to end up in the exact same 64-byte cache line.
The Ping-Pong Effect, or the "Hardware Mutex"
We thought we got rid of the mutex because Core 1 only writes to core1_count, and Core 2 only writes to core2_count. But here is what happens at the hardware level:
- Core 1 increments core1_count. It loads the entire 64-byte line into its L1 cache and changes the value there.
- The CPU's cache coherence protocol sees that the data in the line has changed. To prevent desynchronization, it invalidates (marks as garbage) this exact cache line in the caches of all other cores.
- Core 2 wants to increment core2_count. It tries to access its L1 cache, but sees that its cache line has been dropped! It is forced to re-fetch it from the slow L3 cache or RAM.
- Core 2 changes the value and... invalidates the cache for Core 1.
Even if the cores are working strictly in parallel, this 64-byte memory line bounces between them like a ping-pong ball. An "invisible mutex" is formed at the hardware level.
How to Fix It?
Since the problem is that independent data is sitting too close, the solution is obvious — move them apart.
We need to make sure that each core's counter lives in its own cache line. To do this, we pad the variables (adding empty bytes) to blow up the size to 64 bytes:
#include <stdalign.h>
// Using compiler alignment
struct PaddedCounter {
alignas(64) int count;
// The compiler will automatically add 60 bytes of padding here
};
struct PaddedCounter counters[2]; // Now they are definitely in different cache lines
Because we artificially stretched the size of PaddedCounter to 64 bytes, counters[0] (for Core 1) and counters[1] (for Core 2) will now be placed in completely separate memory blocks. They are guaranteed to sit in different cache lines and will no longer reset each other's caches.
Summary: Lock-free programming is great. But if you forget that the CPU manages memory in 64-byte chunks, you might end up with code that runs slower on a multi-core machine than on a single thread.
💡 A quick footnote: is a cache line always 64 bytes?
In the examples above, we hardcoded 64 bytes — this is the standard for the vast majority of modern processors. But in production code, manually writing magic numbers is a bad practice (what if the code runs on an architecture with a 128-byte cache?). Therefore, in real projects, the size is determined dynamically: for example, C++17 added the cross-platform constant std::hardware_destructive_interference_size, and in Java, developers don't have to think about bytes at all — just adding the @Contended annotation tells the JVM to automatically spread the data across different cache lines.
Top comments (0)