DEV Community

kamal namdeo
kamal namdeo

Posted on

False Sharing: The Silent Performance Killer in Concurrent Go

False Sharing: The Silent Performance Killer in Concurrent Go

Your goroutines never touch each other's data. Your atomics are correct. Your mutexes are in place. And yet your program slows down as you add more cores. Here is why — and how to fix it.


In this article, I would go through

  1. The hardware reality: why memory access is never just one variable
  2. What a cache line actually is
  3. False sharing: correctness without performance
  4. Three patterns that silently kill your throughput
  5. The fix: padding and struct layout
  6. Cache line sizes across architectures
  7. How the Go standard library handles this
  8. Rules to wire into your design instincts

THE HARDWARE REALITY MOST DEVELOPERS IGNORE

When you write a concurrent Go program, you think in variables: goroutine A writes to hits, goroutine B writes to misses. They are different variables. There is no data race. Your atomics are correct. Everything should be fine.

But your program gets slower as you add more cores. You profile it. There is no lock contention. No goroutine is blocked. The CPU utilisation looks high. And yet, throughput does not scale.

The bug is not in your code. It is in your mental model of how memory works.

The CPU does not fetch individual variables. It fetches 64-byte chunks of memory. Always.

This single fact — which most developers never internalise — is the root cause of an entire class of concurrency bugs called false sharing. Understanding it changes how you design every concurrent data structure you will ever write.


WHAT A CACHE LINE ACTUALLY IS

Modern CPUs have a memory hierarchy. Reading from main RAM costs roughly 100–300 clock cycles. Reading from L1 cache costs 3–5 cycles. This 60–100x difference is why caches exist — and why programs that use them well run dramatically faster than programs that do not.

But the cache does not work at the byte level. It works at the cache line level. A cache line is a 64-byte contiguous chunk of memory. Whenever the CPU needs any byte, it loads the entire 64-byte chunk that contains it into cache. You never get just the byte you asked for; you always get the whole 64-byte neighborhood.

RAM (simplified view):

Address:  0    8    16   24   32   40   48   56   64   72 ...
          |    |    |    |    |    |    |    |    |    |
          [────────────── cache line 1 ──────────────][── cache line 2 ──...
                         (64 bytes)                        (64 bytes)
Enter fullscreen mode Exit fullscreen mode

Each core gets its own independent copy of any cache line it needs. If Core 1 and Core 2 both need data from the same 64-byte block, they both hold a copy. Now the hardware has a problem to solve: what happens when those copies diverge?

The coherence protocol

CPUs solve this with a cache coherence protocol (MESI is the most common variant). The rule is simple and absolute: when any core modifies a cache line, every other core's copy of that cache line is immediately marked invalid. Any core that subsequently tries to read or write that cache line must discard its invalidated copy and fetch a fresh one — from main memory, at full main-memory cost (100–300 cycles).


FALSE SHARING: WHEN INDEPENDENCE IS AN ILLUSION

False sharing occurs when two goroutines running on different cores are writing to different variables — logically unrelated variables — but those variables happen to sit within the same 64-byte cache line.

The defining characteristic of false sharing: your code is correct, your synchronisation is right, and you are still paying main-memory prices because of how your data happens to be laid out in RAM.


THREE PATTERNS THAT SILENTLY KILL YOUR THROUGHPUT

1. A slice of counters written by N goroutines

You have a slice of integers, one per goroutine, and each goroutine only ever writes to its own index.

counters := make([]int64, 8)

for i := 0; i < 8; i++ {
    go func(i int) {
        for {
            counters[i]++  // only touches index i
        }
    }(i)
}
Enter fullscreen mode Exit fullscreen mode

In memory, eight int64 elements (8 bytes each) sit contiguously:

Address:  0        8        16       24       32       40       48       56
          [ c[0] ][ c[1] ][ c[2] ][ c[3] ][ c[4] ][ c[5] ][ c[6] ][ c[7] ]
          |────────────────────────────────────────────────────────────────|
                          one single cache line (64 bytes)
Enter fullscreen mode Exit fullscreen mode

Every increment of c[0] by Core 0 invalidates the cache line for all seven other cores, because of the same copy of the cache line onto all eight cores, and changes by one core invalidate all other copies. Eight cores are now bouncing a single cache line through main memory like a hot potato.

2. A mutex and the data it protects on the same cache line

type SafeCounter struct {
    mu    sync.Mutex  // 8 bytes at offset 0
    value int64       // 8 bytes at offset 8
}
Enter fullscreen mode Exit fullscreen mode

The act of contending for a lock invalidates the very cache line that holds the protected data. The goroutine that just won the lock pays a main memory penalty (100+ cycles) just to read the data it just unlocked.

Here is the step-by-step breakdown of the performance collapse:

  • Core 1 wants the lock: It executes an instruction to acquire mu. To do this, it pulls the cache line containing mu into its own L1 cache and marks it as Modified.
  • Core 2 tries to get the lock: It looks for mu. It sees that Core 1 owns the cache line. Core 2’s hardware must now wait for Core 1 to "give up" that line.
  • Core 1 updates the value: Since value is on the same cache line that Core 1 already "owns" for the mutex, this part is fast—initially.
  • Core 1 releases the lock: It writes to mu to unlock it.
  • Core 2 finally gets the line: The hardware intercepts.
  • Core 2’s request. Because Core 1 modified that line, the hardware invalidates Core 1's copy and forces the data to be synchronised and fetched from RAM.

This is the "Silent Killer" part: When Core 2 finally successfully acquires the mutex, it thinks it's ready to work. But because the entire 64-byte line was marked invalid during the "handover" from Core 1, the value variable is also gone from Core 2's cache.

Even though Core 2 now "owns" the lock, the very first time it tries to read SafeCounter.value, it hits a Cache Miss. It has to reach all the way out to the slow Main RAM to get the current value, wasting roughly 100–300 clock cycles.
The Result

You are paying a "RAM Tax" twice:

  • Once to fight for the Mutex.
  • Once to read the Data the mutex was supposed to protect.

3. Atomic counters declared side by side

var (
    hits   atomic.Int64   // offset 0
    misses atomic.Int64   // offset 8
)
Enter fullscreen mode Exit fullscreen mode

atomic.Int64 guarantees correctness, but it says nothing about cache lines. The atomic instruction itself takes nanoseconds; the cache miss that precedes it takes 100+ cycles.

Let's understand with an example -

As shown in the snippet above, the vars hits and misses are on the same cache line.

  • Core 1 is responsible for incrementing hits.
  • Core 2 is responsible for incrementing misses.

Even though these two variables are logically unrelated, they are physically roommates. When Core 1 performs an atomic increment on hits:

  • It must lock the entire cache line to ensure atomicity.
  • It increments the value and marks the cache line as Modified (M).

Because the line is modified, Core 2’s cache copy of that entire 64-byte block is instantly marked Invalid (I). Now, Core 2 wants to increment misses.

  • Core 2 looks in its cache. It sees the line containing misses is Invalid.
  • The Penalty: Core 2 must wait for the hardware to fetch the updated line from Core 1 (or main memory). This takes ~100–300 cycles.
  • Core 2 performs its atomic increment, which in turn invalidates Core 1’s copy.

They are now "ping-ponging" the cache line back and forth across the CPU bus.


THE FIX: PADDING AND STRUCT LAYOUT

The solution is to force hot fields onto separate cache lines using padding.

Fix 1: Padded slice elements

type PaddedInt struct {
    Val int64
    _   [56]byte  // 8 + 56 = 64 bytes total = one full cache line
}

counters := make([]PaddedInt, 8)
Enter fullscreen mode Exit fullscreen mode

Fix 2: Padding between mutex and protected data

type SafeCounter struct {
    mu    sync.Mutex
    _     [56]byte  // pad mu to its own cache line
    value int64
    _     [56]byte  // pad value to its own cache line
}
Enter fullscreen mode Exit fullscreen mode

Field order is load-bearing

Go lays out struct fields in declaration order. Padding must go between the fields you want to isolate.

// CORRECT: padding acts as a wall between variables, hits and misses

type Good struct {
    hits     atomic.Int64
    _        [56]byte   // wall after hits
    misses   atomic.Int64
    _        [56]byte   // wall after misses
}
Enter fullscreen mode Exit fullscreen mode

CACHE LINE SIZES ACROSS ARCHITECTURES

Architecture Cache line size Notes
x86-64 (Intel, AMD) 64 bytes Standard server hardware
ARM64 (AWS Graviton) 64 bytes Cloud ARM servers
Apple M1/M2/M3 64 / 128 bytes L2 uses 128-byte chunks
IBM POWER9/10 128 bytes 128-byte lines throughout

Pro Tip: For portable code, pad to 128 bytes. It wastes a bit of RAM but ensures safety across all modern high-performance architectures.


HOW THE GO STANDARD LIBRARY HANDLES THIS

The Go team uses this sparingly but effectively.

  • runtime.p: The processor struct is explicitly padded to prevent adjacent p structs in the internal array from invalidating each other's run queues.
  • sync.Pool: The poolChain struct pads its head and tail pointers. Since producers write to the head and consumers read from the tail, padding prevents these two roles from fighting over the same cache line.

RULES TO WIRE INTO YOUR DESIGN INSTINCTS

  1. Slices written by N goroutines: Use a padded wrapper struct.
  2. Embedded Mutexes: Pad the mutex if the protected data is "hot."
  3. Atomic Counters: Pad them apart if they are logically independent.
  4. Scaling issues: If performance degrades as you add cores, suspect false sharing and profile cache miss rates.
  5. Field Order: Remember that padding is a wall, not a footer. Place it physically between the fields you need to separate.

In distributed systems, you think about network round trips. In concurrent code, the equivalent cost is: will this memory access hit L1 cache, or main memory?

Getting this right is what separates code that scales from code that merely passes the race detector.

Top comments (0)