- Book: The Complete Guide to Go Programming
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
A team I talked to had a feature-flag cache. A small map of
flag-name to bool, refreshed from a config service every 30
seconds. Every HTTP handler in their service read from it on
the way in. The first version used sync.RWMutex because that
is what every Go intro reaches for when you say "read-heavy".
Under load testing the read path looked fine. Under real
traffic (tens of thousands of req/s across dozens of cores) the
p99 climbed. A flame graph put runtime.semacquire and
sync.(*RWMutex).RLock on the hot path. The mutex was not
contended in the way people usually mean. It was paying
cache-line traffic on every reader because every RLock writes
the reader-count field. Many cores fighting for the same cache
line is a cost you do not see at lower load.
They swapped the cache for atomic.Pointer[map[string]bool].
Readers do an atomic load. Writers build a fresh map and store
the pointer. The reader path went from a mutex acquire to a
single load. p99 dropped, the flame graph thinned out, and the
team stopped paying the RLock tax.
What atomic.Pointer[T] gives you
Go 1.19 added typed atomics. Before that you used
atomic.Value (interface-typed, no compile-time guarantees) or
atomic.LoadPointer with unsafe.Pointer casts. The typed
variant, atomic.Pointer[T], gives you a pointer-sized atomic
field with a generic T:
package cache
import "sync/atomic"
type Flags struct {
p atomic.Pointer[map[string]bool]
}
func (f *Flags) Get(name string) bool {
m := f.p.Load()
if m == nil {
return false
}
return (*m)[name]
}
func (f *Flags) Set(m map[string]bool) {
f.p.Store(&m)
}
Load and Store are single CPU instructions on every
architecture Go targets: a MOV on amd64 with the right
ordering, an LDAR/STLR pair on arm64. The hardware does the
synchronization for you. Readers never block writers. Writers
never block readers.
The trick is what gets stored: a pointer to an immutable map.
Once Set publishes a map, that map is read-only forever.
Updates do not mutate it. They build a new one and Store the
new pointer. Readers that already grabbed the old pointer keep
reading the old map until they release their reference. The
garbage collector reclaims the old map after the last reader
drops it.
That immutability rule is what the whole pattern depends on. If
any reader or writer mutates the map after Store, you lose
the safety property and go test -race will tell you about it
loudly.
The swap-the-whole-map idiom
The pattern at production scale looks like this. A reader path
that is one atomic load. A writer path that copies the current
map, applies the change, and stores the new pointer.
package featureflags
import "sync/atomic"
type Cache struct {
p atomic.Pointer[map[string]bool]
}
func New(initial map[string]bool) *Cache {
c := &Cache{}
c.p.Store(&initial)
return c
}
func (c *Cache) Lookup(name string) bool {
m := c.p.Load()
return (*m)[name]
}
func (c *Cache) Replace(next map[string]bool) {
c.p.Store(&next)
}
The reader is two instructions: load the pointer, index the map.
No mutex acquire, no read counter increment, no semaphore.
The writer is what you have to think about. Replace takes
ownership of next: the caller must not mutate next after
calling Replace. That is the contract. If you want a
read-modify-write style update (toggle one flag, leave the rest
alone), you build a fresh map by copying and replacing:
func (c *Cache) Toggle(name string) {
old := c.p.Load()
next := make(map[string]bool, len(*old)+1)
for k, v := range *old {
next[k] = v
}
next[name] = !next[name]
c.p.Store(&next)
}
Two readers can call Lookup while Toggle is half-built.
They see the old map. After Store returns, new Lookup calls
see the new map. There is no torn state. The pointer write is
atomic, and both maps are complete and immutable when readers
see them.
If two writers race, one wins and one's work is lost. That is
the cost of going lock-free without a CAS retry loop. For a
config-refresher that runs every 30 seconds, single-writer is
the natural shape. For something with many concurrent writers
(a counter cache, a session store), atomic.Pointer alone is
the wrong tool (see the "when it's wrong" section).
Why this beats RWMutex on read-heavy work
sync.RWMutex is a real mutex with a reader counter. Every
RLock does an atomic add on the counter and an atomic check
of the writer-waiting flag. The counter lives in one cache line.
Many cores hammering one cache line means those cores
serializing on the cache-coherence protocol. The mutex itself
is uncontended (no writer waiting), but the cache line is
contended on every RLock.
atomic.Pointer[T].Load is a single load. The pointer value
itself is read-only between writes. Cores can cache the pointer
in their L1, and as long as no writer stores, no cache-line
invalidation fires. Reads scale linearly with cores until
memory bandwidth runs out, which happens at numbers most
services never reach.
The rough ordering on a hot read path, fastest to slowest,
holds across architectures: atomic load, uncontended mutex,
then RWMutex RLock under reader contention. atomic.Pointer.Load
sits at the top.
A benchmark shape that shows the gap. Your numbers will move
with hardware, Go version, and map size, so re-run on your
target:
package cache_test
import (
"sync"
"sync/atomic"
"testing"
)
func BenchmarkRWMutex(b *testing.B) {
m := map[string]bool{"a": true, "b": false}
var mu sync.RWMutex
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
mu.RLock()
_ = m["a"]
mu.RUnlock()
}
})
}
func BenchmarkAtomicPointer(b *testing.B) {
var p atomic.Pointer[map[string]bool]
m := map[string]bool{"a": true, "b": false}
p.Store(&m)
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
mm := p.Load()
_ = (*mm)["a"]
}
})
}
When you run this on your machine, the atomic-pointer read
should be meaningfully faster per op than the RWMutex read,
and the gap should widen as -cpu climbs. Run
go test -bench=. -benchmem -cpu=1,4,8,16 and watch the
RWMutex line scale poorly while the atomic-pointer line stays
flat.
What it costs
The win is real but not free.
Garbage retention. When you Store a new pointer, the old
map is unreachable from the cache field, but readers may still
hold a reference to the old map (the value they got from
Load). The GC cannot reclaim the old map until every reader
that grabbed the pointer has dropped it. For a 30-second
refresh, that means up to 30 seconds of two maps in memory
plus whatever requests were in flight at swap time. For a
1-second refresh on a 100MB map, you are doubling RAM during
the swap window.
The book-keeping rule: maps swap-replaced via atomic.Pointer
should be small enough that holding two of them in RAM during a
swap does not push you over budget. "Small" is a function of
your service, but a feature-flag map is small, a session cache
is not.
Whole-map copy on every update. Every writer-side change
copies the entire map. A 10k-entry map cloned every 30 seconds
is fine. A 10M-entry map cloned every 30 seconds is a CPU
disaster. The throughput ceiling on the writer is set by how
fast you can build a fresh copy, not by how many readers are
reading.
Single-writer assumption (or you build a CAS loop). Two
goroutines both calling Toggle race. One reads the old map,
the other reads the same old map, both build a new map, both
Store. Whichever stores second wins; the first writer's work
is silently dropped. To make multi-writer safe you wrap the
read-modify-write in a CAS loop:
func (c *Cache) Toggle(name string) {
for {
old := c.p.Load()
next := make(map[string]bool, len(*old)+1)
for k, v := range *old {
next[k] = v
}
next[name] = !next[name]
if c.p.CompareAndSwap(old, &next) {
return
}
}
}
CompareAndSwap is in atomic.Pointer since Go 1.19. The loop
retries on conflict. For low write rates, retries are rare and
the loop is cheap. For high write rates, the retries become the
work and a mutex would be cheaper.
When the swap-whole-map idiom is wrong
The shape only fits one quadrant of the cache design space:
small map, read-heavy, infrequent writes, single writer (or
low-rate multi-writer with CAS). Outside that quadrant, reach
for something else.
Large maps, frequent writes. Sharded sync.Map or
sync.RWMutex over a map. The whole-map copy kills you.
Item-level invalidation, high write rate. A request cache
where every request might update one entry. Sharded mutexes or
sync.Map (with its tombstone-and-promote design, see
sync/map.go in the Go source) do per-key locking instead of
per-snapshot replacement.
You need to delete an entry and reclaim its memory now.
With atomic.Pointer, the entry sticks around until the last
reader drops the snapshot. If "now" matters (security, GDPR
delete-right), explicit locking gives you the deterministic
release point.
Stronger isolation than eventual. atomic.Pointer gives
you a snapshot that was current when you Load. Two Load
calls in the same goroutine can see different snapshots if a
writer ran between them. If you need a stable view across
multiple reads, take one Load and pass the snapshot pointer
down the call stack. Do not call Load repeatedly.
What to do with this on Monday
Find the read-heavy caches in your service. Greps that work:
sync.RWMutex near a map[, anything called Cache with a
mutex field, route-table or feature-flag types refreshed on a
timer. For each one, ask three questions.
Is the map small enough to hold two copies in RAM during a
swap? Is the write rate low enough that a single writer (or a
short CAS retry loop) handles updates? Are reads actually the
hot path? Does pprof show RLock or runtime.semacquire on
your traces?
If all three answer yes, an atomic.Pointer[map[K]V] swap is a
five-line change that takes the mutex off the read path. If any
one answers no, the mutex is doing real work, so leave it alone.
The failure modes are not symmetric. Use atomic.Pointer on a
write-heavy or large map and you allocate more and silently
drop writes. Stay on RWMutex under reader contention and you
pay cache-line traffic on every request. Read your pprof,
write the benchmark, and pick the primitive that matches the
workload.
If this was useful
The Complete Guide to Go Programming
covers the runtime mental model behind this idiom: the typed
atomics, the GC's view of pointer swaps, and the cache-coherence
costs that make RWMutex slow under reader contention. It also
covers the rest of sync and sync/atomic end-to-end. It is
part of Thinking in Go, the 2-book series, paired with
Hexagonal Architecture in Go
for the design-layer view of where caches like this fit inside
a service that survives on-call.



Top comments (0)