A Lock-Free Counter in Go: atomic, sync.Map, or Just a Mutex?

#go #concurrency #benchmark

Book: The Complete Guide to Go Programming
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team I talked to had one prometheus counter incremented from every request handler. They reached for sync.Map because the metric registry was already keyed by name, and the rest of the registry used one. Under load test the counter showed up in CPU profiles. Not as the top frame, but high enough that one engineer started a thread on Slack asking whether Go was the wrong language for hot-path counters.

The language was fine. The container was the problem.

You have a few reasonable ways to write a shared counter in Go. Most teams pick one out of habit and never benchmark the others. The habit is usually wrong for the workload, and the cost shows up only when traffic doubles. This post walks through the options, runs them under contention, and gives you a decision rule that fits in your head.

The three shapes

For a single counter, shared across goroutines, you have:

package counter

import (
    "sync"
    "sync/atomic"
)

// 1. atomic.Int64 — the value-type API added in Go 1.19.
type AtomicCounter struct {
    v atomic.Int64
}

func (c *AtomicCounter) Inc()        { c.v.Add(1) }
func (c *AtomicCounter) Load() int64 { return c.v.Load() }

// 2. plain mutex around an int.
type MutexCounter struct {
    mu sync.Mutex
    v  int64
}

func (c *MutexCounter) Inc() {
    c.mu.Lock()
    c.v++
    c.mu.Unlock()
}

func (c *MutexCounter) Load() int64 {
    c.mu.Lock()
    defer c.mu.Unlock()
    return c.v
}

For many counters keyed by name (the prometheus shape), the third option is sync.Map holding *atomic.Int64 values:

type MapCounter struct {
    m sync.Map // map[string]*atomic.Int64
}

func (c *MapCounter) Inc(key string) {
    if v, ok := c.m.Load(key); ok {
        v.(*atomic.Int64).Add(1)
        return
    }
    fresh := new(atomic.Int64)
    actual, _ := c.m.LoadOrStore(key, fresh)
    actual.(*atomic.Int64).Add(1)
}

func (c *MapCounter) Load(key string) int64 {
    v, ok := c.m.Load(key)
    if !ok {
        return 0
    }
    return v.(*atomic.Int64).Load()
}

A note on the API. The old shape was atomic.AddInt64(&x, 1) with the address-and-function form. Go 1.19 added value-typed wrappers (atomic.Int64, atomic.Uint64, atomic.Pointer[T]) that you embed by value and call methods on. The machine instructions underneath are the same. The wrappers read more cleanly, and they prevent the foot-gun of accidentally passing the value instead of the pointer. Use them.

A bench you can run

The whole point of this post is that you should not trust anyone's numbers, including mine. The bench is small enough to copy into a project.

package counter

import "testing"

func BenchmarkAtomicInc(b *testing.B) {
    var c AtomicCounter
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            c.Inc()
        }
    })
}

func BenchmarkMutexInc(b *testing.B) {
    var c MutexCounter
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            c.Inc()
        }
    })
}

func BenchmarkMapInc(b *testing.B) {
    var c MapCounter
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            c.Inc("requests")
        }
    })
}

b.RunParallel is the part that matters. It splits the iterations across GOMAXPROCS goroutines, which is what you actually want when you are measuring contention. A plain for i := 0; i < b.N; i++ benchmark will tell you single-thread cost and miss the entire question.

Run with:

go test -bench=. -benchmem -cpu=1,4,8 -count=10 \
    | tee bench.out
benchstat bench.out

The -cpu=1,4,8 switch reruns the benchmark at three contention levels, which is what makes the result honest. A counter that wins at -cpu=1 and loses at -cpu=8 is the wrong counter for production.

What you will see

The shape of the result is consistent across machines, even when the absolute numbers differ. On a quiet laptop, expect ns-per-op figures roughly in this order of magnitude:

AtomicInc — single-digit nanoseconds at low contention, climbing to tens of nanoseconds at -cpu=8.
MutexInc — tens of nanoseconds at -cpu=1, climbing to roughly an order of magnitude more at -cpu=8. Mutex contention is brutal in a tight loop because every losing goroutine parks.
MapInc — slowest of the three by a comfortable margin, even though the inner increment is atomic. The cost is the sync.Map lookup, which does an interface-typed Load plus the type assertion on the way back.

Atomic and mutex live in different perf classes when the only work inside the lock is v++. The atomic add compiles down to a single CPU instruction with a memory ordering hint. The mutex path is a function call plus an atomic compare-and-swap, and on contention the losing goroutine parks. sync.Map does not get faster than the underlying atomic. It can only get slower, because it adds a lookup on top.

If your hot path is one counter, the choice is atomic.Int64. Anything else is paying for ergonomics you do not need.

When the mutex actually wins

The atomic-always rule has one important exception. If your update is more than one field, or if you increment several counters together as part of a single logical event, the mutex shape can be both faster and less error-prone.

type Stats struct {
    mu       sync.Mutex
    requests int64
    errors   int64
    bytes    int64
}

func (s *Stats) Record(err bool, n int64) {
    s.mu.Lock()
    s.requests++
    if err {
        s.errors++
    }
    s.bytes += n
    s.mu.Unlock()
}

Three atomic adds in a row, with no lock between them, are not atomic as a group. A reader observing the state mid-update sees requests already bumped but bytes not yet, which makes ratios lie. The mutex gives you a consistent snapshot of all three fields, and you pay for only one lock acquire per event instead of three.

A useful benchmark here mirrors the one above, but the body of Inc does three updates instead of one. Compare:

Three sequential atomic.Add calls.
One mutex-guarded block.

At low contention the atomics win on raw nanoseconds. At realistic contention with three updates per event, the gap closes, and the mutex gives you a snapshot the atomic shape cannot give you without a separate lock. Pick consistency when the counters are read together. Pick atomics when the counters are independent.

When sync.Map earns its keep

sync.Map is documented for two cases: caches that are written once and read many times, and disjoint key-sets accessed from different goroutines. The implementation is tuned for read-mostly workloads with stable keys and few writes. A hot increment path is the opposite of that.

If your workload looks like this, sync.Map is fine and probably the right call:

Tens of thousands of distinct counter names.
Reads dominate by a wide margin.
Writes are rare, bursty, and tolerate the slower path.

If your workload looks like a typical metric registry (a stable set of named counters, hot writes on every request, occasional reads from a scrape endpoint), sync.Map is the wrong tool. You want a regular map[string]*atomic.Int64 guarded by a sync.RWMutex, with the mutex held only for the rare insert and the increment running lock-free on the pointer:

type Registry struct {
    mu sync.RWMutex
    m  map[string]*atomic.Int64
}

func (r *Registry) Inc(key string) {
    r.mu.RLock()
    v, ok := r.m[key]
    r.mu.RUnlock()
    if ok {
        v.Add(1)
        return
    }
    r.mu.Lock()
    if v, ok = r.m[key]; !ok {
        v = new(atomic.Int64)
        r.m[key] = v
    }
    r.mu.Unlock()
    v.Add(1)
}

The hot path takes an RLock, reads the pointer, releases the RLock, and increments. The slow path runs once per new key. This is roughly the shape expvar.Map uses internally for its keyed entries (the standalone expvar.Int is just an atomic.Int64 wrapper, no map involved). It scales better than sync.Map for this workload because the lookup avoids the interface hop and the type assertion.

The decision rule

Three lines you can paste into a code review checklist:

One counter, hot path. Use atomic.Int64. Stop there.
Several counters that move together per event. Use a sync.Mutex and update them inside one critical section.
A registry of named counters with a stable, read-mostly key set. Use map[string]*atomic.Int64 with sync.RWMutex. Reach for sync.Map only if the access pattern is the read-mostly disjoint-keys shape the docs describe.

The team in the opening switched the registry from sync.Map to RWMutex plus pointer-to-atomic, and the counter dropped out of the CPU profile. Same metric. Same atomic increment underneath. A different container around it. The lesson is not that any one of these tools is bad. It is that the cost of "use whatever is already there" stays invisible until traffic finds it.

So run the bench, read benchstat, and pick the shape that fits the workload rather than the one that looks safest in a stack trace.

If this was useful

The concurrency primitives behind these counters are the kind of mechanical-sympathy work the Go books I wrote cover end-to-end: the memory model, why atomic.Int64 compiles to one instruction while sync.Mutex takes several, and what sync.Map is actually optimized for. The Complete Guide to Go Programming walks through the runtime, the memory model, and the trade-offs between the synchronization tools in the standard library. Hexagonal Architecture in Go covers how to keep these primitives behind a clean port so the choice does not bleed into your domain layer.

If you ship Go alongside an AI coding assistant, Hermes IDE is the editor I build for that workflow.