Go Channels Aren't Free. Here's the Real Cost

#go #performance #concurrency #benchmark

Book: The Complete Guide to Go Programming
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team I talked to last month was chasing a CPU spike on a
hot-path service. Twelve thousand requests per second, p99
crawling north of forty milliseconds for no obvious reason.
The code looked clean: a request counter behind a channel,
a goroutine that drained the channel and updated a metric.
"Share memory by communicating," the slogan goes. They
followed it.

The fix was three lines. They replaced the counter channel
with a sync.Mutex and a plain int64. CPU dropped by
roughly a third. p99 came back to a flat line. The channel
was the bottleneck.

This is not a story about channels being bad. Channels are
the right tool for the job they were designed for.
"Counter behind a channel" is not that job, and the gap
between channel and mutex is large enough that picking the
wrong one is a measurable production cost.

The numbers, roughly

The Go runtime's channel implementation lives in
runtime/chan.go.
A send on a buffered channel takes the channel's lock,
copies the value into the ring buffer, possibly wakes a
parked receiver, and releases the lock. A receive does the
mirror. None of those steps are expensive in isolation. The
mutex around the channel state is the expensive part on a
contended path, and the goroutine park/wake when the
buffer is empty or full is more expensive still.

Expect roughly these orders of magnitude on a modern x86
box (verify on your hardware and Go version — exact numbers
shift across CPUs, kernels, and runtime releases, and the
runtime ships its own benchmarks in
runtime/chan_test.go
that you can run yourself):

sync.Mutex Lock+Unlock, uncontended: roughly tens of nanoseconds
atomic.AddInt64: a handful of nanoseconds
Buffered channel send+recv, single producer/consumer: several times the cost of a mutex op
Unbuffered channel send+recv (rendezvous): higher still, because every op is a rendezvous
Contended channel under N producers/M consumers: easily microseconds, climbing fast with contention

These are commonly cited orders of magnitude across
published Go-runtime benchmarks. Your machine will produce
different absolute numbers. The ranking is what's stable: a
mutex op is meaningfully cheaper than a buffered channel
op, and an uncontended atomic is cheaper still. Under
contention the gap widens.

ch <- v and mu.Lock() look equally cheap on the page.
The runtime tells a different story: a channel op enters
the runtime's scheduler-aware path and falls off it into a
parking call under contention; a mutex op stays in user
space until contention forces a kernel wait.

The benchmark: counter, two ways

Here is the shape that team had in production, reduced to
a benchmark you can run yourself.

package bench

import (
    "sync"
    "sync/atomic"
    "testing"
)

func BenchmarkCounterChannel(b *testing.B) {
    ch := make(chan int, 1024)
    done := make(chan struct{})
    var total int64

    go func() {
        for v := range ch {
            total += int64(v)
        }
        close(done)
    }()

    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            ch <- 1
        }
    })
    close(ch)
    <-done
    _ = total
}

func BenchmarkCounterMutex(b *testing.B) {
    var mu sync.Mutex
    var total int64

    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            mu.Lock()
            total++
            mu.Unlock()
        }
    })
}

func BenchmarkCounterAtomic(b *testing.B) {
    var total atomic.Int64

    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        for pb.Next() {
            total.Add(1)
        }
    })
}

Run it with go test -bench=Counter -cpu=1,4,8 -benchtime=3s.
The exact numbers depend on your CPU, kernel, and Go
version; expect roughly this ranking every time:

BenchmarkCounterAtomic is the floor. A single LOCK XADD instruction. Expect single-digit nanoseconds per op uncontended.
BenchmarkCounterMutex runs noticeably slower than atomic but stays flat. Cache-line ping-pong starts to bite under 8 cores; the overhead is bounded.
BenchmarkCounterChannel falls off a cliff under contention. The runtime's channel mutex serialises every send. Worse, every full-buffer event parks the producer and every empty-buffer event parks the consumer.

The exact numbers depend on your hardware and Go version.
The ranking is stable. A counter is shared state with a
single, small mutation. A channel forces every mutation
through a queue, and the queue is more machinery than the
mutation needs.

When the channel is the right answer

Now flip the workload. A worker pool dispatches incoming
jobs to N goroutines that each do real work. The channel
is no longer protecting a counter; it's transferring
ownership of a job from one goroutine to another.

package bench

import (
    "sync"
    "testing"
)

type Job struct {
    Payload [256]byte
}

func process(j Job) {
    var s byte
    for _, b := range j.Payload {
        s += b
    }
    _ = s
}

func BenchmarkWorkerPoolChannel(b *testing.B) {
    const workers = 8
    jobs := make(chan Job, 64)
    var wg sync.WaitGroup
    wg.Add(workers)
    for i := 0; i < workers; i++ {
        go func() {
            defer wg.Done()
            for j := range jobs {
                process(j)
            }
        }()
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        jobs <- Job{}
    }
    close(jobs)
    wg.Wait()
}

That's the channel version. Eight workers, one buffered
channel, jobs flow through. Now the same workload with a
mutex-protected slice and a condition variable, which is
what "do this with a mutex" actually looks like once you
add the parking and signalling the channel gives you for
free:

func BenchmarkWorkerPoolMutexQueue(b *testing.B) {
    var (
        mu     sync.Mutex
        cond   = sync.NewCond(&mu)
        queue  []Job
        closed bool
    )
    const workers = 8
    var wg sync.WaitGroup
    wg.Add(workers)
    for i := 0; i < workers; i++ {
        go func() {
            defer wg.Done()
            for {
                mu.Lock()
                for len(queue) == 0 && !closed {
                    cond.Wait()
                }
                if len(queue) == 0 && closed {
                    mu.Unlock()
                    return
                }
                j := queue[0]
                queue = queue[1:]
                mu.Unlock()
                process(j)
            }
        }()
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        mu.Lock()
        queue = append(queue, Job{})
        mu.Unlock()
        cond.Signal()
    }
    mu.Lock()
    closed = true
    mu.Unlock()
    cond.Broadcast()
    wg.Wait()
}

The mutex+slice version is doing the same job as the
channel. It's not faster. Most of the time it's
slightly slower, because you've reimplemented a
ring buffer badly: slice grows, slice copies, no built-in
parking, condition-variable overhead on every wake. The
real-world version of "do this with a mutex" tends to be
buggier too: forgotten Broadcast, missed close
condition, append-and-slice costing more than the channel
op it replaced.

This is the case the channel was designed for. Ownership
transfer between goroutines. The job is allocated by the
producer, processed by the worker, and the worker is the
only goroutine that touches it after the channel hand-off.
There is no shared state. The channel is the
synchronisation mechanism, and the runtime's
park/wake/wake-the-next-waiter logic is exactly what you
want.

The decision rule

Three questions, in order:

1. Is the operation a single state mutation under
contention?

Counter, set membership, a small map of connection IDs to
last-seen timestamps. Use a mutex (or atomic if it fits in
one word). Channels add a queue you don't need.

2. Is the operation transferring ownership of a value
from one goroutine to another?

Job dispatch, request fan-out, a pipeline stage handing
work to the next stage. Use a channel. The runtime's
park/wake is what makes this efficient at all, and
reimplementing it with a mutex+condvar is how you
introduce subtle deadlocks.

3. Is the operation signalling, not data transfer?

"Stop the worker." "Reload the config." "I'm done."
Channels of struct{}{} are the canonical answer. So is
context.Done(), which is itself a channel. Cost per
operation is irrelevant; you're sending one signal.

Effective Go's section on sharing
gets quoted as if it endorses channels for everything.
Read it again: "Don't communicate by sharing memory;
share memory by communicating." The line is about
ownership transfer. Replacing every mutex with a channel
is a misreading of the proverb. It's a design principle,
not a microbenchmark ranking.

Where the channel cost actually bites

Three patterns where engineers reach for channels and the
cost shows up in p99:

Counter behind a channel. The team I opened with. The channel serialises a write that has no contention problem to begin with — atomic or mutex is faster and simpler.
Pub-sub with one channel per subscriber. Fan-out to N subscribers with one buffered channel each works until one subscriber is slow. Then every publish blocks behind the slowest receiver. A sync.RWMutex around a slice of callbacks is faster and gives you backpressure visibility.
"Pipelines" three stages deep where each stage does almost no work. Each channel hop carries non-trivial per-op overhead. A three-stage pipeline doing a handful of nanoseconds of real work per stage spends most of its time in the channel. Either collapse the stages or stop pretending it's a pipeline.

None of these mean "stop using channels." They mean
channels have a per-operation cost in the same league as
the work they're protecting, and at that point the
channel is the work.

Measure before you migrate

Microbenchmarks lie if you take them out of context. The
"channel op is fast" claim assumes a single
producer/consumer on a warm cache. Real services have
contention, NUMA effects, scheduler quirks under heavy
load, and goroutines that fight for the same channel
lock. Run go test -bench against the workload that
mirrors yours. Check go tool pprof for time spent in
runtime.chansend and runtime.chanrecv. If those two
are in your top ten flame-graph entries and the channel
isn't doing ownership transfer, you have the same bug
that team had.

The fix is rarely "rewrite the whole concurrency model."
Usually it's one of: replace a counter-channel with an
atomic or mutex, replace a fan-out channel with a callback
slice, or collapse a pipeline that was never long enough
to justify the hops. None of those are large changes.
All of them benchmark.

The mental model that keeps you out of trouble: a channel
is a queue with park/wake semantics, and queues cost
something. Use one when the queue is the point. Reach for
a mutex when the state is the point.

If this was useful

Channels are one of the things Go got famously right, and
also one of the things tutorials oversimplify until
people use them in places where they cost more than they
help. The Complete Guide to Go Programming covers the
runtime side: how channels are implemented, how the
scheduler parks and wakes goroutines, when atomics beat
mutexes beat channels, and how to read a profile when the
runtime is your bottleneck.

The Complete Guide to Go Programming — the runtime, the scheduler, and the cost model under every line of Go you write: xgabriel.com/go-book
Hexagonal Architecture in Go — the other half of Thinking in Go: xgabriel.com/hexagonal-go
Hermes IDE — an IDE for developers who ship with Claude Code and other AI coding tools: hermes-ide.com
More posts and contact — xgabriel.com