Gabriel Anhaia

Posted on May 4

Goroutine Leaks Don't Show Up in CPU Profiles. Here's What to Read Instead

#go #debugging #performance #pprof

Book: The Complete Guide to Go Programming
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You're staring at a CPU profile from a service that's eating 12GB of RAM and getting OOM-killed every four hours. Top function on the flame graph is runtime.gcBgMarkWorker. Below it, runtime.scanobject. Below that, your handler code, taking 1.4% of samples. Nothing else looks suspicious. The hot path is genuinely hot, but it's the hot path you've already tuned.

You scroll up. You scroll down. You re-record the profile twice. The flame graph keeps telling you the service is fine.

The flame graph is right. The service is also leaking goroutines.

CPU profiles measure work. A parked goroutine does no work. It sits on a channel receive, a select with no ready clauses, a WaitGroup.Wait, or a sync.Mutex.Lock. It burns exactly zero CPU. If you have fifty thousand of them, all asleep, the runtime profiler will not put a single sample on any of them.

These bugs eat weeks because the first tool people open is the wrong one. The right profile is one curl away.

Why CPU Profiles Miss This

Go's CPU profiler is a sampling profiler. It interrupts every running thread roughly 100 times per second and asks "what is this OS thread doing right now." If a goroutine is parked in a wait state, the runtime is not running it on any thread, so the SIGPROF handler never sees it.

That's the whole story. A flame graph of a leaking service looks identical to a flame graph of a healthy service, because the leaked goroutines contribute nothing to the sample set. RAM grows. Goroutine count grows. CPU usage stays flat or drops, because the same amount of real work is competing with more scheduler bookkeeping.

If you only ever check the CPU profile when something seems off, you can run a leaky service for months and never see the leak in any tool you opened. The metric you need is the goroutine count. The profile you need is the goroutine profile.

The Cheapest Detector: NumGoroutine

runtime.NumGoroutine() returns the number of goroutines that currently exist. It costs almost nothing to call. The Prometheus client library exposes it as go_goroutines automatically; if you scrape that metric, you already have a leak detector and you're not using it.

The shape of a healthy service is flat or sawtooth. Flat means goroutine count tracks active connections or in-flight requests. Sawtooth means it grows during traffic peaks and falls back to baseline. Both are fine.

The shape of a leaking service is a line. Up-and-to-the-right, week after week, with no plateau. If you put go_goroutines on a Grafana panel next to memory and request rate, the line is the first thing you'll notice the next time something looks wrong.

A simple alert rule:

- alert: GoroutineLeak
  expr: go_goroutines > 5000
    and rate(go_goroutines[1h]) > 0
  for: 30m

Two conditions. The count is above a threshold you've decided is unreasonable for your workload, and it's still climbing. The threshold matters less than the slope. A service that holds 200 goroutines under normal load and now sits at 50,000 has a leak whether your threshold was 5,000 or 5.

A Leak Small Enough to Read

Before reading the profile, look at the leaks themselves. Two lines of leaky Go, both common in production code.

The first leaks because nobody calls cancel:

func process(parent context.Context, jobs []Job) {
    ctx, _ := context.WithCancel(parent)
    for _, j := range jobs {
        go worker(ctx, j)
    }
}

The _ is the bug. The cancel function is dropped on the floor. The workers will only exit when parent is cancelled, which might be much later than process returns. Every call to process adds workers that stay alive past their useful life. go vet's lostcancel check catches this exact pattern at build time, so the cheapest fix is wiring go vet ./... into CI before you reach for any profile.

The second leaks because the receiver disappears:

func fanout(items []Item) <-chan Result {
    out := make(chan Result)
    for _, it := range items {
        go func(i Item) {
            out <- compute(i)
        }(it)
    }
    return out
}

The unbuffered send is the bug. After the caller reads one result and returns, every other out <- compute(i) parks forever. Every call to fanout adds goroutines that stay alive past their useful life.

A flame graph won't catch either. A goroutine profile catches both.

Reading goroutine.pprof

The endpoint you want is /debug/pprof/goroutine. If you've imported net/http/pprof and have an admin port open, it's already there:

import _ "net/http/pprof"

func main() {
    go http.ListenAndServe("localhost:6060", nil)
    // ...
}

Two ways to fetch it. Aggregated counts:

$ curl -s http://localhost:6060/debug/pprof/goroutine | head -1
goroutine profile: total 50318

Full stack traces, one per goroutine:

$ curl -s http://localhost:6060/debug/pprof/goroutine?debug=2 \
    > goroutines.txt
$ wc -l goroutines.txt
  201272 goroutines.txt

The debug=2 form prints stacks in the same format Go uses on panic. One stack per goroutine, with a wait state and a duration:

goroutine 48273 [chan receive, 18 minutes]:
github.com/example/notif.(*Service).pumpEvents(...)
    /app/notif/service.go:142 +0xa5
created by github.com/example/notif.(*Service).Subscribe
    /app/notif/service.go:88 +0x1c4

The wait state is the diagnosis. [chan receive] is parked on a read. [chan send] is parked on a write. [select] is a select with no ready clause. [semacquire] is a WaitGroup.Wait or mutex acquire that hasn't completed. [IO wait] is usually a network read; that one's normal in a server.

The duration is the second half of the diagnosis. A goroutine parked on chan receive for 18 minutes is doing something it almost certainly should not be doing.

The trick on a real dump with thousands of stacks is grouping. Strip the headers, then sort and uniq each stack body:

$ awk '/^goroutine/{flag=1; next} flag && /^$/{flag=0} flag' \
    goroutines.txt | sort | uniq -c | sort -rn | head

On a leak, one stack dominates:

49872 github.com/example/notif.(*Service).pumpEvents
   42 github.com/example/notif.(*Service).process
    8 net/http.(*conn).serve

Forty-nine thousand goroutines parked at the same source line. That's the leak. The remaining hundred or so are doing real work.

If you have go tool pprof installed, the same data is browsable interactively:

$ go tool pprof http://localhost:6060/debug/pprof/goroutine
(pprof) top
(pprof) list pumpEvents

top shows the call sites that account for the most goroutines. list annotates the source with goroutine counts per line. The source view is what you paste into the bug ticket.

Catching Leaks in Tests with goleak

The lowest-friction tooling change for any Go codebase is one file:

package mypkg

import (
    "testing"

    "go.uber.org/goleak"
)

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

go.uber.org/goleak snapshots the goroutine list at the start and end of the test run. If anything is left over, the suite fails with stack traces pointing at the leaked goroutine.

A more granular form runs per-test:

func TestProcess(t *testing.T) {
    defer goleak.VerifyNone(t)

    process(context.Background(), []Job{{ID: 1}, {ID: 2}})
}

This test fails today on the buggy process shown earlier. The error names the function that started the goroutine and the line it's parked on. You fix it by passing a context you cancel, or by storing the cancel and calling it. The bug never reaches production.

goleak only catches what your tests exercise. The unit test for Subscribe always calls Unsubscribe. The disconnect-mid-stream path that pages someone at 3 AM is the one no test covered. For the test-noise gotchas (IgnoreCurrent, IgnoreTopFunction, the standard library's signal watcher) and a production-incident walkthrough that puts all of this in one timeline, see 50,000 Goroutines Took Down Prod at 3 AM.

Go 1.26: goroutineleakprofile

Go 1.26 ships an experimental runtime-native leak detector (release notes). Build with GOEXPERIMENT=goroutineleakprofile, then curl /debug/pprof/goroutineleak to pull the new profile.

It uses GC reachability: any goroutine blocked on a concurrency primitive that is unreachable from a runnable goroutine is, by definition, a leak. What it cannot see is the other class — a leaked subscription whose channel is still rooted in a service-level map looks live to the GC even when nothing will ever send on it again. That kind of leak is what goleak and the regular goroutine profile catch. Both detection modes have to coexist. The team intends to make goroutineleakprofile default in 1.27, so wire it into your debug builds now.

A Five-Minute Drill for Tomorrow

Run this against any Go service in production that you suspect.

Open the dashboard. Find the go_goroutines panel. If it's not there, add it. If the line trends up, you have a leak.
Curl /debug/pprof/goroutine?debug=2 from one instance. Save the file.
Run the awk-sort-uniq pipeline. The top stack is your leak site.
Open the file at that source line. Find the go func or the context.WithCancel that should have a matching cancel and doesn't. Fix it.
Add goleak.VerifyTestMain(m) to the package you just fixed. Watch the suite turn red. Fix the rest.

The CPU profile was never going to help. The leak you find tomorrow was already in the dump you didn't pull today.

If this was useful

Goroutines, channels, and context cancellation are where most production Go bugs live. The Complete Guide to Go Programming covers the runtime end of the language at the level production demands: the scheduler, channel semantics, context propagation, profiling, and the patterns that prevent the leaks above from being written in the first place.

The Complete Guide to Go Programming: the language at the level production demands. xgabriel.com/go-book
Hexagonal Architecture in Go: the architectural half of Thinking in Go. xgabriel.com/hexagonal-go
Hermes IDE: an IDE for developers who ship with Claude Code and other AI coding tools. hermes-ide.com
More posts and contact: xgabriel.com