Catching Goroutine Leaks in Go Tests With goleak

#go #concurrency #testing

Book: The Complete Guide to Go Programming
Also by me: Hexagonal Architecture in Go — the companion book in the Thinking in Go series
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You've seen the graph. A Go service that idles at 40 goroutines
climbs to 4,000 over three days, then the pod gets OOM-killed and
restarts, and the count starts climbing again. Nothing in the test
suite failed. go test ./... is green. go vet is quiet. The leak
is real, it's in your code, and none of your tooling looked for it.

Goroutine leaks are the bug your tests are structurally blind to. A
test spawns a goroutine, asserts on a return value, and exits. The
goroutine it left parked on a channel read is somebody else's
problem — the test process already moved on. Multiply that by every
handler and worker in the codebase and you get the graph above.

goleak, from Uber, closes that
gap. It snapshots the set of running goroutines at the end of a test
run and fails if any unexpected ones are still alive. The fix is one
file per package. Here is how leaks happen, and how to read the
stack goleak hands you back to the exact go func() that leaked.

What a goroutine leak actually is

A goroutine leaks when it blocks forever on an operation that will
never complete, and nothing holds a reference that could unblock it.
The scheduler parks it. The garbage collector can't reclaim it,
because a parked goroutine is still a live root. It sits there
holding its stack, its captured variables, and whatever those point
at, until the process dies.

Here's a leak that looks like reasonable code:

func Fanout(items []int) <-chan int {
    out := make(chan int)
    go func() {
        for _, n := range items {
            out <- n * 2
        }
        close(out)
    }()
    return out
}

The goroutine sends on an unbuffered channel. If the caller reads
every value, this is fine. If the caller reads two values and
returns early (an error, a break, a context cancel), the
goroutine blocks on out <- n*2 with no reader. It never reaches
close(out). It never returns. That's the leak.

The test that "covers" this function reads the whole channel, so it
never triggers the early-return path. Green test, live bug.

Wiring goleak into TestMain

The one line that catches the whole package is TestMain. Add this
file once per package:

package worker

import (
    "testing"

    "go.uber.org/goleak"
)

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

VerifyTestMain runs all the tests in the package, then checks for
stray goroutines after they finish. If any test left one parked, the
package fails with a stack dump pointing at where the goroutine was
spawned. No per-test boilerplate, no assertions to write. One file
covers every test in the package.

Install it with:

go get go.uber.org/goleak

If you only want to guard one specific test instead of the whole
package, call VerifyNone at the top with a deferred check:

func TestFanoutEarlyReturn(t *testing.T) {
    defer goleak.VerifyNone(t)

    out := Fanout([]int{1, 2, 3, 4, 5})
    <-out // read one value, then walk away
}

That defer runs after the test body, finds the sender still parked
on out <- ..., and fails the test. The leak that shipped to
production now fails in CI instead.

Reading the leaked-stack output

The value of goleak is in the report, not just the red X. When
TestFanoutEarlyReturn fails, you get something close to this:

found unexpected goroutines:
[Goroutine 34 in state chan send, with
 worker.Fanout.func1 on top of the stack:
worker.Fanout.func1()
    /app/worker/fanout.go:11 +0x5c
created by worker.Fanout in goroutine 6
    /app/worker/fanout.go:9 +0x7d
]

Read it bottom to top, then top to bottom, and it tells you the
whole story in three lines.

created by worker.Fanout ... fanout.go:9 — this is the spawn site.
Line 9 is the go func(). That's the goroutine's birth certificate:
which function started it and on what line.

worker.Fanout.func1() ... fanout.go:11 — this is where it's stuck
right now. Line 11 is out <- n * 2. The top-of-stack frame is the
exact statement the goroutine is blocked on.

in state chan send — this is why it's stuck. The scheduler state
tells you the class of leak at a glance:

chan send / chan receive — blocked on a channel with no peer.
select — blocked in a select with no ready case (often a missing ctx.Done() arm).
semacquire — waiting on a sync.Mutex or WaitGroup that never releases.
IO wait — parked in a syscall, usually a read with no deadline.

Spawn site plus stuck line plus state is the full diagnosis. You go
straight to fanout.go:11, see the unbuffered send, and know the
fix is to make the sender respect the caller giving up.

Fixing the leak

The sender needs a way out when the reader stops. A context
plus a select gives it one:

func Fanout(ctx context.Context, items []int) <-chan int {
    out := make(chan int)
    go func() {
        defer close(out)
        for _, n := range items {
            select {
            case out <- n * 2:
            case <-ctx.Done():
                return
            }
        }
    }()
    return out
}

Now the caller cancels the context when it walks away early, the
select takes the ctx.Done() arm, and the goroutine returns
instead of parking forever. Re-run the test with goleak watching and
it passes, because there's nothing left alive to find.

Handling goroutines that are supposed to outlive tests

Real programs have background goroutines that start once and run for
the life of the process: a connection pool's health checker, a
metrics flusher, the goroutine database/sql keeps for connection
cleanup. goleak would flag those as leaks, because from its point of
view they are unexpected survivors.

That's what IgnoreTopFunction is for. You tell goleak which
known-good goroutines to skip:

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m,
        goleak.IgnoreTopFunction(
            "database/sql.(*DB).connectionOpener",
        ),
    )
}

The string is the exact top-of-stack function name from the report —
the same worker.Fanout.func1 you learned to read above. Copy it
from goleak's own output into the ignore list. Keep the list short
and specific. A blanket ignore defeats the point; the goal is to
allow the handful of goroutines you know about and still fail on the
one you didn't.

goleak also retries before failing. It doesn't snapshot once and
give up. It polls a few times with a short backoff, which gives
goroutines mid-shutdown a chance to finish. So a goroutine that's
genuinely on its way out won't produce a flaky failure — only ones
that are actually stuck get reported.

Where to put it

Add the three-line TestMain to the packages that spawn goroutines:
workers, pools, anything with a go func() in it. You don't need it
in every package. Start with the ones that own concurrency, and let
CI tell you where the leaks are.

The leak that used to surface as a 3 a.m. OOM page now surfaces as a
failed test with a stack trace pointing at the line. That's the
trade goleak makes for you: move the discovery from production to the
pull request, and hand you the guilty line for free.

If you want the layer underneath this — why a parked goroutine
stays a live root, how the scheduler tracks chan send versus
select, what the runtime is actually doing when goleak counts
survivors — The Complete Guide to Go Programming walks through the
goroutine and channel model in that kind of depth. And Hexagonal
Architecture in Go is about keeping concurrency at the right
boundary, so the goroutines you spawn have an owner that knows how to
shut them down. Between the two, leaks stop being something you find
in a graph and start being something your design prevents.