Gabriel Anhaia

Posted on Apr 12

Goroutine Leaks in Go: The 4 Patterns and the New Profile in Go 1.26

#go #debugging #performance #webdev

📚 This post pairs with two books I've written on Go. Book 1: The Complete Guide to Go Programming. Book 2: Hexagonal Architecture in Go. Or both together as the Thinking in Go collection: Kindle / Paperback. Short blurbs at the bottom.

Go 1.26 finally shipped a built-in goroutine leak detector. It's experimental, it's off by default, and it's the most important observability feature the runtime has added in years. Almost nobody's talking about it.

That matters because goroutine leaks are the bug class Go's runtime is most permissive about. There's no panic. No warning log. No -race flag for it. You can leak a million goroutines and the language will happily let you. The runtime assumes you'll notice. Most teams don't, until memory creeps into the red and on-call is debugging a service at 3am that "just needs a restart."

In this post we'll do two things. First, walk through the four structural patterns that cause 90% of goroutine leaks in real Go code, with the pprof stack signature each one produces. Second, look at what Go 1.26's new /debug/pprof/goroutineleak profile actually detects, what it misses, and why the patterns still matter even after you enable it.

What "leaked" actually means

A goroutine has leaked when it's in a state it can never exit. Usually that means it's parked on a channel receive, a channel send, a mutex lock, or a condition variable wait, and whatever was supposed to unblock it is never going to happen.

That's a design bug, not a crash. The runtime has no way to distinguish "parked because the work will take a minute" from "parked forever." The only thing it sees is goroutine 42 [chan receive, 6h12m], and even that you have to go looking for.

Now the four patterns.

Pattern 1: Send on a channel nobody's reading

This one is the first leak most people ship, and often the last one they catch. Classic setup:

// returns the index of the first match, using goroutines because
// someone read "go is fast at concurrency" on HN
func findFirst(items []string, target string) int {
    results := make(chan int) // unbuffered

    for i, s := range items {
        go func(i int, s string) {
            if s == target {
                results <- i // blocks until someone reads
            }
        }(i, s)
    }

    return <-results
}

Looks fine. Actually leaks every goroutine except the first one to match.

Here's the mechanic: findFirst reads exactly one value from results. Every other goroutine that also matched is blocked forever on results <- i, because the channel is unbuffered and the reader has already left. Every goroutine that didn't match exits normally, so the leak only shows up when you have multiple matches.

If you run this on a hot path for a few hours, pprof.Lookup("goroutine").WriteTo(os.Stderr, 2) will show you a pile of stacks like:

goroutine 8471 [chan send, 4 minutes]:
main.findFirst.func1(0x7, {0x1099c81, 0x5})
    /app/main.go:12 +0x45

The signature to look for is [chan send, Xm] in the state line. That's goroutine leak dialect for "I tried to send, nobody read."

The fix, depending on what you actually want:

func findFirst(ctx context.Context, items []string, target string) int {
    results := make(chan int, len(items)) // buffered so sends don't block

    for i, s := range items {
        go func(i int, s string) {
            if s == target {
                select {
                case results <- i:
                case <-ctx.Done():
                }
            }
        }(i, s)
    }

    select {
    case r := <-results:
        return r
    case <-ctx.Done():
        return -1
    }
}

Two things changed. The channel is buffered so sends can't block on a missing reader. The sends themselves check ctx.Done() so they can give up. If you don't need cancellation, just buffering the channel is often enough. But the ctx.Done() pattern composes better with the rest of your service and you'll wish you'd used it later.

Pattern 2: `select` with no `<-ctx.Done()`

The second pattern is where worker pools go to die:

func worker(jobs <-chan Job) {
    for {
        select {
        case j := <-jobs:
            process(j)
        // no case <-ctx.Done()
        // no exit condition at all
        }
    }
}

When the service shuts down, jobs might close or might not (depending on whether the producer is well-behaved), but even if it does, this worker is still looping and selecting forever. The for {} plus select has no exit, and the goroutine that owns it is going to outlive the HTTP server, the database pool, and your patience.

The fix is almost embarrassingly small:

func worker(ctx context.Context, jobs <-chan Job) {
    for {
        select {
        case j, ok := <-jobs:
            if !ok {
                return // channel closed, clean exit
            }
            process(j)
        case <-ctx.Done():
            return
        }
    }
}

Now the worker exits on two different signals: the channel being closed, or the context being canceled. Either path returns, the goroutine drops off the runtime's list, and your pprof.Lookup("goroutine") count stops growing on every deploy.

The pprof signature for this one looks like:

goroutine 1234 [select, 12 minutes]:
main.worker(0xc0000aa600)
    /app/worker.go:15 +0x7b

[select, Xm] tells you a goroutine is sitting in a select with no clause ready. If the select has no cancellation branch, that stack is the leak.

Pattern 3: `for range ch` on a channel nobody closes

This one feels safer because for range is so familiar. It's not safer.

func consume(ch <-chan Event) {
    for e := range ch {
        handle(e)
    }
    // if nobody closes ch, this loop never ends
}

for range ch exits when the channel is closed, not when it's empty. If the producer dies or gets GC'd without closing the channel, the consumer hangs forever. You'll see:

goroutine 777 [chan receive, 2h4m]:
main.consume(0xc0000ba660)
    /app/consumer.go:4 +0x62

The annoying thing about this pattern is that the fix is a code review rule, not a line change. Somebody has to own the close. The rule that works:

The goroutine that writes to a channel is responsible for closing it. If multiple goroutines write, use sync.WaitGroup in a supervisor goroutine and have the supervisor close the channel after wg.Wait().

Example:

func produce(ctx context.Context, out chan<- Event, sources []Source) {
    var wg sync.WaitGroup
    for _, s := range sources {
        wg.Add(1)
        go func(s Source) {
            defer wg.Done()
            s.Emit(ctx, out)
        }(s)
    }
    wg.Wait()
    close(out) // single writer closes, consumers wake up cleanly
}

Single close site. Consumer can use for range without worrying. Done.

Pattern 4: `WaitGroup.Wait()` missing a `Done()`

Pattern 4 is the one that ships to production through code review because it looks correct. Here's the buggy version:

func fanOut(ctx context.Context, tasks []Task) error {
    var wg sync.WaitGroup
    errs := make(chan error, len(tasks))

    for _, t := range tasks {
        wg.Add(1)
        go func(t Task) {
            if err := t.Run(ctx); err != nil {
                errs <- err
                return // oops: wg.Done() never called
            }
            wg.Done()
        }(t)
    }

    wg.Wait() // hangs forever if any task errored
    close(errs)
    return firstError(errs)
}

The early return on the error path skips wg.Done(). The counter never reaches zero, wg.Wait() blocks forever, and the calling goroutine joins the leaked-goroutine museum.

defer wg.Done() at the top of the goroutine is the entire fix:

go func(t Task) {
    defer wg.Done() // runs whether we return normally or on error
    if err := t.Run(ctx); err != nil {
        errs <- err
        return
    }
}(t)

This is why "always defer your wg.Done() on the first line of the goroutine body" is a real rule worth writing into your team's style guide. Every time someone tries to get clever with placement, this bug finds them.

pprof signature for this one:

goroutine 1 [semacquire, 30 minutes]:
sync.(*WaitGroup).Wait(0xc0000180a0)
    /usr/local/go/src/sync/waitgroup.go:116 +0x73
main.fanOut(...)

[semacquire] inside sync.(*WaitGroup).Wait is the fingerprint.

Detection, the portable way: pprof.Lookup

The simplest leak detector you can ship today is a single HTTP handler:

import (
    "net/http"
    "runtime/pprof"
)

func goroutineDumpHandler(w http.ResponseWriter, r *http.Request) {
    // debug=2 gives you human-readable stacks, one per goroutine
    _ = pprof.Lookup("goroutine").WriteTo(w, 2)
}

Mount it on your internal port. Hit it. Read the output. Count goroutines by stack prefix. If a stack that should belong to a finite pool shows up thousands of times, you've got a leak.

For automation, you can snapshot the count before and after a known operation:

func goroutineCount() int {
    return runtime.NumGoroutine()
}

// in a test or benchmark
before := goroutineCount()
doTheThing()
time.Sleep(100 * time.Millisecond) // give goroutines a chance to exit
after := goroutineCount()
if after > before {
    t.Fatalf("leaked %d goroutines", after-before)
}

Crude but effective. The time.Sleep is the annoying part, because goroutines exit asynchronously and you don't want to flake. Which is why most teams reach for a real library.

Detection in tests: goleak

Uber's goleak package is the industry standard for catching leaks in CI. One line in TestMain and every test in the package fails if it leaves a goroutine behind:

package mypkg

import (
    "testing"

    "go.uber.org/goleak"
)

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

That's it. Any test that spawns a goroutine and forgets to wait for it to exit will fail with a stack dump pointing at the leaking goroutine. It's the single highest ROI change you can make to a Go codebase that doesn't already have it.

A gotcha you'll hit within a week: test helpers from other libraries (the Go race detector, test servers, httptest.Server) sometimes leave background goroutines running for their own reasons. goleak.IgnoreCurrent() takes a snapshot at startup and ignores anything that was already there. Use it when you have a valid reason, and document why. Don't sprinkle IgnoreTopFunction calls until the suite goes green. That's how you hide real leaks under test noise.

New in Go 1.26: the built-in leak profile

The quietly huge change almost nobody on Go Twitter is talking about: Go 1.26 shipped an experimental goroutine leak profile that uses GC reachability analysis.

Enable it at build time:

GOEXPERIMENT=goroutineleakprofile go build -o myapp ./cmd/myapp

Then the new profile is exposed at /debug/pprof/goroutineleak:

go tool pprof http://localhost:6060/debug/pprof/goroutineleak

How it works, roughly: during GC, the runtime already walks the reachability graph of every live object. The leak profile piggybacks on that walk to find goroutines blocked on concurrency primitives (channels, sync.Mutex, sync.Cond) where nothing outside the blocked island can reach the primitive. If the channel you're parked on is only referenced by other goroutines that are also parked on it, the whole island is unreachable, and the runtime flags it as leaked.

That's exactly the right detection model for Pattern 1 (chan send with no reader) and for certain subclasses of Pattern 3. If the reader is gone and the channel is GC'd down to just the stuck senders, the 1.26 profile will list it directly, with stacks.

What the 1.26 profile won't catch

This is the part the release notes don't emphasize enough.

The leak profile only catches goroutines where the blocking primitive is unreachable. If the primitive is still reachable from live code, the goroutine is not "leaked" by the GC's definition, even if it's pragmatically stuck.

Counterexample: a goroutine parked forever on ctx.Done() from a context.Background() you misplaced. The context object is a package-level variable. It's reachable. The runtime thinks everything is fine. You think you have a leak. You're both right, for different definitions of leak.

Another counterexample: a cache that holds channels as values. If a stuck goroutine is parked on <-c.events, and c.events is reachable through the cache map, and the cache map is reachable through your service singleton, the goroutine is not "leaked" as far as the GC is concerned. It's just permanently stuck on a channel that will never receive a value.

So the 1.26 profile is a massive upgrade for detecting orphaned leaks, and useless for detecting logically stuck goroutines. Both categories show up in real production codebases. You still need the pattern-based mental model, you still need goleak in tests, and you still need to read stacks when memory grows.

The Go team expects the profile to become default in 1.27, and I'd bet money it lands exactly as spec'd. Wire it into staging now so you have a baseline when it goes default.

The mental model that prevents all four patterns

Every go func() you write should have a one-sentence answer to this question:

What is the specific event that makes this goroutine exit?

If you can't answer in one sentence, don't ship it. The answers that are always correct:

"The channel I'm reading from gets closed."
"The context I'm checking gets canceled."
"The function I'm running returns."

The answers that are wrong:

"Someone will stop sending."
"The request will finish."
"It won't run long enough to matter."

"It won't run long enough to matter" is how most goroutine leaks end up shipping. They ran for long enough to matter. They always do.

Next step

Three things to do this week:

Add goleak.VerifyTestMain(m) to one package in your codebase. Watch what breaks. Fix it. Then add it to the next package.
Expose pprof.Lookup("goroutine").WriteTo(w, 2) on an internal endpoint and hit it during peak traffic. Count stacks. Find anything that looks wrong.
Build one service with GOEXPERIMENT=goroutineleakprofile and deploy to staging. Diff the profile on a quiet night vs a loaded one. See what shows up.

Then apply the one-sentence rule to every go func() in your next PR. That's the work.

Question for the comments: which of the four patterns has bitten you hardest, and how did you find it? Drop the story below.

The books

📖 The Complete Guide to Go Programming — Book 1 in the series. The language from the ground up: goroutines, channels, the memory model, the standard library, all the pieces tutorials hand-wave past.

📖 Hexagonal Architecture in Go — Book 2. How to design Go services where the four patterns above don't get the chance to ship. Bounded concurrency at the architecture level, not just the goroutine level. 22 chapters with a companion repo.

📚 Or the full collection: Thinking in Go on Amazon as Kindle or Paperback.

Next in **Go in Production: the context package. Five ways it's misused in almost every Go codebase, and what correct usage actually looks like.

Top comments (2)

Vlad • May 1

Thank you for the feedback on the goroutine leak profiles! You're spot on about the analysis mechanism and its limitations. My only comment is that Pattern 4 (at least as shown in the example, with a transient wait group that could get orphaned) should also be discoverable.