Voskan Voskanyan

Posted on Jan 6

The 15-Minute Goroutine Leak Triage: Two Dumps, One Diff, Zero Guessing

#go #backend #microservices #programming

Goroutine leaks rarely announce themselves with a dramatic outage. They show up as "slowly getting worse":

p95/p99 creeps up over an hour or two
memory trends upward even though traffic is flat
goroutine count keeps climbing and doesn’t return to baseline

If you’ve been on-call long enough, you’ve seen the trap: people debate why it’s happening before they’ve proven what is accumulating.

This post is a compact, production-first triage that I use to confirm a goroutine leak fast, identify the dominant stuck pattern, and ship a fix that holds.

If you want the full long-form runbook with a root-cause catalog, hardening defaults, and a production checklist, I published it here:

https://compile.guru/goroutine-leaks-production-pprof-tracing/

What a goroutine leak is (the only definition that matters in production)

In production I don’t define a leak as "goroutines are high."

A goroutine is leaked when it outlives the request/job that created it and it has no bounded lifetime (no reachable exit path tied to cancellation, timeout budget, or shutdown).

That framing matters because it turns debugging into lifecycle accounting:

What started this goroutine?
What is its exit condition?
Why is the exit unreachable?

Minute 0–3: confirm the signature (don’t skip this)

Before you touch profiling, answer one question:

Is the system accumulating concurrency footprint without a matching increase in work?

What I look at together:

QPS / job intake (flat or stable-ish)
goroutines (upward slope)
inuse heap / RSS (upward slope)
tail latency (upward slope)

If goroutines spike during a burst and then gradually return: that’s not a leak, that’s load.

If goroutines rise linearly (or step-up repeatedly) while work is stable: treat it as a leak until proven otherwise.

Minute 3–10: capture two goroutine profiles and diff them

The key move is comparison. A single goroutine dump is noisy. Two dumps tell you what’s growing.

Option A (best for diffing): capture profile format and use `go tool pprof`

Capture twice, separated by 10–15 minutes.

curl -sS "http://$HOST/debug/pprof/goroutine" > goroutine.1.pb.gz

sleep 900

curl -sS "http://$HOST/debug/pprof/goroutine" > goroutine.2.pb.gz

Now diff them.

go tool pprof -top -diff_base=goroutine.1.pb.gz ./service-binary goroutine.2.pb.gz

What you want:

one (or a few) stacks that grow a lot
a clear wait reason: channel send/recv, network poll, lock wait, select, etc.

Option B (fastest human scan): `debug=2` text dumps

curl -sS "http://$HOST/debug/pprof/goroutine?debug=2" > goroutines.1.txt

sleep 900

curl -sS "http://$HOST/debug/pprof/goroutine?debug=2" > goroutines.2.txt

Then do a "poor man’s diff":

search for repeated top frames
count occurrences (even roughly)
focus on the stacks with the biggest growth

Minute 10-15: map the dominant stack to the first fix you should try

Once you have "the stack that grows," the fix is usually not mysterious. Here’s the mapping I use to choose the first patch.

1) Many goroutines blocked on `chan send` / `chan receive`

Interpretation: backpressure/coordination bug. Producers outpace consumers, or receivers are missing, or close ownership is unclear.

First fix:

add a cancellation edge to send/receive paths (select { case <-ctx.Done(): ... })
bound the queue/channel (and decide policy: block with timeout vs reject)

Example helper:

func sendWithContext[T any](ctx context.Context, ch chan<- T, v T) error {
    select {
    case ch <- v:
        return nil
    case <-ctx.Done():
        return ctx.Err()
    }
}

2) Many goroutines stuck in `net/http.(*Transport).RoundTrip` / netpoll waits

Interpretation: outbound I/O without a real deadline or missing request context wiring. Slow downstream causes your service to "hold on" to goroutines.

First fix:

enforce timeouts at the client level (transport + overall cap)
always use http.NewRequestWithContext (or req = req.WithContext(ctx))
always close bodies and bound reads

3) Many goroutines waiting on `WaitGroup.Wait`, semaphores, or `errgroup`

Interpretation: join/cancellation bug or unbounded fan-out. Work starts faster than it completes; cancellation doesn’t propagate; someone forgot to wait.

First fix:

ensure there is exactly one "owner" that always calls Wait()
use errgroup.WithContext so cancellation is wired
bound concurrency explicitly (e.g., SetLimit)

g, ctx := errgroup.WithContext(ctx)
g.SetLimit(16)

4) Many goroutines in timers/tickers / periodic loops

Interpretation: time-based resources not stopped, or loops not tied to cancellation/shutdown.

First fix:

stop tickers
stop + drain timers when appropriate
ensure the loop has a ctx.Done() exit

Where tracing fits (and why it’s worth it even if pprof "already shows the stack")

pprof tells you what is stuck. Tracing tells you:

which request/job spawned it
what deadline/budget it had
which downstream call/queue wait never returned

If you already have OpenTelemetry (or any tracing), the fastest win is:

put spans around anything that can block: outbound HTTP/gRPC, DB calls, queue publish/consume, semaphore acquire, worker enqueue
tag spans with route/operation, downstream name, and timeout budget

That way, when profiling says "these goroutines are stuck in RoundTrip," tracing tells you "95% of them are from /enrich, tenant X, calling payments-api, timing out at 800ms."

The patch that actually holds: ship "hardening defaults," not one-off fixes

If you only patch the one stack you saw today, the next incident will be a different stack.

The fixes that keep paying dividends are defaults:

timeout budgets at boundaries
bounded concurrency for any fan-out
bounded queues + explicit backpressure policy
explicit channel ownership rules
structured shutdown (stop admission → cancel context → wait with a shutdown budget)

I keep the complete hardening patterns + production checklist in the full post:

https://compile.guru/goroutine-leaks-production-pprof-tracing/

Prove it’s fixed (don’t accept vibes)

A real fix has artifacts:

goroutine slope stabilizes under the same traffic/load pattern
the dominant growing stack is gone (or bounded) in comparable snapshots
tail latency and timeout rate improve (or at least stop worsening)

Also watch out for "false confidence":

restarts and autoscaling can hide leaks without removing the bug
short tests miss slow leaks (especially timer/ticker issues)

Wrap-up

The fastest way to win against goroutine leaks is to stop guessing:

1) confirm the signature (slope + correlation)

2) take two goroutine captures and diff them

3) fix the dominant stack with lifecycle bounds (timeout/cancel/join/backpressure)

4) prove the fix with before/after slope and comparable snapshots

If you want the deeper catalog of leak patterns and the production checklist I use in reviews and incident response, here’s the complete runbook:

https://compile.guru/goroutine-leaks-production-pprof-tracing/

DEV Community

The 15-Minute Goroutine Leak Triage: Two Dumps, One Diff, Zero Guessing

What a goroutine leak is (the only definition that matters in production)

Minute 0–3: confirm the signature (don’t skip this)

Minute 3–10: capture two goroutine profiles and diff them

Option A (best for diffing): capture profile format and use `go tool pprof`

Option B (fastest human scan): `debug=2` text dumps

Minute 10-15: map the dominant stack to the first fix you should try

1) Many goroutines blocked on `chan send` / `chan receive`

2) Many goroutines stuck in `net/http.(*Transport).RoundTrip` / netpoll waits

3) Many goroutines waiting on `WaitGroup.Wait`, semaphores, or `errgroup`

4) Many goroutines in timers/tickers / periodic loops

Where tracing fits (and why it’s worth it even if pprof "already shows the stack")

The patch that actually holds: ship "hardening defaults," not one-off fixes

Prove it’s fixed (don’t accept vibes)

Wrap-up

Top comments (0)

What a goroutine leak is (the only definition that matters in production)

Minute 0–3: confirm the signature (don’t skip this)

Minute 3–10: capture two goroutine profiles and diff them

Option A (best for diffing): capture profile format and use go tool pprof

Option B (fastest human scan): debug=2 text dumps

Minute 10-15: map the dominant stack to the first fix you should try

1) Many goroutines blocked on chan send / chan receive

2) Many goroutines stuck in net/http.(*Transport).RoundTrip / netpoll waits

3) Many goroutines waiting on WaitGroup.Wait, semaphores, or errgroup

4) Many goroutines in timers/tickers / periodic loops

Where tracing fits (and why it’s worth it even if pprof "already shows the stack")

The patch that actually holds: ship "hardening defaults," not one-off fixes

Prove it’s fixed (don’t accept vibes)

Wrap-up

Option A (best for diffing): capture profile format and use `go tool pprof`

Option B (fastest human scan): `debug=2` text dumps

1) Many goroutines blocked on `chan send` / `chan receive`

2) Many goroutines stuck in `net/http.(*Transport).RoundTrip` / netpoll waits

3) Many goroutines waiting on `WaitGroup.Wait`, semaphores, or `errgroup`