DEV Community

Cover image for The 15-Minute Goroutine Leak Triage: Two Dumps, One Diff, Zero Guessing
Voskan Voskanyan
Voskan Voskanyan

Posted on

The 15-Minute Goroutine Leak Triage: Two Dumps, One Diff, Zero Guessing

Goroutine leaks rarely announce themselves with a dramatic outage. They show up as "slowly getting worse":

  • p95/p99 creeps up over an hour or two
  • memory trends upward even though traffic is flat
  • goroutine count keeps climbing and doesn’t return to baseline

If you’ve been on-call long enough, you’ve seen the trap: people debate why it’s happening before they’ve proven what is accumulating.

This post is a compact, production-first triage that I use to confirm a goroutine leak fast, identify the dominant stuck pattern, and ship a fix that holds.

If you want the full long-form runbook with a root-cause catalog, hardening defaults, and a production checklist, I published it here:

https://compile.guru/goroutine-leaks-production-pprof-tracing/


What a goroutine leak is (the only definition that matters in production)

In production I don’t define a leak as "goroutines are high."

A goroutine is leaked when it outlives the request/job that created it and it has no bounded lifetime (no reachable exit path tied to cancellation, timeout budget, or shutdown).

That framing matters because it turns debugging into lifecycle accounting:

  • What started this goroutine?
  • What is its exit condition?
  • Why is the exit unreachable?

Minute 0–3: confirm the signature (don’t skip this)

Before you touch profiling, answer one question:

Is the system accumulating concurrency footprint without a matching increase in work?

What I look at together:

  • QPS / job intake (flat or stable-ish)
  • goroutines (upward slope)
  • inuse heap / RSS (upward slope)
  • tail latency (upward slope)

If goroutines spike during a burst and then gradually return: that’s not a leak, that’s load.

If goroutines rise linearly (or step-up repeatedly) while work is stable: treat it as a leak until proven otherwise.


Minute 3–10: capture two goroutine profiles and diff them

The key move is comparison. A single goroutine dump is noisy. Two dumps tell you what’s growing.

Option A (best for diffing): capture profile format and use go tool pprof

Capture twice, separated by 10–15 minutes.

curl -sS "http://$HOST/debug/pprof/goroutine" > goroutine.1.pb.gz

sleep 900

curl -sS "http://$HOST/debug/pprof/goroutine" > goroutine.2.pb.gz

Enter fullscreen mode Exit fullscreen mode

Now diff them.

go tool pprof -top -diff_base=goroutine.1.pb.gz ./service-binary goroutine.2.pb.gz

Enter fullscreen mode Exit fullscreen mode

What you want:

  • one (or a few) stacks that grow a lot
  • a clear wait reason: channel send/recv, network poll, lock wait, select, etc.

Option B (fastest human scan): debug=2 text dumps

curl -sS "http://$HOST/debug/pprof/goroutine?debug=2" > goroutines.1.txt

sleep 900

curl -sS "http://$HOST/debug/pprof/goroutine?debug=2" > goroutines.2.txt

Enter fullscreen mode Exit fullscreen mode

Then do a "poor man’s diff":

  • search for repeated top frames
  • count occurrences (even roughly)
  • focus on the stacks with the biggest growth

Minute 10-15: map the dominant stack to the first fix you should try

Once you have "the stack that grows," the fix is usually not mysterious. Here’s the mapping I use to choose the first patch.

1) Many goroutines blocked on chan send / chan receive

Interpretation: backpressure/coordination bug. Producers outpace consumers, or receivers are missing, or close ownership is unclear.

First fix:

  • add a cancellation edge to send/receive paths (select { case <-ctx.Done(): ... })
  • bound the queue/channel (and decide policy: block with timeout vs reject)

Example helper:

func sendWithContext[T any](ctx context.Context, ch chan<- T, v T) error {
    select {
    case ch <- v:
        return nil
    case <-ctx.Done():
        return ctx.Err()
    }
}

Enter fullscreen mode Exit fullscreen mode

2) Many goroutines stuck in net/http.(*Transport).RoundTrip / netpoll waits

Interpretation: outbound I/O without a real deadline or missing request context wiring. Slow downstream causes your service to "hold on" to goroutines.

First fix:

  • enforce timeouts at the client level (transport + overall cap)
  • always use http.NewRequestWithContext (or req = req.WithContext(ctx))
  • always close bodies and bound reads

3) Many goroutines waiting on WaitGroup.Wait, semaphores, or errgroup

Interpretation: join/cancellation bug or unbounded fan-out. Work starts faster than it completes; cancellation doesn’t propagate; someone forgot to wait.

First fix:

  • ensure there is exactly one "owner" that always calls Wait()
  • use errgroup.WithContext so cancellation is wired
  • bound concurrency explicitly (e.g., SetLimit)
g, ctx := errgroup.WithContext(ctx)
g.SetLimit(16)
Enter fullscreen mode Exit fullscreen mode

4) Many goroutines in timers/tickers / periodic loops

Interpretation: time-based resources not stopped, or loops not tied to cancellation/shutdown.

First fix:

  • stop tickers
  • stop + drain timers when appropriate
  • ensure the loop has a ctx.Done() exit

Where tracing fits (and why it’s worth it even if pprof "already shows the stack")

pprof tells you what is stuck. Tracing tells you:

  • which request/job spawned it
  • what deadline/budget it had
  • which downstream call/queue wait never returned

If you already have OpenTelemetry (or any tracing), the fastest win is:

  • put spans around anything that can block: outbound HTTP/gRPC, DB calls, queue publish/consume, semaphore acquire, worker enqueue
  • tag spans with route/operation, downstream name, and timeout budget

That way, when profiling says "these goroutines are stuck in RoundTrip," tracing tells you "95% of them are from /enrich, tenant X, calling payments-api, timing out at 800ms."


The patch that actually holds: ship "hardening defaults," not one-off fixes

If you only patch the one stack you saw today, the next incident will be a different stack.

The fixes that keep paying dividends are defaults:

  • timeout budgets at boundaries
  • bounded concurrency for any fan-out
  • bounded queues + explicit backpressure policy
  • explicit channel ownership rules
  • structured shutdown (stop admission → cancel context → wait with a shutdown budget)

I keep the complete hardening patterns + production checklist in the full post:

https://compile.guru/goroutine-leaks-production-pprof-tracing/


Prove it’s fixed (don’t accept vibes)

A real fix has artifacts:

  • goroutine slope stabilizes under the same traffic/load pattern
  • the dominant growing stack is gone (or bounded) in comparable snapshots
  • tail latency and timeout rate improve (or at least stop worsening)

Also watch out for "false confidence":

  • restarts and autoscaling can hide leaks without removing the bug
  • short tests miss slow leaks (especially timer/ticker issues)

Wrap-up

The fastest way to win against goroutine leaks is to stop guessing:

1) confirm the signature (slope + correlation)

2) take two goroutine captures and diff them

3) fix the dominant stack with lifecycle bounds (timeout/cancel/join/backpressure)

4) prove the fix with before/after slope and comparable snapshots

If you want the deeper catalog of leak patterns and the production checklist I use in reviews and incident response, here’s the complete runbook:

https://compile.guru/goroutine-leaks-production-pprof-tracing/

Top comments (0)