Gabriel Anhaia

Posted on Apr 29

Why Your Team Should Stop Reaching for errgroup by Default

#go #concurrency #backend #goroutines

Book: The Complete Guide to Go Programming
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You have seen the pattern. A handler needs to enrich a response from
four backends. Someone reaches for golang.org/x/sync/errgroup,
writes four g.Go(func() error { ... }) blocks, and ships. Code
review nods. Tests pass. Everyone moves on.

A team I worked with did exactly this for two years. Then a partial
outage downstream caused one of the four enrichments to error early.
The other three were halfway through expensive work nobody wanted to
throw away. errgroup cancelled them anyway, the user got a 502 on
data that was already 75% computed, and the on-call got paged. The
tool was wrong for the job, and nobody had read the godoc carefully
enough to notice.

errgroup is a fine tool for the case it was built for. It is also
the most over-applied helper in the Go standard ecosystem. Three
specific failure modes show up in production, and most teams hit them
in the order below.

Failure 1: WithContext cancels peers on the first error

This is the one in the godoc,
in plain English: "the derived Context is canceled the first time a
function passed to Go returns a non-nil error." If your job is
fail-fast (one bad tile means the whole map is useless), that is
exactly what you want. If your job is "fetch four enrichments and
return whatever came back," it is wrong.

The anti-pattern most teams ship:

func enrich(ctx context.Context, id string) (Out, error) {
    g, gctx := errgroup.WithContext(ctx)

    var profile Profile
    var prefs   Prefs
    var orders  []Order
    var promo   *Promo

    g.Go(func() error {
        return fetchProfile(gctx, id, &profile)
    })
    g.Go(func() error {
        return fetchPrefs(gctx, id, &prefs)
    })
    g.Go(func() error {
        return fetchOrders(gctx, id, &orders)
    })
    g.Go(func() error {
        return fetchPromo(gctx, id, &promo)
    })

    if err := g.Wait(); err != nil {
        return Out{}, err
    }
    return Out{profile, prefs, orders, promo}, nil
}

Read this carefully. If fetchPromo returns an error, gctx is
cancelled. The other three goroutines, if they are well-behaved and
listening on gctx.Done(), abort their in-flight HTTP calls. The
function returns one error, and three results that were almost ready
get thrown away. The user sees a hard failure for an optional promo
banner.

What you wanted was: collect everything, log the failures, return
the partial result. errgroup.WithContext cannot express that. The
fix is to drop errgroup entirely for this shape:

func enrich(ctx context.Context, id string) (Out, error) {
    var (
        wg       sync.WaitGroup
        profile  Profile
        prefs    Prefs
        orders   []Order
        promo    *Promo
        errs     [4]error
    )

    wg.Add(4)
    go func() { defer wg.Done()
        errs[0] = fetchProfile(ctx, id, &profile) }()
    go func() { defer wg.Done()
        errs[1] = fetchPrefs(ctx, id, &prefs) }()
    go func() { defer wg.Done()
        errs[2] = fetchOrders(ctx, id, &orders) }()
    go func() { defer wg.Done()
        errs[3] = fetchPromo(ctx, id, &promo) }()
    wg.Wait()

    return Out{profile, prefs, orders, promo},
        errors.Join(errs[:]...)
}

sync.WaitGroup waits for everyone. errors.Join (Go 1.20+) gives
the caller all four errors if they want them, or none if everything
worked. The caller decides what is fatal. The data layer no longer
makes that decision for you.

Failure 2: Panics that don't cancel peers

This one is subtler and changes by x/sync version. errgroup
recovers panics in a goroutine started with g.Go and converts them
to a PanicError (added in x/sync v0.14.0, April 2025). Before
that version, a panic in one g.Go would crash the whole process,
peers or not, and g.Wait would never return.

After v0.14.0, the panic is recovered, g.Wait returns the
PanicError, peers receive context cancellation through gctx, and
your handler returns an error instead of crashing. That is better.
It also means a recover() in some intermediate goroutine your tool
calls into can swallow the panic before errgroup ever sees it. The
peers never get cancelled. The group hangs on Wait until the
parent context times out, which on a request handler is usually
nothing, because nobody set one.

The shape that bites:

g.Go(func() error {
    return runWithMiddleware(gctx, func() error {
        return riskyOp(gctx)
    })
})

If runWithMiddleware has its own defer func() { recover() }()
(common in audit / metrics middleware), the panic in riskyOp
disappears inside that recover. The function returns nil. g.Wait
returns nil. The result the handler used was a zero value the panic
prevented from being populated, and downstream code now operates on
silently-empty data.

errgroup is not at fault here. The lesson is that any concurrency
helper that depends on errors propagating to the group is fragile in
the presence of code that swallows them. Audit your middleware. If
something between g.Go and the work re-panics or returns nil on
recover, you have a peer-cancellation bug waiting for the right
input.

Failure 3: Goroutines parked on channels the cancel cannot unblock

The third failure mode is a goroutine leak that survives g.Wait
returning. It happens when a g.Go function blocks on a channel
operation and the cancellation of gctx does not unblock that
channel. The classic shape:

results := make(chan Item)

g.Go(func() error {
    for _, id := range ids {
        item, err := fetch(gctx, id)
        if err != nil { return err }
        results <- item
    }
    return nil
})

g.Go(func() error {
    for item := range results {
        save(gctx, item)
    }
    return nil
})

The producer sends on results. The consumer reads. If save
returns an error, the consumer goroutine returns, and gctx is
cancelled. The producer was in the middle of results <- item
when the cancel fired. The receive-side is gone, the channel is
unbuffered, and the producer is parked on the send forever. gctx
being done does not unblock a channel send. The goroutine leaks.

g.Wait will never return either, because the leaked goroutine
never completes. Your request handler hangs until the parent context
deadlines out, if it has one.

The fix is the standard select-with-cancel pattern, applied at every
channel operation inside an errgroup-managed function:

g.Go(func() error {
    for _, id := range ids {
        item, err := fetch(gctx, id)
        if err != nil { return err }
        select {
        case results <- item:
        case <-gctx.Done():
            return gctx.Err()
        }
    }
    return nil
})

Every channel send and every channel receive inside a g.Go body
needs a <-gctx.Done() arm. If you forget one, errgroup silently
gives you a leak. goleak in CI catches it on the integration tests
that exercise the cancel path. If you do not have those tests, you
will find out in production.

Three replacements, three shapes of work

The team in the opening story stopped reaching for errgroup first
and started picking from three patterns based on the semantics they
actually wanted.

Pattern 1: collect-all, no cancel

Use plain sync.WaitGroup plus errors.Join. The example in
Failure 1 above is the whole pattern. Caller decides what is fatal,
nothing gets cancelled by surprise, partial results are honest.

Pattern 2: a small Group type that names the semantics

When the same module mixes fail-fast and wait-all calls, naming the
choice at the call site stops the next reader from guessing:

type Group struct {
    wg   sync.WaitGroup
    errs []error
    mu   sync.Mutex
    ctx  context.Context
    cancel context.CancelFunc
}

func NewFailFast(parent context.Context) *Group {
    ctx, cancel := context.WithCancel(parent)
    return &Group{ctx: ctx, cancel: cancel}
}

func NewWaitAll(parent context.Context) *Group {
    return &Group{ctx: parent}
}

func (g *Group) Go(fn func(context.Context) error) {
    g.wg.Add(1)
    go func() {
        defer g.wg.Done()
        if err := fn(g.ctx); err != nil {
            g.mu.Lock()
            g.errs = append(g.errs, err)
            g.mu.Unlock()
            if g.cancel != nil {
                g.cancel()
            }
        }
    }()
}

func (g *Group) Wait() error {
    g.wg.Wait()
    if g.cancel != nil { g.cancel() }
    return errors.Join(g.errs...)
}

Two constructors. The reviewer reading the call site sees
NewFailFast or NewWaitAll and knows what happens on the first
error. No more "I assumed errgroup waited for everyone." Sixty
lines of code, zero dependencies, the two semantics explicit at the
type level.

Pattern 3: errgroup with SetLimit for bounded fan-out

errgroup.SetLimit(n) is the case where errgroup is genuinely the
right answer. You have a thousand IDs to process, you want at most
twenty in flight, fail-fast on first error is the correct semantic,
and you want the cancel-peers behavior:

g, gctx := errgroup.WithContext(ctx)
g.SetLimit(20)

for _, id := range ids {
    id := id // pre-Go-1.22 loop-variable capture
    g.Go(func() error {
        return process(gctx, id)
    })
}
return g.Wait()

Go 1.22 made loop variables per-iteration, so the explicit shadow is
redundant on 1.22+ but harmless. On 1.21 and earlier, dropping it
gives you a data race where every goroutine sees the last id.

This is the shape errgroup was designed for. Bounded concurrency,
homogeneous work, fail-fast. If a goroutine returns an error, the
others abort because they were doing the same kind of work and
finishing the rest is wasted CPU. The cancel-peers behavior is the
feature, not the bug.

The rule that replaces "use errgroup"

Before you reach for errgroup, answer three questions out loud.

What happens when one job fails — do you want the others cancelled,
or do you want their partial results? If partial, do not use
errgroup.

Are the channel operations inside each g.Go body all guarded by
a <-gctx.Done() arm? If you cannot say yes after a careful read,
you have a leak waiting for the right input.

Is any middleware in the call chain doing its own recover()? If
yes, your peer-cancel guarantee is gone, because the panic that
should have triggered the cancel got swallowed before errgroup
saw it.

errgroup answers a specific question well. The mistake is treating
it as the default concurrency primitive. sync.WaitGroup plus
errors.Join is shorter for the wait-all case. A 60-line custom
Group makes the semantics explicit when both shapes coexist.
errgroup.SetLimit is the right answer when you actually want
bounded fail-fast fan-out. Pick by the work, not by the habit.

If this was useful

Concurrency in Go is mostly about choosing the right primitive for
the work, not about writing clever synchronization. The Complete
Guide to Go Programming covers the runtime, the scheduler, channel
semantics, and the patterns above end to end, including the failure
modes that show up in production rather than tutorials. Hexagonal
Architecture in Go is the architectural half: the service shape
the producer/consumer and fan-out enrichments in this post slot
into.

The Complete Guide to Go Programming: the language at the level production demands. xgabriel.com/go-book
Hexagonal Architecture in Go: the architectural half of Thinking in Go. xgabriel.com/hexagonal-go
Hermes IDE: an IDE for developers who ship with Claude Code and other AI coding tools. hermes-ide.com
More posts and contact: xgabriel.com

DEV Community