errgroup vs WaitGroup vs Channels: Pick the Right Concurrency Primitive

#go #concurrency #backend #goroutines

Book: The Complete Guide to Go Programming
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A backend engineer I worked with had a translation pipeline.
A thousand documents go in, a thousand translated documents come
out. The first version used errgroup.WithContext because that is
what every Go talk on YouTube reaches for. One document failed to
translate because the upstream API rate-limited that single tenant.
The context cancelled. 999 documents that were already translated
got thrown away, and the retry job ran the whole batch again. The
cloud bill went up.

The bug was not in the code. The code did exactly what errgroup
documents. The bug was that someone picked errgroup for a job
where you want to collect everything, not fail-fast on the first
error. Go gives you three concurrency primitives for fan-out, and
each answers a different question. Picking the wrong one is a bug
you ship to production and only notice when the input shape stops
being friendly.

The three primitives, what they actually mean

sync.WaitGroup is a counter. Add(n), Done(), Wait(). It
tells you when N goroutines have finished. It does not propagate
errors, it does not cancel anyone, it does not transfer values.
You handle all of that yourself.

errgroup.Group is WaitGroup plus error capture plus
context-cancel-on-first-error (when you use WithContext). The
first goroutine that returns a non-nil error cancels the derived
context. The other goroutines see ctx.Done() and abort. Wait
returns the first error.

A channel is none of those things. A channel transfers ownership
of a value from one goroutine to another, with optional buffering
and select-based cancellation. It is not a fan-out helper. It is
a queue.

The mistake most teams make is treating these as interchangeable.
They are not. Each one has a shape of work it fits. Pick by the
shape of the work, not by what you reached for last time.

Case A: batch translate 1k docs, keep what worked

The translation pipeline. Independent jobs, partial failure is
fine, the caller wants every result that came back and a list of
which IDs failed. This is WaitGroup with errors.Join.

package translate

import (
    "context"
    "errors"
    "sync"
)

type Result struct {
    ID   string
    Text string
    Err  error
}

func TranslateAll(
    ctx context.Context,
    ids []string,
    fn func(context.Context, string) (string, error),
) ([]Result, error) {
    out := make([]Result, len(ids))
    var wg sync.WaitGroup
    wg.Add(len(ids))

    for i, id := range ids {
        go func(i int, id string) {
            defer wg.Done()
            text, err := fn(ctx, id)
            out[i] = Result{ID: id, Text: text, Err: err}
        }(i, id)
    }
    wg.Wait()

    var errs []error
    for _, r := range out {
        if r.Err != nil {
            errs = append(errs, r.Err)
        }
    }
    return out, errors.Join(errs...)
}

Each goroutine writes to its own slot in out. No mutex needed.
Distinct indexes are race-free. errors.Join (Go 1.20+) returns
nil if every job succeeded, or all errors wrapped together if some
did not. The caller decides what is fatal.

If you swap WaitGroup for errgroup.WithContext here, the first
rate-limit error cancels every other in-flight translation. That
is the bug from the opening story.

Case B: parallel API calls where one failure kills the rest

A search endpoint that hits four backends in parallel: documents,
people, projects, tags. If any one of them errors, the whole
response is wrong, and the others are wasting CPU finishing a
result you are about to throw away. Fail-fast is the correct
semantic. This is errgroup.WithContext doing its job.

package search

import (
    "context"

    "golang.org/x/sync/errgroup"
)

type Hits struct {
    Docs     []Doc
    People   []Person
    Projects []Project
    Tags     []Tag
}

func Search(ctx context.Context, q string) (Hits, error) {
    g, gctx := errgroup.WithContext(ctx)
    var h Hits

    g.Go(func() error {
        ds, err := searchDocs(gctx, q)
        h.Docs = ds
        return err
    })
    g.Go(func() error {
        ps, err := searchPeople(gctx, q)
        h.People = ps
        return err
    })
    g.Go(func() error {
        pr, err := searchProjects(gctx, q)
        h.Projects = pr
        return err
    })
    g.Go(func() error {
        ts, err := searchTags(gctx, q)
        h.Tags = ts
        return err
    })

    if err := g.Wait(); err != nil {
        return Hits{}, err
    }
    return h, nil
}

If searchPeople errors, gctx cancels. The other three see
gctx.Done() and abort their HTTP calls. g.Wait returns the
first error. The handler returns 500 fast instead of waiting on
work the caller is going to discard.

Each goroutine writes to a distinct field of h, and the caller
only reads h after g.Wait returns. g.Wait provides the
happens-before edge that makes those distinct-field writes safe to
read. Same teaching as the out[i] slot pattern above: disjoint
writes plus a join point.

Two failure modes bite here. Channel sends inside a g.Go body
need <-gctx.Done() arms or the goroutine leaks when peers cancel.
And any middleware in the call chain doing its own recover() will
swallow panics before errgroup sees them.

Case C: producer/consumer with rate limit

A worker pool calling an upstream API capped at 50 requests per
second. The producer reads jobs from a queue. The workers process
them. Back-pressure has to flow back to the producer when the
workers are saturated. This is what channels were designed for.

package worker

import (
    "context"
    "time"
)

type Job struct {
    ID string
}

func Run(
    ctx context.Context,
    in <-chan Job,
    workers int,
    rps int,
    do func(context.Context, Job) error,
) error {
    tick := time.NewTicker(time.Second / time.Duration(rps))
    defer tick.Stop()

    gated := make(chan Job)
    go func() {
        defer close(gated)
        for j := range in {
            select {
            case <-tick.C:
            case <-ctx.Done():
                return
            }
            select {
            case gated <- j:
            case <-ctx.Done():
                return
            }
        }
    }()

    errs := make(chan error, workers)
    done := make(chan struct{})

    for i := 0; i < workers; i++ {
        go func() {
            for j := range gated {
                if err := do(ctx, j); err != nil {
                    errs <- err
                    return
                }
            }
            done <- struct{}{}
        }()
    }

    finished := 0
    for finished < workers {
        select {
        case err := <-errs:
            return err
        case <-done:
            finished++
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    return nil
}

Two channel hops do real work here. The rate-limit gate blocks the
producer until the ticker fires and a worker is ready to receive.
The unbuffered hand-off makes back-pressure automatic. If every
worker is busy, the producer parks on the send until one frees up.
You can tune the buffer (make(chan Job, N)) to allow a small
burst, but the principle is the same.

You cannot express this with WaitGroup or errgroup. They are
fan-out primitives. They wait for goroutines to finish. They do
not transfer values, do not buffer, do not back-pressure. The
channel is doing work neither helper can do.

Where the cost lives

Goroutine spawn is cheap. Uncontended channel ops are cheap. Once
many producers pile onto the same send, the channel's internal
mutex (see runtime/chan.go) serialises them and the per-op cost
climbs fast. The atomic < mutex < uncontended channel send <
contended channel send ordering holds across CPUs and Go versions,
even though the absolute numbers shift with hardware.

What this means for picking a primitive. At a hundred goroutines
in a fan-out, spawn cost is rounding error next to any real work.
At a million channel sends per second on a hot path, the per-op
cost is the work. Pick the primitive that matches the work, then
run go test -bench on the workload that mirrors yours before you
optimise on a guess.

The decision matrix

Three questions, in order, before you write g.Go or
go func().

1. Do you want partial results when one job fails? Yes:
WaitGroup plus errors.Join. No: errgroup.WithContext.

2. Are the goroutines exchanging values, or just running side
by side? Exchanging values: a channel. Running side by side and
returning errors: errgroup. Running side by side and writing to
disjoint slots: WaitGroup.

3. Is back-pressure part of the job? Yes: a channel with a
chosen buffer size. The send blocks when receivers are slow,
which is the feature. Neither WaitGroup nor errgroup gives you
this — they wait for fixed-size fan-out, not for an open-ended
queue.

The translation pipeline answers question 1 with no. The search
endpoint answers question 1 with yes. The rate-limited worker pool
lands on question 3.

Next time errgroup.WithContext is on the tip of your fingers,
walk the three questions first. And if you have a fan-out helper
in your codebase wrapping errgroup as the default, grep it for
callers that throw away partial results — that is where the next
$x cloud bill is hiding.

If this was useful

Concurrency in Go is mostly about picking the right primitive for
the work, not about clever synchronisation. The Complete Guide to
Go Programming covers sync, errgroup, channel internals, the
runtime scheduler, and the failure modes that show up in production
once your input shape stops being friendly. Hexagonal Architecture
in Go shows how the producer/consumer and fan-out shapes here fit
inside a service that survives on-call.