Bounded Parallelism in Go: The Semaphore Pattern That Caps Goroutines

#go #concurrency #backend

Book: The Complete Guide to Go Programming
Also by me: Hexagonal Architecture in Go — the companion book in the Thinking in Go series
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You have a slice of 50,000 URLs and a function that fetches each one.
The Go way is obvious: spin a goroutine per URL, wait on a
sync.WaitGroup, done. It compiles. It passes the test with ten
URLs. Then you run it against the real list and your process opens
50,000 sockets in a second, the file-descriptor limit slams shut at
too many open files, the downstream API returns 429 on every
request, and your own memory graph goes vertical.

Goroutines are cheap. What they touch is not. Each one here holds a
socket, a TLS session, a response buffer, and a slot in somebody
else's connection pool. Ten thousand cheap goroutines fighting over
a finite resource is how a Go service melts. The fix is not fewer
goroutines. It is a cap on how many run at once. That cap is a
semaphore, and Go gives you two clean ways to build one.

The bug: fan-out with no ceiling

Here is the shape that ships and then falls over.

func fetchAll(urls []string) []Result {
    results := make([]Result, len(urls))
    var wg sync.WaitGroup
    for i, u := range urls {
        wg.Add(1)
        go func() {
            defer wg.Done()
            results[i] = fetch(u)
        }()
    }
    wg.Wait()
    return results
}

Since Go 1.22 the loop variables i and u are per-iteration, so
the old capture bug is gone. This code is correct. It is also
unbounded. len(urls) goroutines start immediately, and every one
of them calls fetch at the same instant. Nothing throttles the
concurrency. The WaitGroup counts goroutines; it does not limit
them.

You want the exact same fan-out, but with never more than N
fetch calls in flight. That number N is a knob you tune to the
weakest resource in the chain: the API's rate limit, your fd limit,
the connection pool size.

Pattern 1: a buffered channel as a counting semaphore

A buffered channel of capacity N is a counting semaphore. Sending
into it is "acquire a slot." Receiving from it is "release a slot."
When the buffer is full, the send blocks, which is exactly the
back-pressure you want.

func fetchAll(urls []string, limit int) []Result {
    results := make([]Result, len(urls))
    sem := make(chan struct{}, limit)
    var wg sync.WaitGroup

    for i, u := range urls {
        wg.Add(1)
        sem <- struct{}{} // acquire; blocks if full
        go func() {
            defer wg.Done()
            defer func() { <-sem }() // release
            results[i] = fetch(u)
        }()
    }
    wg.Wait()
    return results
}

chan struct{} because the slot carries no data. It is a token.
The sem <- struct{}{} sits before go, so the loop itself
blocks once limit goroutines are running. The defer func() { <-sem }()
frees the slot when the goroutine finishes, and the loop unblocks to
start the next one. At any instant, at most limit goroutines are
past the send and inside fetch.

One detail that trips people: put the acquire before the go
statement, not inside the goroutine. If you acquire inside the
goroutine, all len(urls) goroutines still spawn at once and then
queue on the channel. You have capped the concurrent fetch calls
but not the goroutine count itself, and each parked goroutine still
holds its stack and its captured variables. For 50,000 URLs that is
50,000 live goroutines waiting their turn. Acquiring before go
keeps the goroutine count bounded too.

Pattern 2: x/sync/semaphore for weighted limits

The channel version treats every unit of work as one slot. Sometimes
work is not uniform. A batch job might process small files and huge
files from the same queue, and you want the big ones to consume more
of the budget. golang.org/x/sync/semaphore
gives you a weighted semaphore for that.

import "golang.org/x/sync/semaphore"

func fetchAll(ctx context.Context, urls []string,
    limit int64) ([]Result, error) {

    results := make([]Result, len(urls))
    sem := semaphore.NewWeighted(limit)
    var wg sync.WaitGroup

    for i, u := range urls {
        if err := sem.Acquire(ctx, 1); err != nil {
            break // ctx cancelled; stop launching
        }
        wg.Add(1)
        go func() {
            defer wg.Done()
            defer sem.Release(1)
            results[i] = fetch(u)
        }()
    }
    wg.Wait()
    return results, ctx.Err()
}

Two things the channel version can't do come for free here. First,
Acquire takes a context.Context, so a cancelled context unblocks
a waiting acquire instead of parking forever. Second, Acquire(ctx, n)
takes a weight. A job that reserves 4 units leaves room for fewer
concurrent peers, so you can size work by cost rather than count.

NewWeighted(limit) is a plain int64 budget. Acquire(ctx, 1)
subtracts one and blocks when the budget is exhausted; Release(1)
adds it back. Reach for x/sync/semaphore when you need
context-aware acquisition or non-uniform weights. Reach for the
buffered channel when the work is uniform and you want zero
dependencies.

Propagating the first error with errgroup

Both versions above swallow errors: fetch returns a Result and
nobody checks whether it failed. Real fan-out needs to know when a
worker fails, and usually needs to stop the rest. That is where
golang.org/x/sync/errgroup
earns its place, and since late 2022 it has a bounded mode built in.

errgroup.Group has a SetLimit(n) method. It runs at most n
goroutines concurrently, captures the first non-nil error, and
cancels a shared context so the other workers can bail out early.

import "golang.org/x/sync/errgroup"

func fetchAll(ctx context.Context, urls []string,
    limit int) ([]Result, error) {

    results := make([]Result, len(urls))
    g, ctx := errgroup.WithContext(ctx)
    g.SetLimit(limit)

    for i, u := range urls {
        g.Go(func() error {
            r, err := fetch(ctx, u)
            if err != nil {
                return fmt.Errorf("fetch %s: %w", u, err)
            }
            results[i] = r
            return nil
        })
    }
    if err := g.Wait(); err != nil {
        return nil, err
    }
    return results, nil
}

SetLimit(limit) makes g.Go block when limit goroutines are
already running, so it is the semaphore and the fan-out in one call.
errgroup.WithContext returns a derived ctx that gets cancelled
the moment any worker returns a non-nil error. Pass that ctx into
fetch, and an in-flight request cancels itself once a sibling has
already failed. g.Wait returns the first error, or nil if every
worker succeeded.

One caveat worth stating plainly: writing to results[i] from many
goroutines is safe here only because every goroutine writes a
distinct index. No two workers touch the same slot, so there is no
data race. The moment workers share a slice element or a map, you
need a sync.Mutex around the write. Run the race detector to be
sure:

go test -race ./...

If you want partial results even when some workers fail, don't
return early. Collect each worker's error into a per-index slice and
let g.Wait report the aggregate, or use errors.Join to bundle
them. The right call depends on whether one failure invalidates the
whole batch.

Which one to reach for

Three tools, one decision.

Buffered channel of struct{}. No imports, no dependency. Use it when the work is uniform, you don't need error propagation, and you want the pattern visible in the code. Acquire before go.
x/sync/semaphore. Use it for weighted limits or when you need a context-aware Acquire that unblocks on cancellation. It is a budget, not a slot count.
x/sync/errgroup with SetLimit. The default for anything that returns errors. It fans out, caps concurrency, cancels on first failure, and hands you the error. In practice this is what most production fan-out should use.

The number N is the whole point. Pick it from the tightest resource
in the chain, not from your CPU count. runtime.NumCPU() is the
right ceiling for CPU-bound work. For I/O-bound work (HTTP calls,
database queries, disk) the ceiling is whatever the remote side
tolerates, which is usually far smaller than you'd guess. Start
conservative, watch the downstream, raise it if there is headroom.

Unbounded fan-out is one of those patterns that looks idiomatic and
behaves like a load test against your own infrastructure. A cap
turns it back into a tool. Whichever of the three you pick, the
goroutine count stops being a function of your input size, and your
service stops melting on the day the input gets big.

Bounded parallelism is a small pattern that sits on top of Go's
whole concurrency model, and it reads a lot cleaner once channels,
select, and the sync primitives underneath it are second nature.
The Complete Guide to Go Programming works through those from the
runtime up, including how the scheduler parks and wakes goroutines
on a blocked channel send. Hexagonal Architecture in Go is where
you learn to keep this fan-out at the right boundary, so the worker
pool stays an implementation detail and doesn't leak into your domain
logic.