amir

Posted on May 24

The Silent Killers of Go Concurrency: Mutexes, Semaphores, and Goroutine Leaks

#go #backend #concurrency #performance

Go makes concurrency look simple.

You write:

go func() {
    // do something concurrently
}()

And suddenly your code is running in another goroutine.

That simplicity is one of the reasons I like Go so much. But after working on backend systems, notification pipelines, high-traffic APIs, and production services under real load, I learned something important:

Most concurrency problems in Go do not come from not using concurrency.

They come from using concurrency without understanding where the bottleneck actually is.

Sometimes the issue is a missing lock.

But very often, especially in production Go services, the issue is the opposite:

too much locking
locks held for too long
network I/O inside critical sections
goroutines that never exit
unbounded goroutine creation
WaitGroups copied by value
channels used without a cancellation strategy

In this article, I want to walk through the concurrency problems I have seen in real systems, how I reason about mutexes and semaphores, and how I usually debug these issues before they become production incidents.

The Real Problem: Concurrency That Accidentally Becomes Sequential

A service can look concurrent from the outside and still behave like a single-threaded application internally.

This usually happens when a large part of the request flow is hidden behind one shared lock.

A pattern like this is more common than many developers admit:

mu.Lock()
user.Name = "Test User"
sendEmail(user)
callDatabase(user)
mu.Unlock()

At first glance, it may look safe.

The developer wanted to protect shared state. That part is reasonable. But the lock is now protecting much more than shared memory. It is protecting the entire flow:

update a field
send an email
call the database
maybe wait on network I/O
maybe retry
maybe block other goroutines for a long time

That is not just a mutex anymore.

That is a traffic jam.

Every goroutine that needs the same lock must wait until the whole flow finishes. So even if your service has hundreds or thousands of goroutines, a big part of the system becomes sequential.

The dangerous part is that CPU usage may still look normal or even low. Memory may also look fine. But latency increases, throughput drops, and p95/p99 response times become unstable.

This is why lock contention is sometimes difficult to notice from basic infrastructure metrics alone.

A Production-Style Example: Email Inside a Mutex

Imagine we have a service that updates user state and sends notifications.

type Service struct {
    mu    sync.Mutex
    state map[int]string
}

func (s *Service) ProcessUsers(users []User) {
    s.mu.Lock()
    defer s.mu.Unlock()

    for _, user := range users {
        s.state[user.ID] = "processed"
        sendEmail(user) // slow network I/O inside the lock
    }
}

This code is safe from a data race perspective.

But it is dangerous from a performance perspective.

A mutex should protect the smallest possible shared memory operation. It should not protect slow external work like:

sending email
calling another microservice
database queries
HTTP requests
file uploads
logging to a slow external sink
waiting on a third-party API

The memory update may take nanoseconds or microseconds. The email call may take milliseconds or seconds.

That difference matters.

If the lock is held while sendEmail runs, every other goroutine that needs s.mu is blocked behind a network call.

A better version separates shared-state mutation from slow work:

func (s *Service) ProcessUsers(users []User) {
    emails := make([]User, 0, len(users))

    s.mu.Lock()
    for _, user := range users {
        s.state[user.ID] = "processed"
        emails = append(emails, user)
    }
    s.mu.Unlock()

    for _, user := range emails {
        sendEmail(user)
    }
}

This is already better because the lock only protects the shared map.

But in a real production system, I usually prefer pushing the slow work to a queue or bounded worker pool:

func (s *Service) ProcessUsers(users []User, jobs chan<- EmailJob) {
    s.mu.Lock()
    for _, user := range users {
        s.state[user.ID] = "processed"
    }
    s.mu.Unlock()

    for _, user := range users {
        jobs <- EmailJob{UserID: user.ID, Email: user.Email}
    }
}

Now the request path does not directly depend on the email provider latency.

That is the real fix.

Not just “use goroutines.”

The fix is designing the boundary between shared memory, external I/O, and backpressure.

Mutexes Are Not Bad. Large Critical Sections Are Bad.

I sometimes see developers become afraid of mutexes.

That is the wrong lesson.

sync.Mutex is simple, fast, and perfectly fine when used correctly. The problem is not the mutex. The problem is the size of the critical section.

This is what I try to keep in mind:

mu.Lock()
// only touch shared memory here
mu.Unlock()

Not this:

mu.Lock()
// shared memory
// database call
// HTTP call
// email call
// JSON encoding
// logging
// metrics push
mu.Unlock()

A good critical section should be boring.

It should usually do one of these:

read shared state
update shared state
copy shared state into a local variable
swap a pointer
increment a counter
append to a protected slice/map

Then unlock.

After that, do the expensive work outside the lock.

Under the Hood: What a Mutex Gives You

At a high level, a mutex gives you mutual exclusion: only one goroutine can enter a protected section at a time.

But it also gives you memory ordering guarantees.

In Go's memory model, an unlock operation synchronizes before a later lock operation on the same mutex. In practical terms, that means if one goroutine updates shared data and unlocks, another goroutine that later locks the same mutex can safely observe that update.

That is the part many developers forget.

A mutex is not just about “blocking other goroutines.” It is also about creating a safe visibility boundary between goroutines.

Without that boundary, different goroutines may read and write the same memory at the same time, and now you have a data race. Once you have a data race, your program is no longer something you can reason about confidently.

This is why I do not like “clever” lock-free code unless there is a very strong reason for it.

Most backend services do not need clever concurrency.

They need clear concurrency.

Semaphore: Controlling Capacity, Not Ownership

A mutex is usually about ownership of shared memory.

A semaphore is about capacity.

For example, suppose you want to process 10,000 users, but you do not want to send 10,000 emails at the same time.

A naive version might do this:

for _, user := range users {
    go sendEmail(user)
}

This is dangerous because it creates unbounded concurrency.

If users has 10,000 items, you create 10,000 goroutines. If each goroutine performs network I/O, opens connections, allocates memory, and waits on an external provider, you can overload your own service before you overload the email provider.

A simple semaphore pattern fixes this:

sem := make(chan struct{}, 20) // allow only 20 concurrent email sends
var wg sync.WaitGroup

for _, user := range users {
    user := user

    sem <- struct{}{}
    wg.Add(1)

    go func() {
        defer wg.Done()
        defer func() { <-sem }()

        sendEmail(user)
    }()
}

wg.Wait()

Now the code still uses concurrency, but concurrency is bounded.

That one detail is huge in production.

Unbounded concurrency is not scalability.

It is delayed failure.

A Better Worker Pool for Production Code

The semaphore pattern is useful, but for services that run continuously, I often prefer a worker pool.

type EmailJob struct {
    UserID int
    Email  string
}

func startEmailWorkers(ctx context.Context, workerCount int, jobs <-chan EmailJob) {
    var wg sync.WaitGroup

    for i := 0; i < workerCount; i++ {
        wg.Add(1)

        go func(workerID int) {
            defer wg.Done()

            for {
                select {
                case <-ctx.Done():
                    return

                case job, ok := <-jobs:
                    if !ok {
                        return
                    }

                    if err := sendEmailJob(ctx, job); err != nil {
                        // In real systems: log, retry, dead-letter, or expose metrics.
                        fmt.Printf("worker=%d failed to send email user_id=%d err=%v\n", workerID, job.UserID, err)
                    }
                }
            }
        }(i)
    }

    go func() {
        wg.Wait()
    }()
}

This gives you much better operational control:

fixed concurrency
easier metrics
easier shutdown
easier retry strategy
easier backpressure
easier rate limiting

This is the difference between “I used goroutines” and “I designed a concurrent system.”

Goroutine Leak: The Bug That Does Not Explode Immediately

Goroutine leaks are one of the most common production problems in Go.

They are dangerous because the service may not crash immediately. It may slowly become worse over hours or days.

Here is a classic example:

func process() error {
    ch := make(chan result)

    go func() {
        ch <- heavyComputation()
    }()

    select {
    case res := <-ch:
        return handle(res)

    case <-time.After(1 * time.Second):
        return errors.New("timeout")
    }
}

The problem is subtle.

ch is unbuffered.

If the timeout happens first, process returns. After that, there is no receiver waiting on ch.

When heavyComputation() finishes, the goroutine tries to send into ch and blocks forever.

That goroutine is now leaked.

One leaked goroutine may not matter.

Thousands of leaked goroutines matter.

A safer version uses a buffered channel:

func process() error {
    ch := make(chan result, 1)

    go func() {
        ch <- heavyComputation()
    }()

    select {
    case res := <-ch:
        return handle(res)

    case <-time.After(1 * time.Second):
        return errors.New("timeout")
    }
}

This prevents the goroutine from blocking on send after the timeout.

But in real services, I prefer context-based cancellation:

func process(ctx context.Context) error {
    ctx, cancel := context.WithTimeout(ctx, 1*time.Second)
    defer cancel()

    ch := make(chan result, 1)

    go func() {
        res := heavyComputation(ctx)

        select {
        case ch <- res:
        case <-ctx.Done():
        }
    }()

    select {
    case res := <-ch:
        return handle(res)

    case <-ctx.Done():
        return ctx.Err()
    }
}

The important lesson:

Every goroutine needs an exit path.

If you cannot explain how a goroutine stops, you probably have a leak waiting to happen.

WaitGroup by Value: A Small Mistake With a Big Impact

This mistake is very easy to miss in code review:

func worker(wg sync.WaitGroup) { // wrong: copied by value
    defer wg.Done()

    // do work
}

sync.WaitGroup must not be copied after first use.

When you pass it by value, you copy its internal state. The worker calls Done() on the copy, not on the original WaitGroup that the main goroutine is waiting on.

That can cause a deadlock.

Correct version:

func worker(wg *sync.WaitGroup) {
    defer wg.Done()

    // do work
}

And usage:

var wg sync.WaitGroup

for i := 0; i < 10; i++ {
    wg.Add(1)
    go worker(&wg)
}

wg.Wait()

This rule also applies to other synchronization primitives like sync.Mutex.

Do not copy them after first use.

The Loop Variable Trap

This used to be one of the most famous Go concurrency bugs:

for _, user := range users {
    go func() {
        sendEmail(user)
    }()
}

Depending on the Go version and context, capturing loop variables incorrectly could lead to goroutines using the wrong value.

The defensive pattern is still simple and clear:

for _, user := range users {
    user := user

    go func() {
        sendEmail(user)
    }()
}

Even with improvements in newer Go versions, I still like this style in production code because it makes the ownership of the variable obvious to the reader.

Readable concurrency is maintainable concurrency.

How I Debug Lock Contention in Go

When I suspect a concurrency bottleneck, I do not start by guessing.

I start by measuring.

1. Enable pprof

import _ "net/http/pprof"

func main() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // start application
}

Then collect profiles:

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

For mutex contention, enable mutex profiling:

runtime.SetMutexProfileFraction(1)

Then inspect:

go tool pprof http://localhost:6060/debug/pprof/mutex

2. Check goroutine count

A rising goroutine count is often a signal of blocked goroutines or leaks.

fmt.Println("goroutines:", runtime.NumGoroutine())

For production, expose it as a metric:

prometheus.NewGaugeFunc(
    prometheus.GaugeOpts{
        Name: "go_goroutines_current",
        Help: "Current number of goroutines.",
    },
    func() float64 {
        return float64(runtime.NumGoroutine())
    },
)

3. Dump goroutine stacks

When the service is stuck, goroutine dumps are gold.

curl http://localhost:6060/debug/pprof/goroutine?debug=2

Look for many goroutines blocked on the same line:

sync.(*Mutex).Lock
chan send
chan receive
net/http.(*Transport).RoundTrip

If 5,000 goroutines are blocked on the same lock or channel, you found your bottleneck.

4. Use the race detector in tests

go test -race ./...

The race detector is not free, and you usually do not run it in production, but it is extremely useful in CI and local debugging.

My Practical Rules for Production Go Concurrency

These are the rules I try to follow when writing or reviewing concurrent Go code:

1. Keep locks small

Lock only the data that needs protection.

Do not lock the whole request lifecycle.

2. Never put slow I/O inside a mutex

Avoid database calls, HTTP calls, email sending, file uploads, and third-party API calls inside critical sections.

3. Bound concurrency

Do not create unlimited goroutines.

Use worker pools, semaphores, queues, or rate limiters.

4. Every goroutine needs a shutdown path

Use context.Context, channel close, or explicit cancellation.

5. Do not copy synchronization primitives

Pass *sync.WaitGroup, *sync.Mutex, and similar primitives by pointer when sharing them.

6. Measure before optimizing

Use pprof, runtime metrics, traces, logs, and goroutine dumps.

Guessing is not debugging.

7. Prefer boring concurrency

The best concurrent code is usually not clever.

It is clear, measurable, and easy to shut down.

Final Thoughts

Go gives us powerful concurrency tools, but it does not automatically give us good concurrent design.

A goroutine is cheap, but it is not free.

A mutex is fast, but it can destroy throughput if you hold it around slow work.

A channel is elegant, but it can leak goroutines if nobody is receiving.

A WaitGroup is simple, but copying it can break your entire flow.

For me, senior Go engineering is not about using every concurrency primitive. It is about knowing when not to use them, where the real boundary is, and how the system behaves under load.

The next time you write this:

mu.Lock()

Ask one question before moving on:

What exactly am I protecting, and how fast can I release this lock?

That one question can save your service from a silent production bottleneck.

References

Go Memory Model: https://go.dev/ref/mem
Go sync package documentation: https://pkg.go.dev/sync
Go diagnostics and profiling tools: https://go.dev/doc/diagnostics
Go blog: Go scheduler and runtime notes: https://go.dev/blog/

DEV Community