DEV Community

Cover image for Deadlocks in Go: The Silent Production Killer
Serif COLAKEL
Serif COLAKEL

Posted on

Deadlocks in Go: The Silent Production Killer

“The health check is passing, CPU is idling at 2%, memory is flat… but the API is timing out.”

If you’ve managed Go services in production, you know this sinking feeling. Crashes are loud; they give you a panic log and a reboot. Deadlocks are silent. They don't trigger standard error rate alerts until the upstream clients start timing out. They are the "Ghost Ships" of concurrency bugs.

In this deep dive, we’re moving past the textbook definitions. We are going to look at how deadlocks actually manifest in real backends, how to debug a frozen process without killing it, and how to design systems that refuse to freeze.


🧠 Why the Runtime Doesn't Save You

You might know that the Go runtime has a built-in deadlock detector. You’ve seen it in main.go scripts:

fatal error: all goroutines are asleep - deadlock!

Enter fullscreen mode Exit fullscreen mode

Here is the harsh reality: You will almost never see this error in a production web server.

Why? Because the deadlock detector only triggers when every single goroutine is asleep. In a real application (like a Gin or Echo server, or a Kafka consumer), there is always some goroutine running—network pollers, signal handlers, or background metric exporters.

As long as one goroutine is awake, the runtime assumes everything is fine, even if your critical business logic is stuck in a permanent mutex embrace.


🧨 Real-World Deadlock Scenarios

Let's look at how we actually break things in production.

1️⃣ The "Fintech" Lock (Mutex Ordering)

This is the classic "Transfer" problem. You need to lock two resources (User A and User B) to move funds.

type Wallet struct {
    mu      sync.Mutex
    Balance int
}

func Transfer(from, to *Wallet, amount int) {
    from.mu.Lock()
    defer from.mu.Unlock()

    // 🕒 Simulation: Network latency or context switch happens here
    time.Sleep(1 * time.Millisecond)

    to.mu.Lock()
    defer to.mu.Unlock()

    from.Balance -= amount
    to.Balance += amount
}

Enter fullscreen mode Exit fullscreen mode

The Trap:
If Goroutine 1 does Transfer(A, B) and Goroutine 2 does Transfer(B, A) at the same time:

  1. G1 holds Lock A, waits for Lock B.
  2. G2 holds Lock B, waits for Lock A.
  3. 💀 Deadlock.

The Production Fix:
You must enforce a Deterministic Locking Order. In database terms, this is avoiding cycles. In Go, we usually sort by a unique ID (like a UUID or Database ID) before locking.

func Transfer(from, to *Wallet, amount int) {
    // Always lock the "smaller" ID first
    first, second := from, to
    if from.ID > to.ID {
        first, second = to, from
    }

    first.mu.Lock()
    defer first.mu.Unlock()

    second.mu.Lock()
    defer second.mu.Unlock()

    // logic...
}

Enter fullscreen mode Exit fullscreen mode

2️⃣ The RWMutex Trap (The Recursive Read)

This one catches even senior engineers. sync.RWMutex allows multiple readers OR one writer.

The Scenario: You have a function that takes a read lock, and inside it, calls another function that also tries to take a read lock. This is fine.

The Problem: If a Writer arrives in between the two Read calls.

var mu sync.RWMutex

func ReaderA() {
    mu.RLock()
    defer mu.RUnlock()

    // Heavy work...
    ReaderB() // Calls RLock again
}

Enter fullscreen mode Exit fullscreen mode

If a mu.Lock() (Write) is called while ReaderA is running but before ReaderB starts:

  1. ReaderA holds a Read Lock.
  2. Writer signals intent to Lock (blocks new Readers to prevent starvation).
  3. ReaderB tries to RLock, but waits for the Writer to finish (Go implementation detail to prevent write starvation).
  4. Writer waits for ReaderA to finish.
  5. ReaderA is waiting for ReaderB.
  6. 💀 Deadlock.

3️⃣ The Unbuffered Channel & The Abandoned Receiver

Worker pools are great until error handling gets involved.

func processItems(items []string) {
    ch := make(chan string) // Unbuffered

    // Spawn workers
    for i := 0; i < 5; i++ {
        go func() {
            for item := range ch {
                if err := doWork(item); err != nil {
                    return // ❌ Worker exits on error!
                }
            }
        }()
    }

    // Send items
    for _, item := range items {
        ch <- item // 💀 Blocks forever if all workers exited early
    }
}

Enter fullscreen mode Exit fullscreen mode

If your workers encounter errors and return (exit), there is nobody left to read from ch. The main goroutine blocks on ch <- item forever.


🔎 Detective Work: Debugging a Frozen Process

You cannot fix what you cannot see. When a pod is frozen, do not restart it immediately. You need the evidence.

1️⃣ The Full Stack Dump (pprof)

If you have net/http/pprof enabled (and you should, specifically on a private admin port), this is your smoking gun.

Don't just look at the summary. Look at the full goroutine dump with debug=2:

curl "http://localhost:6060/debug/pprof/goroutine?debug=2" > dump.txt

Enter fullscreen mode Exit fullscreen mode

What to search for:

  • semacquire: The goroutine is waiting on a lock (Mutex or RWMutex).
  • chan send: Waiting to put data into a channel.
  • chan receive: Waiting for data that isn't coming.

If you see 5,000 goroutines all stuck on semacquire at the exact same line of code db.GetConnection(), you’ve found your bottleneck.

2️⃣ SIGQUIT (The Nuclear Option)

If you don't have pprof enabled, or the HTTP server itself is deadlocked, send a SIGQUIT to the process.

kill -QUIT <pid>

Enter fullscreen mode Exit fullscreen mode

Go catches this signal and dumps the stack trace of every running goroutine to stderr (stdout in Docker logs) before exiting. It is messy, but it prints the truth.

3️⃣ go-deadlock (For Development)

In your staging environment, consider replacing sync.Mutex with github.com/sasha-s/go-deadlock. It is a drop-in replacement that tracks lock acquisition order. If it detects a potential deadlock or a lock held for >30 seconds, it prints a frantic stack trace. Do not use this in production (performance overhead), but it's a lifesaver in QA.


🛠 Preventing Production Freezes

We want to design systems that fail fast rather than freeze forever.

1️⃣ Context is King (The Timeout Pattern)

Never wait forever. The context package is your best defense against deadlocks.

Bad:

ch <- result // Blocks forever if receiver is gone

Enter fullscreen mode Exit fullscreen mode

Good:

select {
case ch <- result:
    // Success
case <-ctx.Done():
    // Request cancelled or timed out, abandon ship
    return ctx.Err()
}

Enter fullscreen mode Exit fullscreen mode

If every blocking operation (channel send/receive, DB call, HTTP request) is bound to a Context with a Timeout, deadlocks become Timeout Errors. Errors are actionable. Freezes are not.

2️⃣ Structured Concurrency (errgroup)

Stop manually managing WaitGroup counters and channels if you can avoid it. Use golang.org/x/sync/errgroup.

It handles the lifecycle for you:

  • It propagates Context cancellation to all workers.
  • If one worker returns an error, the context is cancelled, and others are notified to stop.
  • It waits for everyone to finish cleanly.
g, ctx := errgroup.WithContext(context.Background())

g.Go(func() error {
    return doHeavyTask(ctx)
})

// If doHeavyTask fails, ctx is cancelled immediately.
// No orphaned goroutines waiting on dead channels.
if err := g.Wait(); err != nil {
    return err
}

Enter fullscreen mode Exit fullscreen mode

3️⃣ Limit Scope of Locks

Keep the critical section (the lines of code between Lock and Unlock) as small as possible.

Anti-Pattern:

mu.Lock()
defer mu.Unlock()
// ❌ Don't do I/O or network calls inside a lock!
http.Get("https://slow-api.com")
mu.State = "done"

Enter fullscreen mode Exit fullscreen mode

Pattern:

resp, _ := http.Get("https://slow-api.com") // Do slow stuff outside

mu.Lock() // Lock only for the memory update
mu.State = "done"
mu.Unlock()

Enter fullscreen mode Exit fullscreen mode

🔗 The Big Picture

Deadlocks are just one head of the Hydra. They are intimately connected to:

  • Goroutine Leaks: A deadlocked goroutine is a leaked goroutine.
  • Resource Exhaustion: Leaked goroutines hold stack memory (min 2KB, often more).
  • Graceful Shutdown: If your shutdown logic deadlocks, your deployments will hang and K8s will SIGKILL your pods, potentially corrupting data.

Understanding deadlocks forces you to think about ownership. Who owns this channel? Who owns this lock? When does this operation end?

In Go, blocking is a feature, but indefinite blocking is a bug. Always assume the other side might never answer.


Key Takeaways

  1. Observability: Ensure pprof is accessible (securely) in production. It is the only way to debug a live deadlock.
  2. Order Matters: Deterministic lock ordering prevents circular dependencies.
  3. Fail Fast: Use select with time.After or ctx.Done on channels. Convert deadlocks into errors.
  4. Avoid Recursion with Locks: Be terrified of RWMutex if you have complex call chains.

Top comments (0)