DEV Community

Cover image for 5 Go Bugs That Only Appear in Production
Pavel Sanikovich
Pavel Sanikovich

Posted on

5 Go Bugs That Only Appear in Production

Go has a reputation for being boring — in a good way.
Strong typing, a simple concurrency model, a strict compiler. If something is wrong, it usually fails fast.

And yet, many Go bugs don’t fail fast at all.

They quietly pass tests, survive code review, behave perfectly on your laptop, and only show up in production — under real traffic, real data, and long-running processes.

This article isn’t about exotic edge cases. It’s about bugs that look innocent, feel “Go-ish”, and still manage to hurt you in production. Especially if you’re a junior or mid-level Go developer.


Goroutines That Never Die

One of the most common production issues in Go is not a crash, but slow degradation. Memory usage grows, CPU usage creeps up, and the number of goroutines keeps increasing.

Often the root cause is a goroutine that was supposed to finish — but never did.

Consider a worker reading from a channel:

func worker(ch <-chan Job) {
    for job := range ch {
        process(job)
    }
}
Enter fullscreen mode Exit fullscreen mode

This code looks clean and idiomatic. In tests, the channel is closed properly. Locally, everything works.

In production, things are different. A producer might crash, a request might be canceled, or a code path that closes the channel might never execute. The goroutine stays alive forever, blocked on receive.

Over time, these goroutines accumulate. The service is still “up”, but it’s slowly dying.

Production-grade goroutines need an explicit lifetime. Usually that means context cancellation:

func worker(ctx context.Context, ch <-chan Job) {
    for {
        select {
        case <-ctx.Done():
            return
        case job, ok := <-ch:
            if !ok {
                return
            }
            process(job)
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

If a goroutine doesn’t know when it should stop, it probably won’t.


Data Races That Only Exist Under Load

Go’s race detector is excellent, but it’s not magic. Many race conditions simply don’t appear without real concurrency and real pressure.

A classic example is shared configuration:

type Config struct {
    Enabled bool
}

var cfg = &Config{Enabled: true}

func handler() {
    if cfg.Enabled {
        doSomething()
    }
}
Enter fullscreen mode Exit fullscreen mode

At some point, someone adds hot reload:

func reload() {
    cfg.Enabled = false
}
Enter fullscreen mode Exit fullscreen mode

This might run fine for weeks. Tests pass. The race detector stays quiet.

Then traffic grows. CPU cores are actually busy. Suddenly behavior becomes inconsistent, but nothing obviously crashes.

The problem isn’t Go. The problem is mutating shared state without synchronization. In production, concurrency is not hypothetical — it’s constant.

A safer approach is to treat configuration as immutable and swap it atomically:

var cfg atomic.Value

func init() {
    cfg.Store(Config{Enabled: true})
}

func handler() {
    c := cfg.Load().(Config)
    if c.Enabled {
        doSomething()
    }
}
Enter fullscreen mode Exit fullscreen mode

Production reveals races not because it’s special, but because it’s honest.


The Interface That Is Nil (Except It Isn’t)

This is one of the most confusing bugs for people new to Go, and it often hides until a rare code path is executed in production.

type MyError struct{}

func (e *MyError) Error() string {
    return "something went wrong"
}

func do() error {
    var err *MyError = nil
    return err
}
Enter fullscreen mode Exit fullscreen mode

From the caller’s point of view:

err := do()
if err != nil {
    log.Println("error:", err)
}
Enter fullscreen mode Exit fullscreen mode

You expect nothing to happen. Instead, the error branch runs.

The reason is subtle but fundamental. An interface value in Go contains both a type and a value. Here, the value is nil, but the type is not. That makes the interface itself non-nil.

This kind of bug often appears only in production, when a rarely used error path finally executes.

The fix is simple but strict: never return a typed nil as an interface. Return a real nil or a real error — nothing in between.


Timeouts That Work Locally and Fail in Production

Timeouts are another classic “it worked on my machine” trap.

client := &http.Client{
    Timeout: 2 * time.Second,
}
Enter fullscreen mode Exit fullscreen mode

Locally, requests are fast. In staging, everything looks fine. In production, requests start timing out randomly.

The difference is the network. DNS latency, TLS handshakes, slow upstreams, saturated connection pools — none of that exists on localhost.

A single global timeout often hides where time is actually being spent. A more production-friendly approach is to put deadlines on requests themselves:

ctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
defer cancel()

req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
resp, err := client.Do(req)
Enter fullscreen mode Exit fullscreen mode

Production is not slow because Go is slow. It’s slow because networks are unreliable.


Allocation Patterns That Break at Scale

Many performance problems don’t come from algorithms, but from memory behavior that changes with scale.

Code like this looks harmless:

buf := make([]byte, 10<<20)
process(buf)
Enter fullscreen mode Exit fullscreen mode

Maybe it runs once per request. Maybe it’s short-lived. Locally, no problem.

In production, under sustained load, this creates constant pressure on the garbage collector. Large allocations must be zeroed, tracked, and scanned. Latency spikes appear, and p99 gets ugly.

This is why production Go code often relies on reuse:

var bufPool = sync.Pool{
    New: func() any {
        return make([]byte, 64<<10)
    },
}
Enter fullscreen mode Exit fullscreen mode

The GC in Go is very good, but it still obeys physics.


Why These Bugs Feel “Production-Only”

Because production is the first place where your code experiences:
long uptimes, real concurrency, unreliable networks, large data, and sustained load.

Go doesn’t hide these problems — it simply doesn’t simulate them for you.

If you write Go as if production is calm and predictable, production will eventually disagree.


Want to go further?

This series focuses on understanding Go, not just using it.

If you want to continue in the same mindset, Educative is a great next step.

It’s a single subscription that gives you access to hundreds of in-depth, text-based courses — from Go internals and concurrency to system design and distributed systems. No videos, no per-course purchases, just structured learning you can move through at your own pace.

👉 Explore the full Educative library here

Top comments (0)