Kevin Abura

Posted on Feb 18 • Edited on Feb 22

Your Go Background Jobs Are Silently Failing. Here’s How to Fix It

#go #webdev

Why Go Worker Pools Fail in Production

Most Go developers write their first worker pool like this:

for job := range jobs {
    go handle(job)
}

It works perfectly in development.

It survives QA.

It passes load testing.

And then production traffic arrives.

Three weeks later someone is SSH-ing into the server at 3AM.

This article is about why that happens.

The Illusion of Simplicity

Background jobs look simple because concurrency in Go is simple.

Goroutines are cheap. Channels are elegant.

So we build async systems quickly:

send emails
process events
call third-party APIs
update caches

Everything works — until failure starts happening repeatedly.

The issue isn't concurrency.

The issue is uncontrolled failure behavior.

Failure #1 — Retry Storms

A typical retry implementation:

for i := 0; i < 5; i++ {
    err := process(job)
    if err == nil {
        return
    }
}

Looks reasonable.

But in production:

downstream service slows
latency increases
every worker retries
retries amplify traffic
database collapses

Your retry mechanism becomes a traffic multiplier.

You DDoS yourself.

What actually happens

100 jobs × 5 retries = 500 operations
But latency increased → workers overlap → concurrency spikes → 2000+ queries

This is one of the most common real outages in async systems.

Failure #2 — Goroutine Leaks

Consider a timeout:

ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()

go func() {
    doWork(ctx)
}()

If doWork ignores context cancellation
the goroutine lives forever.

Under retries, leaks accumulate.

Symptoms:

memory slowly rises
CPU idle but process not healthy
restart "fixes" it

You didn't fix it.
You drained the leak.

Failure #3 — Duplicate Jobs

Service restarts while processing:

job picked
process halfway
server restarts
job retried
executed again

Now:

double charge
duplicate email
inconsistent state

Async systems are not "exactly once"

They are at least once

If your handler is not idempotent, your system is incorrect.

Failure #4 — Shutdown Data Loss

Kubernetes sends SIGTERM.

Your service exits immediately.

In-flight jobs disappear.

No error.
No retry.
No log.

Silent corruption.

This one is especially dangerous because monitoring does not catch it.

Failure #5 — Cache Stampede

Traffic spike:

1 key expires
1000 workers request same resource
DB melts

Worker pools amplify stampedes because they parallelize cache misses.

The Real Problem

Worker pools optimize throughput.

Production systems require survival.

We must design for:

retries without amplification
bounded concurrency
idempotent execution
graceful shutdown
failure containment

This changes the architecture completely.

What We Changed

After several production incidents, we stopped treating async jobs as "background tasks".

We started treating them as state machines under failure.

Key principles:

A job may run multiple times — correctness must hold
Failure is normal — retry must be controlled
Shutdown is frequent — tasks must be recoverable
Downstream is unreliable — backpressure is mandatory

Reference Implementation

We extracted these patterns into a small Go project:

CSJD — a production-safe job dispatcher

It demonstrates:

bounded workers
duplicate suppression
controlled retry
panic recovery
shutdown draining

The goal is not performance.

The goal is preventing 3AM debugging sessions.

Final Thought

Most backend outages are not caused by complex algorithms.

They are caused by simple background jobs behaving badly under real conditions.

Async systems fail slowly — and silently.

Design them assuming failure is the default state.

Top comments (3)

Kevin Abura • Feb 22

I wrote this after debugging several incidents where background jobs behaved fine in tests but caused cascading failures in production.

The common pattern wasn't business logic bugs — it was retry amplification, goroutine leaks, and shutdown loss.

The article focuses on failure modes rather than implementation details.
Curious if others have seen similar issues.

Clara Bennett • Feb 18

Really relevant topic — silent goroutine failures are one of Go's biggest footguns. I've been bitten by this in production where a background worker panicked inside a goroutine and the main process kept running like nothing happened. Two patterns that saved me: 1) always wrap goroutine bodies with a recover + structured logging, and 2) use errgroup.Group so you get proper error propagation and can cancel sibling goroutines on failure. Would be great to see how your approach compares with something like errgroup or a dedicated job runner like river/machinery.

Kevin Abura • Feb 19

Great insights! You're absolutely right — the silent goroutine panic is exactly the kind of "production footgun" we built CSJD to solve.

You hit on a key distinction: CSJD isn't competing with errgroup, it's built on top of the same principles. Under the hood, CSJD's worker pool uses something very similar to what you described:

Every job goroutine has a recover() handler that captures panics → metrics → graceful failure
Worker coordination uses patterns comparable to errgroup, but specialized for job dispatch

How CSJD differs from raw errgroup + recover:

Retry orchestration — errgroup gives you error propagation, but you still need to implement exponential backoff, jitter, and permanent error detection yourself
Queue backpressure — errgroup doesn't handle "what if we have 100k jobs" out of the box
Redis distribution — errgroup is in-process only; CSJD gives you the same API but can scale horizontally

Vs. river/machinery:
You're right to ask! The main difference is reliability-first design philosophy. Where river focuses on Postgres-backed jobs and machinery on pluggable brokers, CSJD's core is the guardrails:

Atomic queue caps to prevent overload
WAL-backed file store with process locking
Detached handler budgets to prevent worker starvation