DEV Community

Cover image for Your Go Background Jobs Are Silently Failing. Here’s How to Fix It
Kevin Abura
Kevin Abura

Posted on • Edited on

Your Go Background Jobs Are Silently Failing. Here’s How to Fix It

Why Go Worker Pools Fail in Production

Most Go developers write their first worker pool like this:

for job := range jobs {
    go handle(job)
}
Enter fullscreen mode Exit fullscreen mode

It works perfectly in development.

It survives QA.

It passes load testing.

And then production traffic arrives.

Three weeks later someone is SSH-ing into the server at 3AM.

This article is about why that happens.


The Illusion of Simplicity

Background jobs look simple because concurrency in Go is simple.

Goroutines are cheap. Channels are elegant.

So we build async systems quickly:

  • send emails
  • process events
  • call third-party APIs
  • update caches

Everything works — until failure starts happening repeatedly.

The issue isn't concurrency.

The issue is uncontrolled failure behavior.


Failure #1 — Retry Storms

A typical retry implementation:

for i := 0; i < 5; i++ {
    err := process(job)
    if err == nil {
        return
    }
}
Enter fullscreen mode Exit fullscreen mode

Looks reasonable.

But in production:

  • downstream service slows
  • latency increases
  • every worker retries
  • retries amplify traffic
  • database collapses

Your retry mechanism becomes a traffic multiplier.

You DDoS yourself.

What actually happens

100 jobs × 5 retries = 500 operations
But latency increased → workers overlap → concurrency spikes → 2000+ queries

This is one of the most common real outages in async systems.


Failure #2 — Goroutine Leaks

Consider a timeout:

ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()

go func() {
    doWork(ctx)
}()
Enter fullscreen mode Exit fullscreen mode

If doWork ignores context cancellation
the goroutine lives forever.

Under retries, leaks accumulate.

Symptoms:

  • memory slowly rises
  • CPU idle but process not healthy
  • restart "fixes" it

You didn't fix it.
You drained the leak.


Failure #3 — Duplicate Jobs

Service restarts while processing:

  1. job picked
  2. process halfway
  3. server restarts
  4. job retried
  5. executed again

Now:

  • double charge
  • duplicate email
  • inconsistent state

Async systems are not "exactly once"

They are at least once

If your handler is not idempotent, your system is incorrect.


Failure #4 — Shutdown Data Loss

Kubernetes sends SIGTERM.

Your service exits immediately.

In-flight jobs disappear.

No error.
No retry.
No log.

Silent corruption.

This one is especially dangerous because monitoring does not catch it.


Failure #5 — Cache Stampede

Traffic spike:

1 key expires
1000 workers request same resource
DB melts

Worker pools amplify stampedes because they parallelize cache misses.


The Real Problem

Worker pools optimize throughput.

Production systems require survival.

We must design for:

  • retries without amplification
  • bounded concurrency
  • idempotent execution
  • graceful shutdown
  • failure containment

This changes the architecture completely.


What We Changed

After several production incidents, we stopped treating async jobs as "background tasks".

We started treating them as state machines under failure.

Key principles:

  1. A job may run multiple times — correctness must hold
  2. Failure is normal — retry must be controlled
  3. Shutdown is frequent — tasks must be recoverable
  4. Downstream is unreliable — backpressure is mandatory

Reference Implementation

We extracted these patterns into a small Go project:

CSJD — a production-safe job dispatcher

It demonstrates:

  • bounded workers
  • duplicate suppression
  • controlled retry
  • panic recovery
  • shutdown draining

The goal is not performance.

The goal is preventing 3AM debugging sessions.


Final Thought

Most backend outages are not caused by complex algorithms.

They are caused by simple background jobs behaving badly under real conditions.

Async systems fail slowly — and silently.

Design them assuming failure is the default state.

Top comments (3)

Collapse
 
kevinabura profile image
Kevin Abura

I wrote this after debugging several incidents where background jobs behaved fine in tests but caused cascading failures in production.

The common pattern wasn't business logic bugs — it was retry amplification, goroutine leaks, and shutdown loss.

The article focuses on failure modes rather than implementation details.
Curious if others have seen similar issues.

Collapse
 
clarabennettdev profile image
Clara Bennett

Really relevant topic — silent goroutine failures are one of Go's biggest footguns. I've been bitten by this in production where a background worker panicked inside a goroutine and the main process kept running like nothing happened. Two patterns that saved me: 1) always wrap goroutine bodies with a recover + structured logging, and 2) use errgroup.Group so you get proper error propagation and can cancel sibling goroutines on failure. Would be great to see how your approach compares with something like errgroup or a dedicated job runner like river/machinery.

Collapse
 
kevinabura profile image
Kevin Abura

Great insights! You're absolutely right — the silent goroutine panic is exactly the kind of "production footgun" we built CSJD to solve.

You hit on a key distinction: CSJD isn't competing with errgroup, it's built on top of the same principles. Under the hood, CSJD's worker pool uses something very similar to what you described:

  • Every job goroutine has a recover() handler that captures panics → metrics → graceful failure
  • Worker coordination uses patterns comparable to errgroup, but specialized for job dispatch

How CSJD differs from raw errgroup + recover:

  1. Retry orchestration — errgroup gives you error propagation, but you still need to implement exponential backoff, jitter, and permanent error detection yourself
  2. Queue backpressure — errgroup doesn't handle "what if we have 100k jobs" out of the box
  3. Redis distribution — errgroup is in-process only; CSJD gives you the same API but can scale horizontally

Vs. river/machinery:
You're right to ask! The main difference is reliability-first design philosophy. Where river focuses on Postgres-backed jobs and machinery on pluggable brokers, CSJD's core is the guardrails:

  • Atomic queue caps to prevent overload
  • WAL-backed file store with process locking
  • Detached handler budgets to prevent worker starvation