DEV Community

Cover image for Go Retries: Backoff, Jitter, and Best Practices
Serif COLAKEL
Serif COLAKEL

Posted on

Go Retries: Backoff, Jitter, and Best Practices

In this article, we will explore how to implement production-grade retry strategies using exponential backoff, jitter, retry budgets, and circuit breakers — without leaking goroutines or memory.

Mitigating Retry Storms with Exponential Backoff, Jitter, and Retry Budgets

Modern distributed systems are built on an uncomfortable truth:
failure is not exceptional — it is normal.

Networks partition, dependencies slow down, pods restart, and databases throttle.
What separates a resilient Go service from a fragile one is not how often it fails —
but how it behaves while failing.

One of the most dangerous instincts in backend engineering is simple:

“The request failed. Let’s retry.”

Retries fix transient issues.
Retries also take down entire systems when misused.

This article dives into how retry storms emerge in Go systems, why they’re so destructive, and how to implement production-grade retry strategies using exponential backoff, jitter, retry budgets, and circuit breakers — without leaking goroutines or memory.


What a Retry Storm Actually Is

A retry storm is a positive feedback loop:

  1. A downstream service slows down or partially fails
  2. Clients retry failed requests
  3. Retries increase load on the already failing service
  4. Latency increases further
  5. More requests fail → more retries

Eventually, the system enters a metastable failure state:
even after the original issue is fixed, traffic amplification keeps the system down.

Retry Amplification in Microservices

Consider a simple chain:

API → Order Service → Inventory Service
Enter fullscreen mode Exit fullscreen mode

If both upstream services retry 3 times, a single failure at the bottom becomes:

3 × 3 = 9 requests
Enter fullscreen mode Exit fullscreen mode

Now multiply that by:

  • thousands of goroutines
  • multiple replicas
  • auto-scaled clients

Congratulations — you just DDoS’d yourself.


The Go Anti-Pattern: Blind Retries

This pattern exists in far too many codebases:

func callRemote() error {
    for i := 0; i < 3; i++ {
        if err := doRequest(); err == nil {
            return nil
        }
    }
    return errors.New("failed after retries")
}
Enter fullscreen mode Exit fullscreen mode

Why This Fails in Production

  • retries happen immediately
  • no timeout awareness
  • no cancellation
  • all clients retry at the same time

Under load, this multiplies traffic exactly when capacity is lowest.


Rule #1: Every Retry Must Be Context-Aware

A retry loop that ignores context.Context is a goroutine leak generator.

Bad

for {
    err := doRequest()
    if err == nil {
        return nil
    }
    time.Sleep(time.Second)
}
Enter fullscreen mode Exit fullscreen mode

Production-Safe Pattern

func retry(ctx context.Context, attempts int, fn func() error) error {
    for i := 0; i < attempts; i++ {
        if err := fn(); err == nil {
            return nil
        }

        select {
        case <-time.After(time.Second):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    return errors.New("retry budget exhausted")
}
Enter fullscreen mode Exit fullscreen mode

Now retries:

  • stop on request cancellation
  • stop on shutdown
  • don’t leak goroutines

Why Exponential Backoff Is Mandatory

Fixed delays don’t scale under congestion.

time.Sleep(100 * time.Millisecond) // ❌
Enter fullscreen mode Exit fullscreen mode

If 10,000 clients sleep for the same duration, they wake up together.

Exponential Backoff Formula

sleep = min(cap, base × factor^attempt)
Enter fullscreen mode Exit fullscreen mode

Idiomatic Go Implementation

func backoff(attempt int) time.Duration {
    base := 100 * time.Millisecond
    max := 5 * time.Second

    d := time.Duration(1<<attempt) * base
    if d > max {
        return max
    }
    return d
}
Enter fullscreen mode Exit fullscreen mode

The Thundering Herd Problem (and Jitter)

Pure exponential backoff still synchronizes clients.

Add Jitter

func jitter(d time.Duration) time.Duration {
    return time.Duration(rand.Int63n(int64(d)))
}
Enter fullscreen mode Exit fullscreen mode

Combined Retry Loop

func retryWithBackoff(ctx context.Context, attempts int, fn func() error) error {
    for i := 0; i < attempts; i++ {
        if err := fn(); err == nil {
            return nil
        }

        delay := jitter(backoff(i))

        select {
        case <-time.After(delay):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    return errors.New("retries exhausted")
}
Enter fullscreen mode Exit fullscreen mode

This spreads retries across time instead of creating traffic spikes.


Retry Only What Is Retryable

Retries are not free — retrying the wrong errors wastes capacity.

Practical Error Classification

func isRetryable(err error, resp *http.Response) bool {
    if errors.Is(err, context.DeadlineExceeded) {
        return true
    }
    if resp == nil {
        return true // network failure
    }
    return resp.StatusCode >= 500 || resp.StatusCode == 429
}
Enter fullscreen mode Exit fullscreen mode

Never retry:

  • validation errors
  • authentication failures
  • non-idempotent writes (unless protected)

Retry Budgets: Capping Damage

A system-wide retry budget limits retry amplification.

Rule of thumb: retries should add no more than 10% traffic

Concept

  • 100 successful requests → earn 10 retry tokens
  • budget exhausted → fail fast

This ensures:

Total load ≤ 1.1 × normal traffic
Enter fullscreen mode Exit fullscreen mode

Retry budgets prevent retries from becoming a denial-of-service vector.


Circuit Breakers: When to Stop Trying

Retries handle transient failure.
Circuit breakers handle systemic failure.

Proper Ordering

Retry → Circuit Breaker → Remote Call
Enter fullscreen mode Exit fullscreen mode

If the breaker is open, retries must fail immediately.

State Machine

  • Closed: normal traffic
  • Open: fail fast
  • Half-Open: limited probe requests

Libraries like sony/gobreaker handle this efficiently and safely.


A Real Failure Story (Condensed)

During a flash sale:

  • Inventory service slowed down
  • Order service retried aggressively
  • traffic tripled
  • latency spiked from 50ms → 5s
  • autoscaling worsened the load
  • system stayed down after inventory recovered

After:

  • exponential backoff + jitter
  • retry budget (10%)
  • circuit breaker

Result:

  • traffic flattened
  • inventory recovered
  • degraded responses instead of outages

Go-Specific Pitfalls: time.After in Loops

Before Go 1.23, this pattern leaked timers:

select {
case <-time.After(delay):
}
Enter fullscreen mode Exit fullscreen mode

In high-throughput retry loops, this caused memory pressure.

Safer Pattern

timer := time.NewTimer(delay)
defer timer.Stop()

select {
case <-timer.C:
case <-ctx.Done():
    return ctx.Err()
}
Enter fullscreen mode Exit fullscreen mode

Go 1.23 improves timer GC, but explicit control is still best practice in hot paths.


Observability: Know When Retries Are Hurting You

Track:

  • retry count
  • retry delay histogram
  • retries / success ratio
  • circuit breaker state
  • goroutine count

A dangerous signal:

retries ≈ successful requests

That’s the edge of collapse.


Key Takeaways

  • Retries are load multipliers
  • Exponential backoff + jitter is mandatory
  • Context cancellation is non-negotiable
  • Retry budgets cap systemic damage
  • Circuit breakers prevent cascading failure
  • Unobserved retries silently kill systems

Resilience is not about retrying harder —
it’s about knowing when to stop.

Happy Coding! 🚀

Top comments (0)