Philippe Gagneux

Posted on Feb 24 • Originally published at techaid.ch

Your Retry Config is Wrong (And So Was Mine)

#kubernetes #devops #microservices #sre

On May 12, 2022, DoorDash went down for over three hours. Not because a database failed – because a database got slow. A routine latency spike in the order storage layer triggered retries. Those retries hit downstream services, which triggered their retries. Within minutes, what started as 50ms of added latency became a full retry storm: every service in the chain hammering every service below it, each one tripling the load on the next. The shared circuit breaker – designed to protect against exactly this – tripped and took out unrelated services that happened to share the same dependency. Three hours of downtime. All because every service had the same retry config: retries: 3.

DoorDash isn't alone. In December 2024, OpenAI went down for over four hours when a telemetry deploy caused every node in their largest clusters to execute resource-intensive Kubernetes API operations simultaneously – a thundering herd that overwhelmed the control plane and locked engineers out of recovery tools. Cloudflare had a similar feedback loop in 2025 involving Let's Encrypt rate limiting and their own retry logic. A 2022 OSDI paper studying metastable failures found that retry policy was the sustaining effect in half of the 22 incidents they analyzed.

The root cause in every case is the same: uniform retry configuration across a service chain.

I set out to find the optimal retry allocation. I found it, proved it mathematically, and then ran it on a real service chain. The math was right. The config was wrong. Here's what happened.

The multiplication problem

When I say "uniform retries are multiplicative, not additive," most engineers nod and move on. So let me be specific.

You have 8 services. Each retries 3 times on failure. Your mental model says: if the leaf service fails, you get 8 × 3 = 24 extra requests. That's wrong.

The actual number is 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 = 6,561.

Each retry at layer N triggers a full cascade of retries through layers N+1 to 8. The gateway retries 3 times. Each of those hits auth, which retries 3 times. Each of those hits orders, which retries 3 times. You're computing a product, not a sum.

Google's SRE team documented 64x amplification at just 3 layers deep. At Agoda, 8% of all production request volume during a slowdown was retry traffic. These aren't theoretical numbers – they're from production telemetry.

At 16 services, the theoretical ceiling is 3^16 ≈ 43 million. Nobody hits that number because circuit breakers trip first. But "your circuit breakers save you by killing your own services" is not the safety story you think it is.

Why nobody questions this

Istio's default retry config is attempts: 2, retryOn: "connect-failure,refused-stream,unavailable,cancelled,retriable-status-codes". Every VirtualService gets its own retry policy. Nothing in the Istio docs warns you about cross-service interaction.

The Google SRE book talks about retry budgets in Chapter 22, but every example uses uniform values. The mental model it builds is per-service: "this service should retry N times." Not "this service's retries multiply against every other service's retries."

Kubernetes and Istio docs show single-service retry config. Always. I've never seen an official example that shows a 5-service chain with retries: 3 on each one and a diagram of what happens when the leaf fails. The multiplicative explosion is invisible in docs because docs show one VirtualService at a time.

And it works fine in staging. Your staging environment has 2-3 services. 3^3 = 27. That's noise. The bomb only detonates in production, where you have 8-20 services deep and real traffic to amplify.

The math says: concentrate retries

I built a cost model with six components – reliability, amplification, cascade timing, latency, resonance interference, circuit breaker saturation – and ran a constrained optimizer across chain lengths from 4 to 128 services.

Three key principles fell out:

1. Retry volume is a product, not a sum.

V = r₁ × r₂ × ... × rₙ

Each layer multiplies the worst case for every layer below it.

2. Reliability has diminishing returns per layer.

If a service succeeds 95% of the time, one retry gives you 99.75%. A second gives 99.9875%. A third gives 99.999%. Smaller gains, full multiplicative cost.

3. Circuit breakers flip the equation.

When retry volume exceeds the CB threshold (~5-10 consecutive errors), the breaker trips for all requests. Retries past the CB threshold actively reduce reliability.

The optimizer kept landing on the same answer: for any chain of 8+ services, concentrate all retries on exactly 2 services and set everything else to r=1. Total volume: 12x instead of 6,561x. A 99.8% reduction.

The weird part: this allocation doesn't change when you add more services. I tested 8, 16, 32, 64, 128 services. Same answer every time. The first few positions get retries, everything after that gets r=1. A 512-dimensional optimization problem collapses to a 3-dimensional one.

I proved this analytically – the optimal vector "freezes" once the chain is long enough. Neat result. I was pretty pleased with myself.

Then I ran it on a real service chain and everything fell apart.

The experiment that broke the theory

I deployed 8 services as Docker containers on a VPS. Real TCP connections, real DNS resolution, real resource contention (64MB memory, 0.25 CPU per container). I injected failures: service 5 at 10% failure rate, service 7 at 5%, the rest at 1%. Then I sent 500 concurrent requests and compared my mathematically optimal config against the uniform default.

Normal load (1-10% failure rates):

Metric	Uniform (r=3)	"Optimal" (r=[1,4,1,3,1,1,1,1])	Delta
Success rate	99.0%	97.6%	-1.4%
Total retries	84	102	+21%
P99 latency	385ms	455ms	+70ms

Stress (5-30% failure rates):

Metric	Uniform (r=3)	"Optimal"	Delta
Success rate	95.2%	87.8%	-7.4%
Total retries	420	487	+16%
P99 latency	476ms	583ms	+107ms

The "optimized" config was worse on every metric. More retries, not fewer. Lower success rate. Higher tail latency. Under stress, 7.4% more requests failed.

My cost model was minimizing the wrong thing.

Where the math goes wrong

The per-service metrics told the story. Under stress, here's where the retries landed:

Service	1	2	3	4	5	6	7	8
Uniform retries	29	27	24	169	28	111	32	0
"Optimal" retries	0	116	0	371	0	0	0	0

The optimized config concentrated 76% of all retries at service 4. Service 4 sits upstream of service 5 (the 30%-failure bottleneck). Every time service 4 retries, it re-sends a request through services 5, 6, 7, and 8. That's 4 downstream hops per retry, through services that are already under stress.

The analytical model minimizes the product of retries (3^8 = 6,561 → 4×3 = 12). But in a real system, the cost of a retry depends on where in the chain it happens:

A retry at position 2 re-traverses 6 downstream services.
A retry at position 7 re-traverses 1 downstream service.

The retry at position 2 is 6x more expensive than the retry at position 7 – but the product-based model treats them identically.

The correct cost function isn't Π rᵢ. It's:

cost = Σᵢ rᵢ × (N - i) × fᵢ

Where (N - i) is the number of downstream hops and fᵢ is the failure rate at position i. Each retry is priced by how much downstream work it creates.

What actually works

The core idea still holds: don't use uniform retries. The 6,561x multiplication problem is real. But the fix isn't "concentrate retries early." It's simpler:

Put your retries close to the failure, not upstream of it.

If service 5 fails 10% of the time, give service 5 a higher retry count – or the service immediately upstream of it (service 4). Don't give service 2 four retries when each retry traverses 6 hops through the failure zone.

The practical retry allocation:

Identify your highest-failure-rate services. Look at rate(istio_requests_total{response_code=~"5.."}[5m]) per service.
Give the retry budget to their immediate neighbors. The service directly upstream of a failure hotspot should get r=2 or 3. The service directly downstream should keep r=1 (it's the one failing – retrying into it from 6 hops away makes things worse).
Everything far from the failure: r=1. Your gateway, your auth service, your API middleware – if they're not adjacent to a failure hotspot, they get r=1. Period.
Never exceed a total product of ~20. Multiply all your retry values along the chain. If the product exceeds 20, you're past the circuit breaker saturation point and additional retries are pure cost.

The Istio YAML for a service adjacent to a hotspot:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-processor  # directly upstream of the flaky payment service
spec:
  hosts:
    - order-processor
  http:
    - route:
        - destination:
            host: order-processor
      retries:
        attempts: 3
        perTryTimeout: 500ms
        retryOn: "5xx,reset,connect-failure,retriable-status-codes"
      timeout: 2s

Everything else:

      retries:
        attempts: 1
        perTryTimeout: 500ms
        retryOn: "5xx,connect-failure"
      timeout: 1s

One thing that bit me: perTryTimeout includes the first attempt. If you set attempts: 3 and perTryTimeout: 500ms, you're saying "try up to 3 times, each attempt has 500ms." The outer timeout is the total wall clock for all attempts. Set it to perTryTimeout × attempts as a ceiling.

Also: Istio retries stack with application-level retries. If your Go service has a retry loop in the HTTP client AND the VirtualService has retries configured, you're multiplying again. Audit both. kubectl get virtualservice -A -o yaml | grep -A5 retries gives you the mesh-level view. For app-level, search for retry libraries (go-retryablehttp, resilience4j, polly, tenacity).

The timeout tradeoff the model found

One finding from the cost model that I didn't expect.

Classical SRE wisdom says: gateway timeout must be greater than downstream timeout × retries. If your downstream has 500ms timeout and 4 retries, gateway needs at least 2.0s. Makes sense – don't give up while downstream retries are still running.

The cost model's optimal gateway timeout is 1.4s – below the cascade-consistent minimum of 2.0s.

Why? Two reasons. First, the model penalizes synchronized timeouts across services (they create correlated retry bursts). A 1.4s gateway timeout breaks the synchronization with the 0.5s downstream timeouts. Second, the 600ms saved per request reduces worst-case latency, and in the model, the latency reduction outweighs the cascade penalty from occasionally timing out before downstream retries complete.

I haven't load-tested this specific finding – the Docker experiment compared retry allocations, not timeout values. But the engineering logic is sound: under saturation, shorter gateway timeouts drop failing requests faster, freeing connections and reducing queue depth.

The practical version: keep cascade-consistent timeouts as default. But consider an adaptive threshold – when your 5xx rate crosses 10%, tighten the gateway timeout:

RATE=$(kubectl exec -n istio-system deploy/prometheus-server -- \
  promtool query instant \
  'sum(rate(istio_requests_total{response_code=~"5.."}[1m])) / sum(rate(istio_requests_total[1m]))' \
  | awk '{print $2}')

if (( $(echo "$RATE > 0.10" | bc -l) )); then
  kubectl patch virtualservice api-gateway -n production --type merge \
    -p '{"spec":{"http":[{"timeout":"1400ms"}]}}'
fi

Not pretty. But it's better than holding failing requests until your gateway OOMs.

What to do Monday morning

Run this and look at the output:

kubectl get virtualservice -A -o yaml | grep -B10 -A5 "retries:"

Multiply all the attempts values along your longest call chain. If the product is over 50, you have a retry bomb. If it's over 200, you're one partial outage away from a DoorDash-style cascade.

Then:

Find your highest-error-rate services. rate(istio_requests_total{response_code=~"5.."}[5m])
Give their immediate upstream neighbor attempts: 2 or 3.
Set everything else to attempts: 1.
Keep the total product under 20.
Deploy to a canary, watch retry volume drop. Compare end-to-end success rate.

That's it. No new infrastructure. No service mesh upgrade. A config change that takes 20 minutes.

The uniform retry config is wrong. My "optimal" config was also wrong. The actual answer is simpler than both: retries cost more the further they are from the failure. Put them close.

Methodology: Six-component cost model (reliability, amplification, cascade timing, latency, resonance, circuit breaker saturation). The freezing result is proven for feedforward chains with independent failures. Docker experiment: 8 Node.js containers, real TCP, 20 concurrent requests, Prometheus per service. Single run – take the exact percentages with a grain of salt, but the direction is consistent across configs.