Ken Imoto

Posted on May 1

When Retries Turn Hostile — How Control Logic Kills Production Systems

#sre #devops #reliability #programming

"Your retries are killing us."

A service team received this message from a downstream dependency during an outage. The upstream API was timing out, so naturally, the client retried. 3 times, 5 times, 10 times. The client thought it was doing the right thing.

From the dependency's perspective, they were at half capacity due to the outage — and receiving several times the normal traffic. Retries were making the outage worse and preventing recovery.

This isn't a fable. In August 2012, Knight Capital's trading system activated legacy code (Power Peg) during a deployment, generating millions of orders over 45 minutes. Orders were never marked as "complete," so the system kept regenerating them. The feedback loop never closed. The structural result: an infinite re-execution loop with the same dynamics as a retry storm. $440 million lost, company effectively bankrupt.

Retries exist to survive failures. But when designed carelessly, retries become the failure.

Three Patterns of Self-Attack

Michael Nygard identified these in Release It! — patterns where production systems attack themselves.

Dogpile

The moment a cache expires, every client simultaneously hits the origin server. A service handling 100 requests/second suddenly receives thousands. The service recovers from the outage, only to be knocked down again by the stampede of queued requests.

The moment after recovery is the most dangerous moment. I've seen this loop repeat until the on-call engineer's sanity fails before the server does.

Cascading Failures

Service A depends on B, B depends on C. When C slows down, B's threads block. When B's thread pool exhausts, A's requests back up too. One service's latency ripples through the entire dependency chain.

The nasty part: latency is worse than errors. Errors return fast and free up resources. Latency holds threads and connections hostage. As Nygard puts it, "slow responses are worse than no responses."

The Slow Response Trap

An HTTP client with a 30-second timeout calls a slow service. The thread is occupied for 30 seconds. Meanwhile, requests pile up and the thread pool drains.

Timeout too long: resources held hostage. Timeout too short: normal operations get killed. Getting the timeout value right is harder than it looks. I've heard "we just left it at the default" more times than I'd like to admit. I was guilty of it too.

Three Principles of Safe Retry Design

Retries aren't evil. Thoughtless retries are.

But first, a prerequisite: the target API must be idempotent (sending the same request multiple times produces the same result). If you retry POST /orders three times and get three orders, no retry strategy will save you. That's not a joke — it happens.

Principle	What to do	What happens if you don't
Exponential Backoff	Increase retry intervals: 1→2→4→8s	All clients retry simultaneously, forever
Jitter	Add random variance to backoff	Backoff waves synchronize, creating periodic spikes
Retry Budget	Cap total retry rate system-wide	Individual retries are rational; collectively, they're destructive

Exponential Backoff Alone Isn't Enough

If every client starts retrying at the same time, they'll all hit 1s, 2s, 4s simultaneously. The backoff waves synchronize.

Jitter Breaks the Wave

import random

def retry_with_jitter(attempt, base=1, max_delay=60):
    delay = min(base * (2 ** attempt), max_delay)
    return random.uniform(0, delay)

AWS's blog "Exponential Backoff And Jitter" (2015) recommends Full Jitter. It desynchronizes retry timing across clients.

Retry Budget — Controlling Collective Behavior

"If retries exceed 20% of all requests in the last minute, stop issuing new retries."

From the Google SRE Handbook. Each client thinks its retry is rational. But when everyone retries simultaneously, the collective behavior is destructive. Same as traffic: one lane change is rational; everyone changing lanes at once makes the jam worse.

5 Checkpoints for Production Debugging

When you suspect retries or control logic are causing an outage:

#	Check	What to verify	Danger sign
1	Retry interval	Exponential backoff + jitter implemented?	Hardcoded `sleep(1)`
2	Retry limit	Maximum retry count set?	`while True` + retry
3	Timeout value	Not left at default?	No timeout, or >30s
4	Circuit breaker	Stops requests when dependency is down?	Sends all traffic during outage
5	Feedback loop	Completion correctly recorded?	Incomplete items get re-processed

Knight Capital failed on #2 and #5. No order limit, no completion flag. Two missing checkpoints = $440M.

Why Control Logic Is Terrifying

Normal code runs millions of times a day — bugs surface quickly. But control logic — retries, timeouts, backoff, circuit breakers — only runs during outages. Outages are rare, so control logic bugs hide for months. When you finally need them, they don't work as expected.

The mechanism designed to survive failures becomes the mechanism that amplifies failures. That's the paradox. And the only way to test control logic during normal operations is to intentionally create failures — chaos engineering. It sounds contradictory, but that's the reality of production operations.

Quick Audit

Run this in your codebase right now:

grep -rn "retry\|retries\|max_attempts\|backoff\|jitter" src/

If you don't find explicit backoff, jitter, and retry limits, your production system has the same structural vulnerability as Knight Capital's.

Appendix: Retry Debug Skill (Copy-Paste Ready)

Drop this into your CLAUDE.md or AI agent skill file. It runs the 5-checkpoint audit when you suspect retry-related issues in production.

# Retry & Control Logic Debug Skill

## Rule
Do not propose fixes until all 5 checkpoints are verified.

## Checkpoints (run in order)
1. **Retry interval**: Is exponential backoff + jitter implemented? Flag hardcoded `sleep(1)`
2. **Retry limit**: Is a max retry count set? Flag `while True` + retry
3. **Timeout value**: Is it explicitly set (not default)? Flag unset or >30s
4. **Circuit breaker**: Does the system stop requests when dependency is down?
5. **Feedback loop**: Is completion correctly recorded? Flag items that get re-processed without completion marks

## Detection commands
    grep -rn "retry\|retries\|max_attempts\|backoff\|jitter" src/
    grep -rn "timeout\|TIMEOUT\|time_out" src/
    grep -rn "circuit\|breaker\|CircuitBreaker" src/

## Verdict
- All 5 explicit → safe
- 1-2 missing → recommend fix (report which)
- 3+ missing or no retry limit → critical (Knight Capital-class risk)

## Prerequisite
Confirm target API is idempotent before approving any retry design.

References

Michael Nygard, Release It! (2007, 2018 2nd Edition)
Google SRE Handbook, Chapter 22: "Addressing Cascading Failures"
AWS Architecture Blog, "Exponential Backoff And Jitter" (2015)
SEC Filing: Knight Capital Group, Form 10-Q (2012)

DEV Community