At DailyWatch we fetch tens of thousands of video metadata records per day from upstream APIs. The realities of dealing with quota limits, transient 5xx errors, DNS hiccups, and rate-limited keys meant our naive while-loop retry pattern was costing us availability and engineering attention every week. This post walks through the retry architecture we converged on — a system that classifies failures, backs off intelligently, rotates credentials, and dead-letters what it cannot handle.
Classify Every Failure Before Retrying
Not every error deserves a retry. Hammering an upstream after a 401 just gets the key banned faster, and retrying a 404 is wasted compute. Before any backoff decision, we bucket failures into three categories:
- Transient: 429, 500, 502, 503, 504, network timeouts, DNS resolution failures
- Permanent: 400, 401, 403, 404, malformed payloads, schema mismatches
-
Quota-related: 403 with a
quotaExceededreason — treated as transient but triggers key rotation rather than backoff
Here is the classifier from our PHP fetcher:
public function classifyFailure(int $status, ?string $body): string {
if (in_array($status, [429, 500, 502, 503, 504], true)) {
return 'transient';
}
if ($status === 403 && $body && str_contains($body, 'quotaExceeded')) {
return 'quota';
}
if ($status >= 400 && $status < 500) {
return 'permanent';
}
return 'transient';
}
Permanent failures go straight to a dead-letter table with the error reason. Quota failures rotate the API key and re-queue the request immediately. Only transient failures enter the backoff loop.
Exponential Backoff With Decorrelated Jitter
The classic doubling delay — 1s, 2s, 4s, 8s — creates thundering herds when many workers hit the same upstream at the same instant. Decorrelated jitter, popularized by the AWS Architecture Blog, picks the next delay uniformly between the base and three times the previous delay, then caps it. Each retry lands at a random offset instead of clustering at predictable doubling boundaries:
func nextDelay(prev, base, cap time.Duration) time.Duration {
minD := base
maxD := time.Duration(float64(prev) * 3.0)
if maxD > cap {
maxD = cap
}
span := int64(maxD - minD)
if span <= 0 {
return cap
}
return minD + time.Duration(rand.Int63n(span))
}
We use base=500ms and cap=60s. Six attempts gives roughly a three-minute total budget — beyond that, the upstream is genuinely broken rather than transiently busy, and the request belongs in the dead-letter queue.
Key Rotation as a Self-Healing Primitive
YouTube Data API quotas reset at midnight Pacific. Running three keys with disjoint quotas keeps the fetcher productive for roughly 36 hours even if one key is fully drained mid-day. The pool tracks each key's last failure timestamp and a rolling failure count. When a key throws quotaExceeded, we mark it on cooldown until the next reset and route subsequent work to a viable peer:
class KeyPool:
def __init__(self, keys: list[str]):
self.keys = [{"key": k, "cooldown_until": 0, "fails": 0} for k in keys]
def acquire(self) -> str | None:
now = time.time()
viable = [k for k in self.keys if k["cooldown_until"] <= now]
if not viable:
return None
return min(viable, key=lambda k: k["fails"])["key"]
def report_quota_exceeded(self, key: str) -> None:
for k in self.keys:
if k["key"] == key:
k["cooldown_until"] = next_midnight_pacific()
k["fails"] += 1
def report_success(self, key: str) -> None:
for k in self.keys:
if k["key"] == key and k["fails"] > 0:
k["fails"] -= 1
This is the self-healing piece: the fetcher recovers from a per-key quota wipeout without human intervention, and the success reporter prevents a key from being permanently down-weighted after one bad day.
Circuit Breakers and Dead Letter Queues
A retry loop without a circuit breaker turns a 30-minute upstream outage into a 30-minute crash loop. We wrap each upstream endpoint in a breaker with three states:
- Closed: normal traffic flows through
- Open: all requests fail-fast for the cooldown window (we use 60s)
- Half-open: one probe request per minute; success closes the breaker, failure reopens it
The half-open probe matters more than people give it credit for. Without it, the breaker either flips back to closed prematurely and floods a partially recovered upstream, or stays open until a human notices.
Permanent failures get serialized to a dead_letter SQLite table with the request payload, error class, and timestamp. A daily job re-evaluates dead-letter entries — sometimes a deleted video reappears, sometimes the upstream starts returning fields our parser previously choked on. Either way, the fetch pipeline never silently loses work.
Observability: The Feedback Loop
Self-healing is meaningless if you cannot see it healing. We emit one structured log line per fetch attempt with these fields:
-
endpoint— which upstream was called -
attempt— 1-indexed retry counter -
status— HTTP status or local error class -
key_id— last four chars of the API key used -
latency_ms— wall time of the request -
outcome—success,transient,permanent,quota, orcircuit_open
These roll up into a 15-minute dashboard showing transient-failure rate, mean retries per success, and circuit-open minutes per endpoint. A spike in mean-retries-per-success is the leading indicator that an upstream is degrading — we usually catch it 20 minutes before it would cause user-visible gaps in the catalog.
What We Stopped Doing
Things that looked smart on paper but caused us real pain in production:
- Fixed-interval retries — guarantees synchronized retry storms across workers
- Infinite retry loops — masks broken endpoints and inflates compute cost
-
Catch-all
except Exception: retry— swallows real bugs into noisy logs - Retrying on 4xx — only burns quota and earns rate-limit penalties
- Single shared API key — one quota event takes the whole pipeline down
The combined effect: fetch success rate moved from 94.1% to 99.7% over a quarter, and on-call pages for fetch failures dropped to roughly one per month — almost always a genuine upstream incident rather than something engineering could have prevented with better code.
Top comments (0)