Building a Self-Healing Video URL Fetch Retry Strategy in Production

#backend #php #python #reliability

At DailyWatch we fetch tens of thousands of video metadata records per day from upstream APIs. The realities of dealing with quota limits, transient 5xx errors, DNS hiccups, and rate-limited keys meant our naive while-loop retry pattern was costing us availability and engineering attention every week. This post walks through the retry architecture we converged on — a system that classifies failures, backs off intelligently, rotates credentials, and dead-letters what it cannot handle.

Classify Every Failure Before Retrying

Not every error deserves a retry. Hammering an upstream after a 401 just gets the key banned faster, and retrying a 404 is wasted compute. Before any backoff decision, we bucket failures into three categories:

Transient: 429, 500, 502, 503, 504, network timeouts, DNS resolution failures
Permanent: 400, 401, 403, 404, malformed payloads, schema mismatches
Quota-related: 403 with a quotaExceeded reason — treated as transient but triggers key rotation rather than backoff

Here is the classifier from our PHP fetcher:

public function classifyFailure(int $status, ?string $body): string {
    if (in_array($status, [429, 500, 502, 503, 504], true)) {
        return 'transient';
    }
    if ($status === 403 && $body && str_contains($body, 'quotaExceeded')) {
        return 'quota';
    }
    if ($status >= 400 && $status < 500) {
        return 'permanent';
    }
    return 'transient';
}

Permanent failures go straight to a dead-letter table with the error reason. Quota failures rotate the API key and re-queue the request immediately. Only transient failures enter the backoff loop.

Exponential Backoff With Decorrelated Jitter

The classic doubling delay — 1s, 2s, 4s, 8s — creates thundering herds when many workers hit the same upstream at the same instant. Decorrelated jitter, popularized by the AWS Architecture Blog, picks the next delay uniformly between the base and three times the previous delay, then caps it. Each retry lands at a random offset instead of clustering at predictable doubling boundaries:

func nextDelay(prev, base, cap time.Duration) time.Duration {
    minD := base
    maxD := time.Duration(float64(prev) * 3.0)
    if maxD > cap {
        maxD = cap
    }
    span := int64(maxD - minD)
    if span <= 0 {
        return cap
    }
    return minD + time.Duration(rand.Int63n(span))
}

We use base=500ms and cap=60s. Six attempts gives roughly a three-minute total budget — beyond that, the upstream is genuinely broken rather than transiently busy, and the request belongs in the dead-letter queue.

Key Rotation as a Self-Healing Primitive

YouTube Data API quotas reset at midnight Pacific. Running three keys with disjoint quotas keeps the fetcher productive for roughly 36 hours even if one key is fully drained mid-day. The pool tracks each key's last failure timestamp and a rolling failure count. When a key throws quotaExceeded, we mark it on cooldown until the next reset and route subsequent work to a viable peer:

class KeyPool:
    def __init__(self, keys: list[str]):
        self.keys = [{"key": k, "cooldown_until": 0, "fails": 0} for k in keys]

    def acquire(self) -> str | None:
        now = time.time()
        viable = [k for k in self.keys if k["cooldown_until"] <= now]
        if not viable:
            return None
        return min(viable, key=lambda k: k["fails"])["key"]

    def report_quota_exceeded(self, key: str) -> None:
        for k in self.keys:
            if k["key"] == key:
                k["cooldown_until"] = next_midnight_pacific()
                k["fails"] += 1

    def report_success(self, key: str) -> None:
        for k in self.keys:
            if k["key"] == key and k["fails"] > 0:
                k["fails"] -= 1

This is the self-healing piece: the fetcher recovers from a per-key quota wipeout without human intervention, and the success reporter prevents a key from being permanently down-weighted after one bad day.

Circuit Breakers and Dead Letter Queues

A retry loop without a circuit breaker turns a 30-minute upstream outage into a 30-minute crash loop. We wrap each upstream endpoint in a breaker with three states:

Closed: normal traffic flows through
Open: all requests fail-fast for the cooldown window (we use 60s)
Half-open: one probe request per minute; success closes the breaker, failure reopens it

The half-open probe matters more than people give it credit for. Without it, the breaker either flips back to closed prematurely and floods a partially recovered upstream, or stays open until a human notices.

Permanent failures get serialized to a dead_letter SQLite table with the request payload, error class, and timestamp. A daily job re-evaluates dead-letter entries — sometimes a deleted video reappears, sometimes the upstream starts returning fields our parser previously choked on. Either way, the fetch pipeline never silently loses work.

Observability: The Feedback Loop

Self-healing is meaningless if you cannot see it healing. We emit one structured log line per fetch attempt with these fields:

endpoint — which upstream was called
attempt — 1-indexed retry counter
status — HTTP status or local error class
key_id — last four chars of the API key used
latency_ms — wall time of the request
outcome — success, transient, permanent, quota, or circuit_open

These roll up into a 15-minute dashboard showing transient-failure rate, mean retries per success, and circuit-open minutes per endpoint. A spike in mean-retries-per-success is the leading indicator that an upstream is degrading — we usually catch it 20 minutes before it would cause user-visible gaps in the catalog.

What We Stopped Doing

Things that looked smart on paper but caused us real pain in production:

Fixed-interval retries — guarantees synchronized retry storms across workers
Infinite retry loops — masks broken endpoints and inflates compute cost
Catch-all except Exception: retry — swallows real bugs into noisy logs
Retrying on 4xx — only burns quota and earns rate-limit penalties
Single shared API key — one quota event takes the whole pipeline down

The combined effect: fetch success rate moved from 94.1% to 99.7% over a quarter, and on-call pages for fetch failures dropped to roughly one per month — almost always a genuine upstream incident rather than something engineering could have prevented with better code.