Building a Reliable Webhook Delivery System: What Actually Broke and How I Fixed It

#api #backend #python #systemdesign

Webhooks seem simple until a worker crashes mid-delivery, a subscriber goes down for an hour, or a payload gets tampered with in transit.

Here's what I actually built to handle that — FastAPI + PostgreSQL + Redis.

The core problems I solved:

1. Synchronous delivery blocks everything
Naive approach calls the subscriber URL inline. One slow endpoint stalls your whole ingest. Fix: return 202 Accepted immediately, persist the event, deliver async.

2. Workers crash and jobs disappear
If a worker dies mid-delivery, that job is stuck IN_FLIGHT forever. Fix: a watchdog sweeping every 30s, requeuing anything stale.

3. Retries without backoff make things worse
Hammering a struggling subscriber on failure makes recovery harder. Fix: exponential backoff (2s → 32s, max 5 attempts) using a Redis sorted set as a delay queue — score = next attempt timestamp.

4. One dead subscriber degrades the whole system
Fix: circuit breaker per subscription. 5 consecutive failures trips it OPEN. After 60s cooldown, one probe tests recovery before resuming.

5. No payload integrity
Fix: per-subscription HMAC-SHA256 signature on every payload, verified with hmac.compare_digest to eliminate timing attacks.

Result: 99.9% delivery reliability across 10,000+ daily webhooks, with full visibility via Prometheus + Grafana.

Full deep-dive coming soon.