Webhooks seem simple until a worker crashes mid-delivery, a subscriber goes down for an hour, or a payload gets tampered with in transit.
Here's what I actually built to handle that — FastAPI + PostgreSQL + Redis.
The core problems I solved:
1. Synchronous delivery blocks everything
Naive approach calls the subscriber URL inline. One slow endpoint stalls your whole ingest. Fix: return 202 Accepted immediately, persist the event, deliver async.
2. Workers crash and jobs disappear
If a worker dies mid-delivery, that job is stuck IN_FLIGHT forever. Fix: a watchdog sweeping every 30s, requeuing anything stale.
3. Retries without backoff make things worse
Hammering a struggling subscriber on failure makes recovery harder. Fix: exponential backoff (2s → 32s, max 5 attempts) using a Redis sorted set as a delay queue — score = next attempt timestamp.
4. One dead subscriber degrades the whole system
Fix: circuit breaker per subscription. 5 consecutive failures trips it OPEN. After 60s cooldown, one probe tests recovery before resuming.
5. No payload integrity
Fix: per-subscription HMAC-SHA256 signature on every payload, verified with hmac.compare_digest to eliminate timing attacks.
Result: 99.9% delivery reliability across 10,000+ daily webhooks, with full visibility via Prometheus + Grafana.
Full deep-dive coming soon.
Top comments (0)