How I Built a Real-Time Stripe Webhook Anomaly Detector (Architecture + Code)
Stripe webhooks are the nervous system of any subscription business. When they work, everything is silent. When they break, you often don't find out until a customer complains — or your MRR drops.
I found this out the hard way. A silent charge failure cascade wiped out $800 in MRR before I noticed. So I built a real-time webhook monitor. Here's the architecture and the key implementation patterns.
The Core Problem With Webhook Monitoring
Stripe sends events. You receive them. But "receiving" and "processing correctly" are different things. Common failure modes:
- Duplicate delivery — Stripe retries events aggressively. If your handler isn't idempotent, you get double-charges.
- Silent lapses — Payment fails, subscriber stays active, you don't notice for days.
- Webhook lag — Events pile up because your endpoint is slow or overloaded.
-
Fraud spikes — Sudden surge in
charge.failedevents from card testing attacks. - Negative invoice anomalies — Credits or refunds creating unexpected negative charges.
A Stripe dashboard doesn't catch these automatically. You need a monitor that watches the stream and fires when something looks wrong.
Architecture: Event Stream → Detectors → Alerts
Stripe Webhook Events
↓
Webhook Receiver (FastAPI)
↓
Event Store (SQLite / Postgres)
↓
Detector Engine (runs on every event)
↓
Alert Router (email / webhook / Slack)
The key design choice: run detectors on every event, not on a schedule. Scheduled checks mean you might miss a 5-minute fraud spike. Event-triggered detection means you catch it within seconds.
The Webhook Receiver
from fastapi import FastAPI, Request, HTTPException
import stripe
import hmac
import hashlib
app = FastAPI()
@app.post("/webhooks/stripe")
async def stripe_webhook(request: Request):
payload = await request.body()
sig_header = request.headers.get("stripe-signature")
try:
event = stripe.Webhook.construct_event(
payload, sig_header, STRIPE_WEBHOOK_SECRET
)
except ValueError:
raise HTTPException(status_code=400, detail="Invalid payload")
except stripe.error.SignatureVerificationError:
raise HTTPException(status_code=400, detail="Invalid signature")
# Store + dispatch
await store_event(event)
await run_detectors(event)
return {"status": "ok"}
Always verify the signature. Always return 200 before doing expensive work — Stripe retries if you take too long.
The Detector Pattern
Each detector is a function that takes an event and returns an alert or None:
from dataclasses import dataclass
from typing import Optional
from datetime import datetime, timedelta
@dataclass
class Alert:
severity: str # "low", "medium", "high", "critical"
detector: str
message: str
event_id: str
async def detect_charge_failure_rate(event: dict, db) -> Optional[Alert]:
"""Alert if charge failure rate exceeds threshold in the last hour."""
if event["type"] not in ("charge.failed", "charge.succeeded"):
return None
one_hour_ago = datetime.utcnow() - timedelta(hours=1)
recent = await db.fetch_events_since(one_hour_ago,
types=["charge.failed", "charge.succeeded"])
if len(recent) < 10: # Not enough data
return None
failures = [e for e in recent if e["type"] == "charge.failed"]
failure_rate = len(failures) / len(recent)
if failure_rate > 0.15: # 15% threshold
return Alert(
severity="high",
detector="charge_failure_rate",
message=f"Charge failure rate: {failure_rate:.0%} in last hour ({len(failures)}/{len(recent)} failed)",
event_id=event["id"]
)
return None
Wire up multiple detectors:
DETECTORS = [
detect_charge_failure_rate,
detect_duplicate_events,
detect_silent_lapse,
detect_webhook_lag,
detect_fraud_spike,
detect_negative_invoice,
detect_revenue_drop,
]
async def run_detectors(event: dict):
for detector in DETECTORS:
alert = await detector(event, db)
if alert:
await route_alert(alert)
The Idempotency Guard (Critical)
Stripe retries. Your handlers must be idempotent:
async def store_event(event: dict) -> bool:
"""Returns False if event was already processed (skip duplicate work)."""
event_id = event["id"]
existing = await db.fetch_one(
"SELECT id FROM stripe_events WHERE event_id = ?",
(event_id,)
)
if existing:
return False # Already processed — do nothing
await db.execute(
"INSERT INTO stripe_events (event_id, type, created, payload) VALUES (?, ?, ?, ?)",
(event_id, event["type"], event["created"], json.dumps(event))
)
return True
This prevents duplicate alerts from Stripe's retry logic.
Silent Lapse Detection
The sneaky one. Payment fails → Stripe retries 3 times over 7 days → subscription cancels → you never noticed because the customer was "still active":
async def detect_silent_lapse(event: dict, db) -> Optional[Alert]:
"""Detect when a subscription cancels after multiple payment failures."""
if event["type"] != "customer.subscription.deleted":
return None
sub = event["data"]["object"]
customer_id = sub["customer"]
# Check if this customer had recent payment failures
recent_failures = await db.fetch_events(
customer_id=customer_id,
types=["invoice.payment_failed"],
since_days=10
)
if len(recent_failures) >= 2:
mrr_lost = sub.get("items", {}).get("data", [{}])[0].get("price", {}).get("unit_amount", 0) / 100
return Alert(
severity="high",
detector="silent_lapse",
message=f"Silent lapse detected: customer {customer_id} cancelled after {len(recent_failures)} payment failures. MRR lost: ${mrr_lost:.2f}/mo",
event_id=event["id"]
)
return None
The Metrics Endpoint
Real-time visibility into detector status:
@app.get("/metrics/detectors")
async def detector_metrics():
last_24h = datetime.utcnow() - timedelta(hours=24)
alerts = await db.fetch_alerts_since(last_24h)
events = await db.fetch_events_since(last_24h)
detector_counts = {}
for alert in alerts:
detector_counts[alert["detector"]] = detector_counts.get(alert["detector"], 0) + 1
return {
"events_processed_24h": len(events),
"alerts_fired_24h": len(alerts),
"active_detectors": len(DETECTORS),
"detector_breakdown": detector_counts,
"last_event_at": events[-1]["created"] if events else None
}
What I Learned
Start with 3 detectors, not 7. The failure rate, silent lapse, and duplicate event detectors catch 90% of real incidents. Add more later.
Tune your thresholds per traffic volume. A 15% failure rate threshold makes sense at 100 charges/hour. At 10 charges/hour, it fires too often.
Alert fatigue is real. Start with high thresholds, tighten as you learn what's normal for your business.
The webhook lag detector is underrated. Stripe's webhook delivery has SLAs. If you're seeing 10+ minute delays, something is wrong with your infrastructure.
Store everything. Raw event storage is cheap. You'll need it for debugging at 2 AM.
The Open-Source Version
I packaged this pattern as BillingWatch — self-hosted, MIT licensed. 7 detectors out of the box, real-time metrics endpoint, configurable alert thresholds.
If you're running subscriptions on Stripe and want coverage against the failure modes above: github.com/rmbell09-lang/billingwatch
What webhook failure modes have you hit in production? Always curious what I'm missing.
Top comments (0)