DEV Community

Ray
Ray

Posted on

How I Built a Real-Time Stripe Webhook Anomaly Detector (Architecture + Code)

How I Built a Real-Time Stripe Webhook Anomaly Detector (Architecture + Code)

Stripe webhooks are the nervous system of any subscription business. When they work, everything is silent. When they break, you often don't find out until a customer complains — or your MRR drops.

I found this out the hard way. A silent charge failure cascade wiped out $800 in MRR before I noticed. So I built a real-time webhook monitor. Here's the architecture and the key implementation patterns.


The Core Problem With Webhook Monitoring

Stripe sends events. You receive them. But "receiving" and "processing correctly" are different things. Common failure modes:

  1. Duplicate delivery — Stripe retries events aggressively. If your handler isn't idempotent, you get double-charges.
  2. Silent lapses — Payment fails, subscriber stays active, you don't notice for days.
  3. Webhook lag — Events pile up because your endpoint is slow or overloaded.
  4. Fraud spikes — Sudden surge in charge.failed events from card testing attacks.
  5. Negative invoice anomalies — Credits or refunds creating unexpected negative charges.

A Stripe dashboard doesn't catch these automatically. You need a monitor that watches the stream and fires when something looks wrong.


Architecture: Event Stream → Detectors → Alerts

Stripe Webhook Events
        ↓
Webhook Receiver (FastAPI)
        ↓
Event Store (SQLite / Postgres)
        ↓
Detector Engine (runs on every event)
        ↓
Alert Router (email / webhook / Slack)
Enter fullscreen mode Exit fullscreen mode

The key design choice: run detectors on every event, not on a schedule. Scheduled checks mean you might miss a 5-minute fraud spike. Event-triggered detection means you catch it within seconds.


The Webhook Receiver

from fastapi import FastAPI, Request, HTTPException
import stripe
import hmac
import hashlib

app = FastAPI()

@app.post("/webhooks/stripe")
async def stripe_webhook(request: Request):
    payload = await request.body()
    sig_header = request.headers.get("stripe-signature")

    try:
        event = stripe.Webhook.construct_event(
            payload, sig_header, STRIPE_WEBHOOK_SECRET
        )
    except ValueError:
        raise HTTPException(status_code=400, detail="Invalid payload")
    except stripe.error.SignatureVerificationError:
        raise HTTPException(status_code=400, detail="Invalid signature")

    # Store + dispatch
    await store_event(event)
    await run_detectors(event)

    return {"status": "ok"}
Enter fullscreen mode Exit fullscreen mode

Always verify the signature. Always return 200 before doing expensive work — Stripe retries if you take too long.


The Detector Pattern

Each detector is a function that takes an event and returns an alert or None:

from dataclasses import dataclass
from typing import Optional
from datetime import datetime, timedelta

@dataclass
class Alert:
    severity: str  # "low", "medium", "high", "critical"
    detector: str
    message: str
    event_id: str

async def detect_charge_failure_rate(event: dict, db) -> Optional[Alert]:
    """Alert if charge failure rate exceeds threshold in the last hour."""
    if event["type"] not in ("charge.failed", "charge.succeeded"):
        return None

    one_hour_ago = datetime.utcnow() - timedelta(hours=1)

    recent = await db.fetch_events_since(one_hour_ago, 
                                          types=["charge.failed", "charge.succeeded"])

    if len(recent) < 10:  # Not enough data
        return None

    failures = [e for e in recent if e["type"] == "charge.failed"]
    failure_rate = len(failures) / len(recent)

    if failure_rate > 0.15:  # 15% threshold
        return Alert(
            severity="high",
            detector="charge_failure_rate",
            message=f"Charge failure rate: {failure_rate:.0%} in last hour ({len(failures)}/{len(recent)} failed)",
            event_id=event["id"]
        )
    return None
Enter fullscreen mode Exit fullscreen mode

Wire up multiple detectors:

DETECTORS = [
    detect_charge_failure_rate,
    detect_duplicate_events,
    detect_silent_lapse,
    detect_webhook_lag,
    detect_fraud_spike,
    detect_negative_invoice,
    detect_revenue_drop,
]

async def run_detectors(event: dict):
    for detector in DETECTORS:
        alert = await detector(event, db)
        if alert:
            await route_alert(alert)
Enter fullscreen mode Exit fullscreen mode

The Idempotency Guard (Critical)

Stripe retries. Your handlers must be idempotent:

async def store_event(event: dict) -> bool:
    """Returns False if event was already processed (skip duplicate work)."""
    event_id = event["id"]

    existing = await db.fetch_one(
        "SELECT id FROM stripe_events WHERE event_id = ?", 
        (event_id,)
    )

    if existing:
        return False  # Already processed — do nothing

    await db.execute(
        "INSERT INTO stripe_events (event_id, type, created, payload) VALUES (?, ?, ?, ?)",
        (event_id, event["type"], event["created"], json.dumps(event))
    )
    return True
Enter fullscreen mode Exit fullscreen mode

This prevents duplicate alerts from Stripe's retry logic.


Silent Lapse Detection

The sneaky one. Payment fails → Stripe retries 3 times over 7 days → subscription cancels → you never noticed because the customer was "still active":

async def detect_silent_lapse(event: dict, db) -> Optional[Alert]:
    """Detect when a subscription cancels after multiple payment failures."""
    if event["type"] != "customer.subscription.deleted":
        return None

    sub = event["data"]["object"]
    customer_id = sub["customer"]

    # Check if this customer had recent payment failures
    recent_failures = await db.fetch_events(
        customer_id=customer_id,
        types=["invoice.payment_failed"],
        since_days=10
    )

    if len(recent_failures) >= 2:
        mrr_lost = sub.get("items", {}).get("data", [{}])[0].get("price", {}).get("unit_amount", 0) / 100
        return Alert(
            severity="high",
            detector="silent_lapse",
            message=f"Silent lapse detected: customer {customer_id} cancelled after {len(recent_failures)} payment failures. MRR lost: ${mrr_lost:.2f}/mo",
            event_id=event["id"]
        )
    return None
Enter fullscreen mode Exit fullscreen mode

The Metrics Endpoint

Real-time visibility into detector status:

@app.get("/metrics/detectors")
async def detector_metrics():
    last_24h = datetime.utcnow() - timedelta(hours=24)

    alerts = await db.fetch_alerts_since(last_24h)
    events = await db.fetch_events_since(last_24h)

    detector_counts = {}
    for alert in alerts:
        detector_counts[alert["detector"]] = detector_counts.get(alert["detector"], 0) + 1

    return {
        "events_processed_24h": len(events),
        "alerts_fired_24h": len(alerts),
        "active_detectors": len(DETECTORS),
        "detector_breakdown": detector_counts,
        "last_event_at": events[-1]["created"] if events else None
    }
Enter fullscreen mode Exit fullscreen mode

What I Learned

  1. Start with 3 detectors, not 7. The failure rate, silent lapse, and duplicate event detectors catch 90% of real incidents. Add more later.

  2. Tune your thresholds per traffic volume. A 15% failure rate threshold makes sense at 100 charges/hour. At 10 charges/hour, it fires too often.

  3. Alert fatigue is real. Start with high thresholds, tighten as you learn what's normal for your business.

  4. The webhook lag detector is underrated. Stripe's webhook delivery has SLAs. If you're seeing 10+ minute delays, something is wrong with your infrastructure.

  5. Store everything. Raw event storage is cheap. You'll need it for debugging at 2 AM.


The Open-Source Version

I packaged this pattern as BillingWatch — self-hosted, MIT licensed. 7 detectors out of the box, real-time metrics endpoint, configurable alert thresholds.

If you're running subscriptions on Stripe and want coverage against the failure modes above: github.com/rmbell09-lang/billingwatch


What webhook failure modes have you hit in production? Always curious what I'm missing.

Top comments (0)