Disclosure: I'm a senior backend tech lead and I run HostingGuru. This article mentions HostingGuru once near the end. The four patterns below work on any platform, any framework, with any webhook provider (Stripe, GitHub, Twilio, Postmark). I want it useful even if you never become a customer.
Two weekends ago a founder I do code reviews for sent me a panicked DM at 11pm: "A customer just opened a chargeback because we never unlocked their Pro plan. Stripe says the subscription is active. What's happening?"
His webhook handler had crashed three days earlier at 3:14am on a Sunday. Stripe retried for 72 hours, gave up, marked the event as failed in their dashboard. Nobody was watching. The customer paid $49, got the same free-tier experience, asked for a refund, didn't get a fast reply, opened a dispute. By the time we caught it, his MRR was down one customer and his Stripe account had a $15 dispute fee on top. The fix took an hour. The damage took weeks to unwind.
Webhooks are the load-bearing wall of indie SaaS, and they're built on sand
If you run a SaaS that takes payments, webhooks are how Stripe tells you a customer paid, upgraded, churned, or failed to pay. The redirect after checkout is a UX nicety. The webhook is the source of truth. Without it, your database and Stripe go out of sync, and that gap is where money disappears.
The problem is that webhooks are the worst combination of "load-bearing" and "unreliable". The network blips. Your server returns 502 for the 30 seconds a deploy is booting. Your DB locks because some other query is doing a full table scan. Stripe sends duplicates on purpose (their delivery guarantee is at-least-once, not exactly-once). The webhook can arrive before your DB has the customer record if checkout.session.completed beats your post-checkout redirect.
Stripe will retry on a backoff for about 3 days, then quietly stop. After that, your state is permanently divergent unless you reconcile.
Here are the four patterns I install on every indie SaaS I touch, usually after the first dropped customer.
Pattern 1: verify signature, acknowledge fast, process async
The most common mistake I see in vibe-coded webhook handlers is doing the actual work inside the HTTP request. The handler verifies the signature, then runs through 15 lines of "if event.type is invoice.paid, update the DB, send an email, hit the analytics API, set a feature flag." All synchronous. All in the request that Stripe is waiting on.
If any of those steps is slow or fails, Stripe sees a timeout or 500 and retries. If your code is not idempotent (more on that next), you double-process. If the DB is locked, the whole request times out.
The pattern is: do the minimum in the HTTP handler. Verify the signature. Persist the raw event. Acknowledge 200. Process asynchronously.
// webhook-handler.js (Express + Postgres)
import express from 'express';
import Stripe from 'stripe';
import { db } from './db.js';
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY);
const router = express.Router();
router.post('/webhooks/stripe',
express.raw({ type: 'application/json' }),
async (req, res) => {
const sig = req.headers['stripe-signature'];
let event;
try {
event = stripe.webhooks.constructEvent(
req.body,
sig,
process.env.STRIPE_WEBHOOK_SECRET
);
} catch (err) {
return res.status(400).send(`Invalid signature: ${err.message}`);
}
// Persist the raw event. That's it. Process later.
try {
await db.query(
`INSERT INTO webhook_events (id, type, payload, received_at, status)
VALUES ($1, $2, $3, NOW(), 'pending')
ON CONFLICT (id) DO NOTHING`,
[event.id, event.type, event]
);
return res.status(200).send({ received: true });
} catch (err) {
console.error('webhook_insert_failed', { event_id: event.id, err });
return res.status(500).send('persist failed');
}
}
);
The handler now does two things: verify, store. If the DB is reachable, you ack 200. If it isn't, you return 500 and Stripe retries, which is fine because step one (verify) is idempotent and step two has an ON CONFLICT DO NOTHING clause.
The actual work (granting Pro access, sending the welcome email, firing analytics) happens in a worker, which I'll get to in pattern 3.
Pattern 2: idempotency via the event ID
Stripe will deliver the same event more than once. Sometimes because of a retry. Sometimes because of their internal failover. Sometimes because of you, when you return 200 after the work succeeded but before the response made it back through your CDN.
The fix is one line of DDL:
ALTER TABLE webhook_events ADD CONSTRAINT webhook_events_pkey PRIMARY KEY (id);
Stripe event IDs are globally unique. If you make the primary key of your webhook_events table the Stripe event ID, the ON CONFLICT DO NOTHING in pattern 1 becomes your dedupe layer. The second time the same event arrives, the insert silently noops and you return 200.
The mistake to avoid: using your internal ID as primary key and the Stripe event ID as a regular column without a unique index. I've seen a team spend a week debugging "why did this customer get three welcome emails."
Idempotency in the processing layer matters just as much. When the worker picks up an event, it should write a "processed" flag in the same transaction as whatever side effect it produces. If the worker crashes in between, the next run reruns. So side effects need to be safe to repeat: granting Pro twice should be a noop, sending the welcome email twice is survivable, sending a $50 refund twice is a Tuesday-morning incident.
Pattern 3: a dead-letter queue and a worker with backoff
Once the raw events are in the table, you need a worker to drain it. The worker is simple:
// webhook-worker.js
import { db } from './db.js';
import { handleStripeEvent } from './handlers/stripe.js';
const BACKOFF = [10, 60, 300, 1800, 7200]; // seconds
async function processNext() {
const { rows } = await db.query(`
SELECT id, type, payload, attempts
FROM webhook_events
WHERE status = 'pending'
AND (next_attempt_at IS NULL OR next_attempt_at <= NOW())
ORDER BY received_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
`);
if (rows.length === 0) return false;
const event = rows[0];
try {
await handleStripeEvent(event);
await db.query(
`UPDATE webhook_events SET status = 'done', processed_at = NOW() WHERE id = $1`,
[event.id]
);
} catch (err) {
const attempts = event.attempts + 1;
const backoff = BACKOFF[Math.min(attempts - 1, BACKOFF.length - 1)];
const dead = attempts >= 10;
await db.query(
`UPDATE webhook_events
SET attempts = $1,
last_error = $2,
next_attempt_at = NOW() + (INTERVAL '1 second' * $3),
status = $4
WHERE id = $5`,
[attempts, err.message, backoff, dead ? 'dead' : 'pending', event.id]
);
console.error('webhook_process_failed', { event_id: event.id, attempts, err });
}
return true;
}
setInterval(async () => {
while (await processNext()) {}
}, 2000);
FOR UPDATE SKIP LOCKED lets you run multiple workers without them stepping on each other. The exponential backoff prevents a poison-pill event from hammering your downstream services. After 10 attempts an event moves to "dead" and triggers an alert (pattern 4 handles the visibility on that).
This worker is a separate process from your web service. It is not optional. If you run it inside your web service as a setInterval, it dies when the web service redeploys or scales to zero. That is exactly the gap where the founder's chargeback story happened.
Pattern 4: the daily reconciliation cron
Patterns 1 to 3 keep drift small. Pattern 4 fixes the drift that already happened.
Once a day, fetch the source of truth and compare to your DB. For Stripe, that means iterating active subscriptions and verifying each one matches what your app thinks the customer has.
// reconcile-subscriptions.js (scheduled daily at 04:00 UTC)
import Stripe from 'stripe';
import { db } from './db.js';
import { sendAlert } from './alerts.js';
const stripe = new Stripe(process.env.STRIPE_SECRET_KEY);
async function reconcile() {
const drift = [];
for await (const sub of stripe.subscriptions.list({ status: 'active', limit: 100 })) {
const customerId = sub.customer;
const planTier = sub.items.data[0].price.lookup_key; // e.g. 'pro_monthly'
const { rows } = await db.query(
`SELECT plan FROM users WHERE stripe_customer_id = $1`,
[customerId]
);
if (rows.length === 0) {
drift.push({ type: 'missing_user', customerId, planTier });
continue;
}
if (rows[0].plan !== planTier) {
drift.push({
type: 'plan_mismatch',
customerId,
dbPlan: rows[0].plan,
stripePlan: planTier,
});
await db.query(
`UPDATE users SET plan = $1, plan_synced_at = NOW() WHERE stripe_customer_id = $2`,
[planTier, customerId]
);
}
}
if (drift.length > 0) {
await sendAlert(`Reconciliation drift: ${drift.length} accounts`, drift);
}
}
reconcile().catch((err) => {
console.error(err);
process.exit(1);
});
The first time you run this on a 6-month-old codebase, you will find drift. I've never seen this fail to surface at least one mismatch. Log it loudly, repair the divergence, notify yourself.
The reconciliation cron is the safety net that catches whatever the previous three patterns miss: the 3-day Stripe retry window that expired during a holiday, the customer who used the Customer Portal to switch plans while your webhook handler was returning 500, the silent failure that nobody noticed because your error tracking is muted on weekends.
For GitHub, Twilio, Postmark and the rest, the same shape applies: if the provider has a list API for the resources its webhooks describe, you can reconcile. Fetch the truth, diff against your DB, repair, alert.
What I built (and why this matters)
I run HostingGuru, a managed PaaS for developers, agencies, and non-tech founders. I built it because I was tired of solving the same shape of problem on every client engagement: a 3-process app (web service, worker, scheduled job) usually requires a Kubernetes setup or a $200/month bill for that to work reliably. HostingGuru gives you all three primitives on the €35/month Pro tier, with a free tier that does not sleep.
This article exists because the AI log monitoring in HostingGuru flagged exactly this pattern for me last month: a client's webhook worker was silently 500ing on a specific event type for 4 hours before our Telegram alert fired. The pattern detection caught the cluster of "stripe_handle_failed" log lines, classified them as a hot Sentry fingerprint, and pinged me. I want every founder to have that net, even the ones who never become customers.
That's the only HostingGuru mention in this article. The patterns above stand on their own.
What to do tonight, regardless of platform
If you have not done these seven things, do them before you sleep:
Open your webhook handler. Move every database write that isn't "insert the raw event" out of the HTTP request. The HTTP handler should be 20 lines or less.
Add a unique constraint on the Stripe event ID column (or whichever upstream ID you use). If you cannot enforce idempotency at the DB layer, you do not have it.
Schedule a daily reconciliation cron against your payment provider's API. Even if it is a 30-line script that just logs drift, you need the visibility.
Add monitoring on webhook 5xx responses. Stripe shows this in their dashboard, but it is a polled view. Push it to whatever alerts you actually read (Telegram, Slack, Discord, email at minimum).
Test the failure path: stop your worker for 10 minutes, fire a test event from the Stripe CLI, confirm it ends up in the dead-letter state correctly and not lost. If you have never tested this, you do not have it.
Add a "dead" status alert. When any event moves to dead, page yourself. These are real customers in a bad state.
Audit how your post-checkout redirect handles the race condition with the webhook. If the user hits the success page before your webhook has fired, what do they see? "Upgrade pending" is correct; "Welcome to Pro" with no Pro access is the dispute path.
That list takes an evening if you don't have any of it. Mine took half a Saturday on a small Express app, including the reconciliation cron.
The closing question
What's the worst webhook failure you've shipped? I'm collecting horror stories for a follow-up post on the rarer edge cases (idempotency keys colliding across test mode and live mode is one I lost a weekend to). Drop yours in the comments and I'll write up the patterns that fix them. The chargeback that prompted this article was a "we'll think about it later" hole in pattern 3, and "later" turned out to be a $64 mistake plus three sleepless days. I'd rather argue about edge cases now than wake up to another DM at 11pm.
Previous posts in this series:
- Heroku just went into "sustaining engineering mode." Here are 5 alternatives whose free tier actually doesn't sleep.
- I built my MVP with Claude Code. Now I need to deploy it. Here's what nobody tells you.
- Your AI app is silently burning $2,000/month and you don't know it. Here are the 5 patterns that bite founders.
- Telegram alerts for any production app, a 5-minute setup (no SaaS, no signup, just curl)
- How I built a Discord 'ship-tracker' bot in a weekend (and the 3-process architecture that keeps it alive 24/7)
- I migrated 12 client projects off Heroku. Here's the playbook (and the 7 things that bit me every single time).
- The Claude Code production checklist: 15 things that aren't obvious until they bite you
- Your indie SaaS has zero working Postgres backups. Here's the 20-minute fix (and the drill you need to run before you sleep tonight).
Top comments (0)