RAXXO Studios

Posted on Jul 3 • Originally published at raxxo.shop

The Stripe Webhook Patterns I Use to Avoid Lost Events

#ai #productivity #claudecode #automation

Signature verification on the edge blocks forged payloads before they hit my logic
Idempotency keys stop duplicate order creation from Stripe retries
A dead-letter table catches the 0.3 percent of events that fail processing
Manual replay from the Stripe dashboard recovers stuck webhooks in under 5 minutes

I lost 3 paid orders in my first month selling digital products because a webhook silently failed and I never knew. No error, no alert, just a customer email asking where their download link was. After that I rebuilt my Stripe handling around 4 patterns, and I have not lost an event since. Here is exactly what runs on every paid product I ship.

Pattern 1: Verify the Signature at the Edge

The first thing any Stripe webhook endpoint does is prove the request actually came from Stripe. Without this, anyone who finds your endpoint URL can POST fake payment events and trigger free downloads. I have seen unprotected endpoints get hit within hours of going live.

Stripe signs every webhook with a secret. The signature lives in the Stripe-Signature header. My handler reads the raw request body (not the parsed JSON, this matters), then calls the verification function with the raw body, the header, and my signing secret. If it fails, I return a 400 immediately and log nothing beyond a counter.

The detail that trips people up: you have to verify against the raw bytes. If your framework parses the body into JSON before your handler runs, the signature check fails every time because the byte order changed. On my edge functions I disable body parsing for that one route and grab the raw stream myself. About 15 minutes of setup, and it closes the biggest hole.

I run this check at the edge, meaning the verification happens in the function closest to the request before any database call. A forged payload never touches my order logic, never opens a connection, never costs me a compute cycle beyond the rejection. In practice this means a bot spraying my endpoint hits a wall and gives up.

One more thing I learned the hard way: keep separate signing secrets for test mode and live mode. I once deployed the test secret to production and every real payment webhook bounced with a signature error for 40 minutes. I only caught it because I had a dashboard counter (more on that later). Now the secret is an environment variable that gets swapped per environment, and I have a startup log line that prints which mode the key belongs to so a mismatch is obvious.

If signature verification passes, I do exactly one thing next: return a 200 fast and hand the work to the next pattern. Stripe expects a response inside a few seconds, and slow handlers cause retries you do not want.

Pattern 2: Idempotency Keys Against Duplicate Delivery

Stripe does not promise a webhook arrives exactly once. It promises at-least-once. That means the same checkout.session.completed event can land on your endpoint twice, sometimes three times, especially if your first response was slow. Without protection, one purchase becomes two order records, two download emails, two of everything.

Every Stripe event carries a unique id that looks like evt_1abc.... I treat that id as an idempotency key. Before I process anything, I try to insert that event id into a processed_events table with a unique constraint on the id column. If the insert succeeds, this is the first time I have seen it and I process normally. If the insert fails on the unique constraint, I have already handled it, so I return 200 and stop.

The insert-first approach beats a check-then-insert because it avoids a race. If two copies of the same event arrive within milliseconds (it happens), a read-then-write pattern lets both pass the check before either writes. The unique constraint at the database level makes the race impossible. Only one insert wins.

I store the event id, a timestamp, and the event type. Nothing else. The table stays small and fast. I prune rows older than 30 days with a scheduled job because Stripe will not retry an event that old anyway. My table sits around 4,000 rows on a steady week and the lookup is instant on the indexed id.

This one pattern eliminated the double-email complaints that made up most of my early support load. If you want the deeper context on wiring these patterns into a working store, see Claude Blueprint, which walks through the whole paid-product setup I run.

The subtle payoff: idempotency also protects you when you replay events manually during a bug fix. You can safely re-send 200 events at your endpoint knowing the already-processed ones will be skipped. That safety net is what makes Pattern 4 usable without fear.

Pattern 3: Dead-Letter Handling for the Events That Fail

Roughly 0.3 percent of my events fail on the first try. A database timeout, a third-party API being down, a bug I shipped that afternoon. The wrong move is to return a 500 and pray Stripe retries. Stripe does retry, but its schedule spreads over hours and eventually gives up. If your bug lasts longer than the retry window, that event is gone.

My fix is a dead-letter table. When processing throws an error, I catch it, write the full event payload plus the error message and a retry count into a failed_events table, and then return 200 to Stripe. Returning 200 tells Stripe the event was received so it stops retrying on its own schedule. I now own the retry, not Stripe.

A scheduled job runs every 10 minutes and picks up any row in failed_events with a retry count under 5. It reprocesses the stored payload through the same handler. Because Pattern 2 uses idempotency keys, a reprocess that partially succeeded the first time will not duplicate anything. If it succeeds, I mark the row resolved. If it fails 5 times, I flag it for manual review and send myself an alert.

The alert matters more than the automation. When something lands in the dead-letter table and cannot self-heal, I want to know within minutes, not from a customer email 2 days later. I pipe the alert to a private channel and to my phone. Last month one event stuck because a download-hosting provider had an outage. My job retried, the provider came back after 22 minutes, and the next retry cleared it. The customer never noticed a thing.

I also log every dead-letter write with a counter. A sudden spike from 2 per day to 50 tells me I broke something in a deploy before any customer complains. That counter has saved me twice. Background: the same alerting habit carries over to how I monitor everything else, and I covered this in more depth at Claude Blueprint.

The mindset shift is simple. Assume events will fail. Build the recovery path first.

Pattern 4: Manual Replay When a Webhook Gets Stuck

Even with the first 3 patterns, sometimes an event needs a human. Maybe I deployed broken code and 12 events failed all 5 automatic retries. Maybe I added a new product type my handler did not recognize. When that happens, I have two recovery routes, and both take under 5 minutes.

Route one is the Stripe dashboard. Every webhook endpoint in Stripe shows a log of recent deliveries with their response codes. I filter to the failed ones, click into a specific event, and hit "Resend". Stripe fires the exact same payload at my endpoint again. Since my signature verification and idempotency both handle re-delivery cleanly, this is safe to do dozens of times. I once resent 30 events in a row after fixing a bug and every one processed correctly on the resend.

Route two is my own dead-letter table. If Stripe already gave up and the event only lives in my failed_events rows, I do not need Stripe at all. I have a small admin action that reprocesses a single row on demand. I fix the underlying code, deploy, then click reprocess. The stored payload runs through the fixed handler.

The trick to making replay painless is deciding it early. Do not bolt on replay after your first outage. Store the raw payload from day one. A failed event you did not store is a failed event you cannot recover, and Stripe only keeps its own copy for a limited window. Storing the full JSON costs almost nothing and turns a 2 am panic into a calm click.

I also keep a short runbook in plain text next to my code. Three steps: check the dead-letter table, decide dashboard-resend or admin-reprocess, verify the counter dropped to zero. Having the steps written down means I do not think under pressure, I just follow the list. When I sat down to write my scheduling and syndication flow I used the same runbook habit, and Buffer handling the social posting side kept that whole pipeline off my plate so I could focus on payment reliability.

Bottom Line

These 4 patterns are not clever. Signature verification blocks forged requests, idempotency keys kill duplicates, a dead-letter table catches failures, and manual replay recovers the rest. Together they turned lost orders from a recurring headache into a problem I have not seen in over a year. The whole setup fits in maybe 150 lines of handler code plus two small tables.

If you sell anything through Stripe, build the recovery path before you launch, not after your first outage. Store the raw payload, add the unique constraint, wire an alert to your phone. The 0.3 percent of events that fail are the ones customers remember, so make those the ones you handle best.

If you want to see how this fits into a full solo product stack, from payments to hosting to the store itself on Shopify, the Claude Blueprint lays out the whole system I run day to day. Start with signature verification this week. It is the fastest win and closes the biggest hole.

This article contains affiliate links. If you sign up through them, I may earn a small commission at no extra cost to you. (Ad)

Top comments (1)

Marcus Kim • Jul 3

The insert-first processed_events table with a unique constraint is the detail I wish more webhook examples led with, because it turns Stripe's at-least-once delivery into something you can reason about under race conditions. Pairing that with storing the raw payload in a dead-letter table and retrying every 10 minutes makes replay an expected path, not an emergency patch. From a founder/engineer angle, this is the right kind of boring reliability work: a customer only sees the missing entitlement, not the clean checkout, so the recovery system is part of the product experience.