Alex Cloudstar

Posted on May 15 • Originally published at alexcloudstar.com

Stripe Webhooks in Production: Idempotency, Retries, and the Mistakes That Cost Me Real Money

#saas #backend #architecture #devtools

The first Stripe webhook bug that cost me actual money happened on a Tuesday. A user signed up for a paid plan, the checkout.session.completed event arrived, my handler created their workspace, and then Stripe retried the same event nine seconds later because my response had taken longer than the timeout. The retry created a second workspace. The user could see both. They picked the first one and ignored the second. A week later they cancelled because their data had silently been split across two accounts and they thought we were buggy.

We were buggy. Not because the webhook code was wrong in any single line, but because I had treated webhooks like a normal API call. They are not a normal API call. They are a message queue with weird rules, and if you do not respect those rules you ship a product that quietly corrupts billing state.

This post is the version of the webhook integration guide I wish I had read before shipping. The Stripe docs are good. They are not pessimistic enough about what happens when your handler meets the real internet.

What Stripe Actually Sends You

A webhook is a POST request from Stripe to a URL you control, with a JSON body describing an event. Events are things that happened: a checkout completed, an invoice was paid, a subscription was updated, a payment failed. Stripe sends every event to every endpoint you have configured to listen for that event type.

The shape of the request is simple. Headers contain a signature. The body is JSON. You verify the signature, parse the body, do something, return 200. Stripe sees the 200, marks the event delivered, and moves on. That is the happy path.

The unhappy path is where the work is. If your endpoint returns anything other than a 2xx within Stripe's timeout, Stripe retries. If your endpoint times out, Stripe retries. If your endpoint returns 200 but you crashed before persisting anything, the event is lost from your side and Stripe thinks it succeeded. If two events arrive at almost the same time, you can process them out of order. If Stripe's own infrastructure has a delivery delay, you can receive a customer.subscription.deleted before the customer.subscription.updated that preceded it.

Three things to internalize before writing a single line of handler code:

Stripe will retry the same event many times. Your handler must be idempotent. Processing the same event twice must produce the same result as processing it once.
Events do not arrive in order. Your handler cannot assume that the event you are reading is the most recent state of the underlying object.
Stripe does not care about your downstream systems. If your database is down, your queue is full, or your downstream API has rate-limited you, that is your problem. Stripe just keeps retrying.

If you build around these three rules from day one, the rest of the work is small. If you do not, you spend the next year writing patches.

Signature Verification, And Why You Cannot Skip It

The first thing your handler does, before parsing the body, is verify the signature. Stripe signs every request with a secret you configured on the endpoint. The signature is in a header called Stripe-Signature. If the signature does not match, drop the request. Period.


const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!);
const webhookSecret = process.env.STRIPE_WEBHOOK_SECRET!;

export async function POST(req: Request) {
  const signature = req.headers.get('stripe-signature');
  if (!signature) return new Response('No signature', { status: 400 });

  const rawBody = await req.text();
  let event: Stripe.Event;

  try {
    event = stripe.webhooks.constructEvent(rawBody, signature, webhookSecret);
  } catch (err) {
    return new Response('Invalid signature', { status: 400 });
  }

  // event is now verified
}

Two things people get wrong here. First, you must use the raw request body, not a parsed JSON object. The signature is computed over the exact bytes Stripe sent. If your framework auto-parses JSON, you have to either disable that for the webhook route or read the raw body separately. Next.js App Router gives you req.text() which works. Express needs express.raw({ type: 'application/json' }) for the route.

Second, the signature header includes a timestamp. The constructEvent helper enforces a default tolerance of 300 seconds, which prevents replay attacks where someone captures a webhook payload and resends it later. Do not extend this tolerance unless you have a specific reason. Five minutes is plenty.

If you are testing locally, the Stripe CLI forwards events with a generated webhook secret you can use during development. The signature is real. The tolerance check still applies. This is by design and it catches plenty of bugs before they ship.

Idempotency Is The Whole Job

Every webhook handler is a function that takes an event ID and updates some state. The contract is: processing the same event ID twice must do the same thing as processing it once. That is what idempotency means in this context.

The simplest implementation is a table of processed event IDs.

CREATE TABLE processed_webhook_events (
  event_id TEXT PRIMARY KEY,
  event_type TEXT NOT NULL,
  processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

In the handler, you insert the event ID before doing anything else. If the insert fails on the primary key conflict, you return 200 immediately. The event has already been processed.

const result = await db.query(
  `INSERT INTO processed_webhook_events (event_id, event_type)
   VALUES ($1, $2)
   ON CONFLICT (event_id) DO NOTHING
   RETURNING event_id`,
  [event.id, event.type]
);

if (result.rowCount === 0) {
  // already processed
  return new Response('ok', { status: 200 });
}

This is the cheapest idempotency strategy that works. It has two limitations that bite you in real production.

The first is the window between insert and the actual work. If you insert the event ID, then crash before doing the work, the next retry sees the event ID and skips. The work is lost.

The fix is to wrap the insert and the work in the same database transaction. If the work fails, the transaction rolls back, the event ID is gone, and the retry will process it. If the work succeeds, both the event ID and the work commit together.

await db.transaction(async (tx) => {
  const result = await tx.query(
    `INSERT INTO processed_webhook_events (event_id, event_type)
     VALUES ($1, $2)
     ON CONFLICT (event_id) DO NOTHING
     RETURNING event_id`,
    [event.id, event.type]
  );

  if (result.rowCount === 0) return; // already processed

  await handleEvent(tx, event);
});

The second limitation is that this only protects against duplicate event IDs. If Stripe sends two genuinely different events for the same underlying state change (which can happen, especially with subscription lifecycle events), each event has its own ID and both will be processed. Your handler has to be idempotent at the business level too. Updating a subscription to active when it is already active should be a no-op, not an error.

The Reordering Problem

Webhooks do not arrive in order. The Stripe docs say this in a sentence and most readers skip past it. Then they ship a handler that assumes the latest event reflects the latest state, and they get strange bugs that take days to track down.

The clearest version of this problem is subscription lifecycle. A user upgrades from plan A to plan B. Stripe emits a customer.subscription.updated event. Two seconds later they downgrade to plan C. Another customer.subscription.updated event. Both events sit in Stripe's delivery queue. The second one arrives first. Your handler sets the user's plan to C. Then the first event arrives and your handler sets the plan back to B. The user is on B in your database and C in Stripe.

The fix is to never trust the event payload as the source of truth for current state. The event tells you something happened. The current state lives on the object in Stripe. For anything that matters (subscription state, customer state, invoice state), refetch the object from Stripe by ID and use that response.

async function handleSubscriptionUpdated(event: Stripe.Event) {
  const subscriptionId = (event.data.object as Stripe.Subscription).id;

  // do not trust the event payload, refetch
  const subscription = await stripe.subscriptions.retrieve(subscriptionId);

  await upsertSubscriptionState(subscription);
}

This costs you an extra API call per webhook. It is worth it. The extra call returns the canonical current state, regardless of which event arrived in which order. Your database converges to the right state even when the events arrive in the wrong order.

There is a more subtle version of this problem for objects that change rapidly. If a subscription is updated five times in a second, you might get five webhooks and each refetch returns whatever the state is at the time you ask, which might be the same value for all five. That is fine. The end state is correct, which is the only thing that matters.

For events about objects you cannot refetch (like payment_intent.succeeded after the intent has been processed), you store the relevant fields from the event and accept that you might overwrite with stale data. The fix is to compare timestamps or version fields if Stripe provides them, and only apply the update if it is newer.

Respond Fast, Process Later

Stripe's timeout for webhooks is approximately 30 seconds, but the practical timeout you should target is 5 seconds or less. The reason is not Stripe's patience. It is your retry behaviour. If your handler takes 8 seconds on a good day and 12 seconds on a slow day, you are going to time out on the slow days, get retried, and double-process events. The solution is to do as little work as possible in the handler itself and offload the rest.

The pattern is:

Verify the signature.
Insert the event into a local queue table or a real message queue.
Return 200 immediately.
Process the queued event asynchronously in a worker.

export async function POST(req: Request) {
  const signature = req.headers.get('stripe-signature');
  const rawBody = await req.text();

  let event: Stripe.Event;
  try {
    event = stripe.webhooks.constructEvent(rawBody, signature!, webhookSecret);
  } catch {
    return new Response('Invalid signature', { status: 400 });
  }

  await db.query(
    `INSERT INTO webhook_events (event_id, event_type, payload, status)
     VALUES ($1, $2, $3, 'pending')
     ON CONFLICT (event_id) DO NOTHING`,
    [event.id, event.type, JSON.stringify(event)]
  );

  return new Response('ok', { status: 200 });
}

A separate worker reads from webhook_events, processes each event, and updates the status. This decouples the speed of your processing from the speed of your acknowledgement. Your webhook endpoint becomes a fast, dumb, almost-impossible-to-break inbox.

The trade-off is that you now have a second system to operate. The worker needs to handle failures, retries, and dead-letter cases. If you do not have a queueing system already, the cheapest version is a polling worker that selects pending events and processes them one at a time. It is not elegant. It works.

For a more complete look at queue selection for solo developers, the background jobs and observability question is the broader topic this pattern sits inside.

Race Conditions With Your Own Application

The most insidious bugs come from webhooks racing with user actions. A user completes checkout in their browser. Your app reads success_url and tries to read the subscription state to show them their new plan. The webhook has not arrived yet. Your app reads stale state. The user sees the old plan.

There are three approaches to this, in increasing order of robustness.

The first is to poll. After the user lands on the success page, your frontend polls your backend for a few seconds checking whether the subscription is active. The webhook usually arrives within a second or two. The polling stops as soon as it does. This is ugly but it works for most products.

The second is to fetch synchronously. When the user lands on the success page, your backend hits Stripe directly to fetch the current subscription state and write it to your database before responding. The webhook still arrives and is idempotent, but you do not depend on it for the immediate UX. This costs you an extra Stripe API call per checkout but eliminates the race entirely.

The third is to make the synchronous fetch and the webhook converge on the same code path. Your success_url handler calls the same function the webhook would call, passing the subscription ID. The function refetches state from Stripe and upserts. Whichever one runs first wins. The other is a no-op. This is the cleanest answer and it generalises beyond checkout to any user flow that depends on Stripe state.

async function reconcileSubscription(subscriptionId: string) {
  const subscription = await stripe.subscriptions.retrieve(subscriptionId);
  await upsertSubscriptionState(subscription);
}

// from the webhook
await reconcileSubscription(event.data.object.id);

// from the success_url handler
await reconcileSubscription(session.subscription);

The function is idempotent. It can be called from either path with no coordination. The race disappears.

What To Listen For And What To Ignore

Stripe sends many event types. Most products only care about a handful. Listening for events you do not handle is a small cost (you have to verify signatures and ignore them) but a bigger cognitive cost (the events show up in logs and confuse you when you debug).

For a typical SaaS with subscriptions, the events that matter are:

checkout.session.completed: a user completed a hosted checkout. Use this to provision their account.
customer.subscription.created, customer.subscription.updated, customer.subscription.deleted: the canonical subscription lifecycle. Refetch the subscription on every event.
invoice.paid: a recurring invoice was paid. Use this to extend the user's access.
invoice.payment_failed: a recurring invoice failed. Use this to flag the user for dunning, suspend access, or send a payment update email.
customer.subscription.trial_will_end: three days before a trial expires. Useful for sending warning emails.

If you are doing one-time payments, you care about payment_intent.succeeded and payment_intent.payment_failed. If you are doing usage-based billing, you care about invoice.upcoming so you can preview the next bill.

Everything else, ignore unless you have a specific reason to care. Stripe has more than 200 event types. Most are diagnostic or only relevant for specific products (Connect, Issuing, Terminal). Listening to all of them is a recipe for noise.

Configure the endpoint in Stripe's dashboard to only send the events you handle. This reduces the volume hitting your endpoint and reduces the surface area of what can go wrong.

Testing Webhooks Without Losing Your Mind

Local development with webhooks used to be miserable. The Stripe CLI fixed most of it. You run stripe listen --forward-to localhost:3000/api/webhooks/stripe and it forwards real Stripe events from your test account to your local server with a temporary webhook secret. You can also trigger specific events with stripe trigger checkout.session.completed for testing handlers in isolation.

The trigger command is what most people miss. You do not have to manually create checkouts and subscriptions to test every handler. Stripe ships a list of common scenarios you can fire with one command. This makes integration testing tractable.

For unit tests, the Stripe Node SDK exposes the same constructEvent function. You can build a fake event payload, sign it with a test secret, and run your handler against it. This is fast and reliable. The only thing you cannot easily simulate locally is the order in which events arrive, but you can build that into your tests by deliberately calling your handlers out of order and confirming the end state is correct.

For end-to-end tests against the live (test mode) Stripe API, the trick is to use idempotency keys on every Stripe API call. This means a flaky test that retries does not double-charge the test customer. The idempotency key is a header on every Stripe API call; passing the same key with the same parameters returns the cached response instead of creating a new resource.

await stripe.customers.create(
  { email: 'test@example.com' },
  { idempotencyKey: `test-customer-${testRunId}` }
);

This is unrelated to webhook idempotency but worth mentioning because both protect against duplicate work, and people often have one and not the other.

What I Run In Production

The setup that has not bitten me in eighteen months is:

A single webhook endpoint that handles all event types. The endpoint verifies the signature, inserts the raw event into a webhook_events table with status = 'pending', and returns 200. Total work in the request handler: signature verification plus one insert.

A worker process that polls the webhook_events table every second, picks up pending events, and dispatches them to type-specific handlers. The worker uses SELECT ... FOR UPDATE SKIP LOCKED so multiple worker instances can run safely.

Type-specific handlers that refetch the relevant Stripe object before applying any state changes. No handler trusts the event payload as the source of truth for current state.

A retry policy in the worker that retries failed events up to five times with exponential backoff, then moves them to a dead-letter table that pages me if anything lands there. The dead-letter table has had four entries in eighteen months. Each was a genuine bug I needed to know about.

An idempotency check at the start of each handler, even though the queue table also has unique constraints. Belt and braces.

A reconcile function that can be called from both the webhook path and the user-facing checkout success path, so races between the two converge instead of conflicting.

A daily cron job that fetches all active subscriptions from Stripe and reconciles them against my database. This catches anything I missed: dropped events, edge cases, bugs in my own code. It runs at 3am and emails me a diff if it finds anything. In eighteen months it has caught exactly two real issues, both of which were my fault.

That last one is the thing most teams skip. Webhooks are a delivery mechanism, not a guarantee. Stripe themselves recommend periodic reconciliation against the API as the canonical source of truth. If your billing state matters (and if you are reading this, it does), you want a backstop that does not depend on every webhook firing correctly forever.

What I Would Tell You If You Asked

Most webhook bugs are not in the webhook code. They are in the assumption that webhooks are simple. They are not simple. They are a distributed system with at-least-once delivery, out-of-order events, and timeouts that turn correctness bugs into double-billing incidents.

If you have one weekend to ship a webhook integration that will survive contact with real users, do this:

Verify signatures. Insert events into a queue. Return 200 fast. Process from the queue with idempotent handlers that refetch state from Stripe. Add a daily reconciliation job. Wire up alerts on the dead-letter table.

That is it. Everything else is a refinement on top of that pattern. The pattern itself does not change between a side project doing $50 a month and a SaaS doing $50,000 a month. The volume changes. The architecture does not.

For the broader question of which auth provider sits in front of your billing flow, the auth comparison post covers the trade-offs that matter for a billing-heavy product. And if you are still picking your stack, the stop obsessing about the perfect stack post is the thing I should have read three projects ago.

Webhooks are one of the few areas where the boring, paranoid version of the integration is also the cheapest one to maintain. Build it boring. Build it paranoid. Sleep through your weekends.

DEV Community