edhiblemeer

Posted on May 6 • Edited on May 11

Stripe Webhook Was Silently Failing for 5 Days: The 4xx Retry Trap and the Beginning-of-Month Time Bomb

#stripe #webhook #nestjs #incident

TL;DR

21 invoice.paid webhooks failed for 5 straight days in production.
We only noticed because Stripe sent a "we'll auto-disable this endpoint by 5/10" warning email.
Root cause: a DB integrity gap caused our handler to throw HttpException(BAD_REQUEST) 竊・Stripe treats 4xx as retry-eligible 竊・infinite retry loop.
Lesson: Stripe webhook 4xx is not "client error, give up." It's "please try again." DB lookup misses should be console.warn + 200 OK.
Bonus lesson: invoice.paid only fires on subscription cycle (once a month). Five days of silent failure went completely unnoticed.

I'm running tasteck, an industry-specific SaaS for the Japanese nightlife industry. This is a real incident report from production.

The wake-up call

One morning, an email from Stripe:

We've encountered an issue when sending an event to your webhook endpoint at https://api.tasteck.tech/.../payment/webhook

We've attempted to send 16 events since 2026-05-01 02:02:21 UTC, all failing.

Stripe will stop sending events to this endpoint by 2026-05-10 02:02:21 UTC.

If we didn't fix it before 5/10, the endpoint would be automatically disabled. Subscription invoice notifications would just stop. That's catastrophic for a billing-driven SaaS.

Triage: is the endpoint dead?

First, a sanity check with curl:

curl -X POST -H "Content-Type: application/json" \
  -d '{"type":"ping"}' \
  https://api.tasteck.tech/.../payment/webhook
# 竊・201 Created

The endpoint is alive. Only specific events from Stripe are failing. Different problem.

Finding the actual error in PM2 logs

ssh prod "grep 'subscription' /home/ec2-user/.pm2/logs/api-error.log | tail"

# 竊・HttpException: Subscription item not found.  ﾃ・0+

The handler code (NestJS + TypeORM):

const stripeSubscriptionItem = await em
  .getRepository(StripeSubscriptionItem)
  .where("stripe_id = :stripeId", { stripeId: subscriptionId })
  .getOne();

if (!stripeSubscriptionItem) {
  throw new HttpException(
    "Subscription item not found.",
    HttpStatus.BAD_REQUEST  // 竊・this is the problem
  );
}

Trap #1: Stripe webhook 4xx IS retry-eligible

This is where REST-API instincts betray you.

A normal API: "client gave us bad data 竊・return 4xx 竊・client should fix it 竊・don't retry."

But Stripe webhooks are not a normal API. From their docs:

Stripe considers any HTTP response code in the range 200-299 as a successful delivery. Anything else, including 4xx and 5xx, is treated as a failure and Stripe will retry.

So our chain was:

DB has integrity gap for one customer
Handler can't find the record
Handler throws HttpException(400)
Stripe sees 4xx 竊・schedules retry (exponential backoff)
Retry hits the same DB gap 竊・another 4xx 竊・another retry
After 3 days, Stripe gives up 竊・emits "we'll auto-disable in 7 days" email
Endpoint gets auto-disabled. Game over.

The fix is structural: webhook handlers should almost never return 4xx for application-level "data not found" cases. Log a warning, return 200, move on. The 4xx semantic doesn't fit the protocol.

Trap #2: `invoice.paid` only fires on the 1st of the month

Why didn't we notice for 5 days?

Because invoice.paid is a subscription cycle event. For a monthly subscription, it fires once a month, on renewal day. So:

Day 1 of the month: 20 customers renew 竊・1 of them is broken 竊・1 failure that day
Day 2-3: Stripe retries that 1 failure several times 竊・spikes our error log briefly
Day 4-30: nothing happens. Logs are silent. Sentry alerts based on rolling-7-day baselines see no change.

This is a class of bugs I'd call calendar-aligned bugs: they only fire on a specific day of the month, hide inside normal noise, and Sentry's "anomaly detection" can't see them because the baseline includes the spike too.

For SaaS founders, the takeaway:

Daily error count alerts won't catch month-aligned failures.
You need per-event-type success rate alerts that fire on absolute thresholds, not anomaly-based ones.

The real cause: one customer with no `subscription_items` row

I queried prod RDS to figure out which customer was hosed:

SELECT * FROM company_groups WHERE customer_id = 'cus_xxx';
-- 竊・1 row (plan='starter')

SELECT * FROM stripe_subscriptions WHERE company_group_id = 96;
-- 竊・2 rows (1 active, 1 deleted)

SELECT * FROM stripe_subscription_items
  WHERE stripe_subscription_id IN (175, 176);
-- 竊・0 rows 笞・・```
{% endraw %}


Zero rows. The {% raw %}`stripe_subscription_items`{% endraw %} table just had no record for this customer. Probably a missed INSERT during a data migration, or a race during initial subscription creation. We don't know exactly when.

## Fix A: data repair (root cause)

Look up the actual subscription item from the Stripe Dashboard:

- subscription item ID: {% raw %}`si_xxx`
- price: `price_xxx` (ﾂ･15,000 / month)
- 竊・matches DB plan_type `starter`

Insert the missing row:



```sql
INSERT INTO stripe_subscription_items
  (stripe_subscription_id, stripe_id, plan_type, is_annual)
VALUES
  (176, 'sub_xxx', 'starter', 0);

Click "Resend" on a failed event in Stripe Dashboard 竊・30 seconds later: 201 OK 笨・

Fix B: handler robustness (preventive)

Data repair is a per-customer band-aid. To prevent future "data integrity gap 竊・retry storm" cases, change the handler:

// Before (3 places in customer.subscription.deleted and invoice.paid)
if (!stripeSubscriptionItem) {
  throw new HttpException(
    "Subscription item not found.",
    HttpStatus.BAD_REQUEST
  );
}

// After
if (!stripeSubscriptionItem) {
  console.warn(
    `[webhook] StripeSubscriptionItem not found for sub_id=${stripeSub.id} (stripe_id=${subscriptionId}), skipping plan update`
  );
  break;  // exits switch, returns 200
}

3 throw sites in two case blocks (customer.subscription.deleted and invoice.paid), all replaced with warn + break. Stripe sees 200, stops retrying. The plan-update side effect is skipped, which is fine because the customer's plan was already correct (we just couldn't verify it via DB).

Verification

After Fix A, manually triggered retry from Stripe Dashboard:

2026-05-05T05:27:01.500Z POST /payment/webhook 201 2 Stripe/1.0

Next day, Stripe's natural retry of the remaining 20 failed events:

2026-05-06T04:15:28.736Z POST /payment/webhook 201 2 Stripe/1.0
2026-05-06T04:17:25.577Z POST /payment/webhook 201 2 Stripe/1.0

All clean. 5/10 auto-disable risk fully averted.

Checklist for your own webhook handlers

Borrow this if it's useful:

[ ] Are you throwing 4xx for any DB lookup miss in your webhook? 竊・consider warn + 200 instead
[ ] Do you have a default: case that returns 200 for unknown event types?
[ ] Are you alerting on per-event-type success rate, not just total error count? (Catches month-aligned failures)
[ ] Is there a periodic batch checking referential integrity between your subscription / customer / item tables?
[ ] Are your webhook signature verification failures returning 4xx (they should 窶・that's the correct use of 4xx, since Stripe needs to know its retry won't help)?

The meta-lesson

The bug here was small (a missing DB row). The damage was disproportionate because:

The protocol's "4xx = retry" semantic doesn't match REST intuition
Calendar-aligned events hide inside normal logs
Sentry-style anomaly detection can't see month-1 spikes

Webhook integrations are deceptively easy at first and quietly break later. Worth a half-hour audit of yours.

Posting these as I find them. I run tasteck, a vertical SaaS, and I've been writing about the operational side in Build-in-Public posts (Japanese). This is the first one I've written in English 窶・if you want more in this style, say so in the comments.

Top comments (3)

Brian Munz • May 6

great writeup. "Calendar-aligned bug" is a great way to put it, and I've added it to my personal lexicon.
In terms of the warn + 200... one workaround would be to route stripe events through an integration layer rather than hitting your app directly. the platform handles retry and delivery, your app gets a clean event when it's ready to process it. you still need the warn + 200 discipline in that layer, but at least it's separated.
anyway, enjoyed the article. I saved that checklist for later.

arun rajkumar • May 7

The 4xx-as-retry behaviour is the part most teams discover the way you did — at 02:02 UTC on the 1st of the month, by Stripe email. We had nearly the same fingerprint at Atoa with bank webhooks (different rail, identical pathology) and ended up writing it up here: dev.to/mickyarun/payment-webhooks-.... Two things that helped us beyond the "always 200 unless you're sure" rule: (1) every webhook handler returns 200 first and then enqueues the work — you stop confusing transport-layer ack with business-logic success; (2) we synthesise our own "missing event" alert by reconciling against the provider every 15 minutes, because the 5-day silence part of your incident is the hardest signal to engineer for. The schedule-aware alert (knowing invoice.paid runs monthly, so a flat day means nothing) is the second-order fix most teams skip until their first incident.

Harjot Singh • May 30

stripe billing-ops costs (chargebacks, dunning, tax filings) eat indie margins fast. moonshift writes auth/billing/deploy to YOUR github + vercel for $3 flat per shipped saas. no monthly. first run free, no card. moonshift.io