DEV Community

edhiblemeer
edhiblemeer

Posted on

Stripe Webhook Was Silently Failing for 5 Days: The 4xx Retry Trap and the Beginning-of-Month Time Bomb

TL;DR

  • 21 invoice.paid webhooks failed for 5 straight days in production.
  • We only noticed because Stripe sent a "we'll auto-disable this endpoint by 5/10" warning email.
  • Root cause: a DB integrity gap caused our handler to throw HttpException(BAD_REQUEST) 竊・Stripe treats 4xx as retry-eligible 竊・infinite retry loop.
  • Lesson: Stripe webhook 4xx is not "client error, give up." It's "please try again." DB lookup misses should be console.warn + 200 OK.
  • Bonus lesson: invoice.paid only fires on subscription cycle (once a month). Five days of silent failure went completely unnoticed.

I'm running tasteck, an industry-specific SaaS for the Japanese nightlife industry. This is a real incident report from production.

The wake-up call

One morning, an email from Stripe:

We've encountered an issue when sending an event to your webhook endpoint at https://api.tasteck.tech/.../payment/webhook

We've attempted to send 16 events since 2026-05-01 02:02:21 UTC, all failing.

Stripe will stop sending events to this endpoint by 2026-05-10 02:02:21 UTC.

If we didn't fix it before 5/10, the endpoint would be automatically disabled. Subscription invoice notifications would just stop. That's catastrophic for a billing-driven SaaS.

Triage: is the endpoint dead?

First, a sanity check with curl:

curl -X POST -H "Content-Type: application/json" \
  -d '{"type":"ping"}' \
  https://api.tasteck.tech/.../payment/webhook
# 竊・201 Created
Enter fullscreen mode Exit fullscreen mode

The endpoint is alive. Only specific events from Stripe are failing. Different problem.

Finding the actual error in PM2 logs

ssh prod "grep 'subscription' /home/ec2-user/.pm2/logs/api-error.log | tail"

# 竊・HttpException: Subscription item not found.  テ・0+
Enter fullscreen mode Exit fullscreen mode

The handler code (NestJS + TypeORM):

const stripeSubscriptionItem = await em
  .getRepository(StripeSubscriptionItem)
  .where("stripe_id = :stripeId", { stripeId: subscriptionId })
  .getOne();

if (!stripeSubscriptionItem) {
  throw new HttpException(
    "Subscription item not found.",
    HttpStatus.BAD_REQUEST  // 竊・this is the problem
  );
}
Enter fullscreen mode Exit fullscreen mode

Trap #1: Stripe webhook 4xx IS retry-eligible

This is where REST-API instincts betray you.

A normal API: "client gave us bad data 竊・return 4xx 竊・client should fix it 竊・don't retry."

But Stripe webhooks are not a normal API. From their docs:

Stripe considers any HTTP response code in the range 200-299 as a successful delivery. Anything else, including 4xx and 5xx, is treated as a failure and Stripe will retry.

So our chain was:

  1. DB has integrity gap for one customer
  2. Handler can't find the record
  3. Handler throws HttpException(400)
  4. Stripe sees 4xx 竊・schedules retry (exponential backoff)
  5. Retry hits the same DB gap 竊・another 4xx 竊・another retry
  6. After 3 days, Stripe gives up 竊・emits "we'll auto-disable in 7 days" email
  7. Endpoint gets auto-disabled. Game over.

The fix is structural: webhook handlers should almost never return 4xx for application-level "data not found" cases. Log a warning, return 200, move on. The 4xx semantic doesn't fit the protocol.

Trap #2: invoice.paid only fires on the 1st of the month

Why didn't we notice for 5 days?

Because invoice.paid is a subscription cycle event. For a monthly subscription, it fires once a month, on renewal day. So:

  • Day 1 of the month: 20 customers renew 竊・1 of them is broken 竊・1 failure that day
  • Day 2-3: Stripe retries that 1 failure several times 竊・spikes our error log briefly
  • Day 4-30: nothing happens. Logs are silent. Sentry alerts based on rolling-7-day baselines see no change.

This is a class of bugs I'd call calendar-aligned bugs: they only fire on a specific day of the month, hide inside normal noise, and Sentry's "anomaly detection" can't see them because the baseline includes the spike too.

For SaaS founders, the takeaway:

  • Daily error count alerts won't catch month-aligned failures.
  • You need per-event-type success rate alerts that fire on absolute thresholds, not anomaly-based ones.

The real cause: one customer with no subscription_items row

I queried prod RDS to figure out which customer was hosed:

SELECT * FROM company_groups WHERE customer_id = 'cus_xxx';
-- 竊・1 row (plan='starter')

SELECT * FROM stripe_subscriptions WHERE company_group_id = 96;
-- 竊・2 rows (1 active, 1 deleted)

SELECT * FROM stripe_subscription_items
  WHERE stripe_subscription_id IN (175, 176);
-- 竊・0 rows 笞・・```
{% endraw %}


Zero rows. The {% raw %}`stripe_subscription_items`{% endraw %} table just had no record for this customer. Probably a missed INSERT during a data migration, or a race during initial subscription creation. We don't know exactly when.

## Fix A: data repair (root cause)

Look up the actual subscription item from the Stripe Dashboard:

- subscription item ID: {% raw %}`si_xxx`
- price: `price_xxx` (ツ・15,000 / month)
- 竊・matches DB plan_type `starter`

Insert the missing row:



```sql
INSERT INTO stripe_subscription_items
  (stripe_subscription_id, stripe_id, plan_type, is_annual)
VALUES
  (176, 'sub_xxx', 'starter', 0);
Enter fullscreen mode Exit fullscreen mode

Click "Resend" on a failed event in Stripe Dashboard 竊・30 seconds later: 201 OK 笨・

Fix B: handler robustness (preventive)

Data repair is a per-customer band-aid. To prevent future "data integrity gap 竊・retry storm" cases, change the handler:

// Before (3 places in customer.subscription.deleted and invoice.paid)
if (!stripeSubscriptionItem) {
  throw new HttpException(
    "Subscription item not found.",
    HttpStatus.BAD_REQUEST
  );
}

// After
if (!stripeSubscriptionItem) {
  console.warn(
    `[webhook] StripeSubscriptionItem not found for sub_id=${stripeSub.id} (stripe_id=${subscriptionId}), skipping plan update`
  );
  break;  // exits switch, returns 200
}
Enter fullscreen mode Exit fullscreen mode

3 throw sites in two case blocks (customer.subscription.deleted and invoice.paid), all replaced with warn + break. Stripe sees 200, stops retrying. The plan-update side effect is skipped, which is fine because the customer's plan was already correct (we just couldn't verify it via DB).

Verification

After Fix A, manually triggered retry from Stripe Dashboard:

2026-05-05T05:27:01.500Z POST /payment/webhook 201 2 Stripe/1.0
Enter fullscreen mode Exit fullscreen mode

Next day, Stripe's natural retry of the remaining 20 failed events:

2026-05-06T04:15:28.736Z POST /payment/webhook 201 2 Stripe/1.0
2026-05-06T04:17:25.577Z POST /payment/webhook 201 2 Stripe/1.0
Enter fullscreen mode Exit fullscreen mode

All clean. 5/10 auto-disable risk fully averted.

Checklist for your own webhook handlers

Borrow this if it's useful:

  • [ ] Are you throwing 4xx for any DB lookup miss in your webhook? 竊・consider warn + 200 instead
  • [ ] Do you have a default: case that returns 200 for unknown event types?
  • [ ] Are you alerting on per-event-type success rate, not just total error count? (Catches month-aligned failures)
  • [ ] Is there a periodic batch checking referential integrity between your subscription / customer / item tables?
  • [ ] Are your webhook signature verification failures returning 4xx (they should 窶・that's the correct use of 4xx, since Stripe needs to know its retry won't help)?

The meta-lesson

The bug here was small (a missing DB row). The damage was disproportionate because:

  1. The protocol's "4xx = retry" semantic doesn't match REST intuition
  2. Calendar-aligned events hide inside normal logs
  3. Sentry-style anomaly detection can't see month-1 spikes

Webhook integrations are deceptively easy at first and quietly break later. Worth a half-hour audit of yours.


Posting these as I find them. I run tasteck, a vertical SaaS, and I've been writing about the operational side in Build-in-Public posts (Japanese). This is the first one I've written in English 窶・if you want more in this style, say so in the comments.

Top comments (0)