War Story: Stripe 2026 Webhook Misconfiguration Caused $100k in Lost Revenue

#story #stripe #2026 #webhook

War Story: Stripe 2026 Webhook Misconfiguration Caused $100k in Lost Revenue

Published: October 12, 2026 | Author: Engineering Team @ SaaS Co

It was 2:14 AM on a Tuesday in March 2026 when our first PagerDuty alert fired. Our checkout success rate had dropped from 99.2% to 0% in the span of 12 minutes. For a SaaS platform processing $400k in daily recurring revenue, that was a five-alarm fire.

The Setup: Stripe Webhooks 101

We’d been using Stripe for payments since 2021, and our webhook integration was battle-tested: we listened for checkout.session.completed events to provision user access, invoice.paid to renew subscriptions, and charge.failed to trigger dunning emails. Our webhook endpoint was a Node.js Lambda function behind API Gateway, with Stripe signature verification enabled, retries configured for 3x, and a dead-letter queue (DLQ) for failed events.

In early 2026, Stripe rolled out v2 of their webhook API, which added support for batched event delivery and mandatory idempotency keys for all event types. We’d scheduled the migration for Q1 2026, and our engineering team had spent 3 weeks testing the v2 integration in staging, which processed 10k+ test events without issue.

The Incident: 12 Minutes of Chaos

On March 9, 2026, we flipped the switch to Stripe Webhooks v2 in production. The first 8 minutes went smoothly: 1.2k checkout events processed, no errors. Then, at 2:14 AM, our error rate spiked to 100%. Every incoming webhook was returning a 500 Internal Server Error.

We jumped into the logs first. API Gateway showed all webhook requests were reaching our Lambda, but the Lambda was throwing an unhandled exception: StripeSignatureVerificationError: No signatures found matching the expected signature for payload. Wait, that didn’t make sense—we’d updated our signature verification logic to support v2’s new signing secret format, right?

We checked the Stripe dashboard: webhook deliveries were showing as "failed" with 500 responses, and Stripe was retrying every 30 minutes. But our checkout flow was broken: users were paying successfully (Stripe showed charges as succeeded), but our system wasn’t provisioning access, so users were emailing support saying they’d paid but couldn’t log in.

The Root Cause: A One-Character Typo

After 45 minutes of debugging, we found the issue. When we updated our environment variables for the v2 migration, we’d copied the new Stripe webhook signing secret from the Stripe dashboard. But the secret had a trailing newline character that we’d accidentally included when pasting into AWS Parameter Store.

Stripe’s v2 signature verification requires an exact match of the signing secret. The trailing newline meant our computed signature never matched the one Stripe sent in the Stripe-Signature header. Every single webhook was failing verification, throwing an error, and returning 500. Stripe’s retry logic would kick in, but since the secret was wrong, every retry also failed.

But wait—why didn’t our staging tests catch this? In staging, we’d used the "Copy Secret" button in Stripe’s dashboard, which automatically trimmed whitespace. In production, we’d manually typed the secret (don’t ask why—we’d had a permissions issue with the staging secret manager that day, so the on-call engineer manually pasted it into prod). The manual paste included the trailing newline from the terminal, which we didn’t notice.

The Impact: $100k Lost Revenue

We fixed the secret at 2:59 AM, 45 minutes after the first alert. But the damage was done:

1,247 successful Stripe charges totaling $112k had no corresponding provisioning events. 89% of those users requested refunds within 24 hours, totaling $99.7k in lost revenue (rounded to $100k).
12 enterprise customers threatened to churn over the outage, requiring 40+ hours of customer success time to retain.
Our support team handled 2,100+ tickets in 48 hours, costing $18k in overtime pay.

Stripe’s retry logic had queued all the failed events, but since our endpoint was returning 500s, the retries never succeeded. Once we fixed the secret, we had to manually replay all 1,247 events from the Stripe webhook logs to provision user access, which took another 6 hours of engineering time.

Lessons Learned

We made a lot of mistakes that night, but we walked away with three hard-won rules for webhook integrations:

Never manually copy-paste secrets into production. Use infrastructure-as-code (IaC) tools like Terraform or AWS Secrets Manager’s API to sync secrets directly from Stripe, with automatic whitespace trimming.
Add synthetic webhook testing to production canaries. We now send a test ping event to our webhook endpoint 1 minute after any payment integration change, with an automatic rollback if verification fails.
Always log unverified webhook payloads to a temporary bucket. We’d originally discarded unverified events to save on storage costs, but now we store all raw webhook payloads (even unverified) for 7 days. That would have let us replay events even if our endpoint was broken, without relying on Stripe’s retry queue.

Final Takeaway

Webhook misconfigurations are silent killers—they often don’t show up until traffic spikes, or a provider updates their API. A single trailing newline cost us $100k, but it taught us that webhook security and reliability require the same rigor as any other production database or API. Test your secrets, automate your config, and always have a replay plan.