ajithmanmu

Posted on May 13

When Stripe's Built-In Dunning Isn't Enough

#aws #stripe #serverless #webdev

When Stripe's Built-In Dunning Isn't Enough

Every subscription business eventually hits the same silent killer: involuntary churn. A customer wants to keep paying you, but their card fails. RevenueCat's State of Subscriptions 2026 report highlights payment failures as one of the leading causes of involuntary churn across subscription apps — and that's not just mobile. For Stripe-based web subscriptions, the same problem exists and the default tooling leaves a lot of room for improvement.

Stripe handles the basics — it retries on a schedule and sends a generic email. For an early-stage startup, that's enough.

But once revenue starts to matter, Stripe's defaults start to feel blunt.

A $500/month VIP customer gets the same retry schedule as someone on a free trial. A stolen card — which will never succeed — gets retried the same way as a card that temporarily had insufficient funds. And there's no clean way to know what's happening with any specific customer's failed payment without digging through raw logs.

By sticking to the defaults, you're leaving money on the table.

I built a custom payment recovery system on AWS to fix this. Instead of a one-size-fits-all retry loop, it gives every failed payment its own isolated workflow — one that understands who the customer is, why the payment failed, and exactly how hard to fight before giving up.

The problem with Stripe's default dunning

Stripe's built-in dunning has four real blind spots:

1. Every customer gets the same treatment

A long-time VIP and a day-one trial user both hit the same retry schedule. That's not how you'd handle it manually. High-value customers deserve more patience. Trial users with unproven intent to pay deserve less.

2. No understanding of why the payment failed

Most engineers look at a failed payment and just see "Failed." But the bank usually tells you exactly why — and that reason changes everything.

Expired card → worth retrying
Temporarily insufficient funds → retry in a few days
Stolen card → retrying is pointless, and repeatedly hitting a stolen card can hurt your reputation with card networks

Stripe retries regardless.

3. Retrying hard declines damages your standing

Stolen cards, lost cards, do-not-honor codes — these will never succeed. Retrying them wastes money on Stripe fees and signals to card networks that you're not doing basic fraud hygiene. This can affect your authorization rates over time.

4. No per-customer audit trail

There's no single place that tells you: "This customer failed on Day 0, got retried on Day 3, received an email here, and will be cancelled on Day 14." You piece it together from logs. That makes debugging and support harder than it should be.

The architecture

The entire infrastructure is provisioned with Terraform — API Gateway, Lambda functions, Step Functions, DynamoDB tables, IAM roles, and Secrets Manager. The system sits between Stripe and your business logic:

Stripe → API Gateway → webhook-handler Lambda → Step Functions
                               ↕
                           DynamoDB
                   (idempotency, customer tier, outcomes)

Stack:

Terraform — all infrastructure as code, fully reproducible
API Gateway — receives invoice.payment_failed webhooks from Stripe
webhook-handler Lambda — validates the Stripe signature, reads the decline reason, starts a Step Functions execution
enricher Lambda — looks up the customer's tier in DynamoDB, defaults to standard if no record exists
payment-checker Lambda — polls Stripe to see if the invoice was paid between retries
canceller Lambda — cancels the subscription and writes the outcome to DynamoDB
DynamoDB — idempotency table (prevents double-processing), customer tier table, dunning-state table (final outcomes)

Every failed payment gets its own Step Functions execution. One customer, one workflow, fully isolated.

Hard vs. soft declines

The first thing the webhook handler does is read charge.outcome.reason from the Stripe charge object. This is one of the most under-used data points in the Stripe API. Most people just see "payment failed" — but the bank tells you exactly why, and that signal is what drives the routing logic.

const charges = await stripe.charges.list({ customer: invoice.customer, limit: 1 });
const charge = charges.data[0];
failureCode = (charge.outcome?.reason ?? charge.failure_code) || 'unknown';

Hard decline codes — stolen_card, lost_card, do_not_honor, pickup_card — go straight to cancellation. No retries, no waiting.

Everything else is a soft decline and enters the retry flow.

The four recovery paths

Once the decline type and customer tier are known, the Step Functions state machine routes to one of four paths:

Scenario	Decline type	Retry schedule	Total window
VIP	soft	Day 1, Day 3, Day 7	7 days
Trial	soft	Day 1, Day 3	3 days
Standard	soft	Day 3, Day 7, Day 14	14 days
Hard decline	hard	None	Immediate cancel

Why the trial window is short: if the first real payment fails after a trial ends, the intent to pay is unproven. You don't want to provide weeks of free access to someone who signed up with a card they never planned to use.

Why VIP gets more time: these customers have demonstrated value. A payment failure is more likely a temporary issue — card replacement, bank fraud hold — than an intent to leave.

The key insight: Step Functions as a per-user scheduled task

The most common engineering approach to retry logic is a database table and a cron job: every hour, scan for rows where next_retry < NOW(). It works, but it's painful to debug. If one customer's retry logic breaks, you're sifting through thousands of rows to find it.

Step Functions changes the mental model entirely.

An execution can sit dormant at a Wait state for days. In production, the wait durations are real:

86,400 seconds = 1 day
259,200 seconds = 3 days
604,800 seconds = 7 days

While an execution is waiting, it costs nothing. When the timer expires, it picks up exactly where it left off.

This means every failed payment is its own isolated process — like a tiny scheduled program running just for that customer. If a PM asks "what's happening with Customer X right now?", you open the Step Functions console, find their execution by name, and see exactly which state they're in and when the next action fires. You can stop that one execution, skip a step, or inspect the full history — without touching anyone else.

No cron jobs. No polling. No database flags tracking retry state.

Seeing it in action

To demo all four paths, I wrote a script that creates real Stripe customers with real subscriptions and triggers genuine payment failures:

node scripts/trigger-failure.js all

The script creates a fresh Stripe customer for each scenario, seeds DynamoDB with the correct tier, attaches a subscription to the Premium or Starter product, and confirms the invoice's PaymentIntent with a test card that declines. For hard decline, it uses pm_card_chargeDeclinedStolenCard which produces outcome.reason: stolen_card on the charge — exactly what the webhook handler reads.

Here's how the customer looks in Stripe — a real subscription with a real failed invoice, not a mocked event:

Each scenario kicks off its own Step Functions execution, named by customer and decline type so they're readable at a glance:

The routing difference is visible in the execution graphs. VIP takes the full retry path:

Hard decline goes straight to cancel — no wait states, done in under a second:

Every outcome is recorded in DynamoDB:

This pattern extends beyond dunning

The most underrated part of this architecture is that dunning is just the first workflow you plug in.

The webhook handler doesn't care what event it receives. The same infrastructure — API Gateway, Lambda, Step Functions — can orchestrate any Stripe event into any workflow:

customer.subscription.deleted — customer cancels. Trigger a win-back flow: wait 3 days, send a discount offer, wait a week, send a final email, archive the record.
customer.subscription.trial_will_end — Stripe fires this exactly 3 days before a trial ends. Start a conversion sequence: "here's what you'll lose", followed by a discount offer, followed by a day-before reminder.
invoice.payment_succeeded after a failure — payment recovered. Send a confirmation, re-activate paused features, notify the account team for high-value customers.
charge.dispute.created — chargeback filed. Immediately alert the team, auto-gather evidence, flag the customer record.

Each of these is a separate state machine sharing the same webhook infrastructure. You're not building a dunning system — you're building a Stripe event orchestration layer. Dunning is just the first workflow you wire up.

What you'd add in production

The current system handles the retry logic. A production deployment would layer on:

Tiered messaging at each retry point

Not one generic email after 14 days, but a different message at each step:

Day 1: "There was an issue with your payment — here's how to fix it"
Day 7: "Your access will end in 7 days"
Final step for VIPs: offer a one-month discount before cancelling

Because each Wait state is followed by a Lambda invocation, adding email is just wiring up SES or your email provider at each step.

Pause instead of cancel

Cancellation is destructive — it might delete data or revoke API access. In many cases it's better to pause the account: the customer can still log in and see their data, but can't take new actions. This keeps the door open for them to return without the friction of a full account reset.

Alerts for hard decline spikes

If 50 customers hit stolen_card errors within ten minutes, that's not a billing issue — it's likely a card-testing attack or a bug in your integration. A CloudWatch alarm feeding into Slack or PagerDuty lets a human intervene before the system mass-cancels legitimate customers.

Discount offer state before cancelling VIPs

Add one more state before CancelSubscription for VIP customers: offer a one-month discount. This is low effort and high impact — the state machine makes it a two-line addition.

What building this changed

The biggest shift wasn't technical — it was how I think about background jobs.

Before this, my mental model was: "How do I schedule retries?" Which leads to cron jobs, retry tables, and scattered logic.

After building this, the question became: "What's the lifecycle of this event?" Then model that lifecycle directly as a state machine.

Once you see a failed payment as its own workflow rather than a row in a retry queue, the whole design becomes cleaner. The complexity lives in the workflow definition, not in scattered application code.

Is it worth building?

Stripe's defaults are designed for the average business. But your customers aren't average — a VIP paying $500/month deserves a different recovery experience than a trial user on day one.

This system pays for itself the first time a VIP customer's card is successfully retried on Day 7 — a retry that Stripe's default schedule might have already given up on.

If you're running subscriptions at any meaningful scale, go check your Stripe dashboard. Look at how many hard declines you're retrying, and how many VIPs you're losing on the same schedule as trial users. The data is usually enough to make the case for building something better.

The full project — Terraform, all four Lambdas, the Step Functions definition, and the demo trigger script — is on GitHub: https://github.com/ajithmanmu/dunning-system

I'm an AWS Community Builder focused on serverless and subscription infrastructure. If you found this useful, follow for more posts on building production systems on AWS.

DEV Community

When Stripe's Built-In Dunning Isn't Enough

When Stripe's Built-In Dunning Isn't Enough

The problem with Stripe's default dunning

The architecture

Hard vs. soft declines

The four recovery paths

The key insight: Step Functions as a per-user scheduled task

Seeing it in action

This pattern extends beyond dunning

What you'd add in production

What building this changed

Is it worth building?

Top comments (0)