guo king

Posted on Jun 16 • Originally published at spec-coding.dev

'Retry on Error' Is Not a Payment Spec — Write the Failure Matrix Instead

#payments #api #architecture #backend

Most payment specs I've reviewed describe the happy path in three pages and the failure behavior in one sentence: "retry on error."

That sentence is where 80% of the production incidents come from.

A payment workflow spec earns its keep by naming — category by category — exactly what the system does when the card network, the issuer, or the customer refuses to cooperate. Here's the structure I use.

Start with a failure taxonomy, not a flowchart

Before drawing a single box, force the spec to answer one question: what are the categories of failure this workflow can produce? I insist on five, because collapsing them into "error" is what creates the mess:

Network timeout. The processor never answered. The charge may or may not exist. This is the only category where retrying with the same idempotency key is mandatory.
Soft decline. The issuer said no for a recoverable reason — insufficient funds, do-not-honor, expired card. Retry is allowed, but only with customer action or a later attempt window.
Hard decline. Stolen card, pickup card, fraud. Retrying is never correct, and on some networks it increases your risk score. The spec must forbid it.
Fraud review. The processor accepted the auth but punted on a decision. The response is async. The spec must describe the waiting state and the webhook that ends it.
3DS challenge. The issuer demanded Strong Customer Authentication. This is not a failure — it's a branch. The customer sees a redirect or iframe, and the workflow pauses until they finish.

Every downstream decision — retry policy, user messaging, observability — hangs off this five-row matrix. If the taxonomy is wrong, nothing below it will save you.

Retry rules that match the category, not the HTTP status

The rule I write into every payment spec, word for word: retry policy is a function of failure category, not of HTTP status.

A 402 from Stripe can be either a soft decline you should let the customer fix or a hard decline you should never touch again. The spec has to branch on the processor's decline code, not the transport code.

Concretely:

card_declined + decline_code: insufficient_funds → up to three retries spaced by the dunning schedule, each gated on a customer action or a scheduled job.
card_declined + decline_code: stolen_card → permanent flag on the payment method; any subsequent attempt fails closed before hitting the network.
Connection error or 5xx with no response body → immediate retry with the same Idempotency-Key, because the processor may have already charged the card and a fresh key would double-charge.

Idempotency keys belong to the attempt, not the request

The single most common mistake I see: idempotency keys scoped to the HTTP request instead of the logical attempt.

If a timeout triggers a retry and the retry generates a new key, the processor treats it as a new charge. The spec must say, in one sentence: the key is minted when the attempt begins and survives every retransmission inside that attempt. A new attempt — a new customer action, a new dunning cycle, a new order — gets a new key. Nothing in between does.

Also spec the key's lifetime. Processors typically honor keys for 24 hours. If a retry crosses that boundary, reconcile against the processor's ledger (list charges by metadata) instead of assuming the retry is safe.

3DS is a first-class state, not an error

If the spec treats SCA as an error, the frontend will do something stupid like show a red banner while the issuer is mid-challenge.

The spec needs a state called requires_action (or whatever your processor calls it) with explicit transitions: entered when the auth returns a challenge URL, exited when the webhook confirms success or failure.

Spec the two flavors separately:

In-flow challenge: the client SDK mounts the iframe, blocks interaction, resolves.
Redirect challenge: the browser navigates to the issuer's ACS URL and comes back to a return URL you control.

Nail down the return URL, the query params you expect, and what happens if the customer closes the tab mid-challenge. That last case always gets forgotten, and it's the one that produces stuck subscriptions in production.

Auth, capture, and the seven-day cliff

If your workflow does auth-now / capture-later, the spec must call out auth expiry. Most processors auto-void an uncaptured auth at roughly seven days (Stripe is 7, Adyen varies by scheme, some schemes are shorter for debit).

The spec needs to answer: what happens if the fulfillment job runs on day 8? My answer is always the same — require a fresh auth before capture attempts beyond day 5, and treat any capture against an expired auth as a hard failure that opens a new authorization, not a retry.

Multi-capture makes this worse. Spec the partial-capture order, whether over-capture is permitted (usually not), and how refund-before-final-capture interacts with the remaining authorized amount. I've watched a team discover at 2am that their "simple" refund reduced the captureable balance to zero and killed the next shipment.

Dunning is a state machine — write it down

For subscription failures, the spec should contain the dunning schedule verbatim, not a vague "we will retry":

Attempt	When	Side effect
1	Immediate, at renewal	—
2	+3 days	Silent retry
3	+7 days	Email 24h earlier
4	+14 days	Final-notice email
Cancel	+21 days	Access revoked at next cycle boundary

Each transition is a row in a state table: previous state, trigger, new state, side effects. Without this table the team re-argues the schedule every quarter.

The webhook is the source of truth

Non-negotiable clause: the synchronous response from the processor is advisory. The webhook is the ledger.

Forbid any state transition derived only from the API response — capture confirmed, refund settled, dispute opened, 3DS completed all wait for the corresponding event.

Concrete consequence: the spec needs an outbox or a reconciliation job. If the webhook is delayed, the UI shows "processing" longer than the customer expects. The spec owns that tradeoff and picks a timeout after which a job polls the processor directly. I pick 30 seconds for interactive flows, 15 minutes for background ones.

Acceptance criteria with a real retry scenario

Given a customer with a Visa ending 4242 and a recurring $29 subscription
When the renewal charge returns card_declined / insufficient_funds
Then the payment is marked past_due
  And attempt 2 is scheduled for +3 days with the same payment method
  And no email is sent on this attempt
  And the customer retains access until the grace period expires

Given the client receives a connection timeout on charge creation
When the client retries within 24 hours
Then it reuses the original Idempotency-Key
  And the processor returns the original charge, not a duplicate

Three metrics the spec has to name

Authorization rate broken down by BIN range and card scheme.
Decline-reason distribution with the processor's raw decline_code preserved — not bucketed into "declined".
3DS drop-off: challenges initiated vs challenges completed.

Miss any of these and you're flying blind the first time a single issuer changes its risk model and tanks your approval rate overnight. I also require a dashboard for webhook lag — the gap between the processor's event timestamp and your ingestion timestamp. A growing lag is usually the earliest signal that the payment pipeline is about to page someone.

The takeaway

A payment spec is not "describe the charge endpoint." It's a failure-handling document with a small happy path attached. Get the taxonomy right, attach a retry rule to each row, treat 3DS as a branch instead of an error, and let the webhook be the source of truth.

This is adapted from the full guide on Spec Coding, where we maintain spec templates and browser-based generators for API contracts, database migrations, and payment workflows.

DEV Community