Not a buzzword. Orchestration is a routing layer + a normalized payment model + a reconciliation spine. Here's the engineering view — and an honest build-vs-buy.
- Payment orchestration is three things, not one: a provider-agnostic API, a routing layer (retries, cascades, least-cost / best-auth-rate decisions), and a reconciliation spine that ties every attempt back to money that actually moved.
- You don't need it on day one. You need it the day your codebase grows its third
if provider == "X"branch — or the day Finance asks "why don't these two reports match?" - "We'll just add acquirer #2 ourselves" is the classic trap: the easy 20% (a second adapter) is visible; the hard 80% (idempotency across providers, webhook fan-in, one settlement model, dispute plumbing) shows up six months later.
- Build-vs-buy is a real decision with real numbers. This post gives you the component checklist, a provider-agnostic interface sketch, a decision matrix, and a migration path that doesn't bet the checkout on a big-bang cutover.
The symptom: your payment code is branching on the provider
Here's the smell. Somewhere in your codebase there's a function that started innocent:
def charge(order, provider):
if provider == "acquirer_a":
return acquirer_a.create_payment(order.amount, order.currency, order.card)
elif provider == "acquirer_b":
# different field names, cents vs decimal, different 3DS flow...
return acquirer_b.charge({"sum": order.amount * 100, "cur": order.currency, ...})
elif provider == "wallet_x":
# redirect-based, no card at all, async callback
...
Every new payment method or geography adds a branch. The branches leak into refunds, into webhook handlers, into reconciliation, into your fraud checks. Soon "add a payment provider" is a two-sprint project and nobody wants to touch the file.
That branching is the thing orchestration removes. Not by magic — by forcing a normalized payment model at the boundary and pushing every provider's quirks into an adapter behind it.
If you have exactly one PSP and no plans to add another, you don't have this problem yet. Don't build orchestration for a problem you don't have. The rest of this post is about recognizing when you do have it.
What "orchestration" actually is — the five components
Strip the marketing and a payment orchestration layer is five concrete things. If a "platform" gives you only the first one, you've bought a thin proxy, not orchestration.
| Component | What it owns | What breaks without it |
|---|---|---|
| 1. Normalized payment model | One Charge / Refund / Payout shape; one currency representation; one status vocabulary; one error taxonomy |
Provider quirks leak into business logic; every consumer re-learns each provider |
| 2. Routing layer | Which provider gets this transaction; retries; cascades on soft declines; least-cost / best-auth-rate decisioning; circuit breakers | One acquirer outage = checkout down; no recovery of "issuer was flaky for 400ms" declines |
| 3. Retry & cascade engine | Idempotency keys across providers; bounded re-attempts; decline-code awareness; resume-after-3DS | Double charges; infinite retry storms; cascading a stolen_card (don't) |
| 4. Webhook fan-in | Normalizing N providers' async callbacks into one event stream; de-dup; ordering; replay | You write N webhook handlers, each with its own retry semantics; missed/duplicate events |
| 5. Reconciliation spine | Joining attempts → authorizations → captures → settlement files → ledger; flagging mismatches | Finance reports that don't tie out; disputes you can't evidence; silent revenue leakage |
Components 1–3 are what most people picture. Components 4 and 5 are where the real engineering is, and they're the ones a DIY "we just added an adapter" effort almost always skips. A second adapter is a week. A reconciliation spine that survives a contested chargeback eight months later is not.
Code: the provider-agnostic boundary
The heart of component 1 is an interface every provider adapter implements. Keep it small and money-shaped:
// The normalized model — providers adapt TO this, not the other way around.
type Money = { amountMinor: bigint; currency: string }; // always minor units, no floats
interface PaymentProvider {
readonly id: string;
authorize(req: AuthorizeRequest): Promise<AuthorizeResult>;
capture(authId: string, amount: Money, idemKey: string): Promise<CaptureResult>;
refund(captureId: string, amount: Money, idemKey: string): Promise<RefundResult>;
// Async truth arrives via webhooks, normalized to the same event type:
parseWebhook(raw: HttpRequest): NormalizedEvent | null;
}
type AuthorizeResult =
| { status: "approved"; authId: string; providerRef: string }
| { status: "declined"; code: DeclineCode; retryable: boolean } // taxonomy is OURS
| { status: "action_required"; kind: "3ds_challenge" | "redirect"; url: string }
| { status: "error"; transient: boolean }; // network/5xx ≠ decline
Two design rules that save you later:
-
Money is always minor units in an integer/bigint. No
amount * 100scattered across adapters. The conversion happens once, in the adapter, on the way in and out. -
The decline-code taxonomy belongs to you, not to a provider. Each adapter maps that provider's
91/do_not_honor/try_again_lateronto your enum and setsretryable. The routing layer never sees a raw provider code.
And the routing layer that sits on top is, at its simplest, a scored candidate list — not a black box:
def pick_candidates(txn) -> list[Provider]:
scored = []
for p in eligible_providers(txn): # currency, method, scheme, geography
if circuit_open(p): # provider in cooldown? skip
continue
score = (
0.55 * rolling_auth_rate(p, txn.bin_country, txn.amount_band) # what works
+ 0.30 * (1 - normalized_cost(p, txn)) # what's cheap
+ 0.15 * health_score(p) # latency/errors
)
scored.append((score, p))
return [p for _, p in sorted(scored, reverse=True)][:MAX_HOPS] # bounded!
# Then: try candidate 0; on a *retryable* decline, cascade to candidate 1; stop at
# MAX_HOPS or a wall-clock deadline; never cascade a hard decline (insufficient_funds,
# stolen_card, ...). One idempotency key per *cascade*, reused per hop — see the
# idempotency-keys post in this series.
That's the whole "AI routing" mystique demystified for the simple case: features → score → ordered list → bounded cascade, with every decision logged so you can answer "why did this $4,000 transaction go to acquirer C?" A model can replace the weighted sum later; the explainability requirement doesn't go away when it does.
When a single PSP is fine — and when you've outgrown it
Adding orchestration has a cost (a new layer in your most critical path). Use it when the pain is real, not aspirational.
| Situation | Single PSP is fine | You've outgrown it |
|---|---|---|
| Geographies | One market, local cards | Multiple regions, local methods (iDEAL, PIX, UPI, SEPA Instant…) |
| Volume | Low enough that a 0.5–2 pp auth-rate gap is noise | High enough that 1 pp of auth rate is a meaningful revenue line |
| Resilience | An hour of PSP downtime is survivable | PSP downtime = direct revenue loss / SLA breach |
| Cost | Interchange-plus is whatever it is | Routing by cost/scheme would save real money at your volume |
| Method sprawl | Cards (+ maybe one wallet) | A growing matrix of methods × providers, each with its own webhook |
| Org | One team owns payments end-to-end | Finance, fraud, and product all consume payment data and it must agree |
A useful litmus test: count the if provider == branches and the distinct webhook handlers. One of each — you're fine. Three or more — orchestration will pay for itself, either as something you build deliberately or something you buy.
Build-vs-buy, honestly
This is a genuine fork, and the honest answer is "it depends — here's on what."
| Dimension | Build it yourself | Buy / adopt a platform |
|---|---|---|
| Upfront engineering | ~6–18 eng-months for a real one (model + routing + retries + webhook fan-in + reconciliation), not the 1-month adapter | Integration weeks, not months |
| Ongoing maintenance | Permanent: new provider quirks, scheme mandates (3DS, SCA, network tokens, VoP…), reconciliation edge cases | Mostly absorbed by the vendor; you track their changes |
| Compliance / PCI scope | You may pull more card data into scope unless you're careful with tokenization | Often reduces your scope (vaulting, redirect/iframe) — verify per vendor |
| Acquirer contracts | You negotiate and hold every contract | Either you bring your own acquirers (BYO-acquiring) or use theirs |
| Control & differentiation | Total control; routing logic can be a competitive edge | Less control over the deep internals; you depend on roadmap |
| The hidden 80% | Idempotency across providers, webhook ordering/de-dup, dispute evidence, settlement-file parsing, ledger truth | Should be solved already — make this an evaluation question, not an assumption |
Rule of thumb: build it if payment routing is core to your product's economics or differentiation (you're a marketplace, a PSP, a platform whose margin lives in routing) and you can fund the ongoing team. Otherwise the maintenance tail — not the initial build — is what makes "buy" win. Either way, the component checklist above is your spec: if you buy, score vendors against all five; if you build, don't ship 1–3 and call it done.
Anti-patterns
- Orchestration as a thin proxy. A normalized API in front of two providers with no reconciliation spine = you built the easy 20% and named it after the hard part. The first contested chargeback exposes it.
- Routing without circuit breakers. "Best auth rate" routing that keeps hammering a provider mid-incident turns one provider's outage into your outage.
- Per-attempt idempotency keys. Regenerating the key on each cascade hop defeats the point — a retried HTTP call to the same provider can now create a second authorization. One key per cascade.
-
Floating-point money.
amount * 100in three adapters with three rounding behaviors. Minor units, integers, one conversion site. - Measuring auth rate per attempt. Cascades make per-attempt auth rate look worse and routing changes look better than reality. Attribute per cascade.
- Big-bang cutover. Moving 100% of checkout to a new orchestration layer over a weekend. Don't.
Migration path: strangler-fig, not big bang
If you're adding orchestration to a live system:
-
Wrap, don't replace. Put the normalized
PaymentProviderinterface in front of your existing PSP first. Same provider, new boundary. Ship that. Nothing user-visible changes. - Route a sliver. Send 1% of eligible traffic through the new layer (still to the same provider). Watch auth rate, latency, error rate, and — critically — that reconciliation still ties out.
- Add provider #2 behind the interface. Now it's an adapter, not a branch in business logic. Route a small % to it; compare auth rates per BIN-country/amount-band.
-
Turn on cascades on retryable declines, bounded (
MAX_HOPS, deadline). Measure recovered transactions and double-auth incidents (should be ~0, caught by the reconciliation sweep). - Move webhooks to the fan-in. One normalized event stream; retire the per-provider handlers.
- Make the ledger the source of truth. Reconciliation runs against the spine, not against one provider's dashboard. Now adding provider #3 is a checklist, not a project.
Each step is independently shippable and independently reversible. That's the point.
Copy-this checklist
- [ ] Normalized money type — minor units, integer/bigint, converted once per adapter.
- [ ] Decline-code taxonomy is yours; each adapter maps the provider's codes onto it and sets
retryable. - [ ]
error(network/5xx/timeout) is not the same asdeclinedin your model. - [ ] Routing produces a bounded candidate list (
MAX_HOPS, wall-clock deadline). - [ ] Every routing decision is logged with its reason (explainable, even if a model picks).
- [ ] Circuit breakers pull a provider out of rotation on error-rate spike; bleed traffic back gradually.
- [ ] One idempotency key per cascade, reused per hop — never regenerated.
- [ ] Webhook fan-in: de-dup, ordering, replay — one normalized event stream, not N handlers.
- [ ] Reconciliation spine joins attempt → auth → capture → settlement file → ledger; flags mismatches.
- [ ] Auth rate measured per cascade, not per attempt; you also watch cost per approved txn.
- [ ] If buying: vendor scored against all five components, not just the normalized API.
- [ ] If building: rollout is strangler-fig (wrap → 1% → provider #2 → cascades → webhooks → ledger), each step reversible.
*Written by the engineering team at PaynetEasy — payment orchestration & cross-border payouts infrastructure. We write about routing, reconciliation and money-movement correctness at payneteasy.com.
Top comments (0)