Payment orchestration for engineers: what it is, when you actually need it, and build-vs-buy

#fintech #payments #systemdesign #tutorial

Not a buzzword. Orchestration is a routing layer + a normalized payment model + a reconciliation spine. Here's the engineering view — and an honest build-vs-buy.

Payment orchestration is three things, not one: a provider-agnostic API, a routing layer (retries, cascades, least-cost / best-auth-rate decisions), and a reconciliation spine that ties every attempt back to money that actually moved.
You don't need it on day one. You need it the day your codebase grows its third if provider == "X" branch — or the day Finance asks "why don't these two reports match?"
"We'll just add acquirer #2 ourselves" is the classic trap: the easy 20% (a second adapter) is visible; the hard 80% (idempotency across providers, webhook fan-in, one settlement model, dispute plumbing) shows up six months later.
Build-vs-buy is a real decision with real numbers. This post gives you the component checklist, a provider-agnostic interface sketch, a decision matrix, and a migration path that doesn't bet the checkout on a big-bang cutover.

The symptom: your payment code is branching on the provider

Here's the smell. Somewhere in your codebase there's a function that started innocent:

def charge(order, provider):
    if provider == "acquirer_a":
        return acquirer_a.create_payment(order.amount, order.currency, order.card)
    elif provider == "acquirer_b":
        # different field names, cents vs decimal, different 3DS flow...
        return acquirer_b.charge({"sum": order.amount * 100, "cur": order.currency, ...})
    elif provider == "wallet_x":
        # redirect-based, no card at all, async callback
        ...

Every new payment method or geography adds a branch. The branches leak into refunds, into webhook handlers, into reconciliation, into your fraud checks. Soon "add a payment provider" is a two-sprint project and nobody wants to touch the file.

That branching is the thing orchestration removes. Not by magic — by forcing a normalized payment model at the boundary and pushing every provider's quirks into an adapter behind it.

If you have exactly one PSP and no plans to add another, you don't have this problem yet. Don't build orchestration for a problem you don't have. The rest of this post is about recognizing when you do have it.

What "orchestration" actually is — the five components

Strip the marketing and a payment orchestration layer is five concrete things. If a "platform" gives you only the first one, you've bought a thin proxy, not orchestration.

Component	What it owns	What breaks without it
1. Normalized payment model	One `Charge` / `Refund` / `Payout` shape; one currency representation; one status vocabulary; one error taxonomy	Provider quirks leak into business logic; every consumer re-learns each provider
2. Routing layer	Which provider gets this transaction; retries; cascades on soft declines; least-cost / best-auth-rate decisioning; circuit breakers	One acquirer outage = checkout down; no recovery of "issuer was flaky for 400ms" declines
3. Retry & cascade engine	Idempotency keys across providers; bounded re-attempts; decline-code awareness; resume-after-3DS	Double charges; infinite retry storms; cascading a `stolen_card` (don't)
4. Webhook fan-in	Normalizing N providers' async callbacks into one event stream; de-dup; ordering; replay	You write N webhook handlers, each with its own retry semantics; missed/duplicate events
5. Reconciliation spine	Joining attempts → authorizations → captures → settlement files → ledger; flagging mismatches	Finance reports that don't tie out; disputes you can't evidence; silent revenue leakage

Components 1–3 are what most people picture. Components 4 and 5 are where the real engineering is, and they're the ones a DIY "we just added an adapter" effort almost always skips. A second adapter is a week. A reconciliation spine that survives a contested chargeback eight months later is not.

Code: the provider-agnostic boundary

The heart of component 1 is an interface every provider adapter implements. Keep it small and money-shaped:

// The normalized model — providers adapt TO this, not the other way around.
type Money = { amountMinor: bigint; currency: string }; // always minor units, no floats

interface PaymentProvider {
  readonly id: string;
  authorize(req: AuthorizeRequest): Promise<AuthorizeResult>;
  capture(authId: string, amount: Money, idemKey: string): Promise<CaptureResult>;
  refund(captureId: string, amount: Money, idemKey: string): Promise<RefundResult>;
  // Async truth arrives via webhooks, normalized to the same event type:
  parseWebhook(raw: HttpRequest): NormalizedEvent | null;
}

type AuthorizeResult =
  | { status: "approved"; authId: string; providerRef: string }
  | { status: "declined"; code: DeclineCode; retryable: boolean }   // taxonomy is OURS
  | { status: "action_required"; kind: "3ds_challenge" | "redirect"; url: string }
  | { status: "error"; transient: boolean };                        // network/5xx ≠ decline

Two design rules that save you later:

Money is always minor units in an integer/bigint. No amount * 100 scattered across adapters. The conversion happens once, in the adapter, on the way in and out.
The decline-code taxonomy belongs to you, not to a provider. Each adapter maps that provider's 91 / do_not_honor / try_again_later onto your enum and sets retryable. The routing layer never sees a raw provider code.

And the routing layer that sits on top is, at its simplest, a scored candidate list — not a black box:

def pick_candidates(txn) -> list[Provider]:
    scored = []
    for p in eligible_providers(txn):           # currency, method, scheme, geography
        if circuit_open(p):                     # provider in cooldown? skip
            continue
        score = (
            0.55 * rolling_auth_rate(p, txn.bin_country, txn.amount_band)  # what works
          + 0.30 * (1 - normalized_cost(p, txn))                           # what's cheap
          + 0.15 * health_score(p)                                         # latency/errors
        )
        scored.append((score, p))
    return [p for _, p in sorted(scored, reverse=True)][:MAX_HOPS]   # bounded!

# Then: try candidate 0; on a *retryable* decline, cascade to candidate 1; stop at
# MAX_HOPS or a wall-clock deadline; never cascade a hard decline (insufficient_funds,
# stolen_card, ...). One idempotency key per *cascade*, reused per hop — see the
# idempotency-keys post in this series.

That's the whole "AI routing" mystique demystified for the simple case: features → score → ordered list → bounded cascade, with every decision logged so you can answer "why did this $4,000 transaction go to acquirer C?" A model can replace the weighted sum later; the explainability requirement doesn't go away when it does.

When a single PSP is fine — and when you've outgrown it

Adding orchestration has a cost (a new layer in your most critical path). Use it when the pain is real, not aspirational.

Situation	Single PSP is fine	You've outgrown it
Geographies	One market, local cards	Multiple regions, local methods (iDEAL, PIX, UPI, SEPA Instant…)
Volume	Low enough that a 0.5–2 pp auth-rate gap is noise	High enough that 1 pp of auth rate is a meaningful revenue line
Resilience	An hour of PSP downtime is survivable	PSP downtime = direct revenue loss / SLA breach
Cost	Interchange-plus is whatever it is	Routing by cost/scheme would save real money at your volume
Method sprawl	Cards (+ maybe one wallet)	A growing matrix of methods × providers, each with its own webhook
Org	One team owns payments end-to-end	Finance, fraud, and product all consume payment data and it must agree

A useful litmus test: count the if provider == branches and the distinct webhook handlers. One of each — you're fine. Three or more — orchestration will pay for itself, either as something you build deliberately or something you buy.

Build-vs-buy, honestly

This is a genuine fork, and the honest answer is "it depends — here's on what."

Dimension	Build it yourself	Buy / adopt a platform
Upfront engineering	~6–18 eng-months for a real one (model + routing + retries + webhook fan-in + reconciliation), not the 1-month adapter	Integration weeks, not months
Ongoing maintenance	Permanent: new provider quirks, scheme mandates (3DS, SCA, network tokens, VoP…), reconciliation edge cases	Mostly absorbed by the vendor; you track their changes
Compliance / PCI scope	You may pull more card data into scope unless you're careful with tokenization	Often reduces your scope (vaulting, redirect/iframe) — verify per vendor
Acquirer contracts	You negotiate and hold every contract	Either you bring your own acquirers (BYO-acquiring) or use theirs
Control & differentiation	Total control; routing logic can be a competitive edge	Less control over the deep internals; you depend on roadmap
The hidden 80%	Idempotency across providers, webhook ordering/de-dup, dispute evidence, settlement-file parsing, ledger truth	Should be solved already — make this an evaluation question, not an assumption

Rule of thumb: build it if payment routing is core to your product's economics or differentiation (you're a marketplace, a PSP, a platform whose margin lives in routing) and you can fund the ongoing team. Otherwise the maintenance tail — not the initial build — is what makes "buy" win. Either way, the component checklist above is your spec: if you buy, score vendors against all five; if you build, don't ship 1–3 and call it done.

Anti-patterns

Orchestration as a thin proxy. A normalized API in front of two providers with no reconciliation spine = you built the easy 20% and named it after the hard part. The first contested chargeback exposes it.
Routing without circuit breakers. "Best auth rate" routing that keeps hammering a provider mid-incident turns one provider's outage into your outage.
Per-attempt idempotency keys. Regenerating the key on each cascade hop defeats the point — a retried HTTP call to the same provider can now create a second authorization. One key per cascade.
Floating-point money. amount * 100 in three adapters with three rounding behaviors. Minor units, integers, one conversion site.
Measuring auth rate per attempt. Cascades make per-attempt auth rate look worse and routing changes look better than reality. Attribute per cascade.
Big-bang cutover. Moving 100% of checkout to a new orchestration layer over a weekend. Don't.

Migration path: strangler-fig, not big bang

If you're adding orchestration to a live system:

Wrap, don't replace. Put the normalized PaymentProvider interface in front of your existing PSP first. Same provider, new boundary. Ship that. Nothing user-visible changes.
Route a sliver. Send 1% of eligible traffic through the new layer (still to the same provider). Watch auth rate, latency, error rate, and — critically — that reconciliation still ties out.
Add provider #2 behind the interface. Now it's an adapter, not a branch in business logic. Route a small % to it; compare auth rates per BIN-country/amount-band.
Turn on cascades on retryable declines, bounded (MAX_HOPS, deadline). Measure recovered transactions and double-auth incidents (should be ~0, caught by the reconciliation sweep).
Move webhooks to the fan-in. One normalized event stream; retire the per-provider handlers.
Make the ledger the source of truth. Reconciliation runs against the spine, not against one provider's dashboard. Now adding provider #3 is a checklist, not a project.

Each step is independently shippable and independently reversible. That's the point.

Copy-this checklist

[ ] Normalized money type — minor units, integer/bigint, converted once per adapter.
[ ] Decline-code taxonomy is yours; each adapter maps the provider's codes onto it and sets retryable.
[ ] error (network/5xx/timeout) is not the same as declined in your model.
[ ] Routing produces a bounded candidate list (MAX_HOPS, wall-clock deadline).
[ ] Every routing decision is logged with its reason (explainable, even if a model picks).
[ ] Circuit breakers pull a provider out of rotation on error-rate spike; bleed traffic back gradually.
[ ] One idempotency key per cascade, reused per hop — never regenerated.
[ ] Webhook fan-in: de-dup, ordering, replay — one normalized event stream, not N handlers.
[ ] Reconciliation spine joins attempt → auth → capture → settlement file → ledger; flags mismatches.
[ ] Auth rate measured per cascade, not per attempt; you also watch cost per approved txn.
[ ] If buying: vendor scored against all five components, not just the normalized API.
[ ] If building: rollout is strangler-fig (wrap → 1% → provider #2 → cascades → webhooks → ledger), each step reversible.

*Written by the engineering team at PaynetEasy — payment orchestration & cross-border payouts infrastructure. We write about routing, reconciliation and money-movement correctness at payneteasy.com.