Saga Compensation for a Payments Flow That Actually Unwinds

#eventdriven #microservices #architecture #python

Book: Event-Driven Architecture Pocket Guide
Also by me: Database Playbook: Choosing the Right Store for Every System You Build
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The card charged at 02:14 UTC. The shipping service threw a 502 at 02:14:03. By 02:14:09 your retry loop had charged the same card a second time and still produced zero shipped boxes. Customer support is now telling a person they were billed $148.50 for a product that does not exist in any warehouse system you own.

That's a saga with no compensation discipline. The four steps reserve_inventory → charge_card → ship → notify look linear on a whiteboard and behave like a minefield in production: each one can fail, each one can succeed-but-time-out, and the retry that "fixes" step 3 can corrupt step 2 if step 2's compensation is not idempotent.

State machine, persisted state, idempotent compensations, and the failure modes the choreography version cannot see.

Why orchestration wins this one

The saga pattern, as Hector Garcia-Molina and Kenneth Salem described it in 1987, is a sequence of local transactions where each step has a compensating action that semantically undoes it. Two ways to wire those steps together:

Choreography. Each service listens for events and emits events. InventoryReserved triggers charge_card. CardCharged triggers ship. The flow is implicit in the event topology.

Orchestration. A coordinator service owns the state machine. It tells each service what to do, waits for the reply, and decides the next step (or the rollback) based on the result.

For payments, orchestration wins for three reasons that show up the first week you run real traffic:

You need to reason about partial failure in one place. When the orchestrator's state is CHARGED, SHIP_FAILED, the next move is one decision, not four services arguing through Kafka.
Compensations need ordering. You must refund before you release the inventory hold in some business rules, and the reverse in others. With choreography you encode that ordering in event chains. With orchestration you write a list.
Auditors ask "what happened to order 8821." The orchestrator's persisted state is the answer. Choreography forces you to reconstruct it by joining six topics.

Chris Richardson's microservices.io saga page calls choreography "simple for simple sagas" and orchestration the move "as the number of steps grows." Payments is not a simple saga. Pick orchestration.

The state machine

Four forward steps. Four compensations. One terminal success state, one terminal failed-and-fully-compensated state, and one terminal stuck state for the cases your humans need to look at.

from enum import Enum

class SagaState(str, Enum):
    STARTED = "started"
    INVENTORY_RESERVED = "inventory_reserved"
    CARD_CHARGED = "card_charged"
    SHIPPED = "shipped"
    NOTIFIED = "notified"          # terminal success
    COMPENSATING = "compensating"
    COMPENSATED = "compensated"    # terminal failure
    STUCK = "stuck"                # terminal manual

Forward transitions go top-to-bottom. A failure at any step transitions to COMPENSATING and runs the compensations for every step already completed, in reverse order. The terminal COMPENSATED state means every side effect has been undone. STUCK means a compensation itself failed past the retry budget, and a human owns it now.

The compensation map for this flow:

Forward step	Compensation
`reserve_inventory`	`release_inventory`
`charge_card`	`refund_card`
`ship`	`cancel_shipment`
`notify`	(no compensation, see below)

Notification is the one step you do not compensate. Once an email is sent, you cannot un-send it. You either ship a follow-up correction email or accept that the customer got a "your order shipped" email a millisecond before you cancelled the shipment, and your support flow handles it. Pretending you can compensate notify is how teams end up with three "your order has been cancelled" emails per failure.

Persisting saga state

The orchestrator must survive a crash mid-saga. That means every state transition writes to durable storage before any external call. Two columns matter: the current state and the idempotency key for the in-flight call.

import uuid
from dataclasses import dataclass, field

@dataclass
class SagaInstance:
    saga_id: str
    order_id: str
    state: SagaState = SagaState.STARTED
    completed_steps: list[str] = field(default_factory=list)
    step_keys: dict[str, str] = field(default_factory=dict)
    last_error: str | None = None

def new_saga(order_id: str) -> SagaInstance:
    return SagaInstance(
        saga_id=str(uuid.uuid4()),
        order_id=order_id,
    )

step_keys holds an idempotency key per step. The first time the orchestrator calls charge_card, it generates a UUID and writes it to step_keys["charge_card"]. If the orchestrator crashes after sending the request but before recording the response, the next attempt sends the same key. The payment service deduplicates. You do not double-charge.

Schema-wise this is one row per saga in Postgres with a JSONB column for completed_steps and step_keys. The Database Playbook covers the trade-offs of putting saga state in Postgres versus a dedicated workflow store like Temporal. The short version: Postgres is fine until you have ten saga types, then you reach for the workflow engine.

The orchestrator loop

The core loop is a dispatcher: read the current state, run the next forward step (or the next compensation), persist, repeat. Failures bubble up as exceptions and switch the orchestrator into compensation mode.

class StepFailed(Exception):
    pass

FORWARD_ORDER = [
    "reserve_inventory",
    "charge_card",
    "ship",
    "notify",
]

def run_saga(saga: SagaInstance, services) -> SagaInstance:
    try:
        for step in FORWARD_ORDER:
            if step in saga.completed_steps:
                continue
            execute_step(saga, step, services)
        saga.state = SagaState.NOTIFIED
        store.save(saga)
        return saga
    except StepFailed as exc:
        saga.last_error = str(exc)
        saga.state = SagaState.COMPENSATING
        store.save(saga)
        return compensate(saga, services)

The if step in saga.completed_steps check is what makes the loop replayable. After a crash, the orchestrator reloads the saga from storage and calls run_saga again. Steps already marked complete are skipped. The first incomplete step runs with its persisted idempotency key, so a duplicate request to a downstream service is a no-op.

execute_step does the work of generating an idempotency key on first attempt, saving it before the call, and recording completion after:

def execute_step(saga, step, services):
    key = saga.step_keys.get(step)
    if key is None:
        key = str(uuid.uuid4())
        saga.step_keys[step] = key
        store.save(saga)
    handler = getattr(services, step)
    try:
        handler(saga.order_id, idempotency_key=key)
    except Exception as exc:
        raise StepFailed(f"{step}: {exc}") from exc
    saga.completed_steps.append(step)
    saga.state = STATE_AFTER[step]
    store.save(saga)

STATE_AFTER is a dict mapping each step name to the SagaState it transitions into. The save before the call and the save after the call are both required. Drop either and you reintroduce the double-charge or the lost-completion bug.

Compensations, idempotent by construction

Compensation runs in reverse order over completed_steps. Each compensation is its own remote call, each gets its own idempotency key, each persists progress.

COMPENSATIONS = {
    "reserve_inventory": "release_inventory",
    "charge_card": "refund_card",
    "ship": "cancel_shipment",
    # notify has no compensation
}

def compensate(saga, services):
    for step in reversed(saga.completed_steps):
        comp = COMPENSATIONS.get(step)
        if comp is None:
            continue
        key = saga.step_keys.setdefault(
            f"comp_{step}", str(uuid.uuid4())
        )
        store.save(saga)
        try:
            getattr(services, comp)(
                saga.order_id, idempotency_key=key
            )
        except Exception as exc:
            saga.last_error = f"{comp}: {exc}"
            saga.state = SagaState.STUCK
            store.save(saga)
            return saga
    saga.state = SagaState.COMPENSATED
    store.save(saga)
    return saga

The idempotency key for refund_card is comp_charge_card's UUID, persisted on first attempt. If the refund call times out and the orchestrator retries, the payments service sees the same key and returns the original refund. No double refunds. No "wait, you refunded the customer twice and now you're asking us to chargeback our own chargeback" support tickets.

A compensation that fails past its retry budget transitions the saga to STUCK. Stuck sagas show up on a dashboard, page someone, and the human decides whether to retry, manually reverse the side effect, or write a one-off database fix. Pretending the orchestrator can recover from every compensation failure is how you end up with sagas that loop forever and a backlog you only discover when the orchestrator's queue depth metric finally alerts at 3am.

What "idempotent compensation" actually requires

Three things, and every saga tutorial that skips one of them produces the bug above.

First, the downstream service must honour the idempotency key. Stripe does this natively with the Idempotency-Key header and a 24-hour replay window; Stripe's docs lay out the contract. Your inventory service probably needs you to build it: a table of (idempotency_key, response) rows, a uniqueness constraint, and a "if you see this key again, return the cached response" path.

Second, the orchestrator must persist the key before the call. Otherwise a crash between "I generated a key" and "I sent the request" produces a fresh key on retry and you lose the deduplication.

Third, the compensation must be safe to call when the forward step never actually took effect. A release_inventory call for an order whose reservation never persisted should return success, not 404. This is where teams get bitten: the forward step timed out client-side but did succeed server-side, the orchestrator thinks it failed, runs the compensation, and the compensation refuses because "there is nothing to release." Design compensations to be tolerant of that ambiguity. Return success for "already released" and "never reserved" alike.

The state machine plus persisted keys plus tolerant compensations is the package. Skip any leg of the tripod and the saga that "works in staging" produces support tickets in production.

If this was useful

Event-Driven Architecture Pocket Guide is the long version of the trade-offs in this post: saga, CQRS, outbox, the dual-write problem, and the failure modes that cost real money. The Database Playbook covers where to put your saga state and why Postgres-with-JSONB outlasts more clever choices for longer than you expect.