- Book: Event-Driven Architecture Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
There is a class of bug that only appears in checkout flows on Sunday nights. The order is created. The inventory is reserved. The payment service times out, retries, and charges twice. The shipment label is printed for the second charge. The customer gets one box and an email saying their refund is on the way. Somewhere a junior engineer is googling "distributed transaction rollback Stripe."
A saga is what you build instead of pretending you can have a distributed transaction. It is a state machine. Each step does work in its own service. If a later step fails, earlier steps run a compensation. Not a rollback: you cannot rollback a charge or an email or a label that already printed. You issue a refund. You release the reservation. The shipment gets cancelled.
Chris Richardson's saga page on microservices.io has the canonical definition. Microsoft's Azure Architecture Center has the cloud-flavored version. Both agree on the same uncomfortable thing: the saga is an explicit state machine you have to write, store, and recover. Hand-waving "we'll just publish events and figure it out" is how you get the Sunday-night bug.
This post builds the canonical order saga in 200 lines of Python: place order, reserve inventory, charge payment, ship. Orchestrated, because orchestration is easier to reason about than choreography when the saga has more than three steps. Durable state, because in-memory sagas die when the pod restarts. Compensations, because that is the entire point.
Orchestration over choreography
Choreography means each service emits an event and other services react. No central coordinator. Beautiful on a slide. In production, debugging a failed order means tailing four log streams and reconstructing the order they fired in.
Orchestration means one service (the saga orchestrator) owns the state machine. It tells inventory to reserve, waits for the reply, tells payment to charge, waits for the reply, and so on. If something fails, it knows exactly which compensations to run and in what order.
Chris Richardson argues for orchestration as participant count grows, and the order saga has four. Orchestration. No drama.
The state machine
The order saga has six states a happy path can be in and three terminal states for failure cases.
PENDING -> INVENTORY_RESERVED -> PAYMENT_CHARGED
-> SHIPPED -> COMPLETED
Failure paths (compensations in reverse):
PAYMENT_FAILED -> COMPENSATING -> CANCELLED
SHIPPING_FAILED -> COMPENSATING -> CANCELLED
INVENTORY_REJECTED -> CANCELLED
If payment fails, we release the inventory. If shipping fails, we refund the payment and release the inventory. If inventory itself rejects, there is nothing to compensate — we just mark the order cancelled.
The orchestrator in 200 lines
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable
import json
import time
import uuid
class State(str, Enum):
PENDING = "pending"
INVENTORY_RESERVED = "inventory_reserved"
PAYMENT_CHARGED = "payment_charged"
SHIPPED = "shipped"
COMPLETED = "completed"
COMPENSATING = "compensating"
CANCELLED = "cancelled"
@dataclass
class SagaState:
saga_id: str
order_id: str
customer_id: str
items: list[dict]
amount: int # cents
state: State = State.PENDING
reservation_id: str | None = None
charge_id: str | None = None
shipment_id: str | None = None
last_error: str | None = None
history: list[dict] = field(default_factory=list)
def transition(self, to: State, **meta) -> None:
self.history.append({
"from": self.state.value,
"to": to.value,
**meta,
})
self.state = to
class SagaStore:
"""Durable state. In production: Postgres row per saga."""
def __init__(self, db):
self.db = db
def save(self, s: SagaState) -> None:
self.db.execute(
"INSERT INTO sagas (id, payload) VALUES (%s, %s) "
"ON CONFLICT (id) DO UPDATE SET payload = %s",
(s.saga_id, json.dumps(s.__dict__, default=str),
json.dumps(s.__dict__, default=str)),
)
def load(self, saga_id: str) -> SagaState:
row = self.db.fetchone(
"SELECT payload FROM sagas WHERE id = %s",
(saga_id,),
)
return SagaState(**json.loads(row["payload"]))
class OrderSaga:
def __init__(self, store, inventory, payment, shipping, alerts):
self.store = store
self.inventory = inventory
self.payment = payment
self.shipping = shipping
self.alerts = alerts
def start(self, order_id, customer_id, items, amount) -> str:
s = SagaState(
saga_id=str(uuid.uuid4()),
order_id=order_id,
customer_id=customer_id,
items=items,
amount=amount,
)
self.store.save(s)
self._reserve_inventory(s)
return s.saga_id
def _reserve_inventory(self, s: SagaState) -> None:
try:
res_id = self.inventory.reserve(
s.order_id, s.items,
idempotency_key=f"reserve:{s.saga_id}",
)
s.reservation_id = res_id
s.transition(State.INVENTORY_RESERVED)
self.store.save(s)
self._charge_payment(s)
except InventoryRejected as e:
s.last_error = str(e)
s.transition(State.CANCELLED, reason="inventory_rejected")
self.store.save(s)
def _charge_payment(self, s: SagaState) -> None:
try:
charge = self.payment.charge(
s.customer_id, s.amount,
idempotency_key=f"charge:{s.saga_id}",
)
s.charge_id = charge
s.transition(State.PAYMENT_CHARGED)
self.store.save(s)
self._ship(s)
except PaymentFailed as e:
s.last_error = str(e)
s.transition(State.COMPENSATING, step="payment")
self.store.save(s)
self._compensate(s)
def _ship(self, s: SagaState) -> None:
try:
sid = self.shipping.create(
s.order_id,
idempotency_key=f"ship:{s.saga_id}",
)
s.shipment_id = sid
s.transition(State.SHIPPED)
s.transition(State.COMPLETED)
self.store.save(s)
except ShippingFailed as e:
s.last_error = str(e)
s.transition(State.COMPENSATING, step="shipping")
self.store.save(s)
self._compensate(s)
def _compensate(self, s: SagaState) -> None:
# Run compensations in reverse order of completion.
if s.charge_id:
self._safe(
lambda: self.payment.refund(
s.charge_id,
idempotency_key=f"refund:{s.saga_id}",
),
s, "refund",
)
if s.reservation_id:
self._safe(
lambda: self.inventory.release(
s.reservation_id,
idempotency_key=f"release:{s.saga_id}",
),
s, "release",
)
if s.state != State.CANCELLED:
s.transition(State.CANCELLED)
self.store.save(s)
def _safe(self, fn: Callable, s: SagaState, name: str) -> None:
for attempt in range(5):
try:
fn()
return
except Exception as e:
if attempt == 4:
self.alerts.page(
f"compensation_failed:{name}",
saga_id=s.saga_id,
error=str(e),
)
self.store.save(s)
return
time.sleep(2 ** attempt)
That is the whole thing. Around 130 lines of orchestrator plus the state and store. Add the participant clients (InventoryClient, PaymentClient, ShippingClient) as boring HTTP wrappers and you are at 200.
The details that earn their lines:
Idempotency keys on every external call. The key is derived from the saga ID and the step name. If the saga restarts mid-step, the next attempt sends the same key. (The pod may have crashed after we sent the request but before we recorded the response.) Inventory, payment, and shipping all dedupe. Microsoft's Azure saga pattern page walks through why idempotency is the only way to make sagas safe under retries.
Durable state on every transition. The orchestrator writes to the saga store before and after every external call. If the process dies, a recovery worker reads pending sagas and resumes from the last persisted state. No state in memory. No "restart from the beginning" drift.
Compensations run in reverse completion order. Refund first (it was last to succeed), then release the reservation. Each compensation is wrapped in _safe, which retries five times with exponential backoff and pages on failure.
Saga vs distributed transaction
The line that gets glossed over: a saga is not a distributed transaction. There is no rollback. There is no atomic "everything happens or nothing happens" guarantee. The window between "inventory reserved" and "payment charged" is a window in which the system is observably inconsistent (inventory shows reserved, payment shows nothing), and there is no avoiding it without two-phase commit, which you do not want.
What you get instead is eventual consistency with explicit compensation. The system reaches a consistent state, but it takes time, and the consistency you reach might be "this order was cancelled" rather than "this order was placed." The compensation is the contract. If you cannot describe a compensation for a step, that step does not belong in a saga.
This is the part Chris Richardson's saga page hammers on, and it is the part most tutorials elide. The saga is a business-level guarantee, not a technical one. "If the payment fails, we release the inventory" is a thing the business agreed to.
What if compensation fails
This is the question that separates demos from production. The demo assumes refunds always succeed. Production assumes nothing.
The pattern is the one in _safe above: retry with backoff, then alert and dead-letter the compensation to a manual queue. A human looks at it. The order is in the COMPENSATING state until either the retry succeeds or someone marks it resolved. The customer gets an email. The support team has a runbook for "your refund is in flight, here is the ticket number."
What you do not do is silently mark the saga CANCELLED when the refund fails. The state machine has to reflect reality. If the refund didn't happen, the saga didn't reach CANCELLED. It stayed in COMPENSATING. The dashboard shows COMPENSATING > 1h as a red tile. Someone is paged.
The other failure mode worth naming: the compensation succeeds but you crash before recording it. The next attempt re-invokes the refund. Hence the idempotency key on the refund call. Stripe and every reasonable payment provider recognize the duplicate and return the original result.
Oracle's lock-free reservations approach is a database-level take on the same problem: reservations as first-class objects that the engine knows how to compensate. Worth reading if you are on Oracle 23ai. The pattern is the same shape; the database does some of the bookkeeping for you.
What goes in the saga store
The minimum is: saga ID, current state, payload (the immutable inputs), step IDs from external services (reservation, charge, shipment), and the transition history. The history is non-negotiable. When a customer asks why their order is in this state, you want to read a log, not reconstruct it from three services' audit tables.
In Postgres, that is one row per saga with a JSONB payload column and an index on state so the recovery worker can query WHERE state IN ('pending', 'compensating') AND updated_at < now() - interval '5 minutes'. Anything older than five minutes that is not in a terminal state is a saga that needs a human or a retry.
The Sunday-night bug, revisited
There is no framework that does this for you correctly without you understanding it. Temporal, Camunda, AWS Step Functions all give you better primitives, but the modeling work is the same. The reason sagas feel hard is that distributed-system honesty is hard. Once you write the state machine down and stop pretending you have a transaction, the code is short and the bugs are findable.
The Sunday-night double-charge bug? It does not happen. The retry hits the payment service with the same idempotency key. The payment service returns the original charge. The saga moves on. The next failure mode worth modeling is the dual-write between the orchestrator and its message bus, and that is the outbox pattern, which is its own post.
If this was useful
The Event-Driven Architecture Pocket Guide covers sagas in production — orchestration vs choreography tradeoffs, the outbox pattern for the message-bus side, idempotency at every layer, and the failure modes you only hit after a few thousand orders. If the code above looks like a skeleton you'd want fleshed out into a real service, it is the longer version of this post.

Top comments (0)