The Worlds of Distributed Systems — Chapter 6
“How do we do a safe transaction across two services?”
Most answers sound like one of these:
- “2PC…? But nobody really does that in practice.”
- “Event-driven… and eventually consistent…”
- “Write a saga with compensating transactions… (and then nobody writes them)”
In Chapter 5, we got serious about RML-2 failures:
- classify failures by world (RML-1/2/3)
- label them with structured errors (
world/action) - return Action Hints like
start-compensation,retry-with-backoff,escalate-history
This chapter is the design layer on top of that:
How to structure the “in-progress conversation” of eventual consistency (RML-2), and make it retryable—via sagas.
1) From “transaction” to “conversation”
1.1 Why local transactions feel easy (RML-1)
In a single service + single DB world (RML-1), the story is simple:
- begin transaction
- update multiple tables
- commit, or rollback
Rollback means:
“Return to the pre-transaction state.”
The DB does it for you.
1.2 Why cross-service flows become dialog (RML-2)
In distributed systems (RML-2), the flow becomes a long conversation:
- Service A calls Service B
- B updates its own DB
- B calls Service C
- something fails mid-way
Now you must answer:
- How far can we roll back as a conversation?
- Who undoes what—and how?
- Where is the RML-3 boundary (history-bound)?
The framework for pre-deciding that is a Saga.
1.3 Eventual consistency in RML terms
A useful restatement:
- RML-2: the world while you’re still converging toward consistency
- RML-3: the world after convergence, where the result is now “settled history”
Sagas are about managing the RML-2 convergence process:
- undo what can be undone (RML-2)
- escalate what can’t (RML-3)
2) A saga is a “retryable conversation”
2.1 A practical definition
In this series:
A Saga = a sequence of forward steps (T1, T2, …) plus compensations (C1, C2, …) that together form an RML-2 conversation.
Think:
- forward steps:
T1, T2, T3, ... - compensations:
C1, C2, C3, ...
If the saga fails at some Tn, you run compensations in reverse order:
T1 → T2 → T3
↑ ↑ ↑
C1 C2 C3 (on failure, run C in reverse)
2.2 The most important question: where is the boundary to RML-3?
Sagas are not magic. The critical design move is naming this:
Which steps are truly reversible in RML-2, and which steps are already history-bound (RML-3)?
If you don’t draw this line, you’ll end up pretending an RML-3 action is “rollbackable,” and reality will eventually correct you—painfully.
3) Treat the saga as a state machine
A helpful worldview: a saga is a tiny world with a lifecycle.
stateDiagram-v2
[*] --> Pending
Pending --> Running: start
Running --> Completed: all Tn succeed
Running --> Compensating: error (RML2 / start-compensation)
Compensating --> Compensated: all Cn succeed
Compensating --> Escalated: compensation failed (RML3 / escalate-history)
Running --> Escalated: history-bound failure detected
Completed --> [*]
Compensated --> [*]
Escalated --> [*]
Mapping from Chapter 5’s error labels:
-
world: "RML2", action: "start-compensation"→Running → Compensating -
world: "RML3", action: "escalate-history"→Running/Compensating → Escalated
This is the big idea:
In RML-2, structured failures aren’t “just exceptions.”
They are state machine triggers.
4) Designing compensations: don’t chase “perfect reversal”
4.1 “Return to the exact prior state” is often impossible
A common misunderstanding:
“Compensation must restore the exact original state.”
In practice, many effects can’t be fully erased:
- the user already saw the email
- shipping started
- a downstream system recorded something immutable
- time passed; state changed elsewhere
So don’t optimize for perfect reversal. Optimize for:
A business-acceptable end state.
Examples:
- “Cancel after shipping started” → refund + return workflow + explanation
- “Revoke points after order cancellation” → subtract points + record why
4.2 The Reserve → Confirm pattern (a clean RML-2 → RML-3 boundary)
A common safe pattern:
-
T1: Reserve(reversible, RML-2) -
T2: Confirm(often history-bound, RML-3 boundary) -
C1: CancelReserve(undo reserve)
T1: Reserve resource ←→ C1: Cancel reservation
T2: Confirm / Settle (RML-3 boundary)
Why it’s powerful:
- fail before confirm → compensation stays inside RML-2
- fail after confirm → treat as RML-3 (refund/correction/explanation)
This pattern makes “history entry points” explicit, instead of accidental.
5) Orchestration vs choreography
There are two classic implementation styles.
5.1 Orchestration (central conductor)
flowchart LR
O[Orchestrator] --> A[Service A]
O --> B[Service B]
O --> C[Service C]
Pros
- saga state is centralized and visible
- the flow is easy to change
- “interpret Action Hints” logic fits naturally in one place
Cons
- orchestrator tends to become a god object
- if the orchestrator is down, many sagas stall
5.2 Choreography (event-driven coordination)
flowchart LR
subgraph Bus[Event Bus]
E[(Topic / Stream)]
end
A[Service A] --> E
B[Service B] --> E
C[Service C] --> E
Pros
- higher service autonomy
- scales well
Cons
- end-to-end flow is harder to “see”
- tracing “where the saga is stuck” becomes non-trivial
RML framing:
- Orchestration → orchestrator is the “center of RML-2”
- Choreography → event stream / effect log becomes the “center of RML-2”
Either way, decide early:
Where is saga state observable as a first-class object?
6) Implementation lifelines: IDs, idempotency keys, logs, and time
6.1 You need a Saga ID and Step IDs
At minimum:
- every saga has a
sagaId - every step has a
stepId
If you lack IDs, incident response becomes “grep and guess.”
6.2 Idempotency keys are the lifeline of retries
If you ever return Action Hints like:
retry-localretry-with-backoff
…then you implicitly promised:
“This operation can be retried without duplicating harmful effects.”
That promise is fragile unless you enforce:
Retry must reuse the same idempotency key.
A simple logging structure helps keep the world honest:
type ActionHint =
| "retry-local"
| "retry-with-backoff"
| "start-compensation"
| "escalate-history"
| "abort";
type SagaStepLog = {
sagaId: string;
stepId: string;
status: "pending" | "done" | "compensating" | "compensated" | "failed";
world: "RML1" | "RML2" | "RML3";
action?: ActionHint;
errorCode?: string;
idempotencyKey?: string;
timestamp: string;
};
Example: derive a stable idempotency key per step:
type ChargeRequest = { paymentId: string; amount: number; currency: string };
type PaymentGateway = {
charge: (req: ChargeRequest, opts: { idempotencyKey: string }) => Promise<void>;
};
// (Assume `RmlError` is the structured error type from Chapter 5.)
async function chargePayment(
sagaId: string,
req: ChargeRequest,
paymentGateway: PaymentGateway
) {
const idempotencyKey = `saga:${sagaId}:charge`;
try {
await paymentGateway.charge(req, { idempotencyKey });
} catch (e) {
throw new RmlError({
world: "RML2",
severity: "error",
action: "retry-with-backoff",
code: "PAYMENT_TEMPORARY_ERROR",
message: "Payment gateway temporary error",
cause: e,
});
}
}
Rule of thumb (worth enforcing in review):
If you return
retry-*, you must return (or deterministically derive) the idempotency key as part of the contract.
6.3 Timeouts and heartbeats: sagas are long-lived
Real sagas get stuck:
-
Runningfor hours -
Compensatingfor 30 minutes - a worker died mid-step
- an event stream consumer lagged
So you need:
- heartbeats (progress timestamps)
- alerts for “stuck” states
- a governance rule like: “If a saga is stuck long enough, escalate to RML-3.”
With RML tags, observability can become clean:
-
world=RML2 status=running lastUpdate>1h→ stuck saga alert -
world=RML2 status=compensating lastUpdate>30m→ compensation-stall alert
7) The saga boundary: “from here on, it’s RML-3”
The most dangerous belief is:
“If we try hard enough, we can handle everything inside RML-2.”
No. You must explicitly define what belongs to RML-3:
- money movement (often RML-3 by default)
- externally delivered notifications (hard to undo)
- legally meaningful records (invoices, healthcare records, etc.)
A practical policy set might look like:
- “Funds settlement is RML-3”
- “Email/SMS is RML-2.5-ish; compensation is correction notice, not deletion”
- “Inventory should stay RML-2 via reserve/confirm/release where possible”
This enables an important architectural move:
“This step is too dangerous inside a saga. Split it into an RML-3 workflow.”
8) Common saga anti-patterns
8.1 Hidden side effects inside steps
A step T2 quietly triggers an irreversible external API call, but C2 never handles it.
Result:
- “We compensated” becomes a lie.
- Evidence is missing, so debugging becomes archaeology.
Countermeasure:
- inventory all side effects per step
- classify them by world (RML-1/2/3)
- ensure RML-2 effects are handled by compensation somewhere
8.2 Using compensation to deny RML-3 reality
If the business reality is history-bound (payment settled, goods shipped), pretending it’s “still RML-2” leads to:
- engineers thinking “we rolled back”
- business/legal seeing “we did not”
Countermeasure:
- define an acceptable end state as RML-3 (refund/correction/explanation)
- treat compensation as “conversation repair,” not “history erasure”
8.3 No visibility into saga state
“No saga IDs, no step logs, we’ll grep logs later.”
That’s how incident response fails.
Countermeasure:
- require
sagaId+stepIdin every log/event at minimum - keep a single “state machine picture” (like above) in docs
9) Saga design checklist (RML-2 edition)
Business / semantics
- [ ] What single “conversation” does this saga represent?
- [ ] Which steps are reversible in RML-2?
- [ ] Which steps are history-bound (RML-3)?
- [ ] For each step, what is an acceptable end state?
Technical
- [ ] Each
T*has a correspondingC*(or an explicit reason why it can’t) - [ ]
sagaId/stepIdexist everywhere - [ ]
retry-*implies stable idempotency keys - [ ] Where is saga state observable: orchestrator or stream?
Operations / observability
- [ ] Logs/traces include
world/severity/action - [ ] Alerts exist for “stuck Running/Compensating”
- [ ]
RML3escalation reliably enters incident/case workflows
Closing — sagas are the tool for accepting RML-2 honestly
Sagas and compensations are not a magic rollback machine.
They don’t:
- turn RML-2 into RML-1
- erase the need to acknowledge RML-3
They do something more realistic:
They help you accept that RML-2 is a world of ongoing dialog, and design the conversation so it’s retryable.
- rewind what can be rewound (as conversation)
- admit what can’t (as history)
- share the boundary via errors, logs, APIs, and org processes
Next:
Chapter 7 — API / Client Design: how (and whether) to expose RML labels to callers, and how to standardize retry and reconciliation behavior across clients and SDKs.
Top comments (0)