kanaria007

Posted on Mar 2 • Originally published at zenn.dev

Chapter 6 — Sagas & Compensating Transactions: Building “Retryable Conversations”

#distributedsystems #microservices #sre #architecture

The Worlds of Distributed Systems — Chapter 6

“How do we do a safe transaction across two services?”

Most answers sound like one of these:

“2PC…? But nobody really does that in practice.”
“Event-driven… and eventually consistent…”
“Write a saga with compensating transactions… (and then nobody writes them)”

In Chapter 5, we got serious about RML-2 failures:

classify failures by world (RML-1/2/3)
label them with structured errors (world / action)
return Action Hints like start-compensation, retry-with-backoff, escalate-history

This chapter is the design layer on top of that:

How to structure the “in-progress conversation” of eventual consistency (RML-2), and make it retryable—via sagas.

1) From “transaction” to “conversation”

1.1 Why local transactions feel easy (RML-1)

In a single service + single DB world (RML-1), the story is simple:

begin transaction
update multiple tables
commit, or rollback

Rollback means:

“Return to the pre-transaction state.”

The DB does it for you.

1.2 Why cross-service flows become dialog (RML-2)

In distributed systems (RML-2), the flow becomes a long conversation:

Service A calls Service B
B updates its own DB
B calls Service C
something fails mid-way

Now you must answer:

How far can we roll back as a conversation?
Who undoes what—and how?
Where is the RML-3 boundary (history-bound)?

The framework for pre-deciding that is a Saga.

1.3 Eventual consistency in RML terms

A useful restatement:

RML-2: the world while you’re still converging toward consistency
RML-3: the world after convergence, where the result is now “settled history”

Sagas are about managing the RML-2 convergence process:

undo what can be undone (RML-2)
escalate what can’t (RML-3)

2) A saga is a “retryable conversation”

2.1 A practical definition

In this series:

A Saga = a sequence of forward steps (T1, T2, …) plus compensations (C1, C2, …) that together form an RML-2 conversation.

Think:

forward steps: T1, T2, T3, ...
compensations: C1, C2, C3, ...

If the saga fails at some Tn, you run compensations in reverse order:

T1 → T2 → T3
↑    ↑    ↑
C1   C2   C3   (on failure, run C in reverse)

2.2 The most important question: where is the boundary to RML-3?

Sagas are not magic. The critical design move is naming this:

Which steps are truly reversible in RML-2, and which steps are already history-bound (RML-3)?

If you don’t draw this line, you’ll end up pretending an RML-3 action is “rollbackable,” and reality will eventually correct you—painfully.

3) Treat the saga as a state machine

A helpful worldview: a saga is a tiny world with a lifecycle.

stateDiagram-v2
  [*] --> Pending
  Pending --> Running: start
  Running --> Completed: all Tn succeed
  Running --> Compensating: error (RML2 / start-compensation)
  Compensating --> Compensated: all Cn succeed
  Compensating --> Escalated: compensation failed (RML3 / escalate-history)
  Running --> Escalated: history-bound failure detected
  Completed --> [*]
  Compensated --> [*]
  Escalated --> [*]

Mapping from Chapter 5’s error labels:

world: "RML2", action: "start-compensation" → Running → Compensating
world: "RML3", action: "escalate-history" → Running/Compensating → Escalated

This is the big idea:

In RML-2, structured failures aren’t “just exceptions.”
They are state machine triggers.

4) Designing compensations: don’t chase “perfect reversal”

4.1 “Return to the exact prior state” is often impossible

A common misunderstanding:

“Compensation must restore the exact original state.”

In practice, many effects can’t be fully erased:

the user already saw the email
shipping started
a downstream system recorded something immutable
time passed; state changed elsewhere

So don’t optimize for perfect reversal. Optimize for:

A business-acceptable end state.

Examples:

“Cancel after shipping started” → refund + return workflow + explanation
“Revoke points after order cancellation” → subtract points + record why

4.2 The Reserve → Confirm pattern (a clean RML-2 → RML-3 boundary)

A common safe pattern:

T1: Reserve (reversible, RML-2)
T2: Confirm (often history-bound, RML-3 boundary)
C1: CancelReserve (undo reserve)

T1: Reserve resource        ←→  C1: Cancel reservation
T2: Confirm / Settle (RML-3 boundary)

Why it’s powerful:

fail before confirm → compensation stays inside RML-2
fail after confirm → treat as RML-3 (refund/correction/explanation)

This pattern makes “history entry points” explicit, instead of accidental.

5) Orchestration vs choreography

There are two classic implementation styles.

5.1 Orchestration (central conductor)

flowchart LR
  O[Orchestrator] --> A[Service A]
  O --> B[Service B]
  O --> C[Service C]

Pros

saga state is centralized and visible
the flow is easy to change
“interpret Action Hints” logic fits naturally in one place

Cons

orchestrator tends to become a god object
if the orchestrator is down, many sagas stall

5.2 Choreography (event-driven coordination)

flowchart LR
  subgraph Bus[Event Bus]
    E[(Topic / Stream)]
  end
  A[Service A] --> E
  B[Service B] --> E
  C[Service C] --> E

Pros

higher service autonomy
scales well

Cons

end-to-end flow is harder to “see”
tracing “where the saga is stuck” becomes non-trivial

RML framing:

Orchestration → orchestrator is the “center of RML-2”
Choreography → event stream / effect log becomes the “center of RML-2”

Either way, decide early:

Where is saga state observable as a first-class object?

6) Implementation lifelines: IDs, idempotency keys, logs, and time

6.1 You need a Saga ID and Step IDs

At minimum:

every saga has a sagaId
every step has a stepId

If you lack IDs, incident response becomes “grep and guess.”

6.2 Idempotency keys are the lifeline of retries

If you ever return Action Hints like:

retry-local
retry-with-backoff

…then you implicitly promised:

“This operation can be retried without duplicating harmful effects.”

That promise is fragile unless you enforce:

Retry must reuse the same idempotency key.

A simple logging structure helps keep the world honest:

type ActionHint =
  | "retry-local"
  | "retry-with-backoff"
  | "start-compensation"
  | "escalate-history"
  | "abort";

type SagaStepLog = {
  sagaId: string;
  stepId: string;
  status: "pending" | "done" | "compensating" | "compensated" | "failed";
  world: "RML1" | "RML2" | "RML3";
  action?: ActionHint;
  errorCode?: string;
  idempotencyKey?: string;
  timestamp: string;
};

Example: derive a stable idempotency key per step:

type ChargeRequest = { paymentId: string; amount: number; currency: string };

type PaymentGateway = {
  charge: (req: ChargeRequest, opts: { idempotencyKey: string }) => Promise<void>;
};

// (Assume `RmlError` is the structured error type from Chapter 5.)
async function chargePayment(
  sagaId: string,
  req: ChargeRequest,
  paymentGateway: PaymentGateway
) {
  const idempotencyKey = `saga:${sagaId}:charge`;

  try {
    await paymentGateway.charge(req, { idempotencyKey });
  } catch (e) {
    throw new RmlError({
      world: "RML2",
      severity: "error",
      action: "retry-with-backoff",
      code: "PAYMENT_TEMPORARY_ERROR",
      message: "Payment gateway temporary error",
      cause: e,
    });
  }
}

Rule of thumb (worth enforcing in review):

If you return retry-*, you must return (or deterministically derive) the idempotency key as part of the contract.

6.3 Timeouts and heartbeats: sagas are long-lived

Real sagas get stuck:

Running for hours
Compensating for 30 minutes
a worker died mid-step
an event stream consumer lagged

So you need:

heartbeats (progress timestamps)
alerts for “stuck” states
a governance rule like: “If a saga is stuck long enough, escalate to RML-3.”

With RML tags, observability can become clean:

world=RML2 status=running lastUpdate>1h → stuck saga alert
world=RML2 status=compensating lastUpdate>30m → compensation-stall alert

7) The saga boundary: “from here on, it’s RML-3”

The most dangerous belief is:

“If we try hard enough, we can handle everything inside RML-2.”

No. You must explicitly define what belongs to RML-3:

money movement (often RML-3 by default)
externally delivered notifications (hard to undo)
legally meaningful records (invoices, healthcare records, etc.)

A practical policy set might look like:

“Funds settlement is RML-3”
“Email/SMS is RML-2.5-ish; compensation is correction notice, not deletion”
“Inventory should stay RML-2 via reserve/confirm/release where possible”

This enables an important architectural move:

“This step is too dangerous inside a saga. Split it into an RML-3 workflow.”

8) Common saga anti-patterns

8.1 Hidden side effects inside steps

A step T2 quietly triggers an irreversible external API call, but C2 never handles it.

Result:

“We compensated” becomes a lie.
Evidence is missing, so debugging becomes archaeology.

Countermeasure:

inventory all side effects per step
classify them by world (RML-1/2/3)
ensure RML-2 effects are handled by compensation somewhere

8.2 Using compensation to deny RML-3 reality

If the business reality is history-bound (payment settled, goods shipped), pretending it’s “still RML-2” leads to:

engineers thinking “we rolled back”
business/legal seeing “we did not”

Countermeasure:

define an acceptable end state as RML-3 (refund/correction/explanation)
treat compensation as “conversation repair,” not “history erasure”

8.3 No visibility into saga state

“No saga IDs, no step logs, we’ll grep logs later.”

That’s how incident response fails.

Countermeasure:

require sagaId + stepId in every log/event at minimum
keep a single “state machine picture” (like above) in docs

9) Saga design checklist (RML-2 edition)

Business / semantics

[ ] What single “conversation” does this saga represent?
[ ] Which steps are reversible in RML-2?
[ ] Which steps are history-bound (RML-3)?
[ ] For each step, what is an acceptable end state?

Technical

[ ] Each T* has a corresponding C* (or an explicit reason why it can’t)
[ ] sagaId / stepId exist everywhere
[ ] retry-* implies stable idempotency keys
[ ] Where is saga state observable: orchestrator or stream?

Operations / observability

[ ] Logs/traces include world / severity / action
[ ] Alerts exist for “stuck Running/Compensating”
[ ] RML3 escalation reliably enters incident/case workflows

Closing — sagas are the tool for accepting RML-2 honestly

Sagas and compensations are not a magic rollback machine.

They don’t:

turn RML-2 into RML-1
erase the need to acknowledge RML-3

They do something more realistic:

They help you accept that RML-2 is a world of ongoing dialog, and design the conversation so it’s retryable.

rewind what can be rewound (as conversation)
admit what can’t (as history)
share the boundary via errors, logs, APIs, and org processes

Chapter 7 — API / Client Design: how (and whether) to expose RML labels to callers, and how to standardize retry and reconciliation behavior across clients and SDKs.

DEV Community

Chapter 6 — Sagas & Compensating Transactions: Building “Retryable Conversations”

1) From “transaction” to “conversation”

1.1 Why local transactions feel easy (RML-1)

1.2 Why cross-service flows become dialog (RML-2)

1.3 Eventual consistency in RML terms

2) A saga is a “retryable conversation”

2.1 A practical definition

2.2 The most important question: where is the boundary to RML-3?

3) Treat the saga as a state machine

4) Designing compensations: don’t chase “perfect reversal”

4.1 “Return to the exact prior state” is often impossible

4.2 The Reserve → Confirm pattern (a clean RML-2 → RML-3 boundary)

5) Orchestration vs choreography

5.1 Orchestration (central conductor)

5.2 Choreography (event-driven coordination)

6) Implementation lifelines: IDs, idempotency keys, logs, and time

6.1 You need a Saga ID and Step IDs

6.2 Idempotency keys are the lifeline of retries

6.3 Timeouts and heartbeats: sagas are long-lived

7) The saga boundary: “from here on, it’s RML-3”

8) Common saga anti-patterns

8.1 Hidden side effects inside steps

8.2 Using compensation to deny RML-3 reality

8.3 No visibility into saga state

9) Saga design checklist (RML-2 edition)

Business / semantics

Technical

Operations / observability

Closing — sagas are the tool for accepting RML-2 honestly

Top comments (0)