DEV Community

kanaria007
kanaria007

Posted on • Originally published at zenn.dev

Chapter 6 — Sagas & Compensating Transactions: Building “Retryable Conversations”

The Worlds of Distributed Systems — Chapter 6

“How do we do a safe transaction across two services?”

Most answers sound like one of these:

  • “2PC…? But nobody really does that in practice.”
  • “Event-driven… and eventually consistent…”
  • “Write a saga with compensating transactions… (and then nobody writes them)”

In Chapter 5, we got serious about RML-2 failures:

  • classify failures by world (RML-1/2/3)
  • label them with structured errors (world / action)
  • return Action Hints like start-compensation, retry-with-backoff, escalate-history

This chapter is the design layer on top of that:

How to structure the “in-progress conversation” of eventual consistency (RML-2), and make it retryable—via sagas.


1) From “transaction” to “conversation”

1.1 Why local transactions feel easy (RML-1)

In a single service + single DB world (RML-1), the story is simple:

  1. begin transaction
  2. update multiple tables
  3. commit, or rollback

Rollback means:

“Return to the pre-transaction state.”

The DB does it for you.

1.2 Why cross-service flows become dialog (RML-2)

In distributed systems (RML-2), the flow becomes a long conversation:

  • Service A calls Service B
  • B updates its own DB
  • B calls Service C
  • something fails mid-way

Now you must answer:

  • How far can we roll back as a conversation?
  • Who undoes what—and how?
  • Where is the RML-3 boundary (history-bound)?

The framework for pre-deciding that is a Saga.

1.3 Eventual consistency in RML terms

A useful restatement:

  • RML-2: the world while you’re still converging toward consistency
  • RML-3: the world after convergence, where the result is now “settled history”

Sagas are about managing the RML-2 convergence process:

  • undo what can be undone (RML-2)
  • escalate what can’t (RML-3)

2) A saga is a “retryable conversation”

2.1 A practical definition

In this series:

A Saga = a sequence of forward steps (T1, T2, …) plus compensations (C1, C2, …) that together form an RML-2 conversation.

Think:

  • forward steps: T1, T2, T3, ...
  • compensations: C1, C2, C3, ...

If the saga fails at some Tn, you run compensations in reverse order:

T1 → T2 → T3
↑    ↑    ↑
C1   C2   C3   (on failure, run C in reverse)
Enter fullscreen mode Exit fullscreen mode

2.2 The most important question: where is the boundary to RML-3?

Sagas are not magic. The critical design move is naming this:

Which steps are truly reversible in RML-2, and which steps are already history-bound (RML-3)?

If you don’t draw this line, you’ll end up pretending an RML-3 action is “rollbackable,” and reality will eventually correct you—painfully.


3) Treat the saga as a state machine

A helpful worldview: a saga is a tiny world with a lifecycle.

stateDiagram-v2
  [*] --> Pending
  Pending --> Running: start
  Running --> Completed: all Tn succeed
  Running --> Compensating: error (RML2 / start-compensation)
  Compensating --> Compensated: all Cn succeed
  Compensating --> Escalated: compensation failed (RML3 / escalate-history)
  Running --> Escalated: history-bound failure detected
  Completed --> [*]
  Compensated --> [*]
  Escalated --> [*]
Enter fullscreen mode Exit fullscreen mode

Mapping from Chapter 5’s error labels:

  • world: "RML2", action: "start-compensation"Running → Compensating
  • world: "RML3", action: "escalate-history"Running/Compensating → Escalated

This is the big idea:

In RML-2, structured failures aren’t “just exceptions.”
They are state machine triggers.


4) Designing compensations: don’t chase “perfect reversal”

4.1 “Return to the exact prior state” is often impossible

A common misunderstanding:

“Compensation must restore the exact original state.”

In practice, many effects can’t be fully erased:

  • the user already saw the email
  • shipping started
  • a downstream system recorded something immutable
  • time passed; state changed elsewhere

So don’t optimize for perfect reversal. Optimize for:

A business-acceptable end state.

Examples:

  • “Cancel after shipping started” → refund + return workflow + explanation
  • “Revoke points after order cancellation” → subtract points + record why

4.2 The Reserve → Confirm pattern (a clean RML-2 → RML-3 boundary)

A common safe pattern:

  • T1: Reserve (reversible, RML-2)
  • T2: Confirm (often history-bound, RML-3 boundary)
  • C1: CancelReserve (undo reserve)
T1: Reserve resource        ←→  C1: Cancel reservation
T2: Confirm / Settle (RML-3 boundary)
Enter fullscreen mode Exit fullscreen mode

Why it’s powerful:

  • fail before confirm → compensation stays inside RML-2
  • fail after confirm → treat as RML-3 (refund/correction/explanation)

This pattern makes “history entry points” explicit, instead of accidental.


5) Orchestration vs choreography

There are two classic implementation styles.

5.1 Orchestration (central conductor)

flowchart LR
  O[Orchestrator] --> A[Service A]
  O --> B[Service B]
  O --> C[Service C]
Enter fullscreen mode Exit fullscreen mode

Pros

  • saga state is centralized and visible
  • the flow is easy to change
  • “interpret Action Hints” logic fits naturally in one place

Cons

  • orchestrator tends to become a god object
  • if the orchestrator is down, many sagas stall

5.2 Choreography (event-driven coordination)

flowchart LR
  subgraph Bus[Event Bus]
    E[(Topic / Stream)]
  end
  A[Service A] --> E
  B[Service B] --> E
  C[Service C] --> E
Enter fullscreen mode Exit fullscreen mode

Pros

  • higher service autonomy
  • scales well

Cons

  • end-to-end flow is harder to “see”
  • tracing “where the saga is stuck” becomes non-trivial

RML framing:

  • Orchestration → orchestrator is the “center of RML-2”
  • Choreography → event stream / effect log becomes the “center of RML-2”

Either way, decide early:

Where is saga state observable as a first-class object?


6) Implementation lifelines: IDs, idempotency keys, logs, and time

6.1 You need a Saga ID and Step IDs

At minimum:

  • every saga has a sagaId
  • every step has a stepId

If you lack IDs, incident response becomes “grep and guess.”

6.2 Idempotency keys are the lifeline of retries

If you ever return Action Hints like:

  • retry-local
  • retry-with-backoff

…then you implicitly promised:

“This operation can be retried without duplicating harmful effects.”

That promise is fragile unless you enforce:

Retry must reuse the same idempotency key.

A simple logging structure helps keep the world honest:

type ActionHint =
  | "retry-local"
  | "retry-with-backoff"
  | "start-compensation"
  | "escalate-history"
  | "abort";

type SagaStepLog = {
  sagaId: string;
  stepId: string;
  status: "pending" | "done" | "compensating" | "compensated" | "failed";
  world: "RML1" | "RML2" | "RML3";
  action?: ActionHint;
  errorCode?: string;
  idempotencyKey?: string;
  timestamp: string;
};
Enter fullscreen mode Exit fullscreen mode

Example: derive a stable idempotency key per step:

type ChargeRequest = { paymentId: string; amount: number; currency: string };

type PaymentGateway = {
  charge: (req: ChargeRequest, opts: { idempotencyKey: string }) => Promise<void>;
};

// (Assume `RmlError` is the structured error type from Chapter 5.)
async function chargePayment(
  sagaId: string,
  req: ChargeRequest,
  paymentGateway: PaymentGateway
) {
  const idempotencyKey = `saga:${sagaId}:charge`;

  try {
    await paymentGateway.charge(req, { idempotencyKey });
  } catch (e) {
    throw new RmlError({
      world: "RML2",
      severity: "error",
      action: "retry-with-backoff",
      code: "PAYMENT_TEMPORARY_ERROR",
      message: "Payment gateway temporary error",
      cause: e,
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb (worth enforcing in review):

If you return retry-*, you must return (or deterministically derive) the idempotency key as part of the contract.

6.3 Timeouts and heartbeats: sagas are long-lived

Real sagas get stuck:

  • Running for hours
  • Compensating for 30 minutes
  • a worker died mid-step
  • an event stream consumer lagged

So you need:

  • heartbeats (progress timestamps)
  • alerts for “stuck” states
  • a governance rule like: “If a saga is stuck long enough, escalate to RML-3.”

With RML tags, observability can become clean:

  • world=RML2 status=running lastUpdate>1h → stuck saga alert
  • world=RML2 status=compensating lastUpdate>30m → compensation-stall alert

7) The saga boundary: “from here on, it’s RML-3”

The most dangerous belief is:

“If we try hard enough, we can handle everything inside RML-2.”

No. You must explicitly define what belongs to RML-3:

  • money movement (often RML-3 by default)
  • externally delivered notifications (hard to undo)
  • legally meaningful records (invoices, healthcare records, etc.)

A practical policy set might look like:

  • “Funds settlement is RML-3”
  • “Email/SMS is RML-2.5-ish; compensation is correction notice, not deletion”
  • “Inventory should stay RML-2 via reserve/confirm/release where possible”

This enables an important architectural move:

“This step is too dangerous inside a saga. Split it into an RML-3 workflow.”


8) Common saga anti-patterns

8.1 Hidden side effects inside steps

A step T2 quietly triggers an irreversible external API call, but C2 never handles it.

Result:

  • “We compensated” becomes a lie.
  • Evidence is missing, so debugging becomes archaeology.

Countermeasure:

  • inventory all side effects per step
  • classify them by world (RML-1/2/3)
  • ensure RML-2 effects are handled by compensation somewhere

8.2 Using compensation to deny RML-3 reality

If the business reality is history-bound (payment settled, goods shipped), pretending it’s “still RML-2” leads to:

  • engineers thinking “we rolled back”
  • business/legal seeing “we did not”

Countermeasure:

  • define an acceptable end state as RML-3 (refund/correction/explanation)
  • treat compensation as “conversation repair,” not “history erasure”

8.3 No visibility into saga state

“No saga IDs, no step logs, we’ll grep logs later.”

That’s how incident response fails.

Countermeasure:

  • require sagaId + stepId in every log/event at minimum
  • keep a single “state machine picture” (like above) in docs

9) Saga design checklist (RML-2 edition)

Business / semantics

  • [ ] What single “conversation” does this saga represent?
  • [ ] Which steps are reversible in RML-2?
  • [ ] Which steps are history-bound (RML-3)?
  • [ ] For each step, what is an acceptable end state?

Technical

  • [ ] Each T* has a corresponding C* (or an explicit reason why it can’t)
  • [ ] sagaId / stepId exist everywhere
  • [ ] retry-* implies stable idempotency keys
  • [ ] Where is saga state observable: orchestrator or stream?

Operations / observability

  • [ ] Logs/traces include world / severity / action
  • [ ] Alerts exist for “stuck Running/Compensating”
  • [ ] RML3 escalation reliably enters incident/case workflows

Closing — sagas are the tool for accepting RML-2 honestly

Sagas and compensations are not a magic rollback machine.

They don’t:

  • turn RML-2 into RML-1
  • erase the need to acknowledge RML-3

They do something more realistic:

They help you accept that RML-2 is a world of ongoing dialog, and design the conversation so it’s retryable.

  • rewind what can be rewound (as conversation)
  • admit what can’t (as history)
  • share the boundary via errors, logs, APIs, and org processes

Next:

Chapter 7 — API / Client Design: how (and whether) to expose RML labels to callers, and how to standardize retry and reconciliation behavior across clients and SDKs.

Top comments (0)