kanaria007

Posted on Feb 19 • Originally published at zenn.dev

Chapter 3 — RML-2 (Dialog World): Rollback as a Conversation

#distributedsystems #architecture #microservices #sre

The Worlds of Distributed Systems — Chapter 3

“Can we roll this back by ourselves?”
Or do we need to talk to someone who now shares the outcome?

In Chapter 2, we defined RML-1 (Closed World) as:

A room where nothing has escaped yet—so failure is safe.

In this chapter, we step outside that room and enter:

RML-2 — Dialog World
(RML = Rollback Maturity Levels: a simple map for how “rollback” changes as the blast radius grows.)

The key idea is simple:

Rollback is not a rewind button. It’s part of an ongoing conversation.

1) What is RML-2? — A world where “someone else exists”

If we define RML-2 in one line:

RML-2 is a world where there is someone else who shares the outcome.

1.1 The boundary between RML-1 and RML-2

RML-1 was “not observable from outside,” roughly meaning:

no external DB writes
no external API calls that cause real effects
no human notifications
no logs that later become inputs to business decisions or audits

The moment you cross into RML-2, the situation changes:

you write a record to a DB that other services read
you call another microservice (RPC / messaging)
you notify users (email / push)
you update a state that appears on dashboards or internal tooling

In short:

The moment your action becomes observable by someone else, you are in RML-2.

1.2 The boundary between RML-2 and RML-3

RML-2 is not yet the History World (RML-3).

A rough split:

RML-2
- inside your company / product boundary
- multiple services and humans can still settle outcomes via “conversation”
- failures are often recoverable via retries, status checks, and compensation
RML-3
- money, legal responsibility, social trust
- other organizations / regulation / reality-level events
- you can’t “make it not happen”—you can only layer corrections on top

A simple mental picture:

        Responsibility weight / “history weight”

 high ─────────────────────┐  RML-3: History World
      |                    |   - money, legal, social trust
      |         ┌──────────┘
      |         |  RML-2: Dialog World
      |         |   - multi-party conversation
      |   ┌─────┘
      |   |  RML-1: Closed World
      |   |   - inside one room/process
      +────────────────────────→  blast radius / scope of impact

RML-2 is the middle zone: the conversation zone.

2) See distributed systems as conversations

A useful metaphor for RML-2:

A distributed system is a network of conversations.

2.1 Not Q&A, but multi-turn dialog

If you only look at a single HTTP call, it feels like one round trip:

client → server (request)
server → client (response)

But real operations are almost always multi-turn:

ask the inventory service
request authorization from payment
dispatch shipping
notify the user via email

When any step fails, you often need the continuation of the conversation:

cancel something you already requested
compensate with another operation
explain to the user and provide next steps

So, for RML-2:

Rollback is not “restart the flow.”
It’s “continue the dialog until the situation settles.”

2.2 Three kinds of “participants”

In RML-2, the “other side” typically comes in three forms:

Services (microservices / external APIs)
- connected via HTTP/gRPC/messaging
- governed by SLOs, retry policies, error contracts
Humans (users / operators)
- connected via UI, email, Slack, phone
- their actions (read/didn’t read, clicked/didn’t click) matter
Schedulers (queues / batch jobs / workflow engines)
- delay and re-execution are normal
- “I’ll come back later” is expected

Treating them all as “dialog participants” makes the system easier to reason about.

2.3 Timeouts and “silence” are also responses

Conversations include silence.

Distributed systems constantly face this:

you sent a request, but no response arrives (timeout)

Now you must decide:

did the request never arrive?
did it arrive, but the response got lost?
did the other side process it and just not reply?

This ambiguity is why RML-2 needs explicit “silence reactions”:

retry with backoff
check status / poll
cancel / start compensation

Worldview summary:

Silence is also part of the dialog.

(Later chapters will encode these as concrete patterns like retry-with-backoff, check-status, and start-compensation.)

3) The basic story of RML-2: Promise → Perform → Reconcile

In RML-2, most operations follow a three-step story:

Promise — “Here’s what I’m asking for, under your rules.”
Perform — “Someone actually did something (maybe).”
Reconcile — “We align our views of what happened.”

3.1 Promise: you enter the other side’s assumptions

Calling an API is essentially:

“I want this outcome, under your rules and constraints.”

request formats
quotas / rate limits / size limits
error codes and retry semantics

If you provide an API, you’re promising things too:

what you guarantee on success vs failure
what is safe to retry
what the caller should do next (retry? give up? check status?)

3.2 Perform: the moment one side moves ahead

This is where classic distributed weirdness happens:

the server processed it, but the response was lost
the server crashed mid-flight
the client thinks it succeeded, while the server thinks it didn’t happen (or vice versa)

The important question becomes:

What does each participant believe the world looks like right now?

RML-2 failures are often belief gaps, not just exceptions.

3.3 Reconcile: not rewind—align histories

Reconciliation closes the belief gap:

ask again to confirm status
retry with idempotency keys
cancel the request
run compensation actions

Over time, you push the system toward:

“A state where all participants share a consistent story.”

This process is what people usually mean by eventual consistency:

You can’t instantly align everyone’s world.
But through dialog (retry, check, compensate), you converge.

RML-2 is the worldview that treats eventual consistency as designed conversation.

4) Talk about “how far rollback goes” in world terms

The key RML-2 design question:

How far do you intend this operation to be rollbackable—realistically?

4.1 Three categories of “rollback scope”

Roughly, RML-2 operations fall into three categories:

Rollbackable by yourself

nothing meaningful has changed outside your boundary, or
your “undo” completes entirely within your own service

Requires dialog with the other side

the other side has state that changed
you need cancel/correct/retry as additional dialog turns

Already escalated into RML-3

the effect has reached “history”
rollback becomes refund/correction/explanation

RML-2 design is often:

How wide can you make category (1),
and how well-designed is category (2)?

4.2 Example: payment authorization vs capture

A common split:

Authorize (reserve funds) → still more RML-2-like (often cancellable)
Capture (actually charge) → more RML-3-like (refund/correction territory)

Making that boundary explicit helps:

which APIs are still dialog/compensation scope
which APIs must be treated as history-entry points

5) Failure patterns you should expect (worldview-level)

Implementation details come later (Part II), but the worldview-level traps are worth naming now.

5.1 “Only one side rolled back” (split history)

caller rolls back locally: “it never happened”
callee processed it: “it happened”

Now you have two incompatible timelines.

RML framing:

both participants are in RML-2
local rollback alone created a dialog-wide inconsistency

From there, you can’t fix it with local rewinds. Only with continued dialog (status checks, compensation, reconciliation).

5.2 Depending on an API that isn’t designed for rollback

you assume compensation exists (“we’ll cancel if needed”)
the other service provides no cancel/undo path

This is a worldview mismatch:

“We treated this as RML-2, but the other side behaves like RML-3.”

If an action is effectively irreversible, you must treat it as history-bound.

6) Humans are also RML-2 participants

RML-2 isn’t just service-to-service.

6.1 User dialog

UI status displays
“processing / completed” emails
chatbot responses

These are dialog turns.

If you get them wrong, you create:

the user believes the old timeline
the system believes the new timeline

That’s an RML-2 split history, but with a human participant.

6.2 Operator dialog

Customer support and operators are often:

Human reconcilers in the Dialog World.

They:

notice inconsistencies
trigger manual corrections
coordinate with users

So RML-2 design should explicitly include:

where automation stops
where humans take over
what “handoff” artifacts exist (logs, case IDs, reason codes)

7) RML-2 checklist (worldview edition)

7.1 Map the participants and the dialog

[ ] Who are the participants (services / humans / schedulers)?
[ ] How many turns can the conversation last?
[ ] If something fails mid-way, how do we continue the dialog to settle it?
[ ] What do we do on silence (timeouts)?

7.2 Define rollback scope

[ ] Which actions are rollbackable by yourself?
[ ] Which actions require dialog (cancel/correct/retry)?
[ ] Where is the boundary where this becomes RML-3?

7.3 Reconciliation ownership

[ ] How do we detect “only one side rolled back”?
[ ] Who is the reconciler (automated or human)?
[ ] What is the explicit dialog flow once inconsistency is detected?

Summary — RML-2 is not “transactions,” it’s conversation

What this chapter is trying to fix is a common misunderstanding:

RML-2 rollback is not a rewind operation.
It’s a conversation that continues until the world settles.

RML-1: inside your room
RML-2: multi-party dialog and reconciliation
RML-3: irreversible history and forward-only correction

If you treat RML-2 as “a simpler distributed transaction,” you’ll miss the real work.
If you treat it as “a conversation among participants,” design becomes clearer—especially around silence, retries, compensation, and eventual consistency.

Chapter 4 — RML-3 (History World): when you can’t delete history, only layer corrections.

DEV Community