kanaria007

Posted on Feb 14 • Originally published at zenn.dev

Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)

#distributedsystems #microservices #sre #architecture

The Worlds of Distributed Systems — Chapter 1
— The hidden worldview behind “We can roll back”

In production, you’ll hear this sentence a lot:

“This feature is always rollbackable.”

But… is it really?

How far can you roll back?
Who is responsible for what—engineering, SRE, business, legal?
Are there things you can’t make “not happen”?

Most teams don’t answer these explicitly. They just ship.

This chapter proposes a shared mental model for “rollback” in distributed systems:

RML-1 — Closed World
RML-2 — Dialog World
RML-3 — History World

RML stands for Rollback Maturity Level, but the important part isn’t the name.
It’s the worldview—a label that helps teams ask the right question:

Which world does this operation live in?

1) Why “We can roll back” is a dangerous sentence

In distributed systems, “rollback” is not one operation. It’s a composite:

“Rollback” is the combination of
which world you’re in, what you’re undoing, and how far back you can realistically go.

A common gap:

Engineers often imagine: “rollback the DB transaction.”
The business side often imagines: “restore the user’s real world to as-if-it-never-happened.”

If you blur this gap with a vague “we can roll back,” you create hidden risk.

A procedure that looks rigorous does not guarantee real rigor.

Likewise:

Having a rollback procedure
is not the same thing as
preserving the integrity of the world.

2) The three worlds — an intuitive picture

RML-1/2/3 is a label system for one key question:

Where does the effect of this operation exist right now?

RML-1: Closed World

Everything completes inside your process
No external DB writes, no external API calls, no notifications
The blast radius is basically local memory + temporary files

RML-2: Dialog World

There is a conversation with other services and/or users
Writes to internal DBs, RPC/messages across microservices
Rollback is attempted via sagas, retries, compensation

RML-3: History World

There is shared history across organizations, reality, and regulations
Bank transfers, credit card payments, healthcare/public infrastructure
You don’t erase the past—you acknowledge what happened and layer corrections on top

A rough mental map:

Responsibility weight / rollback cost

^  high ────────────────┐  RML-3: History World
|                       |   - regulation, audits, external orgs
|              ┌────────┘
|              |  RML-2: Dialog World
|              |   - conversations with services/users
|     ┌────────┘
|     | RML-1: Closed World
|     |  - inside a single process
+──────────────────────────>  blast radius / scope of impact

3) RML-1 — Closed World: nothing has escaped yet

Catchphrase:

“Nothing has left the room yet.”

RML-1 is where rollback is “classically” easy:

only local memory and temporary storage change
nothing is visible to other systems or humans yet

3.1 Typical examples

Dry run (compute the result, then discard it)
Tests in an isolated staging environment
“Simulate everything once before executing” batch workflows

In this world, the old-school idea works:

Take a snapshot → if it fails, restore it.

3.2 The common trap

Teams often think they’re in RML-1, but they’re not:

you ship logs to an external log platform
you send Slack notifications
you “only read” but hit production DBs

The moment an effect becomes observable outside your room, you’ve moved into RML-2 or beyond.

4) RML-2 — Dialog World: rollback as conversation

Catchphrase:

“There is someone else who shares the outcome.”

In RML-2, actions leave the room:

you write to internal DBs
you call another microservice (RPC / message)
you email/notify a user

From that point on:

It’s no longer “roll back my side and we’re done,”
because there is an other side.

4.1 Rollback as a saga

A common RML-2 tool is the Saga pattern:

split a large flow into steps
each step has do() and undo()
log successful do() steps; on failure, run undo() in reverse

Pseudo-code:

type Step = {
  do: () => Promise<void>;     // must be idempotent
  undo: () => Promise<void>;   // should be idempotent (best-effort)
  key: string;                 // stable idempotency key per step
};

async function runSaga(steps: Step[]) {
  const done: Step[] = [];
  try {
    for (const step of steps) {
      await step.do();
      done.push(step);
    }
  } catch (e) {
    // Compensate in reverse order (best-effort).
    for (const step of done.reverse()) {
      try {
        await step.undo();
      } catch {
        // In production, record this (step log / alert) and continue.
      }
    }
    throw e;
  }
}

The real questions in practice are:

Do you design idempotency keys correctly?
Is your event/log record a trusted single source?
How much do you record about the fact that “compensation happened”?

4.2 The reality of RML-2

Even with sagas and compensation, perfect rollback is often impossible:

the user already read the email
an admin saw the alert and performed manual operations

Which means:

You can no longer return the world to RML-1 by software alone.

So you must design rollback flows together with human operations: runbooks, escalation paths, and reconciliation steps.

5) RML-3 — History World: shared, irreversible history

Catchphrase:

“You can’t delete history. You can only decide what to do next.”

RML-3 is where events become long-lived shared history:

banking/payment records
healthcare records / public infrastructure logs
legally auditable events

In this world:

“Rollback as if it never happened” basically does not exist.

Instead, you do forward-only corrections:

refunds
corrections
invalidations
remediation + prevention measures (sometimes with apology and public communication)

5.1 The “Effect Ledger” idea

A helpful framing for RML-3 is an Effect Ledger:

record external effects in an append-only log
chain it (hashes, signatures) to make tampering hard
run a reconciler process that compares ledger ↔ reality and issues corrections (refunds, fixes)

Conceptually:

services append “effects” to the ledger
a reconciler ensures the world converges via additional corrective effects

And rollback becomes:

Not “erase the past,” but
“add correction records on top of the past.”

(Yes, this resembles some blockchain instincts—but you don’t need crypto hype to benefit from the append-only correction mindset.)

5.2 When RML-3 is required

finance, healthcare, public sector
cross-company integrations, cross-border data processing
any system where you owe an explanation to a third party later

In RML-3:

responsibility spans organizations
a single service cannot unilaterally “delete the past”

6) Common anti-patterns

6.1 Treating an RML-3 world like RML-1

The most dangerous state is:

reality includes:
- external payments
- user notifications
- audit-targeted logs
but the narrative says:

“We can roll back by just reverting the DB.”

A rollback procedure won’t save you from reality.

6.2 Forgetting that “rollback” means different things by world

In RML-1, rollback means: snapshot restore.
In RML-2, rollback means: reverse the conversation with other services/users.
In RML-3, rollback means: publish corrections and share the facts.

If you discuss all of these under one word (“rollback”), the conversation will break somewhere—guaranteed.

7) How to use this in practice: label features by world

Here’s a simple, pragmatic adoption method.

7.1 Add an RML column to your backlog / design docs

For example:

Feature	Description	RML
Simulation / dry run	Uses production-like data but discards results	1
Create invoice record	Persisted in internal DB; referenced by multiple services	2
Credit card payment	External payment gateway + user assets	3

When people argue “is this RML-2 or RML-3?”, that’s not a problem—that’s the point.
That’s the discussion you needed.

7.2 Define a “minimum checklist” per world

Example:

RML-1
- snapshot/checkpoint approach
- monitoring/logging for failure patterns
RML-2
- saga/compensation design
- idempotency keys + effect logs
- runbooks for human intervention points
RML-3
- effect ledger strategy
- reconciler ownership and operational flows
- refund/correction/invalidation policies and process

This creates a shared standard:
“Because this feature is RML-2, we must at least do these things.”

7.3 Promote features over time

You don’t have to start with everything as RML-3.

A realistic approach:

design most things as RML-2 initially
promote only the parts that truly become “history” into RML-3

Closing — Rigor in rollback begins with rigor in worldview

To summarize:

Behind “We can roll back,” there are three different worlds: RML-1/2/3
When the world changes, everything changes:
- responsibility boundaries
- what rollback even means
- the required design and operations
If you want rigor, start by making this explicit:

Which world does this feature live in right now?

Next time you’re about to say “it’s rollbackable,” pause for one beat and ask:

“Rollback in which world?”

DEV Community

Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)

1) Why “We can roll back” is a dangerous sentence

2) The three worlds — an intuitive picture

RML-1: Closed World

RML-2: Dialog World

RML-3: History World

3) RML-1 — Closed World: nothing has escaped yet

3.1 Typical examples

3.2 The common trap

4) RML-2 — Dialog World: rollback as conversation

4.1 Rollback as a saga

4.2 The reality of RML-2

5) RML-3 — History World: shared, irreversible history

5.1 The “Effect Ledger” idea

5.2 When RML-3 is required

6) Common anti-patterns

6.1 Treating an RML-3 world like RML-1

6.2 Forgetting that “rollback” means different things by world

7) How to use this in practice: label features by world

7.1 Add an RML column to your backlog / design docs

7.2 Define a “minimum checklist” per world

7.3 Promote features over time

Closing — Rigor in rollback begins with rigor in worldview

Top comments (0)