DEV Community

Cover image for Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)
kanaria007
kanaria007

Posted on • Originally published at zenn.dev

Chapter 1 — Thinking About Rollback in Distributed Systems Through Three Worlds (RML-1/2/3)

The Worlds of Distributed Systems — Chapter 1
— The hidden worldview behind “We can roll back”

In production, you’ll hear this sentence a lot:

“This feature is always rollbackable.”

But… is it really?

  • How far can you roll back?
  • Who is responsible for what—engineering, SRE, business, legal?
  • Are there things you can’t make “not happen”?

Most teams don’t answer these explicitly. They just ship.

This chapter proposes a shared mental model for “rollback” in distributed systems:

  • RML-1 — Closed World
  • RML-2 — Dialog World
  • RML-3 — History World

RML stands for Rollback Maturity Level, but the important part isn’t the name.
It’s the worldview—a label that helps teams ask the right question:

Which world does this operation live in?


1) Why “We can roll back” is a dangerous sentence

In distributed systems, “rollback” is not one operation. It’s a composite:

“Rollback” is the combination of
which world you’re in, what you’re undoing, and how far back you can realistically go.

A common gap:

  • Engineers often imagine: “rollback the DB transaction.”
  • The business side often imagines: “restore the user’s real world to as-if-it-never-happened.”

If you blur this gap with a vague “we can roll back,” you create hidden risk.

A procedure that looks rigorous does not guarantee real rigor.

Likewise:

Having a rollback procedure
is not the same thing as
preserving the integrity of the world.


2) The three worlds — an intuitive picture

RML-1/2/3 is a label system for one key question:

Where does the effect of this operation exist right now?

RML-1: Closed World

  • Everything completes inside your process
  • No external DB writes, no external API calls, no notifications
  • The blast radius is basically local memory + temporary files

RML-2: Dialog World

  • There is a conversation with other services and/or users
  • Writes to internal DBs, RPC/messages across microservices
  • Rollback is attempted via sagas, retries, compensation

RML-3: History World

  • There is shared history across organizations, reality, and regulations
  • Bank transfers, credit card payments, healthcare/public infrastructure
  • You don’t erase the past—you acknowledge what happened and layer corrections on top

A rough mental map:

Responsibility weight / rollback cost

^  high ────────────────┐  RML-3: History World
|                       |   - regulation, audits, external orgs
|              ┌────────┘
|              |  RML-2: Dialog World
|              |   - conversations with services/users
|     ┌────────┘
|     | RML-1: Closed World
|     |  - inside a single process
+──────────────────────────>  blast radius / scope of impact
Enter fullscreen mode Exit fullscreen mode

3) RML-1 — Closed World: nothing has escaped yet

Catchphrase:

“Nothing has left the room yet.”

RML-1 is where rollback is “classically” easy:

  • only local memory and temporary storage change
  • nothing is visible to other systems or humans yet

3.1 Typical examples

  • Dry run (compute the result, then discard it)
  • Tests in an isolated staging environment
  • “Simulate everything once before executing” batch workflows

In this world, the old-school idea works:

Take a snapshot → if it fails, restore it.

3.2 The common trap

Teams often think they’re in RML-1, but they’re not:

  • you ship logs to an external log platform
  • you send Slack notifications
  • you “only read” but hit production DBs

The moment an effect becomes observable outside your room, you’ve moved into RML-2 or beyond.


4) RML-2 — Dialog World: rollback as conversation

Catchphrase:

“There is someone else who shares the outcome.”

In RML-2, actions leave the room:

  • you write to internal DBs
  • you call another microservice (RPC / message)
  • you email/notify a user

From that point on:

It’s no longer “roll back my side and we’re done,”
because there is an other side.

4.1 Rollback as a saga

A common RML-2 tool is the Saga pattern:

  • split a large flow into steps
  • each step has do() and undo()
  • log successful do() steps; on failure, run undo() in reverse

Pseudo-code:

type Step = {
  do: () => Promise<void>;     // must be idempotent
  undo: () => Promise<void>;   // should be idempotent (best-effort)
  key: string;                 // stable idempotency key per step
};

async function runSaga(steps: Step[]) {
  const done: Step[] = [];
  try {
    for (const step of steps) {
      await step.do();
      done.push(step);
    }
  } catch (e) {
    // Compensate in reverse order (best-effort).
    for (const step of done.reverse()) {
      try {
        await step.undo();
      } catch {
        // In production, record this (step log / alert) and continue.
      }
    }
    throw e;
  }
}
Enter fullscreen mode Exit fullscreen mode

The real questions in practice are:

  • Do you design idempotency keys correctly?
  • Is your event/log record a trusted single source?
  • How much do you record about the fact that “compensation happened”?

4.2 The reality of RML-2

Even with sagas and compensation, perfect rollback is often impossible:

  • the user already read the email
  • an admin saw the alert and performed manual operations

Which means:

You can no longer return the world to RML-1 by software alone.

So you must design rollback flows together with human operations: runbooks, escalation paths, and reconciliation steps.


5) RML-3 — History World: shared, irreversible history

Catchphrase:

“You can’t delete history. You can only decide what to do next.”

RML-3 is where events become long-lived shared history:

  • banking/payment records
  • healthcare records / public infrastructure logs
  • legally auditable events

In this world:

“Rollback as if it never happened” basically does not exist.

Instead, you do forward-only corrections:

  • refunds
  • corrections
  • invalidations
  • remediation + prevention measures (sometimes with apology and public communication)

5.1 The “Effect Ledger” idea

A helpful framing for RML-3 is an Effect Ledger:

  • record external effects in an append-only log
  • chain it (hashes, signatures) to make tampering hard
  • run a reconciler process that compares ledger ↔ reality and issues corrections (refunds, fixes)

Conceptually:

  • services append “effects” to the ledger
  • a reconciler ensures the world converges via additional corrective effects

And rollback becomes:

Not “erase the past,” but
“add correction records on top of the past.”

(Yes, this resembles some blockchain instincts—but you don’t need crypto hype to benefit from the append-only correction mindset.)

5.2 When RML-3 is required

  • finance, healthcare, public sector
  • cross-company integrations, cross-border data processing
  • any system where you owe an explanation to a third party later

In RML-3:

  • responsibility spans organizations
  • a single service cannot unilaterally “delete the past”

6) Common anti-patterns

6.1 Treating an RML-3 world like RML-1

The most dangerous state is:

  • reality includes:

    • external payments
    • user notifications
    • audit-targeted logs
  • but the narrative says:

“We can roll back by just reverting the DB.”

A rollback procedure won’t save you from reality.

6.2 Forgetting that “rollback” means different things by world

  • In RML-1, rollback means: snapshot restore.
  • In RML-2, rollback means: reverse the conversation with other services/users.
  • In RML-3, rollback means: publish corrections and share the facts.

If you discuss all of these under one word (“rollback”), the conversation will break somewhere—guaranteed.


7) How to use this in practice: label features by world

Here’s a simple, pragmatic adoption method.

7.1 Add an RML column to your backlog / design docs

For example:

Feature Description RML
Simulation / dry run Uses production-like data but discards results 1
Create invoice record Persisted in internal DB; referenced by multiple services 2
Credit card payment External payment gateway + user assets 3

When people argue “is this RML-2 or RML-3?”, that’s not a problem—that’s the point.
That’s the discussion you needed.

7.2 Define a “minimum checklist” per world

Example:

  • RML-1

    • snapshot/checkpoint approach
    • monitoring/logging for failure patterns
  • RML-2

    • saga/compensation design
    • idempotency keys + effect logs
    • runbooks for human intervention points
  • RML-3

    • effect ledger strategy
    • reconciler ownership and operational flows
    • refund/correction/invalidation policies and process

This creates a shared standard:
“Because this feature is RML-2, we must at least do these things.”

7.3 Promote features over time

You don’t have to start with everything as RML-3.

A realistic approach:

  • design most things as RML-2 initially
  • promote only the parts that truly become “history” into RML-3

Closing — Rigor in rollback begins with rigor in worldview

To summarize:

  • Behind “We can roll back,” there are three different worlds: RML-1/2/3
  • When the world changes, everything changes:

    • responsibility boundaries
    • what rollback even means
    • the required design and operations
  • If you want rigor, start by making this explicit:

Which world does this feature live in right now?

Next time you’re about to say “it’s rollbackable,” pause for one beat and ask:

“Rollback in which world?”

Top comments (0)