The Worlds of Distributed Systems — Chapter 1
— The hidden worldview behind “We can roll back”
In production, you’ll hear this sentence a lot:
“This feature is always rollbackable.”
But… is it really?
- How far can you roll back?
- Who is responsible for what—engineering, SRE, business, legal?
- Are there things you can’t make “not happen”?
Most teams don’t answer these explicitly. They just ship.
This chapter proposes a shared mental model for “rollback” in distributed systems:
- RML-1 — Closed World
- RML-2 — Dialog World
- RML-3 — History World
RML stands for Rollback Maturity Level, but the important part isn’t the name.
It’s the worldview—a label that helps teams ask the right question:
Which world does this operation live in?
1) Why “We can roll back” is a dangerous sentence
In distributed systems, “rollback” is not one operation. It’s a composite:
“Rollback” is the combination of
which world you’re in, what you’re undoing, and how far back you can realistically go.
A common gap:
- Engineers often imagine: “rollback the DB transaction.”
- The business side often imagines: “restore the user’s real world to as-if-it-never-happened.”
If you blur this gap with a vague “we can roll back,” you create hidden risk.
A procedure that looks rigorous does not guarantee real rigor.
Likewise:
Having a rollback procedure
is not the same thing as
preserving the integrity of the world.
2) The three worlds — an intuitive picture
RML-1/2/3 is a label system for one key question:
Where does the effect of this operation exist right now?
RML-1: Closed World
- Everything completes inside your process
- No external DB writes, no external API calls, no notifications
- The blast radius is basically local memory + temporary files
RML-2: Dialog World
- There is a conversation with other services and/or users
- Writes to internal DBs, RPC/messages across microservices
- Rollback is attempted via sagas, retries, compensation
RML-3: History World
- There is shared history across organizations, reality, and regulations
- Bank transfers, credit card payments, healthcare/public infrastructure
- You don’t erase the past—you acknowledge what happened and layer corrections on top
A rough mental map:
Responsibility weight / rollback cost
^ high ────────────────┐ RML-3: History World
| | - regulation, audits, external orgs
| ┌────────┘
| | RML-2: Dialog World
| | - conversations with services/users
| ┌────────┘
| | RML-1: Closed World
| | - inside a single process
+──────────────────────────> blast radius / scope of impact
3) RML-1 — Closed World: nothing has escaped yet
Catchphrase:
“Nothing has left the room yet.”
RML-1 is where rollback is “classically” easy:
- only local memory and temporary storage change
- nothing is visible to other systems or humans yet
3.1 Typical examples
- Dry run (compute the result, then discard it)
- Tests in an isolated staging environment
- “Simulate everything once before executing” batch workflows
In this world, the old-school idea works:
Take a snapshot → if it fails, restore it.
3.2 The common trap
Teams often think they’re in RML-1, but they’re not:
- you ship logs to an external log platform
- you send Slack notifications
- you “only read” but hit production DBs
The moment an effect becomes observable outside your room, you’ve moved into RML-2 or beyond.
4) RML-2 — Dialog World: rollback as conversation
Catchphrase:
“There is someone else who shares the outcome.”
In RML-2, actions leave the room:
- you write to internal DBs
- you call another microservice (RPC / message)
- you email/notify a user
From that point on:
It’s no longer “roll back my side and we’re done,”
because there is an other side.
4.1 Rollback as a saga
A common RML-2 tool is the Saga pattern:
- split a large flow into steps
- each step has
do()andundo() - log successful
do()steps; on failure, runundo()in reverse
Pseudo-code:
type Step = {
do: () => Promise<void>; // must be idempotent
undo: () => Promise<void>; // should be idempotent (best-effort)
key: string; // stable idempotency key per step
};
async function runSaga(steps: Step[]) {
const done: Step[] = [];
try {
for (const step of steps) {
await step.do();
done.push(step);
}
} catch (e) {
// Compensate in reverse order (best-effort).
for (const step of done.reverse()) {
try {
await step.undo();
} catch {
// In production, record this (step log / alert) and continue.
}
}
throw e;
}
}
The real questions in practice are:
- Do you design idempotency keys correctly?
- Is your event/log record a trusted single source?
- How much do you record about the fact that “compensation happened”?
4.2 The reality of RML-2
Even with sagas and compensation, perfect rollback is often impossible:
- the user already read the email
- an admin saw the alert and performed manual operations
Which means:
You can no longer return the world to RML-1 by software alone.
So you must design rollback flows together with human operations: runbooks, escalation paths, and reconciliation steps.
5) RML-3 — History World: shared, irreversible history
Catchphrase:
“You can’t delete history. You can only decide what to do next.”
RML-3 is where events become long-lived shared history:
- banking/payment records
- healthcare records / public infrastructure logs
- legally auditable events
In this world:
“Rollback as if it never happened” basically does not exist.
Instead, you do forward-only corrections:
- refunds
- corrections
- invalidations
- remediation + prevention measures (sometimes with apology and public communication)
5.1 The “Effect Ledger” idea
A helpful framing for RML-3 is an Effect Ledger:
- record external effects in an append-only log
- chain it (hashes, signatures) to make tampering hard
- run a reconciler process that compares ledger ↔ reality and issues corrections (refunds, fixes)
Conceptually:
- services append “effects” to the ledger
- a reconciler ensures the world converges via additional corrective effects
And rollback becomes:
Not “erase the past,” but
“add correction records on top of the past.”
(Yes, this resembles some blockchain instincts—but you don’t need crypto hype to benefit from the append-only correction mindset.)
5.2 When RML-3 is required
- finance, healthcare, public sector
- cross-company integrations, cross-border data processing
- any system where you owe an explanation to a third party later
In RML-3:
- responsibility spans organizations
- a single service cannot unilaterally “delete the past”
6) Common anti-patterns
6.1 Treating an RML-3 world like RML-1
The most dangerous state is:
-
reality includes:
- external payments
- user notifications
- audit-targeted logs
but the narrative says:
“We can roll back by just reverting the DB.”
A rollback procedure won’t save you from reality.
6.2 Forgetting that “rollback” means different things by world
- In RML-1, rollback means: snapshot restore.
- In RML-2, rollback means: reverse the conversation with other services/users.
- In RML-3, rollback means: publish corrections and share the facts.
If you discuss all of these under one word (“rollback”), the conversation will break somewhere—guaranteed.
7) How to use this in practice: label features by world
Here’s a simple, pragmatic adoption method.
7.1 Add an RML column to your backlog / design docs
For example:
| Feature | Description | RML |
|---|---|---|
| Simulation / dry run | Uses production-like data but discards results | 1 |
| Create invoice record | Persisted in internal DB; referenced by multiple services | 2 |
| Credit card payment | External payment gateway + user assets | 3 |
When people argue “is this RML-2 or RML-3?”, that’s not a problem—that’s the point.
That’s the discussion you needed.
7.2 Define a “minimum checklist” per world
Example:
-
RML-1
- snapshot/checkpoint approach
- monitoring/logging for failure patterns
-
RML-2
- saga/compensation design
- idempotency keys + effect logs
- runbooks for human intervention points
-
RML-3
- effect ledger strategy
- reconciler ownership and operational flows
- refund/correction/invalidation policies and process
This creates a shared standard:
“Because this feature is RML-2, we must at least do these things.”
7.3 Promote features over time
You don’t have to start with everything as RML-3.
A realistic approach:
- design most things as RML-2 initially
- promote only the parts that truly become “history” into RML-3
Closing — Rigor in rollback begins with rigor in worldview
To summarize:
- Behind “We can roll back,” there are three different worlds: RML-1/2/3
-
When the world changes, everything changes:
- responsibility boundaries
- what rollback even means
- the required design and operations
If you want rigor, start by making this explicit:
Which world does this feature live in right now?
Next time you’re about to say “it’s rollbackable,” pause for one beat and ask:
“Rollback in which world?”
Top comments (0)