kanaria007

Posted on Feb 14 • Originally published at zenn.dev

The Worlds of Distributed Systems — Align Your Team’s Mental Model

#distributedsystems #microservices #sre #architecture

— A series about rollback and trust, explained through three “worlds.”

When you decide you want to “design distributed systems properly,” you end up thinking about questions like:

How far back do we need to roll back to call it “correct”?
Are sagas and compensating transactions actually enough?
If billing or payment goes wrong, where does engineering responsibility end—and where does organizational responsibility begin?

But in real teams, these questions often get postponed with:

“It works for now. Ship it.”

This series, The Worlds of Distributed Systems, puts those uncomfortable questions on the table—and proposes a simple shared map:

Think in three worlds (RML-1 / RML-2 / RML-3).

0) The Three Worlds — What are RML-1/2/3?

The core of this series is a mental model of three “worlds”:

RML-1 — Closed World
A world where nothing has escaped yet: you can fail safely.
It lives entirely in memory or temporary files and is not externally observable.
RML-2 — Dialog World
A world where services (and users) “talk” their way back to consistency.
Sagas, compensation, retries, and eventual consistency are the main tools here.
RML-3 — History World
A world where money, legal responsibility, and social trust are involved—where you can’t erase the past.
You don’t “delete history”; you can only add refunds, corrections, and explanations on top of it.

Here, RML stands for “Rollback Maturity Level.” But what matters isn’t the name—it’s the worldview.

Think of RML as a label that helps your team share one key question:
“Which world are we responsible for, in this operation?”

1) What this series is trying to do

This is not a collection of isolated techniques. It’s an attempt to align the worldview behind techniques:

The hidden assumptions inside “we can roll back”
What sagas / eventual consistency truly solve—and what they don’t
Where responsibility splits across engineering, SRE, business, and legal

Target readers:

Backend engineers working on microservices or event-driven systems
SRE / platform engineers
Product engineers / PMs handling “high-stakes” domains (payments, billing, etc.)
Tech leads frequently pulled into incident response

2) Series structure (Table of Contents)

The series is structured as 3 parts + a practical guide + an epilogue.

Part I: The Worlds of Distributed Systems

— Get the map of the three worlds

Chapter 1

Think about rollback through three worlds — RML-1/2/3 mental model

Why “we can roll back” is often a dangerous claim
Intuition for RML-1/2/3
Label features by “which world they live in”
A standalone introductory chapter

Chapter 2

RML-1 — Closed World design principles: build a room where failure is safe

RML-1 = “a temporary, unobservable world”
Patterns like read-only + dry-run, effect dispatchers
Embedding RML-1 inside production (simulation / preview)
The difference between staging and RML-1, and the “logging is harmless” trap

Chapter 3

RML-2 — Dialog World design principles: rollback as ‘conversation’ between services

Distributed systems as a network of conversations
Promise / Execute / Reconcile as a three-step structure
Designing timeouts and “silence” as part of the dialog
The real meaning of eventual consistency as “reconciliation design”

Chapter 4

RML-3 — History World design principles: irreversible history and forward-only correction

Domains that naturally become History World (payments, healthcare, public systems, etc.)
The “History Hand-off Point” (when an effect becomes history)
Effect ledgers and the idea that rollback becomes refund + correction + explanation
Where your true accountability is tested: after something goes wrong

Part II: Dialog World Patterns (RML-2)

— Design failure and retry as dialog

Chapter 5

Failure design & operational patterns for RML-2 — exceptions, observability, governance

Structured errors carrying world / action / reason
Action hints returned from server to client
Connecting RML to observability policy (e.g., “RML-3 pages at night; RML-1 can wait”)

Chapter 6

Sagas & compensating transactions — assembling “retryable conversations”

A practical framing of event-driven architecture and sagas
Idempotency keys as the lifeline for retries
Designing retry hints and idempotency together
Connecting “eventual consistency” to the RML-2 → RML-3 boundary

Chapter 7

API / client design — how (and whether) to expose RML labels externally

Expressing world / action in REST
Retry-After + exponential backoff + jitter (thundering herd defense)
Notes for GraphQL extensions and gRPC metadata
Standardizing retry strategy in client libraries

Part III: History World & Governance (RML-3)

— Autonomy, governance, and product strategy in the History World

Chapter 8

History World autonomy — the triangle of legal, business, and SRE

Incidents that impact money, legal risk, and brand trust
How these three roles govern History World together
How ToS / contracts link to RML-3 strategy
Where RML-3 incidents land on the P&L

Chapter 9

RML-3 case files — align your incident response worldview

Detect → Contain → Understand → Decide → Act → Learn as six phases
Runbook (technical ops) = mostly RML-2; Playbook (org decisions) = RML-3
A case-file template that records History World properly (timeline with world markers + Decision Log)

Chapter 10

RML as product strategy — designing trust

Add an RML column to your backlog
Roadmapping: “Should this feature be promoted to RML-3 next quarter?”
Connecting RML-2/3 to metrics (error rates, refunds, incident counts)
Making the cost of History World visible

Chapter 11

A recipe for adoption — start small, then grow RML

Five tiny steps you can start tomorrow
- use the word “world” in conversations
- add a single RML column to the backlog
- add world to one new exception class
Role-based “minimum you should do” (App Eng / SRE / PM / Legal)
A kickoff meeting agenda
Homework: write 3 “your team’s RML-3 cases”

Epilogue

Engineering with a worldview

Taking a beat: “Which world are we talking about?”
Sharing the right question, not a single “correct answer”
A request: build your team’s Worlds of Distributed Systems

3) A quick reading guide

Application engineers
- Read Chapters 1–4 for the worldview
- Then map Chapters 5–7 onto your product’s code and architecture
SRE / platform engineers
- Chapters 3, 5–7, and 9–11 are especially “dense”
- Read with observability and incident response integration in mind
PM / tech leads / legal
- Skim Chapters 1 → 4 → 8–11
- Use it to decide: “Where does RML-3 begin for our team?”

4) One question that unifies the series

This entire series is built around a single question:

“Which world is this rollback in?”

Is it an RML-1 rollback inside a closed room?
Is it an RML-2 rollback as dialog and reconciliation?
Or is it an RML-3 rollback in the History World—where you can’t erase the past?

If your team learns to pause for this one beat, distributed system design, operations, and incident response tend to get a little more coherent—and a little less painful.

This was the overview.
From Chapter 1 onward, we’ll walk through The Worlds of Distributed Systems step by step.

DEV Community