DEV Community

Cover image for The Worlds of Distributed Systems — Align Your Team’s Mental Model
kanaria007
kanaria007

Posted on • Originally published at zenn.dev

The Worlds of Distributed Systems — Align Your Team’s Mental Model

— A series about rollback and trust, explained through three “worlds.”

When you decide you want to “design distributed systems properly,” you end up thinking about questions like:

  • How far back do we need to roll back to call it “correct”?
  • Are sagas and compensating transactions actually enough?
  • If billing or payment goes wrong, where does engineering responsibility end—and where does organizational responsibility begin?

But in real teams, these questions often get postponed with:

“It works for now. Ship it.”

This series, The Worlds of Distributed Systems, puts those uncomfortable questions on the table—and proposes a simple shared map:

Think in three worlds (RML-1 / RML-2 / RML-3).


0) The Three Worlds — What are RML-1/2/3?

The core of this series is a mental model of three “worlds”:

  • RML-1 — Closed World
    A world where nothing has escaped yet: you can fail safely.
    It lives entirely in memory or temporary files and is not externally observable.

  • RML-2 — Dialog World
    A world where services (and users) “talk” their way back to consistency.
    Sagas, compensation, retries, and eventual consistency are the main tools here.

  • RML-3 — History World
    A world where money, legal responsibility, and social trust are involved—where you can’t erase the past.
    You don’t “delete history”; you can only add refunds, corrections, and explanations on top of it.

Here, RML stands for “Rollback Maturity Level.” But what matters isn’t the name—it’s the worldview.

Think of RML as a label that helps your team share one key question:
“Which world are we responsible for, in this operation?”


1) What this series is trying to do

This is not a collection of isolated techniques. It’s an attempt to align the worldview behind techniques:

  • The hidden assumptions inside “we can roll back”
  • What sagas / eventual consistency truly solve—and what they don’t
  • Where responsibility splits across engineering, SRE, business, and legal

Target readers:

  • Backend engineers working on microservices or event-driven systems
  • SRE / platform engineers
  • Product engineers / PMs handling “high-stakes” domains (payments, billing, etc.)
  • Tech leads frequently pulled into incident response

2) Series structure (Table of Contents)

The series is structured as 3 parts + a practical guide + an epilogue.

Part I: The Worlds of Distributed Systems

— Get the map of the three worlds

Chapter 1

Think about rollback through three worlds — RML-1/2/3 mental model

  • Why “we can roll back” is often a dangerous claim
  • Intuition for RML-1/2/3
  • Label features by “which world they live in”
  • A standalone introductory chapter

Chapter 2

RML-1 — Closed World design principles: build a room where failure is safe

  • RML-1 = “a temporary, unobservable world”
  • Patterns like read-only + dry-run, effect dispatchers
  • Embedding RML-1 inside production (simulation / preview)
  • The difference between staging and RML-1, and the “logging is harmless” trap

Chapter 3

RML-2 — Dialog World design principles: rollback as ‘conversation’ between services

  • Distributed systems as a network of conversations
  • Promise / Execute / Reconcile as a three-step structure
  • Designing timeouts and “silence” as part of the dialog
  • The real meaning of eventual consistency as “reconciliation design”

Chapter 4

RML-3 — History World design principles: irreversible history and forward-only correction

  • Domains that naturally become History World (payments, healthcare, public systems, etc.)
  • The “History Hand-off Point” (when an effect becomes history)
  • Effect ledgers and the idea that rollback becomes refund + correction + explanation
  • Where your true accountability is tested: after something goes wrong

Part II: Dialog World Patterns (RML-2)

— Design failure and retry as dialog

Chapter 5

Failure design & operational patterns for RML-2 — exceptions, observability, governance

  • Structured errors carrying world / action / reason
  • Action hints returned from server to client
  • Connecting RML to observability policy (e.g., “RML-3 pages at night; RML-1 can wait”)

Chapter 6

Sagas & compensating transactions — assembling “retryable conversations”

  • A practical framing of event-driven architecture and sagas
  • Idempotency keys as the lifeline for retries
  • Designing retry hints and idempotency together
  • Connecting “eventual consistency” to the RML-2 → RML-3 boundary

Chapter 7

API / client design — how (and whether) to expose RML labels externally

  • Expressing world / action in REST
  • Retry-After + exponential backoff + jitter (thundering herd defense)
  • Notes for GraphQL extensions and gRPC metadata
  • Standardizing retry strategy in client libraries

Part III: History World & Governance (RML-3)

— Autonomy, governance, and product strategy in the History World

Chapter 8

History World autonomy — the triangle of legal, business, and SRE

  • Incidents that impact money, legal risk, and brand trust
  • How these three roles govern History World together
  • How ToS / contracts link to RML-3 strategy
  • Where RML-3 incidents land on the P&L

Chapter 9

RML-3 case files — align your incident response worldview

  • Detect → Contain → Understand → Decide → Act → Learn as six phases
  • Runbook (technical ops) = mostly RML-2; Playbook (org decisions) = RML-3
  • A case-file template that records History World properly (timeline with world markers + Decision Log)

Chapter 10

RML as product strategy — designing trust

  • Add an RML column to your backlog
  • Roadmapping: “Should this feature be promoted to RML-3 next quarter?”
  • Connecting RML-2/3 to metrics (error rates, refunds, incident counts)
  • Making the cost of History World visible

Chapter 11

A recipe for adoption — start small, then grow RML

  • Five tiny steps you can start tomorrow

    • use the word “world” in conversations
    • add a single RML column to the backlog
    • add world to one new exception class
  • Role-based “minimum you should do” (App Eng / SRE / PM / Legal)

  • A kickoff meeting agenda

  • Homework: write 3 “your team’s RML-3 cases”


Epilogue

Engineering with a worldview

  • Taking a beat: “Which world are we talking about?”
  • Sharing the right question, not a single “correct answer”
  • A request: build your team’s Worlds of Distributed Systems

3) A quick reading guide

  • Application engineers

    • Read Chapters 1–4 for the worldview
    • Then map Chapters 5–7 onto your product’s code and architecture
  • SRE / platform engineers

    • Chapters 3, 5–7, and 9–11 are especially “dense”
    • Read with observability and incident response integration in mind
  • PM / tech leads / legal

    • Skim Chapters 1 → 4 → 8–11
    • Use it to decide: “Where does RML-3 begin for our team?”

4) One question that unifies the series

This entire series is built around a single question:

“Which world is this rollback in?”

  • Is it an RML-1 rollback inside a closed room?
  • Is it an RML-2 rollback as dialog and reconciliation?
  • Or is it an RML-3 rollback in the History World—where you can’t erase the past?

If your team learns to pause for this one beat, distributed system design, operations, and incident response tend to get a little more coherent—and a little less painful.

This was the overview.
From Chapter 1 onward, we’ll walk through The Worlds of Distributed Systems step by step.

Top comments (0)