The Worlds of Distributed Systems — Chapter 10
“Will this feature work reliably?”
“Yes—but it depends on which world we’re willing to own.”
So far in this series:
- Chapters 1–4 drew the map of the three worlds,
- Chapters 5–7 drilled into RML-2 (Dialog World) patterns (failures, sagas, APIs),
- Chapters 8–9 focused on RML-3 (History World) autonomy and case files.
In Chapter 10, we zoom out and place RML at the level that decides everything downstream:
RML = a “trust world map” for distributed systems.
Not just an engineering model—but a product strategy tool.
1) Use RML as a world map (not a “maturity level”)
RML is often read as “how mature your rollback is.”
But strategically, it’s more useful as:
A map of how far your product takes responsibility.
A compact restatement:
RML-1 — Closed World
Inside one room: tests, dry runs, simulations.
“A world where you can try freely because you can restart safely.”RML-2 — Dialog World
Cross-service recovery: retries, compensation, reconciliation.
“A world where you can recover through conversation.”RML-3 — History World
Money, law, social trust, irreversible records.
“A world where you correct forward, and you must be accountable.”
Now the strategic question becomes:
“Which world does this product step into?”
“Which world do we intentionally not step into (and delegate to others)?”
Once you answer that, these decisions become far easier to align:
- feature scope and granularity
- SLA / SLO promises
- Terms of Service (ToS) and compensation boundaries
- incident response structure and cross-functional ownership
2) Add an “RML column” to your product backlog
In Chapter 7, we put RML into APIs.
At the strategy level, an even earlier move works surprisingly well:
Add an RML column to your feature backlog.
2.1 A simple table
| Feature | Description | RML | Notes |
|---|---|---|---|
| AI inference simulation | Show results only; don’t persist | 1 | production reads only |
| Internal review status update | internal users; DB updates | 2 | reversible via ops |
| Credit card payments | payment gateway + customer assets | 3 | refunds required |
| Tax-grade PDF export | may be submitted to authorities | 3 | versioning + correction log |
Just adding this column changes the review conversation:
- “Is this really RML-2, or is it actually RML-3?”
- “If it’s RML-3, where is the refund/correction/explanation path?”
2.2 Feature “promotion” as a roadmap concept
You don’t need to build everything as RML-3 on day one.
A natural evolution looks like:
- Prototype: RML-1 (simulation only)
- Alpha: RML-2 (internal use / limited customers)
- GA: promote some flows into RML-3
If you make this an explicit roadmap question:
“Do we promote this feature to RML-3 next quarter?”
…then planning becomes concrete:
- when Legal must be involved
- when sagas and idempotency must be redesigned properly
- when you need an Effect Ledger and case files
3) RML and metrics: measuring “trust” by world
Product strategy always collapses into metrics.
With RML, you can separate “health” into three layers.
3.1 RML-1 metrics (Closed World)
Focus: Are we able to try safely?
- test coverage / number of simulations
- dry-run execution count and failure patterns
- bugs discovered in RML-1 before production rollout
3.2 RML-2 metrics (Dialog World)
Focus: How much recovery can we automate through conversation?
- saga success rate / compensation activation rate
- RML-2 error counts (
world = RML2) by code - auto-retry resolution rate vs human ops intervention rate
- % of critical external calls with idempotency keys
3.3 RML-3 metrics (History World)
Focus: How often do we generate irreversible trust-cost events?
- RML-3 incident count (quarterly)
-
per-incident:
- affected users
- monetary cost (refunds, credits, ops time)
time from detection → containment (MTTD / MTTC)
recurrence rate (same incident class reappears)
At this point, metrics connect directly to:
- P&L (refunds and support cost)
- brand impact
- trust and retention
3.4 Split dashboards by world
Even a simple dashboard split changes management discussions:
- “Error rate is high but RML-3 incidents are zero” → likely a Dialog World design/ops improvement problem
- “Error rate is low but RML-3 incidents are rising” → your History World governance (Ch.8–9) is the real gap
4) RML across the product lifecycle
4.1 Planning & design
- put RML labels into backlog and specs
- decide “how far we own the world” as part of the requirements
- for RML-3 candidates, involve Legal early
4.2 Implementation
Apply Dialog World patterns (Ch.5–7):
- structured errors with
world/action - idempotency keys + saga discipline
- RML metadata in API boundaries
- tests that validate promotion conditions (“when does this become RML-3?”)
4.3 Release & operations
- observability tags:
rml.world,rml.action - RML-3 alerts create incidents automatically
- maintain Runbooks (RML-2) and Playbooks (RML-3)
4.4 Sunset & replacement
RML matters even when you shut things down:
-
retiring an RML-3 feature:
- what happens to the history (records, ledgers, retention)?
-
migrating systems:
- how do you shift RML-2 dialogs (APIs/sagas) safely?
- how do you share RML-3 responsibility during the transition?
A useful sentence for planning:
“Sunsetting a feature means withdrawing from a world.”
That changes the granularity of your sunset plan.
5) The minimum viable adoption set
If this feels heavy, you can adopt RML in stages.
Step 1: adopt the vocabulary only
- say “this is RML-1 / RML-3-ish” in reviews
- add an RML column to backlog
- add an
RML-Worldfield to incident reports
No implementation changes needed yet.
You’re aligning the meta-language first.
Step 2: implement the minimum RML-2 patterns
- add
world/actionto your structured error type (e.g.,RmlError) (Ch.5) - add idempotency keys to critical external calls (Ch.6)
- add
X-RML-World/X-RML-Actionto API responses (Ch.7)
Now you can:
- tag telemetry by world
- alert on RML-3-ish signals
Step 3: use case files only for confirmed RML-3 incidents
- if the org agrees “this is RML-3,” then always write the Ch.9 case file.
This gives you visibility into:
- history-grade costs (refunds, credits)
- where your bottleneck is (detect/contain/decide)
- where your contracts/runbooks/playbooks are drifting
6) Strategy-level anti-patterns
6.1 Treating RML as a compliance label only
If RML becomes “audit paperwork,” it disconnects from real engineering and ops.
Fix:
- always connect it to implementation patterns (Ch.5–7) and governance (Ch.8–9)
6.2 “World inflation” (everything becomes RML-3)
If everything is treated as History World:
- the org becomes permanently incident-fatigued
- decision loops slow to a crawl
Fix:
- pre-agree on “default-to-RML-3” conditions (Ch.8)
- handle the rest in RML-2, and promote only when needed
6.3 Shipping a new business without deciding which world you own
A common failure mode:
- launch quickly
- the History World boundary is left undefined
- incidents arrive before governance exists
Fix:
- require a slide in every planning review:
“Which world does this business step into, and which worlds does it avoid?”
7) Final checklist: do you have a worldview?
7.1 World mapping
- [ ] Can you label major features as RML-1/2/3?
- [ ] Are RML-3 features agreed with Legal and Business?
- [ ] Is “promotion” (RML-2 → RML-3) part of roadmap discussions?
7.2 Implementation & ops
- [ ] StructuredError / ActionHint includes
world/action - [ ] critical external calls use idempotency keys
- [ ] observability supports
rml.world/rml.action - [ ] APIs carry RML metadata (REST/GraphQL/gRPC)
7.3 Governance & incidents
- [ ] RML-3 incident definition and escalation rules are documented
- [ ] Legal/Business/SRE triangle ownership (RACI) is explicit
- [ ] case file template includes
world+ Decision Log - [ ] Runbook (RML-2) and Playbook (RML-3) exist
7.4 Business & trust
- [ ] You can estimate RML-3 incident costs (roughly)
- [ ] ToS compensation scope matches actual RML-3 behavior
- [ ] Reducing RML-3 incidents is treated as a strategic objective
Closing — “Which world is this rollback?”
Distributed systems discussions quickly become technical:
- consistency models
- queues and retries
- sagas and compensation
- APIs, gRPC, messaging…
All of that matters.
But the high-leverage pause—the one this series keeps returning to—is:
“Which world are we talking about?”
- RML-1: a world where you can try safely
- RML-2: a world where you recover through dialog
- RML-3: a world where you carry history and correct forward
When the world changes:
- rollback changes meaning
- incident weight changes
- product responsibility changes
So the next time you’re about to say:
- “We can roll it back,” or
- “We can compensate,”
pause for one beat and ask:
“Rollback in which world—
and which world’s history are we asking whom to carry?”
That question alone tends to upgrade a distributed system from “works most of the time” into something closer to designed trust.
Next:
Chapter 11 — A Field Recipe for RML: Start Small, Grow It
Top comments (0)