kanaria007

Posted on Mar 19 • Originally published at zenn.dev

Chapter 11 — A Field Recipe for RML: Start Small, Grow It

#distributedsystems #microservices #sre #software

The Worlds of Distributed Systems — Chapter 11 (Practical Adoption Guide)

“Nice theory… but what do we do tomorrow?”

After Chapter 10, the worldview may feel right—but you might still be unsure about:

where to start,
how far is “too much,”
and how to talk about this across teams.

This chapter is not “more theory.”
It’s a recipe / adoption guide for bringing RML into real work:

You can’t do everything at once.
But you can introduce it gradually—without overhauling the org.

1) Five tiny steps you can start tomorrow

These are intentionally small. You can do them as an individual—even without “organizational permission.”

1.1 Say the word “world” once in a conversation

In a design review or incident retrospective, ask this one question:

“Which world are we talking about—RML-1, RML-2, or RML-3?”

It’s OK if nobody has an answer yet.
The point is to flip the mental switch:

Make “world awareness” part of the discussion.

1.2 Add an `RML` column to a single backlog list

In Jira, Notion, a spreadsheet—anything.

Pick one epic or feature list.
Add a column called RML.
Write 1 / 2 / 3 for a few items, using your best guess.

Later, when you review it with the team, it becomes a productive “discussion starter”:

“Why is this RML-2?”
“Is this actually RML-3 once it touches money/contracts?”

1.3 Add `world` to a new exception class (just once)

Any language is fine. Keep it lightweight.

class RmlError extends Error {
  world: "RML1" | "RML2" | "RML3";

  constructor(args: { world: "RML1" | "RML2" | "RML3"; message: string }) {
    super(args.message);
    this.world = args.world;
  }
}

You don’t need perfect handling on day one.

Even just writing down:

“This throw assumes which world?”

…changes how engineers reason about failures.

1.4 Retrospect one recent incident with an RML tag

Don’t try to rewrite your entire incident process.

Take a single recent incident and add world markers to the timeline:

“Up to here was RML-2…”
“This point is where it first became RML-3.”

Just drawing that boundary makes the “real weight” visible.

1.5 Write one Runbook and one Playbook (one page each)

From Chapter 9:

Runbook = RML-2 (technical operations)
Playbook = RML-3 (organizational decision-making)

Pick one representative scenario for each:

Runbook: “Payment gateway outage (containment + retries + compensation)”
Playbook: “Incorrect billing occurred (who decides refunds + comms + legal steps)”

Two pages can align the team surprisingly fast.

2) Role-based “do only this” adoption guide

RML becomes easier when each role contributes a small part.

2.1 Application engineers

Add world / action to error types (even minimally)
Add idempotency keys to critical external calls
Ask once per feature: “What’s the RML of this?”

Goal:
Make RML appear in code and design reviews.

2.2 SRE / Platform engineers

Add rml.world / rml.action tags to logs/metrics/traces
Create one alert based on RML-3 signals
Prepare one Runbook (RML-2) + one Playbook (RML-3)

Goal:
Make RML flow into dashboards and incident response.

2.3 Product managers / POs

Allow (or encourage) adding an RML column to the backlog
Put “promotion” into roadmap discussion:
- “Do we promote this to RML-3 next quarter?”
Roughly track RML-3 costs (refunds, coupons, support time)

Goal:
Connect RML to product strategy and P&L reality.

2.4 Legal / Compliance

Define (with engineers + business) what counts as RML-3
Align ToS/contract compensation terms with actual RML-3 operations
Join at least one postmortem for a history-grade incident

Goal:
Align language for how the org enters the History World.

3) A sample RML kickoff meeting agenda (90 minutes)

If you want a “lightweight organizational start,” do one kickoff meeting.

Agenda (90 minutes)

Intro (10 min)

what RML is, in one slide
what you’ll call it internally (RML is fine)

Case sharing (20 min)

pick 1–2 recent incidents or near-misses
draw a timeline
mark where it shifts: “RML-2 → RML-3”

Role perspectives (20 min)

engineers: rollback/retry/compensation reality
business: customer experience + brand impact
legal: contracts/regulation + disclosure obligations

Draft “our org’s RML rules” (30 min)

list default RML-3 cases
define a “when unsure, do X” principle
pick the Playbook owner for RML-3 incidents

Next step (10 min)

choose one service/feature to start labeling
choose one incident type to apply the case-file template

The trick is not perfection.

Treat it as a meeting to decide the minimum that lets you move tomorrow.

4) FAQ (the questions teams actually ask)

Q1) “Our domain is basically all RML-3… doesn’t that make RML useless?”

In finance/health/public infrastructure, it can feel like “everything is History World.”

Even then, separating helps if you split:

World as a processing layer
World as a responsibility boundary

Example:

“Showing appointment candidates” → RML-1 (proposal-only)
“Reserving a slot” → RML-2 (adjustable via operations)
“Medical records / billing / official notes” → RML-3 (legal history)

Even in an RML-3 domain, you usually still have RML-1/2 layers inside it.

Q2) “Is labeling RML wrong itself risky?”

Early on, you will mislabel. That’s normal.

Treat labels as:

Best current hypotheses, updateable over time.

What’s riskier is the default state:

“We don’t know which world we’re in, so we guess mid-incident.”

It’s safer to label, learn, and refine than to operate without a shared map.

Q3) “We’re a small startup—we can’t do all this.”

Small orgs often need RML more, not less—because you can’t afford repeated history-grade mistakes.

Start with minimal scoping:

“Payments are RML-3 and we treat them seriously.”
“Everything else stays RML-2 unless proven otherwise.”

Early boundaries make later growth easier.

5) A small homework assignment

One small exercise to make this “your team’s book”:

Write down three RML-3-ish events in your environment.

Past incidents or imagined “worst days” are both fine.

For each, write:

what happened,
who was affected,
who owned responsibility,
what “history residue” remains.

Those three items become page one of:

Your team’s version of The Worlds of Distributed Systems.

That’s the end of the main content of this book—for now.

Epilogue — Toward Engineering with a Worldview

From here, you can:

remove chapters,
rename terms,
or refactor the entire worldview into your own.

If it helps you build systems that don’t just “work most of the time,” but are designed for trust, then the map did its job.