kanaria007

Posted on Mar 16 • Originally published at zenn.dev

Chapter 10 — RML as Product Strategy: Designing Trust

#distributedsystems #productmanagement #sre #architecture

The Worlds of Distributed Systems — Chapter 10

“Will this feature work reliably?”
“Yes—but it depends on which world we’re willing to own.”

So far in this series:

Chapters 1–4 drew the map of the three worlds,
Chapters 5–7 drilled into RML-2 (Dialog World) patterns (failures, sagas, APIs),
Chapters 8–9 focused on RML-3 (History World) autonomy and case files.

In Chapter 10, we zoom out and place RML at the level that decides everything downstream:

RML = a “trust world map” for distributed systems.

Not just an engineering model—but a product strategy tool.

1) Use RML as a world map (not a “maturity level”)

RML is often read as “how mature your rollback is.”

But strategically, it’s more useful as:

A map of how far your product takes responsibility.

A compact restatement:

RML-1 — Closed World
Inside one room: tests, dry runs, simulations.
“A world where you can try freely because you can restart safely.”
RML-2 — Dialog World
Cross-service recovery: retries, compensation, reconciliation.
“A world where you can recover through conversation.”
RML-3 — History World
Money, law, social trust, irreversible records.
“A world where you correct forward, and you must be accountable.”

Now the strategic question becomes:

“Which world does this product step into?”
“Which world do we intentionally not step into (and delegate to others)?”

Once you answer that, these decisions become far easier to align:

feature scope and granularity
SLA / SLO promises
Terms of Service (ToS) and compensation boundaries
incident response structure and cross-functional ownership

2) Add an “RML column” to your product backlog

In Chapter 7, we put RML into APIs.
At the strategy level, an even earlier move works surprisingly well:

Add an RML column to your feature backlog.

2.1 A simple table

Feature	Description	RML	Notes
AI inference simulation	Show results only; don’t persist	1	production reads only
Internal review status update	internal users; DB updates	2	reversible via ops
Credit card payments	payment gateway + customer assets	3	refunds required
Tax-grade PDF export	may be submitted to authorities	3	versioning + correction log

Just adding this column changes the review conversation:

“Is this really RML-2, or is it actually RML-3?”
“If it’s RML-3, where is the refund/correction/explanation path?”

2.2 Feature “promotion” as a roadmap concept

You don’t need to build everything as RML-3 on day one.
A natural evolution looks like:

Prototype: RML-1 (simulation only)
Alpha: RML-2 (internal use / limited customers)
GA: promote some flows into RML-3

If you make this an explicit roadmap question:

“Do we promote this feature to RML-3 next quarter?”

…then planning becomes concrete:

when Legal must be involved
when sagas and idempotency must be redesigned properly
when you need an Effect Ledger and case files

3) RML and metrics: measuring “trust” by world

Product strategy always collapses into metrics.

With RML, you can separate “health” into three layers.

3.1 RML-1 metrics (Closed World)

Focus: Are we able to try safely?

test coverage / number of simulations
dry-run execution count and failure patterns
bugs discovered in RML-1 before production rollout

3.2 RML-2 metrics (Dialog World)

Focus: How much recovery can we automate through conversation?

saga success rate / compensation activation rate
RML-2 error counts (world = RML2) by code
auto-retry resolution rate vs human ops intervention rate
% of critical external calls with idempotency keys

3.3 RML-3 metrics (History World)

Focus: How often do we generate irreversible trust-cost events?

RML-3 incident count (quarterly)
per-incident:
- affected users
- monetary cost (refunds, credits, ops time)
time from detection → containment (MTTD / MTTC)
recurrence rate (same incident class reappears)

At this point, metrics connect directly to:

P&L (refunds and support cost)
brand impact
trust and retention

3.4 Split dashboards by world

Even a simple dashboard split changes management discussions:

“Error rate is high but RML-3 incidents are zero” → likely a Dialog World design/ops improvement problem
“Error rate is low but RML-3 incidents are rising” → your History World governance (Ch.8–9) is the real gap

4) RML across the product lifecycle

4.1 Planning & design

put RML labels into backlog and specs
decide “how far we own the world” as part of the requirements
for RML-3 candidates, involve Legal early

4.2 Implementation

Apply Dialog World patterns (Ch.5–7):

structured errors with world/action
idempotency keys + saga discipline
RML metadata in API boundaries
tests that validate promotion conditions (“when does this become RML-3?”)

4.3 Release & operations

observability tags: rml.world, rml.action
RML-3 alerts create incidents automatically
maintain Runbooks (RML-2) and Playbooks (RML-3)

4.4 Sunset & replacement

RML matters even when you shut things down:

retiring an RML-3 feature:
- what happens to the history (records, ledgers, retention)?
migrating systems:
- how do you shift RML-2 dialogs (APIs/sagas) safely?
- how do you share RML-3 responsibility during the transition?

A useful sentence for planning:

“Sunsetting a feature means withdrawing from a world.”

That changes the granularity of your sunset plan.

5) The minimum viable adoption set

If this feels heavy, you can adopt RML in stages.

Step 1: adopt the vocabulary only

say “this is RML-1 / RML-3-ish” in reviews
add an RML column to backlog
add an RML-World field to incident reports

No implementation changes needed yet.
You’re aligning the meta-language first.

Step 2: implement the minimum RML-2 patterns

add world/action to your structured error type (e.g., RmlError) (Ch.5)
add idempotency keys to critical external calls (Ch.6)
add X-RML-World / X-RML-Action to API responses (Ch.7)

Now you can:

tag telemetry by world
alert on RML-3-ish signals

Step 3: use case files only for confirmed RML-3 incidents

if the org agrees “this is RML-3,” then always write the Ch.9 case file.

This gives you visibility into:

history-grade costs (refunds, credits)
where your bottleneck is (detect/contain/decide)
where your contracts/runbooks/playbooks are drifting

6) Strategy-level anti-patterns

6.1 Treating RML as a compliance label only

If RML becomes “audit paperwork,” it disconnects from real engineering and ops.

Fix:

always connect it to implementation patterns (Ch.5–7) and governance (Ch.8–9)

6.2 “World inflation” (everything becomes RML-3)

If everything is treated as History World:

the org becomes permanently incident-fatigued
decision loops slow to a crawl

Fix:

pre-agree on “default-to-RML-3” conditions (Ch.8)
handle the rest in RML-2, and promote only when needed

6.3 Shipping a new business without deciding which world you own

A common failure mode:

launch quickly
the History World boundary is left undefined
incidents arrive before governance exists

Fix:

require a slide in every planning review:

“Which world does this business step into, and which worlds does it avoid?”

7) Final checklist: do you have a worldview?

7.1 World mapping

[ ] Can you label major features as RML-1/2/3?
[ ] Are RML-3 features agreed with Legal and Business?
[ ] Is “promotion” (RML-2 → RML-3) part of roadmap discussions?

7.2 Implementation & ops

[ ] StructuredError / ActionHint includes world/action
[ ] critical external calls use idempotency keys
[ ] observability supports rml.world / rml.action
[ ] APIs carry RML metadata (REST/GraphQL/gRPC)

7.3 Governance & incidents

[ ] RML-3 incident definition and escalation rules are documented
[ ] Legal/Business/SRE triangle ownership (RACI) is explicit
[ ] case file template includes world + Decision Log
[ ] Runbook (RML-2) and Playbook (RML-3) exist

7.4 Business & trust

[ ] You can estimate RML-3 incident costs (roughly)
[ ] ToS compensation scope matches actual RML-3 behavior
[ ] Reducing RML-3 incidents is treated as a strategic objective

Closing — “Which world is this rollback?”

Distributed systems discussions quickly become technical:

consistency models
queues and retries
sagas and compensation
APIs, gRPC, messaging…

All of that matters.

But the high-leverage pause—the one this series keeps returning to—is:

“Which world are we talking about?”

RML-1: a world where you can try safely
RML-2: a world where you recover through dialog
RML-3: a world where you carry history and correct forward

When the world changes:

rollback changes meaning
incident weight changes
product responsibility changes

So the next time you’re about to say:

“We can roll it back,” or
“We can compensate,”

pause for one beat and ask:

“Rollback in which world—
and which world’s history are we asking whom to carry?”

That question alone tends to upgrade a distributed system from “works most of the time” into something closer to designed trust.

Chapter 11 — A Field Recipe for RML: Start Small, Grow It

DEV Community

Chapter 10 — RML as Product Strategy: Designing Trust

1) Use RML as a world map (not a “maturity level”)

2) Add an “RML column” to your product backlog

2.1 A simple table

2.2 Feature “promotion” as a roadmap concept

3) RML and metrics: measuring “trust” by world

3.1 RML-1 metrics (Closed World)

3.2 RML-2 metrics (Dialog World)

3.3 RML-3 metrics (History World)

3.4 Split dashboards by world

4) RML across the product lifecycle

4.1 Planning & design

4.2 Implementation

4.3 Release & operations

4.4 Sunset & replacement

5) The minimum viable adoption set

Step 1: adopt the vocabulary only

Step 2: implement the minimum RML-2 patterns

Step 3: use case files only for confirmed RML-3 incidents

6) Strategy-level anti-patterns

6.1 Treating RML as a compliance label only

6.2 “World inflation” (everything becomes RML-3)

6.3 Shipping a new business without deciding which world you own

7) Final checklist: do you have a worldview?

7.1 World mapping

7.2 Implementation & ops

7.3 Governance & incidents

7.4 Business & trust

Closing — “Which world is this rollback?”

Top comments (0)