DEV Community

kanaria007
kanaria007

Posted on • Originally published at zenn.dev

Chapter 10 — RML as Product Strategy: Designing Trust

The Worlds of Distributed Systems — Chapter 10

“Will this feature work reliably?”
“Yes—but it depends on which world we’re willing to own.

So far in this series:

  • Chapters 1–4 drew the map of the three worlds,
  • Chapters 5–7 drilled into RML-2 (Dialog World) patterns (failures, sagas, APIs),
  • Chapters 8–9 focused on RML-3 (History World) autonomy and case files.

In Chapter 10, we zoom out and place RML at the level that decides everything downstream:

RML = a “trust world map” for distributed systems.

Not just an engineering model—but a product strategy tool.


1) Use RML as a world map (not a “maturity level”)

RML is often read as “how mature your rollback is.”

But strategically, it’s more useful as:

A map of how far your product takes responsibility.

A compact restatement:

  • RML-1 — Closed World
    Inside one room: tests, dry runs, simulations.
    “A world where you can try freely because you can restart safely.”

  • RML-2 — Dialog World
    Cross-service recovery: retries, compensation, reconciliation.
    “A world where you can recover through conversation.”

  • RML-3 — History World
    Money, law, social trust, irreversible records.
    “A world where you correct forward, and you must be accountable.”

Now the strategic question becomes:

“Which world does this product step into?”
“Which world do we intentionally not step into (and delegate to others)?”

Once you answer that, these decisions become far easier to align:

  • feature scope and granularity
  • SLA / SLO promises
  • Terms of Service (ToS) and compensation boundaries
  • incident response structure and cross-functional ownership

2) Add an “RML column” to your product backlog

In Chapter 7, we put RML into APIs.
At the strategy level, an even earlier move works surprisingly well:

Add an RML column to your feature backlog.

2.1 A simple table

Feature Description RML Notes
AI inference simulation Show results only; don’t persist 1 production reads only
Internal review status update internal users; DB updates 2 reversible via ops
Credit card payments payment gateway + customer assets 3 refunds required
Tax-grade PDF export may be submitted to authorities 3 versioning + correction log

Just adding this column changes the review conversation:

  • “Is this really RML-2, or is it actually RML-3?”
  • “If it’s RML-3, where is the refund/correction/explanation path?”

2.2 Feature “promotion” as a roadmap concept

You don’t need to build everything as RML-3 on day one.
A natural evolution looks like:

  1. Prototype: RML-1 (simulation only)
  2. Alpha: RML-2 (internal use / limited customers)
  3. GA: promote some flows into RML-3

If you make this an explicit roadmap question:

“Do we promote this feature to RML-3 next quarter?”

…then planning becomes concrete:

  • when Legal must be involved
  • when sagas and idempotency must be redesigned properly
  • when you need an Effect Ledger and case files

3) RML and metrics: measuring “trust” by world

Product strategy always collapses into metrics.

With RML, you can separate “health” into three layers.

3.1 RML-1 metrics (Closed World)

Focus: Are we able to try safely?

  • test coverage / number of simulations
  • dry-run execution count and failure patterns
  • bugs discovered in RML-1 before production rollout

3.2 RML-2 metrics (Dialog World)

Focus: How much recovery can we automate through conversation?

  • saga success rate / compensation activation rate
  • RML-2 error counts (world = RML2) by code
  • auto-retry resolution rate vs human ops intervention rate
  • % of critical external calls with idempotency keys

3.3 RML-3 metrics (History World)

Focus: How often do we generate irreversible trust-cost events?

  • RML-3 incident count (quarterly)
  • per-incident:

    • affected users
    • monetary cost (refunds, credits, ops time)
  • time from detection → containment (MTTD / MTTC)

  • recurrence rate (same incident class reappears)

At this point, metrics connect directly to:

  • P&L (refunds and support cost)
  • brand impact
  • trust and retention

3.4 Split dashboards by world

Even a simple dashboard split changes management discussions:

  • “Error rate is high but RML-3 incidents are zero” → likely a Dialog World design/ops improvement problem
  • “Error rate is low but RML-3 incidents are rising” → your History World governance (Ch.8–9) is the real gap

4) RML across the product lifecycle

4.1 Planning & design

  • put RML labels into backlog and specs
  • decide “how far we own the world” as part of the requirements
  • for RML-3 candidates, involve Legal early

4.2 Implementation

Apply Dialog World patterns (Ch.5–7):

  • structured errors with world/action
  • idempotency keys + saga discipline
  • RML metadata in API boundaries
  • tests that validate promotion conditions (“when does this become RML-3?”)

4.3 Release & operations

  • observability tags: rml.world, rml.action
  • RML-3 alerts create incidents automatically
  • maintain Runbooks (RML-2) and Playbooks (RML-3)

4.4 Sunset & replacement

RML matters even when you shut things down:

  • retiring an RML-3 feature:

    • what happens to the history (records, ledgers, retention)?
  • migrating systems:

    • how do you shift RML-2 dialogs (APIs/sagas) safely?
    • how do you share RML-3 responsibility during the transition?

A useful sentence for planning:

“Sunsetting a feature means withdrawing from a world.”

That changes the granularity of your sunset plan.


5) The minimum viable adoption set

If this feels heavy, you can adopt RML in stages.

Step 1: adopt the vocabulary only

  • say “this is RML-1 / RML-3-ish” in reviews
  • add an RML column to backlog
  • add an RML-World field to incident reports

No implementation changes needed yet.
You’re aligning the meta-language first.

Step 2: implement the minimum RML-2 patterns

  • add world/action to your structured error type (e.g., RmlError) (Ch.5)
  • add idempotency keys to critical external calls (Ch.6)
  • add X-RML-World / X-RML-Action to API responses (Ch.7)

Now you can:

  • tag telemetry by world
  • alert on RML-3-ish signals

Step 3: use case files only for confirmed RML-3 incidents

  • if the org agrees “this is RML-3,” then always write the Ch.9 case file.

This gives you visibility into:

  • history-grade costs (refunds, credits)
  • where your bottleneck is (detect/contain/decide)
  • where your contracts/runbooks/playbooks are drifting

6) Strategy-level anti-patterns

6.1 Treating RML as a compliance label only

If RML becomes “audit paperwork,” it disconnects from real engineering and ops.

Fix:

  • always connect it to implementation patterns (Ch.5–7) and governance (Ch.8–9)

6.2 “World inflation” (everything becomes RML-3)

If everything is treated as History World:

  • the org becomes permanently incident-fatigued
  • decision loops slow to a crawl

Fix:

  • pre-agree on “default-to-RML-3” conditions (Ch.8)
  • handle the rest in RML-2, and promote only when needed

6.3 Shipping a new business without deciding which world you own

A common failure mode:

  • launch quickly
  • the History World boundary is left undefined
  • incidents arrive before governance exists

Fix:

  • require a slide in every planning review:

“Which world does this business step into, and which worlds does it avoid?”


7) Final checklist: do you have a worldview?

7.1 World mapping

  • [ ] Can you label major features as RML-1/2/3?
  • [ ] Are RML-3 features agreed with Legal and Business?
  • [ ] Is “promotion” (RML-2 → RML-3) part of roadmap discussions?

7.2 Implementation & ops

  • [ ] StructuredError / ActionHint includes world/action
  • [ ] critical external calls use idempotency keys
  • [ ] observability supports rml.world / rml.action
  • [ ] APIs carry RML metadata (REST/GraphQL/gRPC)

7.3 Governance & incidents

  • [ ] RML-3 incident definition and escalation rules are documented
  • [ ] Legal/Business/SRE triangle ownership (RACI) is explicit
  • [ ] case file template includes world + Decision Log
  • [ ] Runbook (RML-2) and Playbook (RML-3) exist

7.4 Business & trust

  • [ ] You can estimate RML-3 incident costs (roughly)
  • [ ] ToS compensation scope matches actual RML-3 behavior
  • [ ] Reducing RML-3 incidents is treated as a strategic objective

Closing — “Which world is this rollback?”

Distributed systems discussions quickly become technical:

  • consistency models
  • queues and retries
  • sagas and compensation
  • APIs, gRPC, messaging…

All of that matters.

But the high-leverage pause—the one this series keeps returning to—is:

“Which world are we talking about?”

  • RML-1: a world where you can try safely
  • RML-2: a world where you recover through dialog
  • RML-3: a world where you carry history and correct forward

When the world changes:

  • rollback changes meaning
  • incident weight changes
  • product responsibility changes

So the next time you’re about to say:

  • “We can roll it back,” or
  • “We can compensate,”

pause for one beat and ask:

“Rollback in which world—
and which world’s history are we asking whom to carry?”

That question alone tends to upgrade a distributed system from “works most of the time” into something closer to designed trust.

Next:

Chapter 11 — A Field Recipe for RML: Start Small, Grow It

Top comments (0)