The Worlds of Distributed Systems — Chapter 8
“Technically, we can fix it.
Legally, it’s already considered ‘done.’”
The moment you enter RML-3 (History World), engineering alone is no longer sufficient.
Suddenly, these roles show up at the table—at the same time:
- Legal / Compliance
- Business (Product Owner / Leadership)
- SRE / Platform / Engineering
And the conversation shifts into questions like:
- “Is this an incident or just a bug?”
- “Do we need to apologize? Refund? What does the contract require?”
- “How much do we disclose, and to whom?”
- “What does ‘prevention’ mean here, and how do we prove it?”
This chapter reframes RML-3 as:
Not a technical problem, but an autonomy problem.
In other words: how an organization governs History World decisions—who decides what, at what level, and with what responsibilities.
1) Why RML-3 becomes organizational design
In RML-1 / RML-2, the main actor is usually engineering:
- RML-1: process-local rollback, tests, dry-runs
- RML-2: sagas, compensation, backoff & retry, reconciliation
Many problems can be solved inside an engineering team.
But in RML-3, that collapses quickly. The moment you touch:
- movement of money
- information that already reached customers
- regulation, law, contracts, audits
- security/privacy obligations
…you can’t “make it as-if-it-never-happened” by technical rollback.
From here on, History World must be operated as a shared domain:
- Legal: rules, contracts, liability, privacy, reputation constraints
- Business: customer relationship, brand behavior, cost and trade-offs
- SRE/Engineering: facts, causality, technical limits, prevention measures
So:
RML-3 autonomy = how these three parties split decision-making and accountability for history-grade events.
2) The triangle model of the History World
Here’s the simplest map that keeps teams sane:
Legal / Compliance
▲
│ (rules / contracts / social responsibility)
│
Business ◀────────┼────────▶ SRE / Engineering
(customer / P&L) │ (facts / system reality)
Each vertex brings a different “truth”:
- Legal optimizes for constraint satisfaction (what you must/must not do)
- Business optimizes for relationship and viability (brand, customer trust, cost)
- SRE/Engineering optimizes for reality alignment (what happened, why, what can be changed)
When a History World incident happens, you’re effectively negotiating:
- what level to record it at
- who must be informed
- what remediation is acceptable
- how public communication should look
- what prevention commitments are real vs performative
That negotiation is not a bug. It is History World governance.
3) Lifecycle of an RML-3 incident
Most history-grade events follow a repeatable lifecycle:
- Detection
- Initial triage
- Impact assessment
- Remediation
- Record & communication
- Prevention
3.1 Detection often starts from RML-2 escalation
If you adopted Chapters 5–7, you already have the clean “entry point”:
- application code throws an error labeled like
world: "RML3", action: "escalate-history" - observability/gateways/incident tooling detect and route it
Key principle:
The moment you enter History World should be explicit in code and telemetry—not discovered later by humans.
3.2 Initial triage: is it really RML-3?
This phase separates:
- “It looked scary but self-corrected in RML-2” vs
- “This is truly a History World event”
Examples:
- double charge occurred and money actually moved → RML-3 confirmed
- double charge appeared but one side auto-canceled before settlement → could be RML-2 bug / UX issue
Practical rule: when uncertain, treat gray cases as RML-3 until proven otherwise. It’s safer to downgrade than to underreact.
3.3 Impact assessment: how “heavy” is the history?
Three axes help:
- Scope: affected users / transactions
- Asset: money, PII, legal exposure, operational integrity
- Time: how long the impact persisted
Example intuition:
- 1 user, small mistaken charge → small RML-3 (CS-led)
- hundreds of invoices mis-sent → medium RML-3 (Business + SRE-led)
- potential confidential leak → large RML-3 (Legal + leadership-level)
4) Escalation rules: when to treat it as RML-3 by default
Teams lose time when every incident starts with:
“Is this really RML-3?”
So create shared rules.
4.1 Default-to-RML-3 cases
Treat these as RML-3 unless you can prove otherwise:
-
Financial discrepancies
- double charges, overcharges, missing refunds
-
Externally visible misdelivery
- sending data to the wrong party (email/notifications/statements)
-
Potential regulatory/contract breach
- missed deadlines, deletion/retention obligations, audit failures
-
Security incident suspicion
- authz bugs, privilege escalation, data exposure risk
Common thread:
You have an obligation to explain later—what happened, to whom, and what you did about it.
That’s History World.
4.2 Cases that may stay in RML-2
Possibly RML-2-contained:
- temporary sync failures that later reconciled correctly
- issues resolved entirely by retries/compensation with no external visibility
- staging accidents with no production data or external effects
Even here, the “may” depends on shared criteria agreed with Legal/Business—not engineering vibes.
5) A usable RACI table for RML-3
A lightweight RACI (Responsible / Accountable / Consulted / Informed) makes autonomy explicit.
| Phase | Legal | Business (PO/Leadership) | SRE/Engineering | CS/Support |
|---|---|---|---|---|
| Detection | I | I | R/A | I |
| Initial triage | C | C | R/A | I |
| Impact assessment | C/A | C/A | R | C |
| Remediation policy decision | A | A | C | C |
| Remediation execution (technical) | I | I | R/A | C |
| Remediation execution (customer) | C | A | I | R/A |
| Record & reporting | A | C | R | C |
| Prevention measures | C | A | R | C |
This does not need to be “perfect.”
What matters is that, in an incident, nobody is guessing:
- who owns facts and technical limits
- who owns external responsibility and contractual constraints
- who owns customer-facing behavior and cost trade-offs
6) The information infrastructure that makes autonomy real
Governance without records becomes “trust me bro.”
6.1 Effect Ledger + Incident Report + Decision Log
To operate History World, you need artifacts that survive time:
- Effect Ledger: what external effects occurred (charges, notices, statements, access grants)
-
Incident Report:
- summary, scope, root cause (technical + process), remediation, prevention
-
Decision Log:
- who decided what, when, based on what information/policy
History World autonomy includes:
Not only “we can explain what happened,”
but “we can explain who decided the response, and why.”
6.2 Reduce “shadow operations”
A common anti-pattern:
- refunds/corrections done manually in an admin UI
- no durable record of who did what
- later you can’t reconstruct the timeline confidently
Better target:
- admin actions also append to the Effect Ledger
- every manual operation is linked to an incident or ledger record
7) Two lenses that matter in the boardroom: P&L and ToS
7.1 P&L (Profit & Loss): History World costs are real
RML-3 remediation becomes money:
- refunds
- goodwill coupons/credits
- external legal/consulting fees
- staffing/ops costs for support and comms
If you keep good History World records, you can later answer:
- “How much did RML-3 incidents cost us this year?”
- “How much was refunds vs ops vs goodwill?”
- “Which product area is repeatedly generating history-grade costs?”
That clarity improves prioritization—and makes prevention investment legible.
7.2 ToS / contracts: the minimum line vs the brand line
Legal contracts often define:
- what is guaranteed vs best-effort
- what compensation is required
- what is excluded/limited
This enables a clean split:
- ToS line: the legal minimum you must do
- Brand line: what Business chooses to do beyond minimum to preserve trust
Important: this is not “ToS justifies anything.”
It’s a way to make the autonomy boundary explicit.
8) Organizational anti-patterns (and fixes)
8.1 “We can roll back” as an argument against Legal
- Engineering: “We can revert the DB.”
- Legal: “But users already received the email.”
Fix: present the honest residue:
“We can roll back within RML-2 up to here.
What remains as history is: X, Y, Z.”
8.2 Treating incidents as “just bug tickets”
If RML-3 events are filed like normal bugs:
- scope/impact/refunds aren’t recorded
- later you can’t tell whether something was RML-2 or RML-3
Fix:
- manage RML-3 as a distinct incident type
- add fields like
RML world,external impact,refund required,external comms required
8.3 “Someone will handle it” culture
If SRE ends up doing everything (triage, reporting, prevention) while Legal/Business only rubber-stamp later, autonomy becomes fragile.
Fix:
- enforce the RACI in the postmortem process
- require cross-functional attendance for history-grade incidents
9) Practical checklist (History World autonomy)
Rules & process
- [ ] Do we have a shared definition of “RML-3 incident”?
- [ ] Are escalation conditions documented and agreed cross-functionally?
- [ ] Is the incident lifecycle owned via an explicit RACI?
Information infrastructure
- [ ] Do we have templates for Effect Ledger + Incident Report + Decision Log?
- [ ] Are manual/admin operations recorded and linked to history artifacts?
- [ ] Do
world=RML3signals trigger incident creation and routing?
Culture
- [ ] Do Legal/Business/SRE attend postmortems for RML-3 events?
- [ ] Do we talk in “which world, how far rollback goes,” not “we can rollback”?
- [ ] Do we treat RML-3 incidents as improvement inputs, not shame events?
Closing — whose world is History World?
RML-3 cannot be operated by:
- engineering alone
- legal alone
- business alone
History World is the organization’s autonomy domain.
The question is not whether you’ll face RML-3 incidents. You will.
The question is whether, when it happens, your organization can:
- identify it cleanly
- decide responsibly
- remediate coherently
- explain credibly
- improve deterministically
Next chapter:
Chapter 9 — RML-3 Case Files: Aligning your incident-response worldview
Top comments (0)