kanaria007

Posted on Mar 9 • Originally published at zenn.dev

Chapter 8 — Autonomy in the History World: The Legal–Business–SRE Triangle

#distributedsystems #sre #architecture #ai

The Worlds of Distributed Systems — Chapter 8

“Technically, we can fix it.
Legally, it’s already considered ‘done.’”

The moment you enter RML-3 (History World), engineering alone is no longer sufficient.

Suddenly, these roles show up at the table—at the same time:

Legal / Compliance
Business (Product Owner / Leadership)
SRE / Platform / Engineering

And the conversation shifts into questions like:

“Is this an incident or just a bug?”
“Do we need to apologize? Refund? What does the contract require?”
“How much do we disclose, and to whom?”
“What does ‘prevention’ mean here, and how do we prove it?”

This chapter reframes RML-3 as:

Not a technical problem, but an autonomy problem.

In other words: how an organization governs History World decisions—who decides what, at what level, and with what responsibilities.

1) Why RML-3 becomes organizational design

In RML-1 / RML-2, the main actor is usually engineering:

RML-1: process-local rollback, tests, dry-runs
RML-2: sagas, compensation, backoff & retry, reconciliation

Many problems can be solved inside an engineering team.

But in RML-3, that collapses quickly. The moment you touch:

movement of money
information that already reached customers
regulation, law, contracts, audits
security/privacy obligations

…you can’t “make it as-if-it-never-happened” by technical rollback.

From here on, History World must be operated as a shared domain:

Legal: rules, contracts, liability, privacy, reputation constraints
Business: customer relationship, brand behavior, cost and trade-offs
SRE/Engineering: facts, causality, technical limits, prevention measures

So:

RML-3 autonomy = how these three parties split decision-making and accountability for history-grade events.

2) The triangle model of the History World

Here’s the simplest map that keeps teams sane:

            Legal / Compliance
                 ▲
                 │  (rules / contracts / social responsibility)
                 │
Business ◀────────┼────────▶ SRE / Engineering
(customer / P&L)  │          (facts / system reality)

Each vertex brings a different “truth”:

Legal optimizes for constraint satisfaction (what you must/must not do)
Business optimizes for relationship and viability (brand, customer trust, cost)
SRE/Engineering optimizes for reality alignment (what happened, why, what can be changed)

When a History World incident happens, you’re effectively negotiating:

what level to record it at
who must be informed
what remediation is acceptable
how public communication should look
what prevention commitments are real vs performative

That negotiation is not a bug. It is History World governance.

3) Lifecycle of an RML-3 incident

Most history-grade events follow a repeatable lifecycle:

Detection
Initial triage
Impact assessment
Remediation
Record & communication
Prevention

3.1 Detection often starts from RML-2 escalation

If you adopted Chapters 5–7, you already have the clean “entry point”:

application code throws an error labeled like world: "RML3", action: "escalate-history"
observability/gateways/incident tooling detect and route it

Key principle:

The moment you enter History World should be explicit in code and telemetry—not discovered later by humans.

3.2 Initial triage: is it really RML-3?

This phase separates:

“It looked scary but self-corrected in RML-2” vs
“This is truly a History World event”

Examples:

double charge occurred and money actually moved → RML-3 confirmed
double charge appeared but one side auto-canceled before settlement → could be RML-2 bug / UX issue

Practical rule: when uncertain, treat gray cases as RML-3 until proven otherwise. It’s safer to downgrade than to underreact.

3.3 Impact assessment: how “heavy” is the history?

Three axes help:

Scope: affected users / transactions
Asset: money, PII, legal exposure, operational integrity
Time: how long the impact persisted

Example intuition:

1 user, small mistaken charge → small RML-3 (CS-led)
hundreds of invoices mis-sent → medium RML-3 (Business + SRE-led)
potential confidential leak → large RML-3 (Legal + leadership-level)

4) Escalation rules: when to treat it as RML-3 by default

Teams lose time when every incident starts with:

“Is this really RML-3?”

So create shared rules.

4.1 Default-to-RML-3 cases

Treat these as RML-3 unless you can prove otherwise:

Financial discrepancies
- double charges, overcharges, missing refunds
Externally visible misdelivery
- sending data to the wrong party (email/notifications/statements)
Potential regulatory/contract breach
- missed deadlines, deletion/retention obligations, audit failures
Security incident suspicion
- authz bugs, privilege escalation, data exposure risk

Common thread:

You have an obligation to explain later—what happened, to whom, and what you did about it.

That’s History World.

4.2 Cases that may stay in RML-2

Possibly RML-2-contained:

temporary sync failures that later reconciled correctly
issues resolved entirely by retries/compensation with no external visibility
staging accidents with no production data or external effects

Even here, the “may” depends on shared criteria agreed with Legal/Business—not engineering vibes.

5) A usable RACI table for RML-3

A lightweight RACI (Responsible / Accountable / Consulted / Informed) makes autonomy explicit.

Phase	Legal	Business (PO/Leadership)	SRE/Engineering	CS/Support
Detection	I	I	R/A	I
Initial triage	C	C	R/A	I
Impact assessment	C/A	C/A	R	C
Remediation policy decision	A	A	C	C
Remediation execution (technical)	I	I	R/A	C
Remediation execution (customer)	C	A	I	R/A
Record & reporting	A	C	R	C
Prevention measures	C	A	R	C

This does not need to be “perfect.”
What matters is that, in an incident, nobody is guessing:

who owns facts and technical limits
who owns external responsibility and contractual constraints
who owns customer-facing behavior and cost trade-offs

6) The information infrastructure that makes autonomy real

Governance without records becomes “trust me bro.”

6.1 Effect Ledger + Incident Report + Decision Log

To operate History World, you need artifacts that survive time:

Effect Ledger: what external effects occurred (charges, notices, statements, access grants)
Incident Report:
- summary, scope, root cause (technical + process), remediation, prevention
Decision Log:
- who decided what, when, based on what information/policy

History World autonomy includes:

Not only “we can explain what happened,”
but “we can explain who decided the response, and why.”

6.2 Reduce “shadow operations”

A common anti-pattern:

refunds/corrections done manually in an admin UI
no durable record of who did what
later you can’t reconstruct the timeline confidently

Better target:

admin actions also append to the Effect Ledger
every manual operation is linked to an incident or ledger record

7) Two lenses that matter in the boardroom: P&L and ToS

7.1 P&L (Profit & Loss): History World costs are real

RML-3 remediation becomes money:

refunds
goodwill coupons/credits
external legal/consulting fees
staffing/ops costs for support and comms

If you keep good History World records, you can later answer:

“How much did RML-3 incidents cost us this year?”
“How much was refunds vs ops vs goodwill?”
“Which product area is repeatedly generating history-grade costs?”

That clarity improves prioritization—and makes prevention investment legible.

7.2 ToS / contracts: the minimum line vs the brand line

Legal contracts often define:

what is guaranteed vs best-effort
what compensation is required
what is excluded/limited

This enables a clean split:

ToS line: the legal minimum you must do
Brand line: what Business chooses to do beyond minimum to preserve trust

Important: this is not “ToS justifies anything.”
It’s a way to make the autonomy boundary explicit.

8) Organizational anti-patterns (and fixes)

8.1 “We can roll back” as an argument against Legal

Engineering: “We can revert the DB.”
Legal: “But users already received the email.”

Fix: present the honest residue:

“We can roll back within RML-2 up to here.
What remains as history is: X, Y, Z.”

8.2 Treating incidents as “just bug tickets”

If RML-3 events are filed like normal bugs:

scope/impact/refunds aren’t recorded
later you can’t tell whether something was RML-2 or RML-3

Fix:

manage RML-3 as a distinct incident type
add fields like RML world, external impact, refund required, external comms required

8.3 “Someone will handle it” culture

If SRE ends up doing everything (triage, reporting, prevention) while Legal/Business only rubber-stamp later, autonomy becomes fragile.

Fix:

enforce the RACI in the postmortem process
require cross-functional attendance for history-grade incidents

9) Practical checklist (History World autonomy)

Rules & process

[ ] Do we have a shared definition of “RML-3 incident”?
[ ] Are escalation conditions documented and agreed cross-functionally?
[ ] Is the incident lifecycle owned via an explicit RACI?

Information infrastructure

[ ] Do we have templates for Effect Ledger + Incident Report + Decision Log?
[ ] Are manual/admin operations recorded and linked to history artifacts?
[ ] Do world=RML3 signals trigger incident creation and routing?

Culture

[ ] Do Legal/Business/SRE attend postmortems for RML-3 events?
[ ] Do we talk in “which world, how far rollback goes,” not “we can rollback”?
[ ] Do we treat RML-3 incidents as improvement inputs, not shame events?

Closing — whose world is History World?

RML-3 cannot be operated by:

engineering alone
legal alone
business alone

History World is the organization’s autonomy domain.

The question is not whether you’ll face RML-3 incidents. You will.

The question is whether, when it happens, your organization can:

identify it cleanly
decide responsibly
remediate coherently
explain credibly
improve deterministically

Next chapter:

Chapter 9 — RML-3 Case Files: Aligning your incident-response worldview

DEV Community