DEV Community

kanaria007
kanaria007

Posted on • Originally published at zenn.dev

Chapter 8 — Autonomy in the History World: The Legal–Business–SRE Triangle

The Worlds of Distributed Systems — Chapter 8

“Technically, we can fix it.
Legally, it’s already considered ‘done.’”

The moment you enter RML-3 (History World), engineering alone is no longer sufficient.

Suddenly, these roles show up at the table—at the same time:

  • Legal / Compliance
  • Business (Product Owner / Leadership)
  • SRE / Platform / Engineering

And the conversation shifts into questions like:

  • “Is this an incident or just a bug?”
  • “Do we need to apologize? Refund? What does the contract require?”
  • “How much do we disclose, and to whom?”
  • “What does ‘prevention’ mean here, and how do we prove it?”

This chapter reframes RML-3 as:

Not a technical problem, but an autonomy problem.

In other words: how an organization governs History World decisions—who decides what, at what level, and with what responsibilities.


1) Why RML-3 becomes organizational design

In RML-1 / RML-2, the main actor is usually engineering:

  • RML-1: process-local rollback, tests, dry-runs
  • RML-2: sagas, compensation, backoff & retry, reconciliation

Many problems can be solved inside an engineering team.

But in RML-3, that collapses quickly. The moment you touch:

  • movement of money
  • information that already reached customers
  • regulation, law, contracts, audits
  • security/privacy obligations

…you can’t “make it as-if-it-never-happened” by technical rollback.

From here on, History World must be operated as a shared domain:

  • Legal: rules, contracts, liability, privacy, reputation constraints
  • Business: customer relationship, brand behavior, cost and trade-offs
  • SRE/Engineering: facts, causality, technical limits, prevention measures

So:

RML-3 autonomy = how these three parties split decision-making and accountability for history-grade events.


2) The triangle model of the History World

Here’s the simplest map that keeps teams sane:

            Legal / Compliance
                 ▲
                 │  (rules / contracts / social responsibility)
                 │
Business ◀────────┼────────▶ SRE / Engineering
(customer / P&L)  │          (facts / system reality)
Enter fullscreen mode Exit fullscreen mode

Each vertex brings a different “truth”:

  • Legal optimizes for constraint satisfaction (what you must/must not do)
  • Business optimizes for relationship and viability (brand, customer trust, cost)
  • SRE/Engineering optimizes for reality alignment (what happened, why, what can be changed)

When a History World incident happens, you’re effectively negotiating:

  • what level to record it at
  • who must be informed
  • what remediation is acceptable
  • how public communication should look
  • what prevention commitments are real vs performative

That negotiation is not a bug. It is History World governance.


3) Lifecycle of an RML-3 incident

Most history-grade events follow a repeatable lifecycle:

  1. Detection
  2. Initial triage
  3. Impact assessment
  4. Remediation
  5. Record & communication
  6. Prevention

3.1 Detection often starts from RML-2 escalation

If you adopted Chapters 5–7, you already have the clean “entry point”:

  • application code throws an error labeled like world: "RML3", action: "escalate-history"
  • observability/gateways/incident tooling detect and route it

Key principle:

The moment you enter History World should be explicit in code and telemetry—not discovered later by humans.

3.2 Initial triage: is it really RML-3?

This phase separates:

  • “It looked scary but self-corrected in RML-2” vs
  • “This is truly a History World event”

Examples:

  • double charge occurred and money actually moved → RML-3 confirmed
  • double charge appeared but one side auto-canceled before settlement → could be RML-2 bug / UX issue

Practical rule: when uncertain, treat gray cases as RML-3 until proven otherwise. It’s safer to downgrade than to underreact.

3.3 Impact assessment: how “heavy” is the history?

Three axes help:

  • Scope: affected users / transactions
  • Asset: money, PII, legal exposure, operational integrity
  • Time: how long the impact persisted

Example intuition:

  • 1 user, small mistaken charge → small RML-3 (CS-led)
  • hundreds of invoices mis-sent → medium RML-3 (Business + SRE-led)
  • potential confidential leak → large RML-3 (Legal + leadership-level)

4) Escalation rules: when to treat it as RML-3 by default

Teams lose time when every incident starts with:

“Is this really RML-3?”

So create shared rules.

4.1 Default-to-RML-3 cases

Treat these as RML-3 unless you can prove otherwise:

  • Financial discrepancies

    • double charges, overcharges, missing refunds
  • Externally visible misdelivery

    • sending data to the wrong party (email/notifications/statements)
  • Potential regulatory/contract breach

    • missed deadlines, deletion/retention obligations, audit failures
  • Security incident suspicion

    • authz bugs, privilege escalation, data exposure risk

Common thread:

You have an obligation to explain later—what happened, to whom, and what you did about it.

That’s History World.

4.2 Cases that may stay in RML-2

Possibly RML-2-contained:

  • temporary sync failures that later reconciled correctly
  • issues resolved entirely by retries/compensation with no external visibility
  • staging accidents with no production data or external effects

Even here, the “may” depends on shared criteria agreed with Legal/Business—not engineering vibes.


5) A usable RACI table for RML-3

A lightweight RACI (Responsible / Accountable / Consulted / Informed) makes autonomy explicit.

Phase Legal Business (PO/Leadership) SRE/Engineering CS/Support
Detection I I R/A I
Initial triage C C R/A I
Impact assessment C/A C/A R C
Remediation policy decision A A C C
Remediation execution (technical) I I R/A C
Remediation execution (customer) C A I R/A
Record & reporting A C R C
Prevention measures C A R C

This does not need to be “perfect.”
What matters is that, in an incident, nobody is guessing:

  • who owns facts and technical limits
  • who owns external responsibility and contractual constraints
  • who owns customer-facing behavior and cost trade-offs

6) The information infrastructure that makes autonomy real

Governance without records becomes “trust me bro.”

6.1 Effect Ledger + Incident Report + Decision Log

To operate History World, you need artifacts that survive time:

  • Effect Ledger: what external effects occurred (charges, notices, statements, access grants)
  • Incident Report:

    • summary, scope, root cause (technical + process), remediation, prevention
  • Decision Log:

    • who decided what, when, based on what information/policy

History World autonomy includes:

Not only “we can explain what happened,”
but “we can explain who decided the response, and why.”

6.2 Reduce “shadow operations”

A common anti-pattern:

  • refunds/corrections done manually in an admin UI
  • no durable record of who did what
  • later you can’t reconstruct the timeline confidently

Better target:

  • admin actions also append to the Effect Ledger
  • every manual operation is linked to an incident or ledger record

7) Two lenses that matter in the boardroom: P&L and ToS

7.1 P&L (Profit & Loss): History World costs are real

RML-3 remediation becomes money:

  • refunds
  • goodwill coupons/credits
  • external legal/consulting fees
  • staffing/ops costs for support and comms

If you keep good History World records, you can later answer:

  • “How much did RML-3 incidents cost us this year?”
  • “How much was refunds vs ops vs goodwill?”
  • “Which product area is repeatedly generating history-grade costs?”

That clarity improves prioritization—and makes prevention investment legible.

7.2 ToS / contracts: the minimum line vs the brand line

Legal contracts often define:

  • what is guaranteed vs best-effort
  • what compensation is required
  • what is excluded/limited

This enables a clean split:

  • ToS line: the legal minimum you must do
  • Brand line: what Business chooses to do beyond minimum to preserve trust

Important: this is not “ToS justifies anything.”
It’s a way to make the autonomy boundary explicit.


8) Organizational anti-patterns (and fixes)

8.1 “We can roll back” as an argument against Legal

  • Engineering: “We can revert the DB.”
  • Legal: “But users already received the email.”

Fix: present the honest residue:

“We can roll back within RML-2 up to here.
What remains as history is: X, Y, Z.”

8.2 Treating incidents as “just bug tickets”

If RML-3 events are filed like normal bugs:

  • scope/impact/refunds aren’t recorded
  • later you can’t tell whether something was RML-2 or RML-3

Fix:

  • manage RML-3 as a distinct incident type
  • add fields like RML world, external impact, refund required, external comms required

8.3 “Someone will handle it” culture

If SRE ends up doing everything (triage, reporting, prevention) while Legal/Business only rubber-stamp later, autonomy becomes fragile.

Fix:

  • enforce the RACI in the postmortem process
  • require cross-functional attendance for history-grade incidents

9) Practical checklist (History World autonomy)

Rules & process

  • [ ] Do we have a shared definition of “RML-3 incident”?
  • [ ] Are escalation conditions documented and agreed cross-functionally?
  • [ ] Is the incident lifecycle owned via an explicit RACI?

Information infrastructure

  • [ ] Do we have templates for Effect Ledger + Incident Report + Decision Log?
  • [ ] Are manual/admin operations recorded and linked to history artifacts?
  • [ ] Do world=RML3 signals trigger incident creation and routing?

Culture

  • [ ] Do Legal/Business/SRE attend postmortems for RML-3 events?
  • [ ] Do we talk in “which world, how far rollback goes,” not “we can rollback”?
  • [ ] Do we treat RML-3 incidents as improvement inputs, not shame events?

Closing — whose world is History World?

RML-3 cannot be operated by:

  • engineering alone
  • legal alone
  • business alone

History World is the organization’s autonomy domain.

The question is not whether you’ll face RML-3 incidents. You will.

The question is whether, when it happens, your organization can:

  • identify it cleanly
  • decide responsibly
  • remediate coherently
  • explain credibly
  • improve deterministically

Next chapter:

Chapter 9 — RML-3 Case Files: Aligning your incident-response worldview

Top comments (0)