kanaria007

Posted on Mar 12 • Originally published at zenn.dev

Chapter 9 — RML-3 Case Files: Aligning Your Incident Response Worldview

#distributedsystems #sre #incidentmanagement

The Worlds of Distributed Systems — Chapter 9

“When an incident happens…
what is supposed to be happening behind the screen?”

In Chapter 8, we reframed RML-3 (History World) as an autonomy domain operated by a triangle:

Legal / Compliance
Business
SRE / Engineering

In this chapter, we move one step closer to the field and answer:

When an RML-3 incident happens, what should the actual response flow look like?

The goal is not “the ideal incident process.”
It’s a practical case-file format that connects:

the RML-1/2/3 worldview, and
the incident response workflow teams actually run.

Think of it as:

A “case file” is a worldview log:
which world changed, when, and who decided what about it.

1) Alert vs Bug vs Incident — separate them by world

In day-to-day operations, teams mix these terms:

“An alert fired”
“We hit a bug”
“We had an incident”

RML makes the distinction crisp.

1.1 Alert

An alert is typically:

a monitoring threshold was exceeded
error rate increased
latency degraded

This usually lives in RML-1/2:

many are transient
there may be user impact
but once recovered, no history-grade residue remains

So alerts often resolve inside the RML-1/2 flow.

1.2 Bug (Defect)

A bug is:

behavior differs from spec
certain conditions cause errors
the system acts wrong—but hasn’t reached the History World

This is mostly RML-1/2 territory:

track it in tickets (JIRA, GitHub issues)
fix it with normal engineering cycles

1.3 Incident

In this series, we define an incident as:

A meaningful responsibility-bearing event that has reached the History World (RML-3).

Examples:

money moved (or should have moved but didn’t)
customers/third parties were affected in the real world
there is regulatory/contractual/social responsibility exposure

In RML terms:

world = "RML3", action = "escalate-history"
should be recorded in an Effect Ledger
requires an incident report / case file

This is what belongs in your “casebook.”

2) The RML-3 incident flow: a 6-step casebook

A practical response flow that teams can template as-is:

Detect
Contain
Understand
Decide
Act
Learn

The key improvement is:

Attach “which world this step lives in” to every phase.

2.1 Detect — discovery (RML-2 → RML-3 boundary)

Triggers typically include:

RML-2 code throws world="RML3" errors
gateway/SLO detects elevated “RML-3 errors”
CS/Sales escalates customer complaints

The most important move:

The moment you suspect “this might be RML-3,”
switch into the RML-3 workflow immediately.

Concrete checks:

[ ] incident ticket created with RML3 label
[ ] provisional Effect Ledger record created
[ ] you didn’t “handle it” via retries/compensation and hope nobody notices

2.2 Contain — stop the bleed (mostly RML-2)

Containment means:

prevent further History World damage from accumulating

Typical actions:

feature flag off / kill switch
block a tenant/region
stop new execution paths at RML-2 level (e.g., immediate cancels)

This is SRE/Engineering-led, but if business impact is large, bring in PO/leadership quickly.

2.3 Understand — build a single timeline (RML-1/2 analysis + RML-3 records)

Here you answer:

what happened (facts)
which world(s) it happened in (RML-1/2/3)
how much is already written as history (RML-3 residue)

You usually reconcile:

app logs (RML-1/2)
traces/metrics
Effect Ledger (RML-3)

Your output should be a single coherent timeline, not a pile of fragments.

2.4 Decide — the triangle meeting

This is where Chapter 8’s triangle becomes real:

Legal: minimum obligations (regulation/contracts/privacy)
Business: how far to go (customer trust/brand/cost)
SRE/Engineering: what is technically possible + risk trade-offs

Decision framing:

“How do we update history forward—without pretending we can delete it?”

Typical plan components:

refund / re-run / correction notice / feature disable / rollback-forward
customer notification vs press release vs ToS-based disclosure rules

2.5 Act — remediation + communication (History World updates)

Now you execute:

refunds
data corrections
customer emails/calls
internal and external explanations

Key point: these actions are new History World events.

You do not erase the original record. You add correction events on top of it.

2.6 Learn — prevention + governance improvements (back to RML-1/2)

Postmortem is where you convert RML-3 pain into RML-1/2 improvements:

technical design fixes
monitoring + runbook changes
process/governance gaps (ToS mismatch, escalation rules, ownership)

The objective:

Transform History World events into repeatable improvements—rather than shame artifacts.

3) Runbook vs Playbook (RML mapping)

Ops docs often come in two forms:

Runbook: step-by-step operation manual (How)
Playbook: scenario/decision flow (What/Who)

RML mapping makes this clean:

Runbook ≈ RML-2
- containment, rollback-forward, feature flags, commands
Playbook ≈ RML-3
- who must be involved, how decisions are made, comms strategy

A practical pattern:

RML-2 incidents stay in runbooks
the moment you detect RML3, you hand off to playbooks

4) A one-page case-file template (YAML)

Here is a “casebook template” you can actually use.

incident_id: INC-2025-0001
title: "Possible double charge"
rml_world: RML3
status: resolved # or ongoing, monitoring

detected_at: 2025-06-01T10:23:45Z
detected_by:
  - type: alert
    source: "payment_service.rml3_error_rate"
  - type: cs_ticket
    id: "CS-1234"

classification:
  severity: SEV-1 # align with your org’s scheme
  asset:
    - money
  scope:
    affected_users_estimate: 37
    affected_transactions_estimate: 41

timeline:
  - at: 2025-06-01T10:23:45Z
    world: RML2
    type: detection
    detail: "RML3 error rate crossed threshold"
  - at: 2025-06-01T10:25:00Z
    world: RML2
    type: containment
    detail: "Feature flag off to stop new payment execution"
  - at: 2025-06-01T11:10:00Z
    world: RML3
    type: impact_assessed
    detail: "37 users / 41 transactions impacted"

decisions:
  summary: >
    Full refund to all impacted users + apology coupon.
    No public press release required.
  decided_by:
    - role: legal
      name: "..."
    - role: business
      name: "..."
    - role: sre
      name: "..."

actions:
  technical:
    - type: refund
      world: RML3
      count: 41
    - type: feature_fix
      world: RML2
      detail: "Require idempotency keys for payment saga steps"
  communication:
    - type: user_email
      world: RML3
      template_id: "refund_apology_v2"
    - type: internal_post
      world: RML2
      channel: "#incident"

lessons_learned:
  technical:
    - "Make idempotency keys mandatory for payment saga steps"
  process:
    - "Add CS handoff procedure when RML3 alerts fire"
  governance:
    - "Review ToS compensation clause vs actual practice"

Why this works:

the timeline includes world: RML1/2/3
actions also include which world they update
decisions record who decided what
lessons split into technical/process/governance

This makes later review possible:

where are we slow? detect/contain/decide?
what bottleneck is repeated? technical or governance?

5) A concrete story: payment saga → RML-3 incident

A common path into RML-3 is an RML-2 mistake.

Example anti-pattern:

// Anti-pattern: retry without an idempotency key (duplicates can happen)
type PaymentGateway = {
  charge: (req: { paymentId: string; amount: number }) => Promise<void>;
};
declare const paymentGateway: PaymentGateway;

// (Assume `RmlError` is the structured error type from Chapter 5.)
async function charge(paymentId: string, amount: number) {
  try {
    await paymentGateway.charge({ paymentId, amount });
  } catch (e) {
    throw new RmlError({
      world: "RML2",
      severity: "error",
      action: "retry-with-backoff",
      code: "PAYMENT_GATEWAY_TEMPORARY_ERROR",
      message: "Temporary payment error",
      cause: e,
    });
  }
}

Client follows retry-with-backoff, retries a few times.
Without idempotency, the gateway may charge multiple times.

This is how:

an RML-2 “retry policy” bug
becomes an RML-3 money incident
which then returns as an RML-2 design fix:
- require idempotency keys
- enforce with SDK/lint/tests

It’s the full “world loop” in one example.

6) Anti-patterns: when case files stop working

6.1 Writing the report only “for appearance”

If the case file is written after the fact to look clean:

timeline diverges from reality
decisions are retroactively rationalized

Fix:

fill the timeline in real time
write decision logs at the moment decisions happen

6.2 Treating an RML-3 incident as a “lighter” bug

This is cultural:

the org doesn’t want to admit history-grade impact
so it gets down-scoped as “just an RML-2 defect”

Result:

later: “why wasn’t this reported/escalated?”
trust damage compounds

Fix:

predefine what counts as RML-3
when unsure, treat gray as RML-3 and downscope later

6.3 Only writing the technical root cause

If postmortems omit governance/process/ToS aspects, the same class of incident repeats in new shapes.

Fix:

always keep:
- lessons_learned.technical
- lessons_learned.process
- lessons_learned.governance

7) Checklist: how ready is your RML-3 casebook?

Definitions

[ ] Can you explain alert vs bug vs incident using RML?
[ ] Is your RML-3 incident definition documented?
[ ] Do you have a “gray → treat as RML-3” principle?

Flow

[ ] Do you have Detect/Contain/Understand/Decide/Act/Learn?
[ ] Do world="RML3" signals auto-create incident tickets?
[ ] Do you have kill switches / feature flags for containment?

Template

[ ] Do case files include a world per timeline event?
[ ] Do decisions capture “who decided what”?
[ ] Are lessons split across technical/process/governance?

Docs

[ ] Runbooks exist for RML-2 containment steps
[ ] Playbooks exist for RML-3 decision + comms steps
[ ] Lessons feed back into runbooks/playbooks

Culture

[ ] Legal/Business/SRE attend RML-3 postmortems
[ ] You can roughly estimate RML-3 costs (refunds, credits, ops)
[ ] RML-3 incidents are treated as learning opportunities

Closing — case files are worldview logs

This chapter’s one-liner:

An RML-3 case file is a log of worldview:
which world changed, what residue remained, who decided what, and how you corrected forward.

RML-1/2 explain how the system behaved internally
RML-3 records what became history
the organization decides how to update history forward
and then you rewrite RML-1/2 design to prevent repeat incidents

Chapter 10 — RML as Product Strategy: Designing Trust

DEV Community

Chapter 9 — RML-3 Case Files: Aligning Your Incident Response Worldview

1) Alert vs Bug vs Incident — separate them by world

1.1 Alert

1.2 Bug (Defect)

1.3 Incident

2) The RML-3 incident flow: a 6-step casebook

2.1 Detect — discovery (RML-2 → RML-3 boundary)

2.2 Contain — stop the bleed (mostly RML-2)

2.3 Understand — build a single timeline (RML-1/2 analysis + RML-3 records)

2.4 Decide — the triangle meeting

2.5 Act — remediation + communication (History World updates)

2.6 Learn — prevention + governance improvements (back to RML-1/2)

3) Runbook vs Playbook (RML mapping)

4) A one-page case-file template (YAML)

5) A concrete story: payment saga → RML-3 incident

6) Anti-patterns: when case files stop working

6.1 Writing the report only “for appearance”

6.2 Treating an RML-3 incident as a “lighter” bug

6.3 Only writing the technical root cause

7) Checklist: how ready is your RML-3 casebook?

Definitions

Flow

Template

Docs

Culture

Closing — case files are worldview logs

Top comments (0)