DEV Community

kanaria007
kanaria007

Posted on • Originally published at zenn.dev

Chapter 9 — RML-3 Case Files: Aligning Your Incident Response Worldview

The Worlds of Distributed Systems — Chapter 9

“When an incident happens…
what is supposed to be happening behind the screen?”

In Chapter 8, we reframed RML-3 (History World) as an autonomy domain operated by a triangle:

  • Legal / Compliance
  • Business
  • SRE / Engineering

In this chapter, we move one step closer to the field and answer:

When an RML-3 incident happens, what should the actual response flow look like?

The goal is not “the ideal incident process.”
It’s a practical case-file format that connects:

  • the RML-1/2/3 worldview, and
  • the incident response workflow teams actually run.

Think of it as:

A “case file” is a worldview log:
which world changed, when, and who decided what about it.


1) Alert vs Bug vs Incident — separate them by world

In day-to-day operations, teams mix these terms:

  • “An alert fired”
  • “We hit a bug”
  • “We had an incident”

RML makes the distinction crisp.

1.1 Alert

An alert is typically:

  • a monitoring threshold was exceeded
  • error rate increased
  • latency degraded

This usually lives in RML-1/2:

  • many are transient
  • there may be user impact
  • but once recovered, no history-grade residue remains

So alerts often resolve inside the RML-1/2 flow.

1.2 Bug (Defect)

A bug is:

  • behavior differs from spec
  • certain conditions cause errors
  • the system acts wrong—but hasn’t reached the History World

This is mostly RML-1/2 territory:

  • track it in tickets (JIRA, GitHub issues)
  • fix it with normal engineering cycles

1.3 Incident

In this series, we define an incident as:

A meaningful responsibility-bearing event that has reached the History World (RML-3).

Examples:

  • money moved (or should have moved but didn’t)
  • customers/third parties were affected in the real world
  • there is regulatory/contractual/social responsibility exposure

In RML terms:

  • world = "RML3", action = "escalate-history"
  • should be recorded in an Effect Ledger
  • requires an incident report / case file

This is what belongs in your “casebook.”


2) The RML-3 incident flow: a 6-step casebook

A practical response flow that teams can template as-is:

  1. Detect
  2. Contain
  3. Understand
  4. Decide
  5. Act
  6. Learn

The key improvement is:

Attach “which world this step lives in” to every phase.

2.1 Detect — discovery (RML-2 → RML-3 boundary)

Triggers typically include:

  • RML-2 code throws world="RML3" errors
  • gateway/SLO detects elevated “RML-3 errors”
  • CS/Sales escalates customer complaints

The most important move:

The moment you suspect “this might be RML-3,”
switch into the RML-3 workflow immediately.

Concrete checks:

  • [ ] incident ticket created with RML3 label
  • [ ] provisional Effect Ledger record created
  • [ ] you didn’t “handle it” via retries/compensation and hope nobody notices

2.2 Contain — stop the bleed (mostly RML-2)

Containment means:

prevent further History World damage from accumulating

Typical actions:

  • feature flag off / kill switch
  • block a tenant/region
  • stop new execution paths at RML-2 level (e.g., immediate cancels)

This is SRE/Engineering-led, but if business impact is large, bring in PO/leadership quickly.

2.3 Understand — build a single timeline (RML-1/2 analysis + RML-3 records)

Here you answer:

  • what happened (facts)
  • which world(s) it happened in (RML-1/2/3)
  • how much is already written as history (RML-3 residue)

You usually reconcile:

  • app logs (RML-1/2)
  • traces/metrics
  • Effect Ledger (RML-3)

Your output should be a single coherent timeline, not a pile of fragments.

2.4 Decide — the triangle meeting

This is where Chapter 8’s triangle becomes real:

  • Legal: minimum obligations (regulation/contracts/privacy)
  • Business: how far to go (customer trust/brand/cost)
  • SRE/Engineering: what is technically possible + risk trade-offs

Decision framing:

“How do we update history forward—without pretending we can delete it?”

Typical plan components:

  • refund / re-run / correction notice / feature disable / rollback-forward
  • customer notification vs press release vs ToS-based disclosure rules

2.5 Act — remediation + communication (History World updates)

Now you execute:

  • refunds
  • data corrections
  • customer emails/calls
  • internal and external explanations

Key point: these actions are new History World events.

You do not erase the original record. You add correction events on top of it.

2.6 Learn — prevention + governance improvements (back to RML-1/2)

Postmortem is where you convert RML-3 pain into RML-1/2 improvements:

  • technical design fixes
  • monitoring + runbook changes
  • process/governance gaps (ToS mismatch, escalation rules, ownership)

The objective:

Transform History World events into repeatable improvements—rather than shame artifacts.


3) Runbook vs Playbook (RML mapping)

Ops docs often come in two forms:

  • Runbook: step-by-step operation manual (How)
  • Playbook: scenario/decision flow (What/Who)

RML mapping makes this clean:

  • Runbook ≈ RML-2

    • containment, rollback-forward, feature flags, commands
  • Playbook ≈ RML-3

    • who must be involved, how decisions are made, comms strategy

A practical pattern:

  • RML-2 incidents stay in runbooks
  • the moment you detect RML3, you hand off to playbooks

4) A one-page case-file template (YAML)

Here is a “casebook template” you can actually use.

incident_id: INC-2025-0001
title: "Possible double charge"
rml_world: RML3
status: resolved # or ongoing, monitoring

detected_at: 2025-06-01T10:23:45Z
detected_by:
  - type: alert
    source: "payment_service.rml3_error_rate"
  - type: cs_ticket
    id: "CS-1234"

classification:
  severity: SEV-1 # align with your org’s scheme
  asset:
    - money
  scope:
    affected_users_estimate: 37
    affected_transactions_estimate: 41

timeline:
  - at: 2025-06-01T10:23:45Z
    world: RML2
    type: detection
    detail: "RML3 error rate crossed threshold"
  - at: 2025-06-01T10:25:00Z
    world: RML2
    type: containment
    detail: "Feature flag off to stop new payment execution"
  - at: 2025-06-01T11:10:00Z
    world: RML3
    type: impact_assessed
    detail: "37 users / 41 transactions impacted"

decisions:
  summary: >
    Full refund to all impacted users + apology coupon.
    No public press release required.
  decided_by:
    - role: legal
      name: "..."
    - role: business
      name: "..."
    - role: sre
      name: "..."

actions:
  technical:
    - type: refund
      world: RML3
      count: 41
    - type: feature_fix
      world: RML2
      detail: "Require idempotency keys for payment saga steps"
  communication:
    - type: user_email
      world: RML3
      template_id: "refund_apology_v2"
    - type: internal_post
      world: RML2
      channel: "#incident"

lessons_learned:
  technical:
    - "Make idempotency keys mandatory for payment saga steps"
  process:
    - "Add CS handoff procedure when RML3 alerts fire"
  governance:
    - "Review ToS compensation clause vs actual practice"
Enter fullscreen mode Exit fullscreen mode

Why this works:

  • the timeline includes world: RML1/2/3
  • actions also include which world they update
  • decisions record who decided what
  • lessons split into technical/process/governance

This makes later review possible:

  • where are we slow? detect/contain/decide?
  • what bottleneck is repeated? technical or governance?

5) A concrete story: payment saga → RML-3 incident

A common path into RML-3 is an RML-2 mistake.

Example anti-pattern:

// Anti-pattern: retry without an idempotency key (duplicates can happen)
type PaymentGateway = {
  charge: (req: { paymentId: string; amount: number }) => Promise<void>;
};
declare const paymentGateway: PaymentGateway;

// (Assume `RmlError` is the structured error type from Chapter 5.)
async function charge(paymentId: string, amount: number) {
  try {
    await paymentGateway.charge({ paymentId, amount });
  } catch (e) {
    throw new RmlError({
      world: "RML2",
      severity: "error",
      action: "retry-with-backoff",
      code: "PAYMENT_GATEWAY_TEMPORARY_ERROR",
      message: "Temporary payment error",
      cause: e,
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Client follows retry-with-backoff, retries a few times.
Without idempotency, the gateway may charge multiple times.

This is how:

  • an RML-2 “retry policy” bug
  • becomes an RML-3 money incident
  • which then returns as an RML-2 design fix:

    • require idempotency keys
    • enforce with SDK/lint/tests

It’s the full “world loop” in one example.


6) Anti-patterns: when case files stop working

6.1 Writing the report only “for appearance”

If the case file is written after the fact to look clean:

  • timeline diverges from reality
  • decisions are retroactively rationalized

Fix:

  • fill the timeline in real time
  • write decision logs at the moment decisions happen

6.2 Treating an RML-3 incident as a “lighter” bug

This is cultural:

  • the org doesn’t want to admit history-grade impact
  • so it gets down-scoped as “just an RML-2 defect”

Result:

  • later: “why wasn’t this reported/escalated?”
  • trust damage compounds

Fix:

  • predefine what counts as RML-3
  • when unsure, treat gray as RML-3 and downscope later

6.3 Only writing the technical root cause

If postmortems omit governance/process/ToS aspects, the same class of incident repeats in new shapes.

Fix:

  • always keep:

    • lessons_learned.technical
    • lessons_learned.process
    • lessons_learned.governance

7) Checklist: how ready is your RML-3 casebook?

Definitions

  • [ ] Can you explain alert vs bug vs incident using RML?
  • [ ] Is your RML-3 incident definition documented?
  • [ ] Do you have a “gray → treat as RML-3” principle?

Flow

  • [ ] Do you have Detect/Contain/Understand/Decide/Act/Learn?
  • [ ] Do world="RML3" signals auto-create incident tickets?
  • [ ] Do you have kill switches / feature flags for containment?

Template

  • [ ] Do case files include a world per timeline event?
  • [ ] Do decisions capture “who decided what”?
  • [ ] Are lessons split across technical/process/governance?

Docs

  • [ ] Runbooks exist for RML-2 containment steps
  • [ ] Playbooks exist for RML-3 decision + comms steps
  • [ ] Lessons feed back into runbooks/playbooks

Culture

  • [ ] Legal/Business/SRE attend RML-3 postmortems
  • [ ] You can roughly estimate RML-3 costs (refunds, credits, ops)
  • [ ] RML-3 incidents are treated as learning opportunities

Closing — case files are worldview logs

This chapter’s one-liner:

An RML-3 case file is a log of worldview:
which world changed, what residue remained, who decided what, and how you corrected forward.

  • RML-1/2 explain how the system behaved internally
  • RML-3 records what became history
  • the organization decides how to update history forward
  • and then you rewrite RML-1/2 design to prevent repeat incidents

Next:

Chapter 10 — RML as Product Strategy: Designing Trust

Top comments (0)