The Worlds of Distributed Systems — Chapter 9
“When an incident happens…
what is supposed to be happening behind the screen?”
In Chapter 8, we reframed RML-3 (History World) as an autonomy domain operated by a triangle:
- Legal / Compliance
- Business
- SRE / Engineering
In this chapter, we move one step closer to the field and answer:
When an RML-3 incident happens, what should the actual response flow look like?
The goal is not “the ideal incident process.”
It’s a practical case-file format that connects:
- the RML-1/2/3 worldview, and
- the incident response workflow teams actually run.
Think of it as:
A “case file” is a worldview log:
which world changed, when, and who decided what about it.
1) Alert vs Bug vs Incident — separate them by world
In day-to-day operations, teams mix these terms:
- “An alert fired”
- “We hit a bug”
- “We had an incident”
RML makes the distinction crisp.
1.1 Alert
An alert is typically:
- a monitoring threshold was exceeded
- error rate increased
- latency degraded
This usually lives in RML-1/2:
- many are transient
- there may be user impact
- but once recovered, no history-grade residue remains
So alerts often resolve inside the RML-1/2 flow.
1.2 Bug (Defect)
A bug is:
- behavior differs from spec
- certain conditions cause errors
- the system acts wrong—but hasn’t reached the History World
This is mostly RML-1/2 territory:
- track it in tickets (JIRA, GitHub issues)
- fix it with normal engineering cycles
1.3 Incident
In this series, we define an incident as:
A meaningful responsibility-bearing event that has reached the History World (RML-3).
Examples:
- money moved (or should have moved but didn’t)
- customers/third parties were affected in the real world
- there is regulatory/contractual/social responsibility exposure
In RML terms:
-
world = "RML3",action = "escalate-history" - should be recorded in an Effect Ledger
- requires an incident report / case file
This is what belongs in your “casebook.”
2) The RML-3 incident flow: a 6-step casebook
A practical response flow that teams can template as-is:
- Detect
- Contain
- Understand
- Decide
- Act
- Learn
The key improvement is:
Attach “which world this step lives in” to every phase.
2.1 Detect — discovery (RML-2 → RML-3 boundary)
Triggers typically include:
- RML-2 code throws
world="RML3"errors - gateway/SLO detects elevated “RML-3 errors”
- CS/Sales escalates customer complaints
The most important move:
The moment you suspect “this might be RML-3,”
switch into the RML-3 workflow immediately.
Concrete checks:
- [ ] incident ticket created with
RML3label - [ ] provisional Effect Ledger record created
- [ ] you didn’t “handle it” via retries/compensation and hope nobody notices
2.2 Contain — stop the bleed (mostly RML-2)
Containment means:
prevent further History World damage from accumulating
Typical actions:
- feature flag off / kill switch
- block a tenant/region
- stop new execution paths at RML-2 level (e.g., immediate cancels)
This is SRE/Engineering-led, but if business impact is large, bring in PO/leadership quickly.
2.3 Understand — build a single timeline (RML-1/2 analysis + RML-3 records)
Here you answer:
- what happened (facts)
- which world(s) it happened in (RML-1/2/3)
- how much is already written as history (RML-3 residue)
You usually reconcile:
- app logs (RML-1/2)
- traces/metrics
- Effect Ledger (RML-3)
Your output should be a single coherent timeline, not a pile of fragments.
2.4 Decide — the triangle meeting
This is where Chapter 8’s triangle becomes real:
- Legal: minimum obligations (regulation/contracts/privacy)
- Business: how far to go (customer trust/brand/cost)
- SRE/Engineering: what is technically possible + risk trade-offs
Decision framing:
“How do we update history forward—without pretending we can delete it?”
Typical plan components:
- refund / re-run / correction notice / feature disable / rollback-forward
- customer notification vs press release vs ToS-based disclosure rules
2.5 Act — remediation + communication (History World updates)
Now you execute:
- refunds
- data corrections
- customer emails/calls
- internal and external explanations
Key point: these actions are new History World events.
You do not erase the original record. You add correction events on top of it.
2.6 Learn — prevention + governance improvements (back to RML-1/2)
Postmortem is where you convert RML-3 pain into RML-1/2 improvements:
- technical design fixes
- monitoring + runbook changes
- process/governance gaps (ToS mismatch, escalation rules, ownership)
The objective:
Transform History World events into repeatable improvements—rather than shame artifacts.
3) Runbook vs Playbook (RML mapping)
Ops docs often come in two forms:
- Runbook: step-by-step operation manual (How)
- Playbook: scenario/decision flow (What/Who)
RML mapping makes this clean:
-
Runbook ≈ RML-2
- containment, rollback-forward, feature flags, commands
-
Playbook ≈ RML-3
- who must be involved, how decisions are made, comms strategy
A practical pattern:
- RML-2 incidents stay in runbooks
- the moment you detect
RML3, you hand off to playbooks
4) A one-page case-file template (YAML)
Here is a “casebook template” you can actually use.
incident_id: INC-2025-0001
title: "Possible double charge"
rml_world: RML3
status: resolved # or ongoing, monitoring
detected_at: 2025-06-01T10:23:45Z
detected_by:
- type: alert
source: "payment_service.rml3_error_rate"
- type: cs_ticket
id: "CS-1234"
classification:
severity: SEV-1 # align with your org’s scheme
asset:
- money
scope:
affected_users_estimate: 37
affected_transactions_estimate: 41
timeline:
- at: 2025-06-01T10:23:45Z
world: RML2
type: detection
detail: "RML3 error rate crossed threshold"
- at: 2025-06-01T10:25:00Z
world: RML2
type: containment
detail: "Feature flag off to stop new payment execution"
- at: 2025-06-01T11:10:00Z
world: RML3
type: impact_assessed
detail: "37 users / 41 transactions impacted"
decisions:
summary: >
Full refund to all impacted users + apology coupon.
No public press release required.
decided_by:
- role: legal
name: "..."
- role: business
name: "..."
- role: sre
name: "..."
actions:
technical:
- type: refund
world: RML3
count: 41
- type: feature_fix
world: RML2
detail: "Require idempotency keys for payment saga steps"
communication:
- type: user_email
world: RML3
template_id: "refund_apology_v2"
- type: internal_post
world: RML2
channel: "#incident"
lessons_learned:
technical:
- "Make idempotency keys mandatory for payment saga steps"
process:
- "Add CS handoff procedure when RML3 alerts fire"
governance:
- "Review ToS compensation clause vs actual practice"
Why this works:
- the timeline includes
world: RML1/2/3 - actions also include which world they update
- decisions record who decided what
- lessons split into technical/process/governance
This makes later review possible:
- where are we slow? detect/contain/decide?
- what bottleneck is repeated? technical or governance?
5) A concrete story: payment saga → RML-3 incident
A common path into RML-3 is an RML-2 mistake.
Example anti-pattern:
// Anti-pattern: retry without an idempotency key (duplicates can happen)
type PaymentGateway = {
charge: (req: { paymentId: string; amount: number }) => Promise<void>;
};
declare const paymentGateway: PaymentGateway;
// (Assume `RmlError` is the structured error type from Chapter 5.)
async function charge(paymentId: string, amount: number) {
try {
await paymentGateway.charge({ paymentId, amount });
} catch (e) {
throw new RmlError({
world: "RML2",
severity: "error",
action: "retry-with-backoff",
code: "PAYMENT_GATEWAY_TEMPORARY_ERROR",
message: "Temporary payment error",
cause: e,
});
}
}
Client follows retry-with-backoff, retries a few times.
Without idempotency, the gateway may charge multiple times.
This is how:
- an RML-2 “retry policy” bug
- becomes an RML-3 money incident
-
which then returns as an RML-2 design fix:
- require idempotency keys
- enforce with SDK/lint/tests
It’s the full “world loop” in one example.
6) Anti-patterns: when case files stop working
6.1 Writing the report only “for appearance”
If the case file is written after the fact to look clean:
- timeline diverges from reality
- decisions are retroactively rationalized
Fix:
- fill the timeline in real time
- write decision logs at the moment decisions happen
6.2 Treating an RML-3 incident as a “lighter” bug
This is cultural:
- the org doesn’t want to admit history-grade impact
- so it gets down-scoped as “just an RML-2 defect”
Result:
- later: “why wasn’t this reported/escalated?”
- trust damage compounds
Fix:
- predefine what counts as RML-3
- when unsure, treat gray as RML-3 and downscope later
6.3 Only writing the technical root cause
If postmortems omit governance/process/ToS aspects, the same class of incident repeats in new shapes.
Fix:
-
always keep:
lessons_learned.technicallessons_learned.processlessons_learned.governance
7) Checklist: how ready is your RML-3 casebook?
Definitions
- [ ] Can you explain alert vs bug vs incident using RML?
- [ ] Is your RML-3 incident definition documented?
- [ ] Do you have a “gray → treat as RML-3” principle?
Flow
- [ ] Do you have Detect/Contain/Understand/Decide/Act/Learn?
- [ ] Do
world="RML3"signals auto-create incident tickets? - [ ] Do you have kill switches / feature flags for containment?
Template
- [ ] Do case files include a
worldper timeline event? - [ ] Do decisions capture “who decided what”?
- [ ] Are lessons split across technical/process/governance?
Docs
- [ ] Runbooks exist for RML-2 containment steps
- [ ] Playbooks exist for RML-3 decision + comms steps
- [ ] Lessons feed back into runbooks/playbooks
Culture
- [ ] Legal/Business/SRE attend RML-3 postmortems
- [ ] You can roughly estimate RML-3 costs (refunds, credits, ops)
- [ ] RML-3 incidents are treated as learning opportunities
Closing — case files are worldview logs
This chapter’s one-liner:
An RML-3 case file is a log of worldview:
which world changed, what residue remained, who decided what, and how you corrected forward.
- RML-1/2 explain how the system behaved internally
- RML-3 records what became history
- the organization decides how to update history forward
- and then you rewrite RML-1/2 design to prevent repeat incidents
Next:
Chapter 10 — RML as Product Strategy: Designing Trust
Top comments (0)