The Worlds of Distributed Systems — Chapter 5
“Just catch all exceptions and show the user:
‘An error occurred. Please try again.’”
In distributed systems, this approach doesn’t “handle errors.”
It sets the world on fire.
In Chapter 1, we reframed rollback using three worlds:
- RML-1 — Closed World
- RML-2 — Dialog World
- RML-3 — History World
This chapter zooms into the world teams step on the most often: RML-2 (Dialog World).
The goal is practical:
- classify failures by which world they belong to
- decide where to catch them and how far to propagate
- connect that design to observability and on-call governance
1) First: classify failures by world
If you pull RML into “failure thinking,” you get a useful picture:
| World | What kind of failure? | Blast radius | Examples |
|---|---|---|---|
| RML-1 | process-local failure | your memory / temp files | OOM, validation failure, local invariants |
| RML-2 | failure during dialog | your service + neighbor services + some users | downstream outage, saga step failure, broker outage |
| RML-3 | history-grade failure | reality shared by org/society | double charge, mis-send, compliance incident |
This chapter focuses on failures happening in RML-2.
In RML-2, the moment an exception happens, you must (often) decide—by pattern, not by vibes:
- is this safe to resolve locally (RML-1-ish)?
- is this a dialog failure requiring saga/reconcile logic (RML-2)?
- is this already history-bound and must escalate (RML-3)?
If you keep this ambiguous, you’ll eventually do something fatal:
You’ll make an RML-3 incident look like it was “handled” inside RML-1.
And that’s how SRE ends up doing overnight incident response by default.
2) The RML-2 “three-layer” model: Local / Dialog / History-bound
When you look at RML-2 failures in practice, it helps to split them into three layers:
- Local Failure (still closed enough)
- Dialog Failure (inside an ongoing conversation)
- History-bound Failure (should be treated as RML-3)
2.1 Local Failure — still inside the room
- nothing meaningful has escaped yet
- input validation, local invariants, per-item batch errors
Handle it like RML-1:
- rollback locally (retry/abort)
- log it
- return a normal error to the caller
If you escalate every local failure as a major incident, operations will burn out fast.
2.2 Dialog Failure — failure as part of the conversation
Examples:
- RPC timeout to a downstream service
- step 3 of a saga fails
- publish to broker fails
These failures need context-aware handling:
- should the saga compensate?
- is this retryable (with backoff)?
- should we pause and reconcile later?
If you just catch and return 500, you tear the conversation and damage the saga’s consistency.
2.3 History-bound Failure — this is already RML-3
Even if you thought you were designing “just RML-2,” you will hit situations like:
- payment already settled externally
- email/notification already delivered
- compliance/legal requires recording the cancellation/correction itself
In these cases, “undo” is not your tool. You must treat it as history:
- refunds
- corrections
- customer communication
- incident/case management
From inside RML-2 code, it may look like “just an exception.”
But operationally, it belongs to RML-3.
3) Label errors structurally: Error → (World, Severity, Action)
The safest starting move is:
Make errors structured.
Here’s a compact TypeScript model (you can adapt it to any stack):
type World = "RML1" | "RML2" | "RML3";
type Severity = "info" | "warn" | "error" | "critical";
type ActionHint =
| "retry-local" // retry immediately in-place
| "retry-with-backoff" // retry with backoff/jitter
| "start-compensation" // begin saga compensation
| "escalate-history" // treat as RML-3 and escalate
| "abort"; // stop now
class RmlError extends Error {
readonly world: World;
readonly severity: Severity;
readonly action: ActionHint;
readonly code: string;
readonly cause?: unknown;
constructor(args: {
world: World;
severity: Severity;
action: ActionHint;
code: string;
message: string;
cause?: unknown;
}) {
super(args.message);
this.name = "RmlError";
this.world = args.world;
this.severity = args.severity;
this.action = args.action;
this.code = args.code;
this.cause = args.cause;
}
}
In real systems, you won’t get perfect purity. That’s fine.
Even if you only standardize these two fields, you win:
-
world— which world this failure belongs to -
action— what the caller should do next
Because then exception handling becomes worldview-driven instead of “who guessed right this time.”
4) Core handling patterns in RML-2
Pattern A: Local Failure ends locally
Input validation, local invariants, “nothing escaped yet.”
function validateInput(input: unknown) {
if (!isValid(input)) {
throw new RmlError({
world: "RML1",
severity: "warn",
action: "abort",
code: "INVALID_INPUT",
message: "Invalid input",
});
}
}
The caller can:
- show a validation error
- treat the saga as “never started”
No need to elevate to dialog-level handling.
Pattern B: Dialog Failure must be handled in saga context
Downstream failures should often become saga-level decisions.
async function reserveStock(orderId: string): Promise<void> {
try {
await stockService.reserve(orderId);
} catch (e) {
throw new RmlError({
world: "RML2",
severity: "error",
action: "start-compensation",
code: "STOCK_RESERVE_FAILED",
message: "Failed to reserve stock",
cause: e,
});
}
}
The saga runner can interpret:
world === "RML2" && action === "start-compensation"
…and switch into compensation flow.
Key point:
Don’t “just return 500.”
Decide failure behavior at saga design time.
Pattern C: History-bound Failure must escalate (RML-3)
Some failures are not “retry vs compensate.” They are “accept history and correct forward.”
async function cancelPayment(paymentId: string): Promise<void> {
try {
await paymentGateway.cancel(paymentId);
} catch (e) {
if (isAlreadySettledError(e)) {
throw new RmlError({
world: "RML3",
severity: "critical",
action: "escalate-history",
code: "PAYMENT_ALREADY_SETTLED",
message: "Payment is already settled; cannot cancel",
cause: e,
});
}
// Otherwise treat as an RML-2 retryable failure.
throw new RmlError({
world: "RML2",
severity: "error",
action: "retry-with-backoff",
code: "PAYMENT_CANCEL_FAILED",
message: "Failed to cancel payment (retryable)",
cause: e,
});
}
}
Design intent:
- RML-3 errors → incident/case/ledger workflows
- RML-2 errors → retry/compensation workflows
This “hard split” prevents RML-3 incidents from being silently swallowed inside RML-2 code paths.
5) Anti-patterns (how teams get burned)
5.1 Catch-all and return 500 for everything
try {
// many operations
} catch (e) {
logger.error(e);
return res.status(500).json({ message: "Internal Server Error" });
}
This hides:
- RML-1 local mistakes
- RML-2 saga failures
- RML-3 history-grade incidents
…inside one bucket: “500.”
What happens in reality:
- it looks “handled”
- but history-grade incidents quietly accumulate
- and one day they surface all at once
5.2 “Just retry” loops
Blind retries can turn incidents into self-inflicted outages:
- retries overload a downstream dependency
- history-bound errors never become fixed (retries do nothing)
- worst case: you repeat a harmful operation
Whether retry helps is world-dependent:
- RML-2 transient failure → retry can work
- RML-3 “already settled / forbidden” → retry is pointless (and often harmful)
5.3 Errors without worldview
If you can’t tell:
- “Which world is this failure?”
- “What should the caller do?”
Then callers will guess. And guessing creates inconsistent handling—exactly what RML is trying to prevent.
6) Connect it to observability: tag by world and severity
Once you have world and severity, you can connect them directly to:
- logs
- metrics
- traces
- alert routing policy
6.1 Logging example
function logRmlError(err: RmlError) {
logger.error({
name: err.name,
message: err.message,
world: err.world,
severity: err.severity,
action: err.action,
code: err.code,
// include request_id/trace_id in your logger context as usual
});
}
6.2 Tracing example (OpenTelemetry-style attributes)
span.setAttribute("rml.world", err.world);
span.setAttribute("rml.severity", err.severity);
span.setAttribute("rml.action", err.action);
span.setAttribute("rml.code", err.code);
6.3 Governance example: on-call routing by worldview
This is where the model becomes operationally powerful:
-
RML-3 + critical → page on-call even at night
(
rml.world = RML3 AND rml.severity = critical) - RML-2 + error → respond during business hours (still important, not instant page)
- RML-1 + warn → aggregate and review later
World tags let you avoid the terrible binary choice:
“Alert on everything” vs “ignore everything.”
Instead you can decide:
“Which world do we page for, at what times, and to whom?”
7) Action hints are ignored unless you enforce them
Action hints are “what the caller should do.”
But hints are famously easy to ignore.
Common failure modes:
-
escalate-history…but the client keeps retrying -
start-compensation…but nobody implemented compensation
So you need enforcement, not just modeling.
7.1 Put handling logic into a shared client wrapper
type RetryPolicy = { maxAttempts: number };
async function callWithRmlHandling<T>(
f: () => Promise<T>,
policy: RetryPolicy = { maxAttempts: 3 }
): Promise<T> {
let attempt = 0;
while (true) {
attempt += 1;
try {
return await f();
} catch (e) {
const err = e instanceof RmlError ? e : new RmlError({
world: "RML2",
severity: "error",
action: "abort",
code: "UNSTRUCTURED_ERROR",
message: "Unstructured error (wrap it!)",
cause: e,
});
switch (err.action) {
case "retry-local":
if (attempt >= policy.maxAttempts) throw err;
continue;
case "retry-with-backoff":
if (attempt >= policy.maxAttempts) throw err;
await sleep(backoffWithJitter(attempt));
continue;
case "start-compensation":
await startCompensationFlow(err);
throw err;
case "escalate-history":
await notifyIncident(err); // or create a case, append to ledger, etc.
throw err;
case "abort":
default:
throw err;
}
}
}
}
This makes it harder for application code to “accidentally invent its own retry behavior.”
7.2 Put RML hints in API responses too
You can also expose the worldview at the boundary:
HTTP/1.1 409 Conflict
Content-Type: application/json
X-RML-World: RML3
X-RML-Action: escalate-history
Retry-After: 0
{
"code": "PAYMENT_ALREADY_SETTLED",
"message": "Payment is already settled; cannot cancel"
}
Benefits:
- gateways/BFF layers can detect “infinite retry against RML-3 escalation”
- clients can standardize behavior without parsing free-text messages
7.3 Governance via tests and lint
Culture alone won’t prevent drift. Add checks:
- tests that fail if
RML3errors get mapped to generic 500 - lint rules that forbid swallowing
escalate-history - golden test cases for “retryable” vs “non-retryable” classification
8) Practical checklists
8.1 When designing error classes / codes
- [ ] Which world does this failure belong to (RML-1/2/3)?
-
[ ] What action do we expect the caller to take?
- abort / retry / retry-with-backoff / start-compensation / escalate-history
[ ] Do logs/traces include
world,action,code?[ ] Does alerting policy reflect
world+severity?
8.2 When designing sagas / workflows
-
[ ] For each step, what failures are:
- local (RML-1-ish)
- dialog (RML-2)
- history-bound (RML-3 escalation)
-
[ ] What happens if an RML-3 failure occurs mid-saga?
- who owns the case?
- what is the operator handoff artifact?
[ ] Do we have a shared handling layer that enforces action hints?
8.3 When designing APIs
- [ ] Which world does this API primarily live in?
-
[ ] When a history-bound failure happens:
- is it detectable via status codes/headers?
- does the audit trail capture enough context?
Closing — classifying errors by world makes everything easier
The point is simple:
Treat failures not just as “technical problems,”
but as events that occur in a world.
- RML-1 failures can end locally.
- RML-2 failures must be handled as dialog (saga, retries, reconciliation).
- RML-3 failures must be admitted as history and corrected forward.
Once you attach worldview labels (world, action, severity):
- engineers can reason consistently
- SRE can route alerts sanely
- the organization can stop pretending history-grade incidents were “just a 500”
Next, we go deeper into the Dialog World mechanics that make History World survivable: compensations/sagas and idempotency (Chapter 6), and the API/client boundary that carries the worldview (Chapter 7).
Top comments (0)