kanaria007

Posted on Feb 26 • Originally published at zenn.dev

Chapter 5 — Failure Design for RML-2 (Dialog World): Exceptions, Observability, and Governance

#distributedsystems #observability #sre #microservices

The Worlds of Distributed Systems — Chapter 5

“Just catch all exceptions and show the user:
‘An error occurred. Please try again.’”

In distributed systems, this approach doesn’t “handle errors.”
It sets the world on fire.

In Chapter 1, we reframed rollback using three worlds:

RML-1 — Closed World
RML-2 — Dialog World
RML-3 — History World

This chapter zooms into the world teams step on the most often: RML-2 (Dialog World).

The goal is practical:

classify failures by which world they belong to
decide where to catch them and how far to propagate
connect that design to observability and on-call governance

1) First: classify failures by world

If you pull RML into “failure thinking,” you get a useful picture:

World	What kind of failure?	Blast radius	Examples
RML-1	process-local failure	your memory / temp files	OOM, validation failure, local invariants
RML-2	failure during dialog	your service + neighbor services + some users	downstream outage, saga step failure, broker outage
RML-3	history-grade failure	reality shared by org/society	double charge, mis-send, compliance incident

This chapter focuses on failures happening in RML-2.

In RML-2, the moment an exception happens, you must (often) decide—by pattern, not by vibes:

is this safe to resolve locally (RML-1-ish)?
is this a dialog failure requiring saga/reconcile logic (RML-2)?
is this already history-bound and must escalate (RML-3)?

If you keep this ambiguous, you’ll eventually do something fatal:

You’ll make an RML-3 incident look like it was “handled” inside RML-1.

And that’s how SRE ends up doing overnight incident response by default.

2) The RML-2 “three-layer” model: Local / Dialog / History-bound

When you look at RML-2 failures in practice, it helps to split them into three layers:

Local Failure (still closed enough)
Dialog Failure (inside an ongoing conversation)
History-bound Failure (should be treated as RML-3)

2.1 Local Failure — still inside the room

nothing meaningful has escaped yet
input validation, local invariants, per-item batch errors

Handle it like RML-1:

rollback locally (retry/abort)
log it
return a normal error to the caller

If you escalate every local failure as a major incident, operations will burn out fast.

2.2 Dialog Failure — failure as part of the conversation

Examples:

RPC timeout to a downstream service
step 3 of a saga fails
publish to broker fails

These failures need context-aware handling:

should the saga compensate?
is this retryable (with backoff)?
should we pause and reconcile later?

If you just catch and return 500, you tear the conversation and damage the saga’s consistency.

2.3 History-bound Failure — this is already RML-3

Even if you thought you were designing “just RML-2,” you will hit situations like:

payment already settled externally
email/notification already delivered
compliance/legal requires recording the cancellation/correction itself

In these cases, “undo” is not your tool. You must treat it as history:

refunds
corrections
customer communication
incident/case management

From inside RML-2 code, it may look like “just an exception.”
But operationally, it belongs to RML-3.

3) Label errors structurally: `Error → (World, Severity, Action)`

The safest starting move is:

Make errors structured.

Here’s a compact TypeScript model (you can adapt it to any stack):

type World = "RML1" | "RML2" | "RML3";
type Severity = "info" | "warn" | "error" | "critical";

type ActionHint =
  | "retry-local"         // retry immediately in-place
  | "retry-with-backoff"  // retry with backoff/jitter
  | "start-compensation"  // begin saga compensation
  | "escalate-history"    // treat as RML-3 and escalate
  | "abort";              // stop now

class RmlError extends Error {
  readonly world: World;
  readonly severity: Severity;
  readonly action: ActionHint;
  readonly code: string;
  readonly cause?: unknown;

  constructor(args: {
    world: World;
    severity: Severity;
    action: ActionHint;
    code: string;
    message: string;
    cause?: unknown;
  }) {
    super(args.message);
    this.name = "RmlError";
    this.world = args.world;
    this.severity = args.severity;
    this.action = args.action;
    this.code = args.code;
    this.cause = args.cause;
  }
}

In real systems, you won’t get perfect purity. That’s fine.

Even if you only standardize these two fields, you win:

world — which world this failure belongs to
action — what the caller should do next

Because then exception handling becomes worldview-driven instead of “who guessed right this time.”

4) Core handling patterns in RML-2

Pattern A: Local Failure ends locally

Input validation, local invariants, “nothing escaped yet.”

function validateInput(input: unknown) {
  if (!isValid(input)) {
    throw new RmlError({
      world: "RML1",
      severity: "warn",
      action: "abort",
      code: "INVALID_INPUT",
      message: "Invalid input",
    });
  }
}

The caller can:

show a validation error
treat the saga as “never started”

No need to elevate to dialog-level handling.

Pattern B: Dialog Failure must be handled in saga context

Downstream failures should often become saga-level decisions.

async function reserveStock(orderId: string): Promise<void> {
  try {
    await stockService.reserve(orderId);
  } catch (e) {
    throw new RmlError({
      world: "RML2",
      severity: "error",
      action: "start-compensation",
      code: "STOCK_RESERVE_FAILED",
      message: "Failed to reserve stock",
      cause: e,
    });
  }
}

The saga runner can interpret:

world === "RML2" && action === "start-compensation"

…and switch into compensation flow.

Key point:

Don’t “just return 500.”
Decide failure behavior at saga design time.

Pattern C: History-bound Failure must escalate (RML-3)

Some failures are not “retry vs compensate.” They are “accept history and correct forward.”

async function cancelPayment(paymentId: string): Promise<void> {
  try {
    await paymentGateway.cancel(paymentId);
  } catch (e) {
    if (isAlreadySettledError(e)) {
      throw new RmlError({
        world: "RML3",
        severity: "critical",
        action: "escalate-history",
        code: "PAYMENT_ALREADY_SETTLED",
        message: "Payment is already settled; cannot cancel",
        cause: e,
      });
    }

    // Otherwise treat as an RML-2 retryable failure.
    throw new RmlError({
      world: "RML2",
      severity: "error",
      action: "retry-with-backoff",
      code: "PAYMENT_CANCEL_FAILED",
      message: "Failed to cancel payment (retryable)",
      cause: e,
    });
  }
}

Design intent:

RML-3 errors → incident/case/ledger workflows
RML-2 errors → retry/compensation workflows

This “hard split” prevents RML-3 incidents from being silently swallowed inside RML-2 code paths.

5) Anti-patterns (how teams get burned)

5.1 Catch-all and return 500 for everything

try {
  // many operations
} catch (e) {
  logger.error(e);
  return res.status(500).json({ message: "Internal Server Error" });
}

This hides:

RML-1 local mistakes
RML-2 saga failures
RML-3 history-grade incidents

…inside one bucket: “500.”

What happens in reality:

it looks “handled”
but history-grade incidents quietly accumulate
and one day they surface all at once

5.2 “Just retry” loops

Blind retries can turn incidents into self-inflicted outages:

retries overload a downstream dependency
history-bound errors never become fixed (retries do nothing)
worst case: you repeat a harmful operation

Whether retry helps is world-dependent:

RML-2 transient failure → retry can work
RML-3 “already settled / forbidden” → retry is pointless (and often harmful)

5.3 Errors without worldview

If you can’t tell:

“Which world is this failure?”
“What should the caller do?”

Then callers will guess. And guessing creates inconsistent handling—exactly what RML is trying to prevent.

6) Connect it to observability: tag by world and severity

Once you have world and severity, you can connect them directly to:

logs
metrics
traces
alert routing policy

6.1 Logging example

function logRmlError(err: RmlError) {
  logger.error({
    name: err.name,
    message: err.message,
    world: err.world,
    severity: err.severity,
    action: err.action,
    code: err.code,
    // include request_id/trace_id in your logger context as usual
  });
}

6.2 Tracing example (OpenTelemetry-style attributes)

span.setAttribute("rml.world", err.world);
span.setAttribute("rml.severity", err.severity);
span.setAttribute("rml.action", err.action);
span.setAttribute("rml.code", err.code);

6.3 Governance example: on-call routing by worldview

This is where the model becomes operationally powerful:

RML-3 + critical → page on-call even at night (rml.world = RML3 AND rml.severity = critical)
RML-2 + error → respond during business hours (still important, not instant page)
RML-1 + warn → aggregate and review later

World tags let you avoid the terrible binary choice:

“Alert on everything” vs “ignore everything.”

Instead you can decide:

“Which world do we page for, at what times, and to whom?”

7) Action hints are ignored unless you enforce them

Action hints are “what the caller should do.”
But hints are famously easy to ignore.

Common failure modes:

escalate-history …but the client keeps retrying
start-compensation …but nobody implemented compensation

So you need enforcement, not just modeling.

7.1 Put handling logic into a shared client wrapper

type RetryPolicy = { maxAttempts: number };

async function callWithRmlHandling<T>(
  f: () => Promise<T>,
  policy: RetryPolicy = { maxAttempts: 3 }
): Promise<T> {
  let attempt = 0;

  while (true) {
    attempt += 1;
    try {
      return await f();
    } catch (e) {
      const err = e instanceof RmlError ? e : new RmlError({
        world: "RML2",
        severity: "error",
        action: "abort",
        code: "UNSTRUCTURED_ERROR",
        message: "Unstructured error (wrap it!)",
        cause: e,
      });

      switch (err.action) {
        case "retry-local":
          if (attempt >= policy.maxAttempts) throw err;
          continue;

        case "retry-with-backoff":
          if (attempt >= policy.maxAttempts) throw err;
          await sleep(backoffWithJitter(attempt));
          continue;

        case "start-compensation":
          await startCompensationFlow(err);
          throw err;

        case "escalate-history":
          await notifyIncident(err); // or create a case, append to ledger, etc.
          throw err;

        case "abort":
        default:
          throw err;
      }
    }
  }
}

This makes it harder for application code to “accidentally invent its own retry behavior.”

7.2 Put RML hints in API responses too

You can also expose the worldview at the boundary:

HTTP/1.1 409 Conflict
Content-Type: application/json
X-RML-World: RML3
X-RML-Action: escalate-history
Retry-After: 0

{
  "code": "PAYMENT_ALREADY_SETTLED",
  "message": "Payment is already settled; cannot cancel"
}

Benefits:

gateways/BFF layers can detect “infinite retry against RML-3 escalation”
clients can standardize behavior without parsing free-text messages

7.3 Governance via tests and lint

Culture alone won’t prevent drift. Add checks:

tests that fail if RML3 errors get mapped to generic 500
lint rules that forbid swallowing escalate-history
golden test cases for “retryable” vs “non-retryable” classification

8) Practical checklists

8.1 When designing error classes / codes

[ ] Which world does this failure belong to (RML-1/2/3)?
[ ] What action do we expect the caller to take?
- abort / retry / retry-with-backoff / start-compensation / escalate-history
[ ] Do logs/traces include world, action, code?
[ ] Does alerting policy reflect world + severity?

8.2 When designing sagas / workflows

[ ] For each step, what failures are:
- local (RML-1-ish)
- dialog (RML-2)
- history-bound (RML-3 escalation)
[ ] What happens if an RML-3 failure occurs mid-saga?
- who owns the case?
- what is the operator handoff artifact?
[ ] Do we have a shared handling layer that enforces action hints?

8.3 When designing APIs

[ ] Which world does this API primarily live in?
[ ] When a history-bound failure happens:
- is it detectable via status codes/headers?
- does the audit trail capture enough context?

Closing — classifying errors by world makes everything easier

The point is simple:

Treat failures not just as “technical problems,”
but as events that occur in a world.

RML-1 failures can end locally.
RML-2 failures must be handled as dialog (saga, retries, reconciliation).
RML-3 failures must be admitted as history and corrected forward.

Once you attach worldview labels (world, action, severity):

engineers can reason consistently
SRE can route alerts sanely
the organization can stop pretending history-grade incidents were “just a 500”

Next, we go deeper into the Dialog World mechanics that make History World survivable: compensations/sagas and idempotency (Chapter 6), and the API/client boundary that carries the worldview (Chapter 7).

DEV Community

Chapter 5 — Failure Design for RML-2 (Dialog World): Exceptions, Observability, and Governance

1) First: classify failures by world

2) The RML-2 “three-layer” model: Local / Dialog / History-bound

2.1 Local Failure — still inside the room

2.2 Dialog Failure — failure as part of the conversation

2.3 History-bound Failure — this is already RML-3

3) Label errors structurally: `Error → (World, Severity, Action)`

4) Core handling patterns in RML-2

Pattern A: Local Failure ends locally

Pattern B: Dialog Failure must be handled in saga context

Pattern C: History-bound Failure must escalate (RML-3)

5) Anti-patterns (how teams get burned)

5.1 Catch-all and return 500 for everything

5.2 “Just retry” loops

5.3 Errors without worldview

6) Connect it to observability: tag by world and severity

6.1 Logging example

6.2 Tracing example (OpenTelemetry-style attributes)

6.3 Governance example: on-call routing by worldview

7) Action hints are ignored unless you enforce them

7.1 Put handling logic into a shared client wrapper

7.2 Put RML hints in API responses too

7.3 Governance via tests and lint

8) Practical checklists

8.1 When designing error classes / codes

8.2 When designing sagas / workflows

8.3 When designing APIs

Closing — classifying errors by world makes everything easier

Top comments (0)

1) First: classify failures by world

2) The RML-2 “three-layer” model: Local / Dialog / History-bound

2.1 Local Failure — still inside the room

2.2 Dialog Failure — failure as part of the conversation

2.3 History-bound Failure — this is already RML-3

3) Label errors structurally: Error → (World, Severity, Action)

4) Core handling patterns in RML-2

Pattern A: Local Failure ends locally

Pattern B: Dialog Failure must be handled in saga context

Pattern C: History-bound Failure must escalate (RML-3)

5) Anti-patterns (how teams get burned)

5.1 Catch-all and return 500 for everything

5.2 “Just retry” loops

5.3 Errors without worldview

6) Connect it to observability: tag by world and severity

6.1 Logging example

6.2 Tracing example (OpenTelemetry-style attributes)

6.3 Governance example: on-call routing by worldview

7) Action hints are ignored unless you enforce them

7.1 Put handling logic into a shared client wrapper

7.2 Put RML hints in API responses too

7.3 Governance via tests and lint

8) Practical checklists

8.1 When designing error classes / codes

8.2 When designing sagas / workflows

8.3 When designing APIs

Closing — classifying errors by world makes everything easier

3) Label errors structurally: `Error → (World, Severity, Action)`