The Worlds of Distributed Systems — Chapter 7
In Chapters 5–6, we made RML-2 concrete:
- Attach
world / severity / actionto failures - Connect them to observability and on-call policy
- Treat sagas as retryable conversations, where
retry-with-backoff,start-compensation, andescalate-historyare state machine triggers - Treat idempotency keys as the lifeline of retries
This chapter brings that worldview to the most important interface in real systems:
The boundary between API and client.
“What world is burning behind this HTTP 500?”
We’ll design APIs and clients so they can share RML semantics and enforce correct behavior—rather than leaving everything to guesswork.
1) HTTP status codes don’t tell you which world you’re in
A typical API error looks like this:
HTTP/1.1 500 Internal Server Error
Content-Type: application/json
{ "message": "Internal Server Error" }
From the client’s perspective, this tells you almost nothing:
- Was it invalid input (RML-1-ish)?
- Was it a transient downstream failure (RML-2 retryable)?
- Did we hit a history-bound rule (RML-3 escalation / no retries)?
In RML terms:
“Which world is this error from: RML-1 / RML-2 / RML-3?”
“What should I do next?”
This chapter’s goal:
- Make API/client boundaries RML-aware
- Let clients interpret Action Hints
- Prevent the worst outcome: automation-by-guessing
2) Put RML in API responses: body vs headers vs status
Think of an API response as three layers:
- HTTP status
- HTTP headers
- JSON body
Each has a natural role.
2.1 JSON body: the core error object
Body is what humans and clients read first.
A practical shape:
{
"code": "PAYMENT_ALREADY_SETTLED",
"message": "This payment is already settled and cannot be canceled.",
"world": "RML3",
"action": "escalate-history",
"details": {
"paymentId": "pay_123",
"settledAt": "2025-12-01T10:23:45Z"
}
}
-
code: stable branching key (logging, UI mapping, analytics) -
message: human-facing message (UI/logs) -
world/action: the RML metadata -
details: domain-specific context
Public APIs note: you may choose to expose fewer fields in the body (e.g., omit world/action) and put RML only in headers. Internal APIs can usually include everything.
2.2 HTTP headers: RML metadata + retry control
Headers are ideal for machine-readable policy and cross-cutting behavior:
HTTP/1.1 409 Conflict
Content-Type: application/json
X-RML-World: RML3
X-RML-Action: escalate-history
Retry-After: 0
X-Idempotency-Key: saga:12345:charge
X-Request-ID: req-abcde
-
X-RML-World:RML1/RML2/RML3 -
X-RML-Action:retry-localretry-with-backoffstart-compensationescalate-historyabort
Retry-After: suggested delay (especially forretry-with-backoff)X-Idempotency-Key: the retry lifeline (when relevant)X-Request-ID: ties UI and support workflows to traces
Now clients can behave consistently:
-
RML3 + escalate-history→ never auto-retry, show support path / ticket ID -
RML2 + retry-with-backoff→ retry using the same idempotency key -
RML1 + abort→ show validation feedback immediately
2.3 HTTP status: keep it coarse
Status codes are a coarse bucket. That’s fine.
A useful mapping (not universal, but practical):
| world | action | status (typical) | example |
|---|---|---|---|
| RML1 | abort | 400 / 422 | validation failure |
| RML2 | retry-with-backoff | 502 / 503 | transient downstream outage |
| RML2 | start-compensation | 409 / 500 | saga step failed mid-flow |
| RML3 | escalate-history | 409 / 422 / 403 | settled payment; forbidden reversal |
Don’t try to encode everything in status codes.
Encode the meaning in X-RML-World and X-RML-Action.
3) Internal API vs BFF vs Public API: different exposure levels
3.1 Internal microservice APIs
Assumption: callers understand RML.
So you can return full metadata:
{
"code": "STOCK_RESERVE_FAILED",
"message": "Failed to reserve stock.",
"world": "RML2",
"action": "start-compensation",
"details": { "sku": "SKU-1234", "warehouse": "TOKYO-1" }
}
BFFs, orchestrators, and gateways can interpret this directly.
3.2 BFF (Backend for Frontend) as translator
BFF sits between:
- RML-aware backend world
- UI that must be understandable to humans
Typical responsibilities:
- Receive
world/actionfrom downstream - Translate to a UI-level error model:
- which field to highlight
- whether to auto-retry
- whether to show a “contact support” path
- Optionally soften messaging for end users
Example translation:
Downstream returns:
{
"code": "PAYMENT_ALREADY_SETTLED",
"world": "RML3",
"action": "escalate-history",
"message": "Payment already settled; cannot cancel."
}
BFF outputs to UI:
- End-user message: “This payment can’t be canceled here. Please contact support.”
- Internal behavior: enqueue incident/case + attach
requestId
3.3 Public APIs
For external developers, you must decide what becomes part of the external contract.
A common approach:
- Body:
code/message/detailsonly - Headers:
X-RML-World,X-RML-Action,Retry-After,X-Request-ID
External clients can still follow simple rules:
-
RML2 + retry-with-backoff→ retry safely -
RML3 + escalate-history→ do not retry; escalate to human flow
4) The RML-aware client (SDK) pattern
If you expose X-RML-* but nobody reads it, nothing changes.
So the core move is:
Provide an RML-aware client wrapper and standardize its use.
4.1 Parse RML from headers
type World = "RML1" | "RML2" | "RML3";
type Action =
| "retry-local"
| "retry-with-backoff"
| "start-compensation"
| "escalate-history"
| "abort";
type RmlMeta = { world?: World; action?: Action };
function asWorld(v: string | null): World | undefined {
if (v === "RML1" || v === "RML2" || v === "RML3") return v;
return undefined;
}
function asAction(v: string | null): Action | undefined {
if (
v === "retry-local" ||
v === "retry-with-backoff" ||
v === "start-compensation" ||
v === "escalate-history" ||
v === "abort"
) return v;
return undefined;
}
function parseRml(res: Response): RmlMeta {
return {
world: asWorld(res.headers.get("X-RML-World")),
action: asAction(res.headers.get("X-RML-Action")),
};
}
4.2 Enforce Action Hints
function sleep(ms: number) {
return new Promise<void>((r) => setTimeout(r, ms));
}
function backoffWithJitter(baseMs: number, attempt: number) {
const capped = Math.min(baseMs * 2 ** attempt, 10_000);
const jitter = Math.random() * (capped * 0.2); // ~ +0–20%
return capped + jitter;
}
export async function fetchWithRmlHandling(
input: RequestInfo,
init?: RequestInit,
opts?: { maxRetries?: number; baseDelayMs?: number }
): Promise<Response> {
const maxRetries = opts?.maxRetries ?? 3;
const baseDelayMs = opts?.baseDelayMs ?? 500;
let attempt = 0;
while (true) {
const res = await fetch(input, init);
if (res.ok) return res;
const { world, action } = parseRml(res);
// If no RML metadata: behave like a normal fetch client.
if (!world || !action) return res;
// Never auto-retry history-bound failures.
if (action === "escalate-history" || world === "RML3") return res;
if (action === "retry-local" || action === "retry-with-backoff") {
if (attempt >= maxRetries) return res;
attempt++;
const retryAfterHeader = res.headers.get("Retry-After");
const retryAfterMs = retryAfterHeader ? Number(retryAfterHeader) * 1000 : 0;
const delay =
action === "retry-with-backoff"
? (retryAfterMs || backoffWithJitter(baseDelayMs, attempt))
: 0;
if (delay > 0) await sleep(delay);
continue;
}
// For saga compensation triggers, return the response to caller (or hook your saga runner here).
if (action === "start-compensation") return res;
// Default: do not invent behavior.
return res;
}
}
Key point: action hints only matter if you enforce them via shared code.
5) UX by world: don’t show “RML3 error” to users
RML metadata is primarily for clients/ops—not end users.
A healthy UX mapping:
-
RML-1 (validation / local issues)
- highlight fields, inline errors
- “Please fix and resubmit”
-
RML-2 (transient dialog failures)
- “Temporary issue. Retrying…”
- keep spinner, auto-retry with backoff
- after N retries: “Please try later”
-
RML-3 (history-bound)
- “This can’t be completed here. Contact support.”
- show
requestId/ case ID - quietly create an incident/case on the backend
6) Gateway governance: centralize retry policies and detect bad clients
If you emit RML headers, gateways can enforce org-level safety.
6.1 Centralized retry rules
-
If
X-RML-World=RML2andX-RML-Action=retry-with-backoff:- gateway retries a limited number of times
-
If
X-RML-World=RML3:- gateway forbids retries
6.2 Detect clients ignoring Action Hints
If you see:
- API responds
RML3 + escalate-history - same client continues calling at high frequency
That’s a sign they ignore the contract. Gateways can:
- rate limit
- block
- notify developers
7) GraphQL / gRPC: where to carry world/action
RML is not REST-specific.
7.1 GraphQL (use extensions)
{
"data": null,
"errors": [
{
"message": "Payment already settled; cannot cancel.",
"path": ["cancelPayment"],
"extensions": {
"code": "PAYMENT_ALREADY_SETTLED",
"world": "RML3",
"action": "escalate-history",
"requestId": "req-abcde"
}
}
]
}
7.2 gRPC (metadata + rich status details)
Use:
- gRPC status codes (
FAILED_PRECONDITION,UNAVAILABLE, etc.) - metadata/trailers:
x-rml-world,x-rml-action - optional structured error details message
The idea is the same: clients must be able to route behavior based on world/action.
8) Anti-patterns
8.1 You defined X-RML-*, but nobody reads it
Result: you changed nothing. Fix it by:
- providing an RML-aware SDK wrapper
- enforcing “no raw fetch/http client” via linting or code review rules
8.2 “World” semantics drift across services
If one service uses RML3 to mean “non-retryable policy violation” and another uses RML3 to mean “temporary restriction,” clients become confused and unsafe.
Fix it by:
- having a single org-level RML meaning table
- documenting endpoint guarantees like “This endpoint never returns RML3”
8.3 Exposing RML jargon directly to end users
“RML3 error occurred” is meaningless to users.
Fix it by:
- translating at the BFF/UI layer
- showing meaningful messages + a support path + request ID
9) Checklist
API side
- [ ] JSON body includes
code,message, and optionaldetails - [ ] Headers carry
X-RML-World,X-RML-Action,Retry-After,X-Request-ID - [ ]
RML3 + escalate-historyreturns a non-retryable status (often 409/422/403) - [ ] Idempotency is contractually supported where retries exist
Client / BFF side
- [ ] Use an RML-aware shared client wrapper
- [ ] Behavior branches on
world/action(retry / compensate / escalate) - [ ] UX differs per world (validation vs retry vs support)
- [ ] Public API docs explain RML headers clearly
Gateway / SRE side
- [ ] Central retry policy exists for
RML2 + retry-* - [ ] Retry forbidden for
RML3 - [ ] Detection exists for clients ignoring Action Hints
Closing — carrying the worldview across boundaries makes responsibility easier
The core takeaway:
- HTTP status codes alone can’t express RML semantics
- Add
world/actionin headers (and optionally body) - Build an RML-aware client wrapper so Action Hints are enforced
- Translate RML semantics into UX, gateway policy, and incident workflows
- Apply the same idea across REST, GraphQL, and gRPC
At the end of Part II, we now have the full “Dialog World” stack:
- Chapter 5: Failure design + observability + governance
- Chapter 6: Sagas and compensations as retryable conversations
- Chapter 7: API/client boundary carrying the worldview
Next, Part III moves the focus fully into RML-3 (History World)—where distributed systems meet legal responsibility, money, and social trust.
Top comments (0)