kanaria007

Posted on Apr 20

Deterministic Time: Don’t Break Ordering and Replay in Distributed Systems

#sre #observability #architecture #distributedsystems

In distributed systems, what we actually want is not “the correct time.”

We want two things:

Ordering: don’t lose which came first.
Replay: later, explain the same decision with the same grounds and procedure.

But in real systems we casually lean on created_at (wall clock).
And then everything breaks:

NTP adjustments / VM migration / suspend-resume makes clocks jump backward or forward
Prometheus / logs / warehouse don’t align to the “same 5-minute window”
log order collapses → postmortems become “guessing + meetings”

If you have TrueTime-like “strong time,” life is easier.
But here’s the trick:

Don’t treat time as truth. Treat ordering authority as truth — and pin it to logs.

That’s what I mean by deterministic time.

(Target reader: engineers/SREs who have touched Kafka/Prometheus/etc. Code examples are minimal and portable.)

0) Split time into three kinds (mixing them causes accidents)

Treat these as different things:

event_time: when the event happened (client/device/source)
ingest_time: when the system received/ingested it (server/pipeline boundary)
decision_time: the authority used to decide ordering (e.g., seq, watermark conditions, promotion gate) (Not necessarily a wall-clock timestamp.)

Wall clocks (created_at) are mostly for display and search.
They are not stable enough to be the ground truth for ordering or replay.

1) The minimal solution: single-writer sequence (`seq`)

The strongest, simplest option is:

Pin ordering to a monotonically increasing seq issued by a single authority (a DB sequence).
Time is not truth. Numbering is truth.

1.1 NDJSON log: `seq` is the truth of order

{
  "seq": 10498231,
  "ingest_time": "2026-02-20T00:12:01+09:00",
  "event_time": "2026-02-20T00:11:57+09:00",
  "kind": "rollout_decision",
  "run_id": "exp-42:B:2026-02-18",
  "stage": "CANARY_25",
  "decision": { "verdict": "DEGRADE", "reason_codes": ["observation_stale:watermark_skew"] }
}

Ordering is defined by seq ascending.
event_time may be wrong or skewed — that’s fine; it’s reference info.
ingest_time is for measuring delay.

1.2 Why it works

node clocks can drift; the sequencer still serializes
you can replay in the same order later
“which time is correct?” debates disappear — seq is the authority
seq doesn’t have to be gapless; treat it as an ordering key, not a count

1.3 Where `seq` comes from

PostgreSQL: BIGSERIAL, GENERATED AS IDENTITY, sequences
MySQL: AUTO_INCREMENT
Kafka: partition offset (very strong within a partition, not global)
SQLite (not recommended as a distributed sequencer): fine for local tooling; audit-grade monotonicity has edge cases (rebuilds / reuse) unless you design for append-only

1.3.1 Tradeoffs: bottleneck and SPOF

A single sequencer can become:

a throughput bottleneck
a single point of failure (ordering stops)

The move is not “abandon determinism.” It’s:

Decide how global your ordering must be.

Practical levels:

(A) Key-scoped total order: tenant_id / run_id / aggregate_id each has its own ordering lane (This is enough for many operational decisions.)
(B) Partition order: multiple lanes (Kafka partitions) with strong lane-local order
(C) Final consolidation (only when needed): create a final ordering key downstream for audits/reports

1.3.2 Boundary: what this won’t solve

This post focuses on:

total order within a key
strong order within a partition
final ordering for replay/audit

If you require real-time external consistency across partitions (global total order + distributed commit semantics), you’re now in distributed transaction territory (2PC / consensus / Spanner-class designs). This post gives you a replay/audit foundation, not a replacement.

1.4 Replay design: same input, same order, same decision

Even if order is pinned by seq, ops still fails if logs are not replayable.

Replay is not “we read logs and feel convinced.”
Replay is:

Run the same evaluator with the same inputs in the same order, and reproduce the same verdict.

1.4.0 Canonicalization: if you can’t hash it stably, you can’t replay it

If you plan to pin input_digest / snapshot_digest, you must decide how bytes become a digest.

A common failure mode is “we hashed JSON,” and then:

key order changes,
floats render differently,
null/empty fields drift,
pretty-print vs minified changes the bytes,

…and your digests become meaningless.

Minimal rules that already prevent most drift:

Canonical JSON: stable key ordering, no insignificant whitespace
Number contract: avoid floats where you can; if you must use them, define rounding/formatting
Explicit null policy: define whether missing vs null are equivalent (usually: they are not)
Stable encoding: UTF-8 everywhere

If you want one practical baseline, use a strict JSON Canonicalization Scheme (JCS-like) and treat “digest mismatch” as a deterministic classification (snapshot_mismatch), not a debate.

1.4.1 The three conditions

A) Same input

input_digest (digest of canonicalized input)
schema_version
policy_version / policy_digest

B) Same order

seq (final order authority)
HLC may be useful as provisional order, but final audit order is best pinned by seq

C) Same decision

evaluator_version (git sha / build id)
decision_fn_digest (e.g., hash of {policy_digest + evaluator_version})
keep nondeterministic dependencies out of the evaluator (see below)

1.4.2 Minimal fields to log at decision points

{
  "seq": 10498231,
  "kind": "rollout_decision",
  "run_id": "exp-42:B:2026-02-18",
  "stage": "CANARY_25",

  "schema_version": "v1",
  "policy_version": "2026-02-18",
  "policy_digest": "sha256:...",

  "evaluator_version": "git:3f2c9a1",
  "decision_fn_digest": "sha256:...",

  "input_digest": "sha256:...",
  "snapshot_digest": "sha256:...",

  "decision": { "verdict": "DEGRADE", "reason_codes": ["observation_stale:watermark_skew"] }
}

Two pins matter most:

snapshot_digest: fingerprint of the observation snapshot (metrics/log aggregates) used to decide
decision_fn_digest: proves “the rulebook didn’t change”

Store the snapshot itself immutably (object storage / warehouse) and fetch by digest during replay.

Practical note: you don’t need full snapshots for every request.
Keep heavy snapshots for decision points (promote/stop/DEGRADE), and keep lightweight references elsewhere.

1.4.3 Snapshot retention: Hot / Cold / Long-term (keep replay affordable)

Replayability dies when snapshots disappear—or when keeping them is too expensive, so teams quietly stop.

A simple retention split keeps it realistic:

Hot (hours–days): full snapshots for active runs (fast iteration, quick incident response)
Cold (weeks–months): compressed snapshots or rollups for audits and “why did this gate fire?”
Long-term (months–years, selective): only for high-stakes domains (RML-3-ish decisions), or for representative incidents

The key is to make retention a policy decision (with costs), not an accident.
If a decision is audit-grade, its snapshot must survive long enough to replay it.

1.4.4 Replay procedure

read events in seq order
fetch observation snapshots by snapshot_digest
run evaluator fixed to policy_version + evaluator_version
verify verdict + reason codes match

If mismatch occurs, classify deterministically:

policy_mismatch
evaluator_mismatch
snapshot_mismatch
nondeterminism_leak

1.4.5 Common “nondeterminism leaks” (and how to seal them)

using now() in the decision path → use seq / watermark conditions as decision anchors; wall clock is display-only
randomness (sampling) → derive seed from run_id and log it
float rounding → contract rounding rules (and treat NaN as DEGRADE)
calling external services inside decisions → snapshot first; evaluator reads only the snapshot via digest

Replayability is not about heroics. It’s mostly log design.

2) Multi-source alignment: build it with a Watermark contract

The next pain point is multi-source evaluation:

Primary metrics (logs/warehouse) and guardrails (Prometheus) don’t align
you either miss violations or DEGRADE forever

Don’t chase perfect synchronization.

Define “evaluation is valid up to here” as a watermark contract.

2.1 Include watermark in the snapshot

{
  "snapshot_id": "snap-2026-02-18T10:15:00+09:00",
  "collected_at": "2026-02-18T10:15:12+09:00",
  "window": "5m",
  "watermark_event_time": "2026-02-18T10:09:30+09:00",
  "sources": [
    {"name": "prometheus", "watermark_event_time": "2026-02-18T10:10:00+09:00"},
    {"name": "bigquery",   "watermark_event_time": "2026-02-18T10:09:30+09:00"}
  ],
  "freshness_seconds": 72,
  "watermark_skew_seconds": 30,
  "metrics": {
    "conversion_rate": 0.131,
    "error_rate_5m": 0.013,
    "p95_latency_ms_5m": 420
  }
}

2.2 Decide “valid enough” by budgets (deterministically)

Example rules:

freshness_seconds <= max_freshness_seconds
watermark_skew_seconds <= max_watermark_skew_seconds

If budget is exceeded, DEGRADE (hold safely) instead of guessing.

{
  "missing_data_policy": {
    "max_freshness_seconds": 120,
    "max_watermark_skew_seconds": 60,
    "on_stale": "DEGRADE"
  }
}

The goal is not “true time.”
It’s defining decision admissibility under replayable conditions.

2.2.1 Late data: make the policy explicit (IGNORE / RECOMPUTE / COMPENSATE)

Late data is inevitable. If your system handles it “by vibes,” replayability dies.

Pick one mode and pin it in policy:

A) IGNORE
Drop data beyond lateness budget. Easy to operate; residual error exists but determinism holds.

B) RECOMPUTE
Recompute and update the decision itself. Heavy: you need decision_epoch to track which version is “current.”

C) COMPENSATE
Don’t rewrite past decisions; append a correction event later. Often best for audit-friendly ops.

Policy example:

{
  "late_data_policy": {
    "lateness_budget_seconds": 300,
    "mode": "COMPENSATE",
    "on_over_budget": "IGNORE"
  }
}

Correction event example (append-only):

{
  "seq": 10498500,
  "kind": "correction",
  "corrects": { "seq": 10498231, "snapshot_digest": "sha256:old..." },
  "new_snapshot_digest": "sha256:new...",
  "reason": "late_data_within_budget",
  "effect": { "metric": "conversion_rate", "delta": 0.0004 }
}

Rule: do not rewrite old log rows. Append corrections.

2.3 Operate watermarks with reason codes + SLOs

Watermark is not a feeling. Make it measurable:

freshness_seconds
watermark_skew_seconds
(optional) completeness

Fix vocabulary via typed reason codes:

observation_missing:required_source
observation_missing:required_metric
observation_stale:over_freshness_budget
observation_stale:watermark_skew
observation_invalid:watermark_parse

Then manage it like ops:

staleness rate SLO
time-to-fresh SLO

2.4 How to compute a watermark (logs / metrics / traces)

2.4.1 Logs (stream → warehouse)

Watermark is a frontier, not a timestamp.

track per-partition processed offsets
map offsets to effective event-time (careful: client clocks can lie)
watermark = low watermark = min(partition_frontier_times)

If event_time is client-provided, separate:

event_time_raw (do not mutate)
event_time_effective (used for evaluation; derived by policy)
event_time_policy (ACCEPT / CLAMP / QUARANTINE)

If offset→event_time mapping is expensive, fall back to conservative ingest-time estimates and tighten skew budgets so you DEGRADE earlier.

For replay, keep frontier evidence in the snapshot:

source_offsets (partition → offset)
job_run_id
query_digest
counts of CLAMP/QUARANTINE

2.4.2 Metrics (Prometheus-like)

Prometheus is sample-time driven. A conservative watermark can be:

watermark = min(scrape_latest_ts among required series)

Missing required series is naturally a reason code.

2.4.3 Traces (OTel)

Traces are incomplete and delayed by nature. Don’t “wait forever.”

Contract closure conditions:

required spans present
trace_end_time + grace_period <= decision_anchor_time (anchor pinned in snapshot)
for async workflows: prefer closure markers over “root span ended”

Reason codes:

observation_incomplete:trace_not_closed
observation_incomplete:required_spans_missing
observation_stale:trace_grace_exceeded

Common rule: watermark is always a low watermark (align to the slowest frontier).

3) If you can’t place a single writer: HLC for practical ordering

Sometimes you can’t route everything through one sequencer:

offline edge/mobile
partitions and temporary disconnects
local ordering needed before ingestion

A pragmatic tool is HLC (Hybrid Logical Clock):
wall clock + logical counter, merged on receive to preserve causality-ish order.

3.1 Minimal HLC (stdlib-only)

(Python 3.10+; stdlib only.)

from __future__ import annotations

from dataclasses import dataclass
import time
from typing import Tuple


@dataclass(frozen=True)
class HLCTimestamp:
    physical_ns: int
    logical: int

    def as_tuple(self) -> Tuple[int, int]:
        return (self.physical_ns, self.logical)

    def to_string(self) -> str:
        # Fixed-width hex to avoid lexicographic surprises.
        return f"{self.physical_ns:016x}-{self.logical:016x}"


class HLC:
    """
    Hybrid Logical Clock (minimal).
    - now(): local tick (clamp + logical counter)
    - observe(remote): merge remote timestamp to preserve order-ish causality
    """
    def __init__(self) -> None:
        self._p = 0
        self._l = 0

    def now(self) -> HLCTimestamp:
        pt = time.time_ns()  # wall clock; may jump
        if pt > self._p:
            self._p = pt
            self._l = 0
        else:
            self._l += 1
        return HLCTimestamp(self._p, self._l)

    def observe(self, remote: HLCTimestamp) -> HLCTimestamp:
        pt = time.time_ns()
        rp, rl = remote.physical_ns, remote.logical

        p = max(pt, self._p, rp)
        if p == self._p == rp:
            l = max(self._l, rl) + 1
        elif p == self._p:
            l = self._l + 1
        elif p == rp:
            l = rl + 1
        else:
            l = 0

        self._p, self._l = p, l
        return HLCTimestamp(self._p, self._l)

3.1.1 Operational caveats (don’t skip these)

this minimal implementation stores state in memory; reboot can regress → persist last HLC, or sync on startup by observing peers/aggregator
a clock-broken node can inject “far future” timestamps → enforce max_clock_skew and quarantine remote timestamps beyond it (log it with reason codes like clock_anomaly:remote_future_exceeded)

3.2 How HLC should be used (best practice)

compare HLC as (physical_ns, logical) (tuple order)
if you need stable total order, add a tie-breaker (node_id) so sort key is (p, l, node_id)
use HLC as provisional order, then assign final seq at ingestion if possible

Practical pattern:

Local: HLC for provisional order
Aggregate: seq for final audit/replay order

You don’t need Spanner for this level of determinism.

4) Long-running operations: stop writing “sync loops,” use state machines

Even with seq / watermark / HLC, long-running drivers (A/B rollouts, canaries, jobs) die if they are written as a single loop.

Processes restart. Deploys happen. Workers crash.

So treat time-long operations as state transitions and persist state.

4.1 Split into two layers: Evaluator vs Orchestrator

Evaluator (pure-ish): snapshot + policy -> decision Same input → same output (replayable).
Orchestrator (state machine): persists run_id state + resume conditions.

This solves:

re-run: state retains stage + digest pins
duplicates: enforce run_id + stage uniqueness / idempotency
interrupted progress: write-ahead transitions
resume: DEGRADE becomes re-enterable hold (wait for snapshot/approval/time)

4.2 Minimal orchestrator state

{
  "run_id": "exp-42:B:2026-02-18",
  "stage": "CANARY_25",
  "policy_version": "2026-02-18",
  "last_snapshot_digest": "sha256:...",
  "last_decision": {
    "verdict": "DEGRADE",
    "reason_codes": ["observation_stale:watermark_skew"],
    "missing": ["p95_latency_ms_5m"]
  },
  "resume": {
    "resume_token": "opaque:rsn_...",
    "requested_actions": [
      {"name": "collect_metric", "params": {"metric": "p95_latency_ms_5m"}},
      {"name": "rerun_gate_check", "params": {"stage": "CANARY_25"}}
    ],
    "next_check_at": "2026-02-18T10:45:00+09:00"
  }
}

next_check_at is operational scheduling time, not ordering authority.

4.3 Beware dual writes: orchestrator state vs ordering log

If your orchestrator state and your seq event log are in different stores, you get classic dual-write failure modes:

state updated but event not logged → “it happened” without replay proof
event logged but state not updated → duplicates and confusion

Two standard fixes:

A) Event sourcing: write the state-transition event into the seq log as the source of truth; derive orchestrator state as a projection.

B) Outbox: write state update + “planned event” into the same DB tx; forward via outbox to the seq log.

If you use outbox, receiver-side idempotency is mandatory:

define an event_id (idempotency key)
enforce UNIQUE(event_id) on the receiving log
INSERT ... ON CONFLICT DO NOTHING

And keep orchestrator transitions CAS-style:

UPDATE orchestrator_runs
SET stage = :to_stage,
    last_event_id = :event_id,
    updated_at = now()
WHERE run_id = :run_id
  AND stage = :from_stage;

4.3.1 Outbox transfer lag can become your “time source”

If downstream uses the log frontier as a watermark, outbox lag can stall watermark → increase DEGRADE.

So treat outbox transfer lag as a first-class SLO signal:

outbox_oldest_age_seconds
outbox_backlog_count
outbox_to_seq_p95_seconds

Reason codes:

pipeline_stale:outbox_transfer_lag

Policy example (conceptual):

{
  "pipeline_policy": {
    "max_outbox_oldest_age_seconds": 30,
    "on_over_budget": "DEGRADE"
  }
}

Also consider catch-up safety:

rate-limit catch-up
optionally cap watermark advance rate (catch-up budget)
reason-code budget exceedance for deterministic behavior

5) Summary: deterministic time is not “correct time,” it’s “ordering authority”

wall clocks drift and lie; don’t use them as truth
define truth as an ordering authority:
- seq (single-writer) as the baseline
- watermark contracts to decide “evaluation admissibility” across sources
- HLC for provisional ordering under partition/offline, then finalize with seq if possible
make replay real with:
- digests (input/snapshot/policy)
- evaluator version pins
- append-only logs + correction events
make long-running “time” operable with:
- state machines (orchestrator)
- idempotency + outbox/event sourcing
- SLOs + reason codes

Time in distributed systems is not physics.
It’s agreement and contracts.

TrueTime is powerful, but you don’t need it to stop your ops from turning into guesswork.

You need a pinned authority for ordering — and the discipline to treat “stale/missing” as first-class (DEGRADE), measured by SLOs.

DEV Community

Deterministic Time: Don’t Break Ordering and Replay in Distributed Systems

0) Split time into three kinds (mixing them causes accidents)

1) The minimal solution: single-writer sequence (`seq`)

1.1 NDJSON log: `seq` is the truth of order

1.2 Why it works

1.3 Where `seq` comes from

1.3.1 Tradeoffs: bottleneck and SPOF

1.3.2 Boundary: what this won’t solve

1.4 Replay design: same input, same order, same decision

1.4.0 Canonicalization: if you can’t hash it stably, you can’t replay it

1.4.1 The three conditions

1.4.2 Minimal fields to log at decision points

1.4.3 Snapshot retention: Hot / Cold / Long-term (keep replay affordable)

1.4.4 Replay procedure

1.4.5 Common “nondeterminism leaks” (and how to seal them)

2) Multi-source alignment: build it with a Watermark contract

2.1 Include watermark in the snapshot

2.2 Decide “valid enough” by budgets (deterministically)

2.2.1 Late data: make the policy explicit (IGNORE / RECOMPUTE / COMPENSATE)

2.3 Operate watermarks with reason codes + SLOs

2.4 How to compute a watermark (logs / metrics / traces)

2.4.1 Logs (stream → warehouse)

2.4.2 Metrics (Prometheus-like)

2.4.3 Traces (OTel)

3) If you can’t place a single writer: HLC for practical ordering

3.1 Minimal HLC (stdlib-only)

3.1.1 Operational caveats (don’t skip these)

3.2 How HLC should be used (best practice)

4) Long-running operations: stop writing “sync loops,” use state machines

4.1 Split into two layers: Evaluator vs Orchestrator

4.2 Minimal orchestrator state

4.3 Beware dual writes: orchestrator state vs ordering log

4.3.1 Outbox transfer lag can become your “time source”

5) Summary: deterministic time is not “correct time,” it’s “ordering authority”

Top comments (0)

0) Split time into three kinds (mixing them causes accidents)

1) The minimal solution: single-writer sequence (seq)

1.1 NDJSON log: seq is the truth of order

1.2 Why it works

1.3 Where seq comes from

1.3.1 Tradeoffs: bottleneck and SPOF

1.3.2 Boundary: what this won’t solve

1.4 Replay design: same input, same order, same decision

1.4.0 Canonicalization: if you can’t hash it stably, you can’t replay it

1.4.1 The three conditions

1.4.2 Minimal fields to log at decision points

1.4.3 Snapshot retention: Hot / Cold / Long-term (keep replay affordable)

1.4.4 Replay procedure

1.4.5 Common “nondeterminism leaks” (and how to seal them)

2) Multi-source alignment: build it with a Watermark contract

2.1 Include watermark in the snapshot

2.2 Decide “valid enough” by budgets (deterministically)

2.2.1 Late data: make the policy explicit (IGNORE / RECOMPUTE / COMPENSATE)

2.3 Operate watermarks with reason codes + SLOs

2.4 How to compute a watermark (logs / metrics / traces)

2.4.1 Logs (stream → warehouse)

2.4.2 Metrics (Prometheus-like)

2.4.3 Traces (OTel)

3) If you can’t place a single writer: HLC for practical ordering

3.1 Minimal HLC (stdlib-only)

3.1.1 Operational caveats (don’t skip these)

3.2 How HLC should be used (best practice)

4) Long-running operations: stop writing “sync loops,” use state machines

4.1 Split into two layers: Evaluator vs Orchestrator

4.2 Minimal orchestrator state

4.3 Beware dual writes: orchestrator state vs ordering log

4.3.1 Outbox transfer lag can become your “time source”

5) Summary: deterministic time is not “correct time,” it’s “ordering authority”

1) The minimal solution: single-writer sequence (`seq`)

1.1 NDJSON log: `seq` is the truth of order

1.3 Where `seq` comes from