DEV Community

kanaria007
kanaria007

Posted on

Deterministic Time: Don’t Break Ordering and Replay in Distributed Systems

In distributed systems, what we actually want is not “the correct time.”

We want two things:

  • Ordering: don’t lose which came first.
  • Replay: later, explain the same decision with the same grounds and procedure.

But in real systems we casually lean on created_at (wall clock).
And then everything breaks:

  • NTP adjustments / VM migration / suspend-resume makes clocks jump backward or forward
  • Prometheus / logs / warehouse don’t align to the “same 5-minute window”
  • log order collapses → postmortems become “guessing + meetings”

If you have TrueTime-like “strong time,” life is easier.
But here’s the trick:

Don’t treat time as truth. Treat ordering authority as truth — and pin it to logs.

That’s what I mean by deterministic time.

(Target reader: engineers/SREs who have touched Kafka/Prometheus/etc. Code examples are minimal and portable.)


0) Split time into three kinds (mixing them causes accidents)

Treat these as different things:

  • event_time: when the event happened (client/device/source)
  • ingest_time: when the system received/ingested it (server/pipeline boundary)
  • decision_time: the authority used to decide ordering (e.g., seq, watermark conditions, promotion gate) (Not necessarily a wall-clock timestamp.)

Wall clocks (created_at) are mostly for display and search.
They are not stable enough to be the ground truth for ordering or replay.


1) The minimal solution: single-writer sequence (seq)

The strongest, simplest option is:

Pin ordering to a monotonically increasing seq issued by a single authority (a DB sequence).
Time is not truth. Numbering is truth.

1.1 NDJSON log: seq is the truth of order

{
  "seq": 10498231,
  "ingest_time": "2026-02-20T00:12:01+09:00",
  "event_time": "2026-02-20T00:11:57+09:00",
  "kind": "rollout_decision",
  "run_id": "exp-42:B:2026-02-18",
  "stage": "CANARY_25",
  "decision": { "verdict": "DEGRADE", "reason_codes": ["observation_stale:watermark_skew"] }
}
Enter fullscreen mode Exit fullscreen mode
  • Ordering is defined by seq ascending.
  • event_time may be wrong or skewed — that’s fine; it’s reference info.
  • ingest_time is for measuring delay.

1.2 Why it works

  • node clocks can drift; the sequencer still serializes
  • you can replay in the same order later
  • “which time is correct?” debates disappear — seq is the authority
  • seq doesn’t have to be gapless; treat it as an ordering key, not a count

1.3 Where seq comes from

  • PostgreSQL: BIGSERIAL, GENERATED AS IDENTITY, sequences
  • MySQL: AUTO_INCREMENT
  • Kafka: partition offset (very strong within a partition, not global)
  • SQLite (not recommended as a distributed sequencer): fine for local tooling; audit-grade monotonicity has edge cases (rebuilds / reuse) unless you design for append-only

1.3.1 Tradeoffs: bottleneck and SPOF

A single sequencer can become:

  • a throughput bottleneck
  • a single point of failure (ordering stops)

The move is not “abandon determinism.” It’s:

Decide how global your ordering must be.

Practical levels:

  • (A) Key-scoped total order: tenant_id / run_id / aggregate_id each has its own ordering lane (This is enough for many operational decisions.)
  • (B) Partition order: multiple lanes (Kafka partitions) with strong lane-local order
  • (C) Final consolidation (only when needed): create a final ordering key downstream for audits/reports

1.3.2 Boundary: what this won’t solve

This post focuses on:

  • total order within a key
  • strong order within a partition
  • final ordering for replay/audit

If you require real-time external consistency across partitions (global total order + distributed commit semantics), you’re now in distributed transaction territory (2PC / consensus / Spanner-class designs). This post gives you a replay/audit foundation, not a replacement.


1.4 Replay design: same input, same order, same decision

Even if order is pinned by seq, ops still fails if logs are not replayable.

Replay is not “we read logs and feel convinced.”
Replay is:

Run the same evaluator with the same inputs in the same order, and reproduce the same verdict.

1.4.0 Canonicalization: if you can’t hash it stably, you can’t replay it

If you plan to pin input_digest / snapshot_digest, you must decide how bytes become a digest.

A common failure mode is “we hashed JSON,” and then:

  • key order changes,
  • floats render differently,
  • null/empty fields drift,
  • pretty-print vs minified changes the bytes,

…and your digests become meaningless.

Minimal rules that already prevent most drift:

  • Canonical JSON: stable key ordering, no insignificant whitespace
  • Number contract: avoid floats where you can; if you must use them, define rounding/formatting
  • Explicit null policy: define whether missing vs null are equivalent (usually: they are not)
  • Stable encoding: UTF-8 everywhere

If you want one practical baseline, use a strict JSON Canonicalization Scheme (JCS-like) and treat “digest mismatch” as a deterministic classification (snapshot_mismatch), not a debate.

1.4.1 The three conditions

A) Same input

  • input_digest (digest of canonicalized input)
  • schema_version
  • policy_version / policy_digest

B) Same order

  • seq (final order authority)
  • HLC may be useful as provisional order, but final audit order is best pinned by seq

C) Same decision

  • evaluator_version (git sha / build id)
  • decision_fn_digest (e.g., hash of {policy_digest + evaluator_version})
  • keep nondeterministic dependencies out of the evaluator (see below)

1.4.2 Minimal fields to log at decision points

{
  "seq": 10498231,
  "kind": "rollout_decision",
  "run_id": "exp-42:B:2026-02-18",
  "stage": "CANARY_25",

  "schema_version": "v1",
  "policy_version": "2026-02-18",
  "policy_digest": "sha256:...",

  "evaluator_version": "git:3f2c9a1",
  "decision_fn_digest": "sha256:...",

  "input_digest": "sha256:...",
  "snapshot_digest": "sha256:...",

  "decision": { "verdict": "DEGRADE", "reason_codes": ["observation_stale:watermark_skew"] }
}
Enter fullscreen mode Exit fullscreen mode

Two pins matter most:

  • snapshot_digest: fingerprint of the observation snapshot (metrics/log aggregates) used to decide
  • decision_fn_digest: proves “the rulebook didn’t change”

Store the snapshot itself immutably (object storage / warehouse) and fetch by digest during replay.

Practical note: you don’t need full snapshots for every request.
Keep heavy snapshots for decision points (promote/stop/DEGRADE), and keep lightweight references elsewhere.

1.4.3 Snapshot retention: Hot / Cold / Long-term (keep replay affordable)

Replayability dies when snapshots disappear—or when keeping them is too expensive, so teams quietly stop.

A simple retention split keeps it realistic:

  • Hot (hours–days): full snapshots for active runs (fast iteration, quick incident response)
  • Cold (weeks–months): compressed snapshots or rollups for audits and “why did this gate fire?”
  • Long-term (months–years, selective): only for high-stakes domains (RML-3-ish decisions), or for representative incidents

The key is to make retention a policy decision (with costs), not an accident.
If a decision is audit-grade, its snapshot must survive long enough to replay it.

1.4.4 Replay procedure

  1. read events in seq order
  2. fetch observation snapshots by snapshot_digest
  3. run evaluator fixed to policy_version + evaluator_version
  4. verify verdict + reason codes match

If mismatch occurs, classify deterministically:

  • policy_mismatch
  • evaluator_mismatch
  • snapshot_mismatch
  • nondeterminism_leak

1.4.5 Common “nondeterminism leaks” (and how to seal them)

  • using now() in the decision path → use seq / watermark conditions as decision anchors; wall clock is display-only
  • randomness (sampling) → derive seed from run_id and log it
  • float rounding → contract rounding rules (and treat NaN as DEGRADE)
  • calling external services inside decisions → snapshot first; evaluator reads only the snapshot via digest

Replayability is not about heroics. It’s mostly log design.


2) Multi-source alignment: build it with a Watermark contract

The next pain point is multi-source evaluation:

  • Primary metrics (logs/warehouse) and guardrails (Prometheus) don’t align
  • you either miss violations or DEGRADE forever

Don’t chase perfect synchronization.

Define “evaluation is valid up to here” as a watermark contract.

2.1 Include watermark in the snapshot

{
  "snapshot_id": "snap-2026-02-18T10:15:00+09:00",
  "collected_at": "2026-02-18T10:15:12+09:00",
  "window": "5m",
  "watermark_event_time": "2026-02-18T10:09:30+09:00",
  "sources": [
    {"name": "prometheus", "watermark_event_time": "2026-02-18T10:10:00+09:00"},
    {"name": "bigquery",   "watermark_event_time": "2026-02-18T10:09:30+09:00"}
  ],
  "freshness_seconds": 72,
  "watermark_skew_seconds": 30,
  "metrics": {
    "conversion_rate": 0.131,
    "error_rate_5m": 0.013,
    "p95_latency_ms_5m": 420
  }
}
Enter fullscreen mode Exit fullscreen mode

2.2 Decide “valid enough” by budgets (deterministically)

Example rules:

  • freshness_seconds <= max_freshness_seconds
  • watermark_skew_seconds <= max_watermark_skew_seconds

If budget is exceeded, DEGRADE (hold safely) instead of guessing.

{
  "missing_data_policy": {
    "max_freshness_seconds": 120,
    "max_watermark_skew_seconds": 60,
    "on_stale": "DEGRADE"
  }
}
Enter fullscreen mode Exit fullscreen mode

The goal is not “true time.”
It’s defining decision admissibility under replayable conditions.

2.2.1 Late data: make the policy explicit (IGNORE / RECOMPUTE / COMPENSATE)

Late data is inevitable. If your system handles it “by vibes,” replayability dies.

Pick one mode and pin it in policy:

A) IGNORE
Drop data beyond lateness budget. Easy to operate; residual error exists but determinism holds.

B) RECOMPUTE
Recompute and update the decision itself. Heavy: you need decision_epoch to track which version is “current.”

C) COMPENSATE
Don’t rewrite past decisions; append a correction event later. Often best for audit-friendly ops.

Policy example:

{
  "late_data_policy": {
    "lateness_budget_seconds": 300,
    "mode": "COMPENSATE",
    "on_over_budget": "IGNORE"
  }
}
Enter fullscreen mode Exit fullscreen mode

Correction event example (append-only):

{
  "seq": 10498500,
  "kind": "correction",
  "corrects": { "seq": 10498231, "snapshot_digest": "sha256:old..." },
  "new_snapshot_digest": "sha256:new...",
  "reason": "late_data_within_budget",
  "effect": { "metric": "conversion_rate", "delta": 0.0004 }
}
Enter fullscreen mode Exit fullscreen mode

Rule: do not rewrite old log rows. Append corrections.

2.3 Operate watermarks with reason codes + SLOs

Watermark is not a feeling. Make it measurable:

  • freshness_seconds
  • watermark_skew_seconds
  • (optional) completeness

Fix vocabulary via typed reason codes:

  • observation_missing:required_source
  • observation_missing:required_metric
  • observation_stale:over_freshness_budget
  • observation_stale:watermark_skew
  • observation_invalid:watermark_parse

Then manage it like ops:

  • staleness rate SLO
  • time-to-fresh SLO

2.4 How to compute a watermark (logs / metrics / traces)

2.4.1 Logs (stream → warehouse)

Watermark is a frontier, not a timestamp.

  • track per-partition processed offsets
  • map offsets to effective event-time (careful: client clocks can lie)
  • watermark = low watermark = min(partition_frontier_times)

If event_time is client-provided, separate:

  • event_time_raw (do not mutate)
  • event_time_effective (used for evaluation; derived by policy)
  • event_time_policy (ACCEPT / CLAMP / QUARANTINE)

If offset→event_time mapping is expensive, fall back to conservative ingest-time estimates and tighten skew budgets so you DEGRADE earlier.

For replay, keep frontier evidence in the snapshot:

  • source_offsets (partition → offset)
  • job_run_id
  • query_digest
  • counts of CLAMP/QUARANTINE

2.4.2 Metrics (Prometheus-like)

Prometheus is sample-time driven. A conservative watermark can be:

  • watermark = min(scrape_latest_ts among required series)

Missing required series is naturally a reason code.

2.4.3 Traces (OTel)

Traces are incomplete and delayed by nature. Don’t “wait forever.”

Contract closure conditions:

  • required spans present
  • trace_end_time + grace_period <= decision_anchor_time (anchor pinned in snapshot)
  • for async workflows: prefer closure markers over “root span ended”

Reason codes:

  • observation_incomplete:trace_not_closed
  • observation_incomplete:required_spans_missing
  • observation_stale:trace_grace_exceeded

Common rule: watermark is always a low watermark (align to the slowest frontier).


3) If you can’t place a single writer: HLC for practical ordering

Sometimes you can’t route everything through one sequencer:

  • offline edge/mobile
  • partitions and temporary disconnects
  • local ordering needed before ingestion

A pragmatic tool is HLC (Hybrid Logical Clock):
wall clock + logical counter, merged on receive to preserve causality-ish order.

3.1 Minimal HLC (stdlib-only)

(Python 3.10+; stdlib only.)

from __future__ import annotations

from dataclasses import dataclass
import time
from typing import Tuple


@dataclass(frozen=True)
class HLCTimestamp:
    physical_ns: int
    logical: int

    def as_tuple(self) -> Tuple[int, int]:
        return (self.physical_ns, self.logical)

    def to_string(self) -> str:
        # Fixed-width hex to avoid lexicographic surprises.
        return f"{self.physical_ns:016x}-{self.logical:016x}"


class HLC:
    """
    Hybrid Logical Clock (minimal).
    - now(): local tick (clamp + logical counter)
    - observe(remote): merge remote timestamp to preserve order-ish causality
    """
    def __init__(self) -> None:
        self._p = 0
        self._l = 0

    def now(self) -> HLCTimestamp:
        pt = time.time_ns()  # wall clock; may jump
        if pt > self._p:
            self._p = pt
            self._l = 0
        else:
            self._l += 1
        return HLCTimestamp(self._p, self._l)

    def observe(self, remote: HLCTimestamp) -> HLCTimestamp:
        pt = time.time_ns()
        rp, rl = remote.physical_ns, remote.logical

        p = max(pt, self._p, rp)
        if p == self._p == rp:
            l = max(self._l, rl) + 1
        elif p == self._p:
            l = self._l + 1
        elif p == rp:
            l = rl + 1
        else:
            l = 0

        self._p, self._l = p, l
        return HLCTimestamp(self._p, self._l)
Enter fullscreen mode Exit fullscreen mode

3.1.1 Operational caveats (don’t skip these)

  • this minimal implementation stores state in memory; reboot can regress → persist last HLC, or sync on startup by observing peers/aggregator
  • a clock-broken node can inject “far future” timestamps → enforce max_clock_skew and quarantine remote timestamps beyond it (log it with reason codes like clock_anomaly:remote_future_exceeded)

3.2 How HLC should be used (best practice)

  • compare HLC as (physical_ns, logical) (tuple order)
  • if you need stable total order, add a tie-breaker (node_id) so sort key is (p, l, node_id)
  • use HLC as provisional order, then assign final seq at ingestion if possible

Practical pattern:

Local: HLC for provisional order
Aggregate: seq for final audit/replay order

You don’t need Spanner for this level of determinism.


4) Long-running operations: stop writing “sync loops,” use state machines

Even with seq / watermark / HLC, long-running drivers (A/B rollouts, canaries, jobs) die if they are written as a single loop.

Processes restart. Deploys happen. Workers crash.

So treat time-long operations as state transitions and persist state.

4.1 Split into two layers: Evaluator vs Orchestrator

  • Evaluator (pure-ish): snapshot + policy -> decision Same input → same output (replayable).
  • Orchestrator (state machine): persists run_id state + resume conditions.

This solves:

  • re-run: state retains stage + digest pins
  • duplicates: enforce run_id + stage uniqueness / idempotency
  • interrupted progress: write-ahead transitions
  • resume: DEGRADE becomes re-enterable hold (wait for snapshot/approval/time)

4.2 Minimal orchestrator state

{
  "run_id": "exp-42:B:2026-02-18",
  "stage": "CANARY_25",
  "policy_version": "2026-02-18",
  "last_snapshot_digest": "sha256:...",
  "last_decision": {
    "verdict": "DEGRADE",
    "reason_codes": ["observation_stale:watermark_skew"],
    "missing": ["p95_latency_ms_5m"]
  },
  "resume": {
    "resume_token": "opaque:rsn_...",
    "requested_actions": [
      {"name": "collect_metric", "params": {"metric": "p95_latency_ms_5m"}},
      {"name": "rerun_gate_check", "params": {"stage": "CANARY_25"}}
    ],
    "next_check_at": "2026-02-18T10:45:00+09:00"
  }
}
Enter fullscreen mode Exit fullscreen mode

next_check_at is operational scheduling time, not ordering authority.

4.3 Beware dual writes: orchestrator state vs ordering log

If your orchestrator state and your seq event log are in different stores, you get classic dual-write failure modes:

  • state updated but event not logged → “it happened” without replay proof
  • event logged but state not updated → duplicates and confusion

Two standard fixes:

A) Event sourcing: write the state-transition event into the seq log as the source of truth; derive orchestrator state as a projection.

B) Outbox: write state update + “planned event” into the same DB tx; forward via outbox to the seq log.

If you use outbox, receiver-side idempotency is mandatory:

  • define an event_id (idempotency key)
  • enforce UNIQUE(event_id) on the receiving log
  • INSERT ... ON CONFLICT DO NOTHING

And keep orchestrator transitions CAS-style:

UPDATE orchestrator_runs
SET stage = :to_stage,
    last_event_id = :event_id,
    updated_at = now()
WHERE run_id = :run_id
  AND stage = :from_stage;
Enter fullscreen mode Exit fullscreen mode

4.3.1 Outbox transfer lag can become your “time source”

If downstream uses the log frontier as a watermark, outbox lag can stall watermark → increase DEGRADE.

So treat outbox transfer lag as a first-class SLO signal:

  • outbox_oldest_age_seconds
  • outbox_backlog_count
  • outbox_to_seq_p95_seconds

Reason codes:

  • pipeline_stale:outbox_transfer_lag

Policy example (conceptual):

{
  "pipeline_policy": {
    "max_outbox_oldest_age_seconds": 30,
    "on_over_budget": "DEGRADE"
  }
}
Enter fullscreen mode Exit fullscreen mode

Also consider catch-up safety:

  • rate-limit catch-up
  • optionally cap watermark advance rate (catch-up budget)
  • reason-code budget exceedance for deterministic behavior

5) Summary: deterministic time is not “correct time,” it’s “ordering authority”

  • wall clocks drift and lie; don’t use them as truth
  • define truth as an ordering authority:

    • seq (single-writer) as the baseline
    • watermark contracts to decide “evaluation admissibility” across sources
    • HLC for provisional ordering under partition/offline, then finalize with seq if possible
  • make replay real with:

    • digests (input/snapshot/policy)
    • evaluator version pins
    • append-only logs + correction events
  • make long-running “time” operable with:

    • state machines (orchestrator)
    • idempotency + outbox/event sourcing
    • SLOs + reason codes

Time in distributed systems is not physics.
It’s agreement and contracts.

TrueTime is powerful, but you don’t need it to stop your ops from turning into guesswork.

You need a pinned authority for ordering — and the discipline to treat “stale/missing” as first-class (DEGRADE), measured by SLOs.

Top comments (0)