DEV Community: Artem Gurtovoi

Log Ordering in Distributed Systems

Artem Gurtovoi — Tue, 25 Nov 2025 07:16:41 +0000

TL;DR

Extend trace context with a Lamport clock scoped to an execution branch.

Problem Definition

Ordering log events produced across distributed systems is fundamentally constrained by the nature of independent physical clocks. Wall-clock timestamps cannot provide a reliable global sequence, because each machine maintains its own oscillator with unavoidable drift. Even under NTP or similar protocols, timestamp discrepancies accumulate continuously due to rate differences, network delays, and local scheduling effects.

Any distributed operation introduces uncertainty in temporal order. Network latency, buffering, batching, and concurrency all contribute to the inability to determine whether two events across services occurred in a particular order when relying solely on physical time. The concept of temporal ordering is meaningful only within a single clock domain; across independent clocks, timestamps cannot be meaningfully compared.

Attempts to merge logs by wall-clock timestamps therefore yield interleavings that may not reflect causality. The resulting timeline may hide the true execution structure of the system.

Limitations of Timestamp-Based Approaches

Conventional log ingestion and aggregation systems typically sort events using:

the timestamp recorded by the service, or
the timestamp of ingestion by the collector.

Both approaches suffer inherent limitations:

Clock drift is indistinguishable from communication delay. No observation can determine whether a later timestamp originated from drift or from slow delivery.
Timestamps encode no causal information. An event with a greater timestamp may not depend on an event with a smaller one.
Collectors alter event order. Batching and transport buffering introduce additional nondeterministic reordering unrelated to causality.
Sorting heuristics obscure underlying issues. When tools reorder logs using timestamps, clock drift becomes invisible; without reordering, the sequence appears chaotic.

Because globally correct ordering cannot be derived from wall-clock time, most platforms expose ordering policies to users, implicitly acknowledging the limitations of timestamp-based ordering.

Solution: Logical Clock Propagation

A more robust approach is to derive ordering from causality rather than physical time. This is achieved by propagating two logical values—branch and sequence—along the execution path of each trace.

Overview

Each request carries:

trace_id – uniquely identifies the execution.
branch – a hierarchical identifier describing the request’s execution path through the system.
sequence – a monotonically increasing counter local to the current branch.

Every log entry produced within that request includes (trace_id, branch, sequence), allowing deterministic reconstruction of execution order.

Sequence Increment

The sequence value increases with each causally significant step performed along the same branch: service-to-service calls, internal operations, or downstream interactions such as database access.

Events within a branch are totally ordered by the sequence field.

Branch Creation

When execution diverges into parallel or independent subpaths, a new branch identifier is created by extending the current branch:

root branch: /
first parallel branch: /0/
second parallel branch: /1/

Each new branch initializes its own sequence counter starting at zero.

Sibling branches represent concurrent execution and therefore remain unordered relative to each other.

Observations

Causal Ordering

The branch–sequence mechanism encodes happens-before relations explicitly:

Events within the same branch are totally ordered by sequence.
Branch prefixes encode ancestry (e.g., /1/2/ descends from /1/).
Sibling branches (e.g., /0/ and /1/) represent concurrency and therefore remain unordered.

Practical Properties

Unaffected by clock drift, NTP adjustments, or physical-time inconsistencies.
Represents execution structure explicitly, enabling reliable causal reconstruction.
Requires minimal instrumentation: two small metadata fields propagated through normal trace context.
Suitable both for operational debugging and post-incident analysis.

By deriving ordering from execution structure instead of timekeeping infrastructure, the branch–sequence method provides deterministic, causally accurate ordering within each distributed trace.

Considerations

There are important considerations and trade-offs with using logical clocks for log ordering:

Causality over chronology: The solution intentionally prioritizes causal order (the happens-before relation) over interpretations based on each server’s local clock readings. In diagnostic scenarios, the causal chain of events provides more reliable information than the timestamps generated by individual machines. For instance, if an error in Service C originates from a failure in Service A, the logical ordering will correctly place A’s events before C’s, even if a chronological sort based on local timestamps is distorted by clock drift, transient I/O delays, or other timing uncertainties.
Not a global order: This mechanism does not produce a single total order of all events across the entire system in real time – nor is that typically needed. Each trace (request) is ordered internally, but events from different traces remain incomparable except by wall-clock time. This is acceptable, because unrelated requests do not have a causal ordering between them anyway. Even within one trace, if two branches execute in parallel, their events are concurrent and any ordering between them is somewhat arbitrary.

References

Distributed Peer Indexing

Artem Gurtovoi — Wed, 19 Mar 2025 12:41:39 +0000

Problem

Modulo partitioning algorithm taks.id % replicas == index requires knowing the number of task processing instances running in the cluster and the own index of the current instance.

Forces

Static configuration is not an option (due to dynamic scaling / failover).
In a distributed system, there is no concept of a global current time.

Solution

An algorithm that emits (index, replicas) once per interval seconds, using a common Redis key and atomic increment.

Define the following parameters:

name: a name of the task processing (e.g. mail-sender)
interval: indexing interval that is deliberately greater than the expected clock skew among instances

At the start of each interval in Unix epoch:

Calculate an ordinal number of the current interval: number = ceil(now() / interval)
Compose a key as {name}:{number}
Atomically increment a key in Redis (INCR)
If index is defined (see 5)
- Get the value of the previous key {name}:{number-1} as replicas
- If the replicas is defined, emit (index, replicas) algorithm result
Store the response (3) in index.

Safe index transition

If the index or replicas changes, the algorithm consumer must stop consuming new tasks and execute safe index transition, to prevent task duplication or loss.

The transition can be implemented in a manner similar to the described algorithm, using a dedicated {name}:transition key. However, this process is considered outside the scope of this document.

Extension

If the system clocks are too precisely synchronized (skew is less than a round-trip time to Redis), this may result in continuous index transitions.

To mitigate this, the algorithm can be extended with a random delay:

Before starting the algorithm, define a random constant clock skew, significantly smaller than interval: skew = random() * (interval / 2)
Start the algorithm at each interval + skew.

Caveats

The first result will become known between interval and interval × 2 seconds.

Decentralized Request Throttling

Artem Gurtovoi — Wed, 19 Mar 2025 12:30:21 +0000

Problem

To prevent an unnecessary load, API requests should be throttled when made maliciously or due to an error.

Context

The API Gateway of a distributed system runs in multiple instances, each with an in-memory state that does not allow tracking the current quota usage.
Precise quota enforcement (per request, per second) is not critical; the goal is to prevent significant overuse.
Quota configuration is static.

Forces

No per request IO is allowed (centralized solutions do not fit).
In a distributed system, there is no concept of a global current time.
Failure to retrieve the quota state should not result in Gateway failure.

Solution

Implement in-memory quotas in each process, periodically synchronizing them asynchronously using Redis.

Consider a basic throttling rule:

No more than MAX_REQUESTS within INTERVAL time for any API route (KEY).
If the limit is exceeded, block requests to KEY for COOLDOWN seconds.

Concept

Divide INTERVAL into N spans (N >= 2).
At the end of each span, atomically increment the value in Redis (INCRBY) by the number of requests received for each KEY, using a key composed of the KEY and the current ordinal number of INTERVAL in Unix epoch.
If the returned value exceeds MAX_REQUESTS, block access to KEY, remove after COOLDOWN.
If the write operation fails and REQUESTS > MAX_REQUESTS / N (i.e., the quota is exceeded locally), block access to KEY, remove after COOLDOWN.

Each API Gateway instance will generate the KEY based on its own local time, which, in general, will lead to simultaneous writes (from Redis’s local time perspective) to different KEYs. However, the total contribution of each node to each KEY will correspond to the actual request rate experienced by that node.

Dividing the INTERVAL into spans smooths the desynchronization effect.

Extension

Introduce nodes, the approximate number of active API Gateway instances, to improve the algorithm.

Initially, nodes equals to 1.
During each INTERVAL, sum the total request count for each KEY, keep current and previous values.
At the end of each interval:

Read the value from the KEY for the previous interval. At this point, it is assumed that all nodes have switched to the next interval.
Update the nodes value by dividing response form Redis by the number of REQUESTS counted for the previous interval.
Set the previous value to the current value.
Set the current value to 0

Update point 4 of the Concept: REQUESTS * nodes > MAX_REQUESTS / N. This will increase the precision of the local quota enforcement.
Add a new step: if REQUESTS * nodes > MAX_REQUESTS, block access to KEY, remove after COOLDOWN.

When nodes are added or removed, the algorithm will adapt in the upcoming intervals.

Caveats

In the worst case scenario (really, really unlikely), the quota is exceeded by MAX_REQUESTS / N in a span on each node.
Time desynchronization between nodes should be insignificant for the selected INTERVAL (i.e., INTERVAL >> desync). See Extension point 3.