What is Order and Why is it important in distributed systems
Before we jump into order, lets make some assumptions about the nodes in a distributed system. Nodes(either physical or logical),
- communicate by message passing
- over an unreliable network
- with no shared memory
What is Order
Order defines the sequence in which events occur. That is we are interested in a property which defines a relationship of happens before. Given two events A and B, we want to understand whether
- A happens before B or vice versa
- Or we cant say anything about the order in which A or B occur
Why do we care about order in distributed systems
A system which can do only thing at a time will have a well defined total order of operations. if people pass through a door which allows only one person at a time, then each person will have a well defined predecessor and successor. However in a distributed system since there are multiple nodes, there may be conflicts which may be introduced.
Assume a scenario where successive writes go to different nodes for the same shopping cart. Which node has the correct view of the cart?
Hence it becomes important in distributed systems to reason about ordering, to resolve conflicts.
Total and Partial Order
A total order is a binary relationship which defines an order for each element in a set. Two distinct elements are comparable if one of them is greater than the other. In a partially ordered set, there are some elements that cannot be compared, and hence a partial order does not specify the exact order for each element.
The natural state in a distributed system is about partial order. Neither the node nor the network makes any guarantee about relative order, but at each node you can observe local total order.
Mathematically, total and partial orders have the following properties
Total Order
Total Order is
- Transitive -- if a <= b and b <= c then a <= c
- Antisymmetric -- if a <= b and b <= a then a = b
- Totality -- a <= b or b <= a for all a,b in X
Partial Order
Partial Order is
- Transitive -- if a <= b and b <= c then a <= c
- Antisymmetric -- if a <= b and b <= a then a = b
- Reflexive -- a <= a for all a in X
Totality implies reflexivity, however for some elements in a partial order, the totality does not apply, so some elements of the set cannot be compared.
Time
Time is a source of order. It allows us to define the order of operations. In a sense time is like any other counter. Time has two useful interpretations when applied to programs.
- Order Time as a source of order can be used in a few ways -- We can attach time-stamps to un-ordered events to order them. -- We can use the notion of time to determine whether something happened in chronological order. -- We can utilize time to enforce ordering of operations, like detecting out of order updates.
- Duration Duration is used to define boundary conditions for algorithms. For example, duration can be used to define network partitions, high latency or the unavailability of a node using heartbeat.
Time and Clocks
Global Clocks
The global clock is a globally available clock with perfect accuracy. A global clock provides a total order of events across nodes even if the nodes never communicate. However this is an idealized view of the world. In reality clock synchronization is only possible to a certain level of accuracy. Fundamentally, global clocks are restricted in accuracy by the nature of spacetime
Global clocks assume that clocks on a node all start at the same value and never drift. These assumptions are inherently weak and there are multiple cases, where the assumption breaks, like an out of sync node joining a cluster, or the clocks on nodes drifting apart. However there are systems which assume clock synchronization and apply semantics of last write wins, Cassandra being one of them. Another is Google's Spanner which assumes a worst case clock drift.
Local Clocks
Local clocks are clocks on each node. Local clocks cannot be used to determine whether a remote timestamp happens before a local timestamp. Hence with local clocks, you cannot compare timestamps from two nodes. Local clocks assign a partial order; there is total ordering of events on a node, however events cannot be ordered across nodes.
Causal/Logical Clocks
Causal or Logical clocks assume that there are no clocks and use other mechanisms of tracking causality. Logical clocks use counters and communication between nodes to determine whether something happened before, after or concurrently with something else. Counters can be used to order events locally on a node(timestamps of local clocks can be considered as a special case of counters). In the absence of communication, there would be partial order across nodes. However for ordering across nodes we require communication or message passing.
Lamport clocks and Vector clocks are logical clocks which are used as a replacement for physical clocks. they rely on counters and communication between nodes to determine the order of events in distributed systems. These clocks provide a counter that is usable across nodes. Voldemort and Riak are distributed systems that use vector clocks.
Lamport Clock
Lamport formulated the happens before construct for ordering events across nodes in his paper in 1978, Time, Clocks and Ordering of Events . To understand Lamport clocks lets first look at the how counters are maintained for a process/node
-- Whenever a process does work, increment the counter
-- Whenever a process sends a message, include the counter in the message
-- Whenever a process receives a message, set the counter to max(local_counter, received_counter) + 1
Lamport clocks allow counters to be compared across nodes using a partial order. For two events A and B, if counter(A) < counter(B)
-- Then A may have happened before B
-- Or A and B cannot be compared
In simplicity what this says is that if an event comes before another, then the logical clock of the event comes before the other.
Looking at it another way, if A < B IF
-- A and B both occur on the same node/process.
-- A and B are synchronization pairs. A synchronization pair is response to a previous event. So B is a response for A. As an example, A is a GET for performing an update(B).
Vector Clock
A Vector clock is an extension of Lamport clocks. A vector clock maintains an array of N logical clocks, one for each node. Rather than incrementing a common counter, each node updates its logical clock in the array. The rules defining Vector clocks are
-- Whenever a process does work, increment the logical clock on that node in the array.
-- Whenever a process sends a message, it includes the complete array in the message.
-- When a process receives a message, update each element in the array to max(local,received) and increment the logical clock value representing the current node in the array.
The way to determine whether A happens before B is based on
if A < B IFF VectorClock(A) < VectorClock(B). Assuming three nodes N1, N2 and N3 and the vector clock at A to be [1,0,0] and the vector clock at B being [2,1,0] implies that A happens before B.
Top comments (0)