Ordering Distributed Events

#distributedsystems #beginners #computerscience

One of the hardest things about distributed systems is that we often find ourselves needing to approach them very differently than other problems in computing. Distributed computing forces us to reevaluate how we’d approach even the simplest obstacles in a single system model.

We recently began exploring one such example of this when we took a closer look at ticking clocks and how unreliable they are in a distributed system! As we learned, there is no single global clock in a distributed system, which makes it hard to ever agree on what time it is. Understanding the caveats of clocks and time is crucial to wrapping one’s head around the distributed computing model.

But why is time specifically so important? Why do we really care what time it is, and why does it matter that we know when things happened? And if it matters, how do we go about figuring out what time it is without a global clock to rely upon? As we’ll learn, it’s not so much the clocks and time that we care about, but rather, the events that they’re always tied to.

Time in the human mind

Although we’ve already dipped our toes into the topic of time, there’s still a lot more left for us to uncover. In fact, we won’t even be done talking about time by the end of this post! Now, it might seem odd to spend so much time learning about, well, time. Why does time matter so much in a distributed system, anyway? If we think about it more deeply, we might realize that we don’t specifically care about knowing what time it is in a distributed system. What we care about is be able to rely on time to help us order things.

Why does knowing the time even matter in a distributed system?

The notion of time helps us order events in a single system as well as a distributed one.

Time is particularly useful when it comes to ordering events in system. Most programmers will rely on and expect some kind of time format when they deal with data or processes; this time format is usually something like a timestamp , which are a sequence of characters that have been encoded to represent a certain moment in time. We use timestamps in order to tell when some data was created or updated, when a message was sent or received, and when a process was started or completed. In actuality, we don’t care about the timestamp on its own — it’s really the timestamp being tied to some data or event that makes it useful to us.

The trouble is, however, that we humans have our own notion of time and how it ought to work.

We tend to think of time as being linear; an event happens at “some” time, another event happens after it at “another” time, and a final event happens last, at some “final” time.

The human mental model of time assumes a few things.

First, we all agree on what time it is. We have timezones and shared clocks (including a global atomic clock), and there is rarely any discrepancy between us when it comes to what time it is in a given location.
Second, we tend to think of events happening as one after another. Event A happens first, then event B, and finally, event C.
Third, we can determine the exact order of events based on which event happened before another. In other words, if event A happened before event B, and we know the time of A and B, we know the exact ordering of the two events.

While this might seem obvious to us, this mental model of time doesn’t work in the context of distributed systems! There are some major differences between our human mental model of how time works and the reality of time in distributed systems.

The reality of how time works in a distributed system is very different from the human mental model.

The first difference is one that we already know: distributed systems have no global clock , so there is just no way to be sure of what time it is on a different machine in a different location. The second difference to note is that events don’t just happen one after another! Events can happen at the same time, in parallel, or concurrently. Both of these points lead us to one more major difference: events happen in different places, in different nodes/locations in a system. But, because they can happen at the same time, and because it’s not possible to know what time it is without a global clock, the events that occur all over a distributed network are hard to order!

The human modeling of time and the reality of time in a distributed system are each two very different approaches to how we might go about ordering events in a system. In the human model approach, we know exactly what time it is, so it’s easy to figure out how to order some events — we can just use each event’s timestamp, since we know what time it is! But in the distributed system approach, we don’t have the luxury of having all the information (read: knowing the time) about all the events in the system.

Total and partial ordering

The human modeling of time and the more realistic view of time in a distributed system can be grouped into two very different approaches of ordering events. Let’s learn a bit more about them.

Our view of time aligns well with a single system’s model of time. In a single system — as opposed to a distributed one — each event can be placed in a specific order, which is based on what time they occurred. This is also known as a total ordering of events.

When events in a system follow a total order, then every event in that system has a specific order in which it occurred. In other words, when we know exactly when each of the events occurred, we know the total order of all the events in a system.

For example, in the illustration shown above, if we knew that event A occurred first, followed by event B, and ending with event C, then we would know the total order of all the events. Events in a single system can be totally ordered since we have a local clock to help us figure out exactly when an event happened, which allows us to figure out the total ordering of all the events.

Yet a distributed system is very different from a single system! Sure, an individual node in a system will know what time it is based on its local clock, but it won’t know what time it is on another node. It’s almost as if each node in a distributed system can get halfway to a total ordering…it can order its own events, but if it receives any messages/incoming events from other nodes in the system, it won’t know what to do with them.

This brings us to partial ordering, which is a “less sure” version of total ordering. In a partial order , we can’t be sure of the exact order of all the events in the system. Instead, all we can be sure about is the order of events that are reliant upon one another.

Let’s think back to our example three events: A, B, and C. In a single system or in a system with a total order, we were able to know the exact order of each of these events. But in a distributed system, that would have likely not been the case! Instead, we probably would have been sure about some of the events, but not others.

Understanding partial ordering with multiple events.

For example, let’s say that we are sure about when event B and event C occurred. In fact, let’s say that we are even sure that event B occurred before event C. But what about event A? We’re not at all sure about when that event occurred!

In reality, because we don’t know when event A occurred, it could have theoretically happened at any time: before event B, after event C, or even right between the two events! All we can say with certainty is that event B occurred before C.

Indeed, our partial sureness about how to order these events is what makes this set of events a partial ordering— we’re only “partially” sure about the order of these events.

Without a global clock, distributed system events are partially ordered.

In a distributed system, we mostly deal with partially ordered events, simply because individual nodes can be sure about how to order local events, but they can’t always be sure about how to order events that are happening on a different node. A node in our distributed system will send messages out to other nodes, who can’t necessarily be sure of when they occurred. Similarly a node in our system will receive its own set of incoming messages, and it can’t be sure of what time those messages were sent, either!

So how do we reconcile this lack of knowledge throughout our system? Well, it involves some reframing of how we think about time…and whether we even need it at all (?!).

Maybe time doesn’t matter

In the beginning of this post, we asked ourselves why we even cared about time. We realized that the reason it kept coming up was because we relied on time to order events in our system. Along the way, we learned that sometimes we can be sure about the order of events (total ordering) and other times, well, we can’t be sure of everything (partial ordering). But if we circle back to our original question, we might realize that we don’t actually care about time, but rather the order in which things happen.

Maybe time itself isn’t important! Maybe we just care about the order of things.

Perhaps we don’t actually need to worry about the exact time that each event in our system occurred. Instead, maybe we should try to just figure out when those events happened in a way that doesn’t require us to even think about clocks or time!

All we really want to know is the order in which some events occurred. And why do we care about the ordering of events? Well, the major benefit of ordering events is so that we can figure out how one event caused another to occur. So, the question is: how can we get the benefit of ordering events without all the headache of figuring out the time at which each event happened?

The answer to this mystery is causal ordering , which helps us order events not based on the time that they occurred, but rather, based on cause and effect. Causal ordering reframes how we think about events. If we can just figure out which events cause other events, we can come up with a loose ordering of how those events occurred.

In other words: we don’t need to care about time anymore! We can sort of skirt around it, and jump right to the matter at heart: figuring out if one event came before another. In the next post, we’ll dive into the logic behind causal ordering and figuring out the chronological relationships between events. In the meantime, rest easy knowing that we’ve got a (logical) time solution ahead of us.

Resources

Understanding the ordering of events in a distributed system is super helpful to know before learning about more complex time-related topics. It can be hard to find resources that specifically explain total and partial ordering in the context of distributed systems, but there a few that do a pretty good job, which I’ve listed below!