When we started off learning about distributed systems, we knew that this topic in particular can be a little hard. And in the beginning of this series, we didn’t even get too deep into the weeds about what makes distributed systems hard.
Instead, we focused on what exactly they were, and what makes for a good, well-designed system (spoiler alert: the perfect system doesn’t even exist, but we strive towards it!). We didn’t really dive too deep into the things that make distributed systems difficult to deal with until recently, when we explored topics like downtime, availability, and fault-tolerant systems. This week, it’s time to go even deeper, down into the depths into the trenches of where things start to wrong in our systems.
Now that we’ve been introduced to faults and have a better idea about some of the things that could potentially go wrong, it’s time to understand what we’re really talking about when it comes to faults, what they look like, and the subsequent headaches that they can cause. So let’s dig right in and try to get a better handle on what we’re dealing with when it comes to faults!
In our recent run-in with faults, we discussed them in a very concrete sense: in the form of some piece of hardware going haywire (read: failing). But, faults don’t just occur in a hardware-gone-bad scenario. The actual definition and usage of the term “fault” is more vague than what we initially may have thought.
A fault is really just anything in our system that is different from what we expect it to be. Whenever some piece of our system deviates from its expected behavior, or whenever something unexpectedly occurs in our system, that behavior itself is a fault!
Now, going by just this definition alone, we can start to see that it’s not just failing or broken hardware that can “behave unexpectedly”. Nope, there are ton of different potential points in our system that could do that, and yes — you guessed it — all of those potential points are just faults in our system, waiting to occur.
It’s important to distinguish between the concepts of a “fault waiting to happen” and a “fault that already exploded in our face”!
Thankfully, there are two terms that come in handy here. If we have a point in our system that has the potential to behave unexpectedly but it hasn’t actually manifested that “unexpected” behavior just yet, we can refer to it as a latent fault , since it exists, but is dormant, and isn’t actually affecting any part of our system in any way (…yet!).
However, when a fault actually reveals itself — that is to say, when a fault no longer has just the potential to behave in an unexpected way, but it actually does the unexpected thing — we refer to it by another name. An active fault is one that actually deviates from what we expect, and rises to surface of our system, and affects other parts of it.
When an active fault occurs in our system, it ripples through the system and causes a domino effect. Usually a fault (which is, by definition, some kind of unexpected behavior) causes an error, which then results in a failure. But what does that even mean, exactly? Time to find out!
While a fault is latent, it could be dormant and waiting around at any place in our system where some unexpected behavior is permitted to occur. Usually, the unexpected behavior is something we haven’t considered or handled in any way. Once a fault becomes active, it tends to be a bit easier to understand where it originates from.
We’ll recall that distributed systems are made up of nodes that are communicating wit one another, so a fault could originate from any node within the system — wherever the unexpected flaw in the system happens to live. So, for our purposes, we can talk about faults abstractly enough to say that a fault originates within some node in the system.
Once the fault in a node actually manifests itself and becomes active, it causes some unexpected behavior; as we might guess, different faults will result in different behaviors, but the common thread here is that they’re unexpected. An expected behavior in our system effectively means that some part of our system did something that we did not plan for! Unexpected behavior results in our system doing something we didn’t plan for it to do, which yields an incorrect result: an error!
We’ve likely run into errors in some way, shape, or form in our lives — whether as the creators of software or hardware, or as the consumers of it. In the context of distributed systems, an error is actually a way of a fault surfacing itself. Errors are manifestations of faults within our system, and when an error occurs and then spreads or propagates through the system…well that’s when things start to really come to our attention.
If an error occurs because of a fault and is not handled in some way, things start to get out of hand.
Specifically, the unhandled error caused by a single fault within one node can now begin to impact the rest of the system.
If an error is not hidden from the rest of the system, then the error propagates outward; from the perspective of the rest of the system, the node where the fault originated is now behaving unexpectedly, since it is responding with an error, rather than whatever the system expected it to respond with! This unexpected behavior from the node is what the system perceives as a failure , or an incorrect result or behavior for that “faulty” node.
Failures themselves are a whole topic within distributed systems, and we’ll talk about them again soon, I promise! But now that we have a better sense of the flow of how a fault surfaces, becomes active, causes and error (and later, a failure), let’s try to understand the different types of faults that can occur. This will give us better insight into what kinds of failures we can expect to see down the road.
As we’ve now learned, the definition of a fault is broader than what we first thought it to be. Since a fault is anything that behaves unexpectedly, a fault could be any hardware, software, network, or operational aspect of the system that does something we didn’t plan for! So, faults themselves can be caused by many reasons and don’t just have to stem from one place.
But since faults can look like so many different things — because they can come from different places — how can we better make sense of them? Well, thankfully, we can use some predetermined categories to help us understand what kind of fault we are dealing with. There are three main types of faults: transient, intermittent, and permanent.
A transient fault is a fault that happens once, and then doesn’t ever happen again. For example, a fault in the network might result in a request that is being sent from one node to another to time out or fail. However, if the same request is made between the two nodes again and succeeds, that fault has disappeared, which is how we can define it as transient.
An intermittent fault is one that occurs once, seems to go away, and then occurs again! Intermittent faults are some of the hardest ones to debug and deal with, since they masquerade as transient faults at first, but then come back — sometimes with inconsistency. A good example of this is with loose connections in hardware, where sometimes it seems like the connection works, but occasionally (and often erratically) the connection just stops working for a bit.
Finally, a permanent fault is one that just does not go away after it first occurs. A permanent fault occurs once, and then continues to persist until it has been addressed. For example, if part of a system runs out of memory, hits an infinite loop, or crashes unexpectedly, that “broken” state will just continue to be the same until someone (or some part of the system) fixes it or replaces it entirely.
Last but not least, each of these three types of faults come in two flavors! And this is where faults become even more tricky. Every transient, intermittent, or permanent fault runs the risk of either being a fail-silent fault or a Byzantine fault.
A fail-silent fault (also sometimes called a fail-stop fault) is one where the node where the fault originated actually stops working. In this specific flavor of fault, when the origin node stops working, it will either produce no result (error/output) whatsoever, or it will produce some sort of output that indicates that the node actually failed. In a fail-silent fault, there is no guarantee that the node with the fault will actually give us an error, so it’s possible that we won’t even know that a fault occurred!
By contrast, in a Byzantine fault , the origin node does actually produce an error output, but it doesn’t always produce the same error output. And, confusingly, even though the node is producing errors, it continues to run! In a Byzantine fault, a node could behave inconsistently in the exact errors that it surfaces, which means that a single fault within a node could actually result in the node responding with various different errors, all of them potentially different from one another!
As we might be able to imagine, both a fail-safe fault and a Byzantine fault seem like treacherous scenarios. And our systems should aim to try to build for those situations (although, it’s important to note that we can’t build completely fault-free, fault-tolerant systems…though it’s nice to try to strive in that direction). Faults are a cornerstone of key distributed systems discussions, specifically because they are hard to deal with, reason about, and consider while building our systems. But now that we know how to talk about them, what they’re call, and what they might look like, we’re far more equipped than before to take these crafty creatures on whenever we encounter them next in the distributed system wilderness!
Faults are a big topic in distributed systems, specifically because many folks have written about how to understand and design for fault-tolerance in a system. There’s a lot of good content out there on faults and how they fit into the larger narrative of distributed systems, but the resources below are some of my favorites.
- Fault Tolerance in Distributed Systems, Sumit Jain
- Fault Tolerance: Reliable Systems from Unreliable Components, Jerome H. Saltzer and M. Frans Kaashoek
- Distributed Systems: Fault Tolerance, Professor Jussi Kangasharju
- Recovery and Fault Tolerance, Professor Tong Lai Y
- Fault Tolerance, Paul Krzyzanowski