Vaidehi Joshi

Posted on Jun 5, 2019 • Originally published at Medium on Jun 5, 2019

Weeding Out Distributed System Bugs

#distributedsystems #beginners #computerscience

As we’ve learned more and more about distributed systems, we’ve seen the many ways that things can wrong. More specifically, we’ve seen that there are just so many possible scenarios in which a portion of a large or system can fail.

Since failure is inevitable in any sized system, we ought to do ourselves a favor and better understand it. So far, we’ve been talking about different kinds of faults and failures in a fairly abstract sense. It’s time for us to get a little more concrete, however. Sure, we vaguely understand the different flavors and profiles of how things can go wrong within a system, and why they are problematic (yet inevitable) within our system. But how can we start to understand these flaws in a system in a more tangible form?

In order to do that, we need to think more deeply about the different ways in which failures present themselves to us in the systems that we all deal with every day, as both consumers and creators of software. For most of us, failures in our systems present themselves as bugs. The ways in which a bug might appear to us, however, can make all the difference in how we are able to understand it in a more concrete way.

So what kinds of bugs do we deal with, exactly, and how do they impact the failures of a distributed system? Well, that’s the mystery we’re about to solve!

(Hardware) failures of the past

Before we dive into identifying the bugs of today, let’s take a quick detour into the past. It’s easy to get overwhelmed when thinking about failures and faults in a system, so before we get too deep into the trenches, we ought to take a moment to see how the landscape of bugs in computing have changed over the years.

Until the 1980’s, the major focus of computing was around hardware. More specifically, much of the field was focused on how to make hardware just generally better. This was simply because hardware was the major limiting factor in many ways. For example, if we wanted to make a machine faster and more performant back then, we needed bigger hardware; that was the only way to have enough space for all the circuits we needed! And if we wanted more circuits — which were each already pretty large and lofty in size —then we also needed to be prepared for them to use a lot of power and exude a lot of energy and heat.

These issues begin to highlight some of the clear possible faults that could have popped up within a system just a few decades ago. As we already know, hardware faults — such as a circuit overheating, or a network wiring issue causing a widespread outage—are what lead to hardware failures. If any aspect of the hardware fails, then that failure will likely cause some form of downtime in a system, which we know makes a system less reliable. Until the 80’s, hardware faults were very much a real and common problem.

But these days, the story is a little different. We experience far less downtime due to hardware faults than we did just forty years ago. We have many years of concentrated effort that has helped improve the hardware we all rely upon on a daily basis now! In the past three decades, the size of circuits has decreased, allowing us to pack more circuits into a smaller space, and those circuits produce less heat, and use much less energy. Circuits have also become easier and cheaper to produce, making them more inexpensive in general. This has also allowed us to create smaller devices, like laptop computers, tables, and smartphones, to mention just a few.

This doesn’t mean that there are no hardware problems whatsoever, though! Even small devices with smaller circuits inside of them will experience hardware failures. Network issues that cause downtime are still likely to happen, even though the frequency with which they occur has definitely dropped off. Hardware disks are still prone to failures, which makes it tricky to read (much less write) data to them. And, of course, just because hardware has improved doesn’t mean that it doesn’t require maintenance and upgrades; since these are still requirements, they will still result in planned downtime.

Overall, however, the changes we’ve seen in hardware have been quite a net positive for computing. So, if hardware has improved…what else could be a contributing factor to failures in a distribute system? Why, our dear friend software, of course!

Even in the most well-tested systems, software failures are responsible for a significant amount of downtime. We know these “failures” by another name: bugs.

Improvements in hardware notwithstanding, it is bugs in the software of distributed systems that result in unexpected and unplanned downtime. Many studies estimate that, 25 to 35% of the downtime in a system are caused by bugs in software-related code.

Software failures as a major pain point in a distributed system

The interesting aspect of this story is the fact that, even within systems that are fairly well-established and have rigorous testing practices in place, studies have found that the actual percentage of software-related downtime doesn’t really ever reduce beyond that 25% threshold! There are just some bugs that still seem to exist, even with well thought-out tests and quality control.

(Software) problems of the present

The bugs that still exist in more mature systems — those that have rigorous testing, for example — are also known as residual bugs , and they can be classified into two separate categories:

The two main kinds of “residual” software bugs

Bohrbugs , which are named for Niels Bohr and Ernest Rutherford’s model fo the atomic nucleus, and
Heisenbugs , which are named as a pun on Werner Heisenberg’s Heisenberg uncertainty principle.

These two bugs have been researched by many different computer scientists; three of the most notable ones include Jim Gray, Bruce Lindsey, and Andrea (“Anita”) Borr, and we’ll read about the fruits of their labor in a little bit. Between these two different kinds of “residual” bugs, one is definitely way easier to wrap our heads around than the other. So let’s start with that one first!

The Bohrbug is a bug that most (all?) programmers will encounter while tinkering with software. A Bohrbug is a bug that can reliably be reproduced, given the right conditions. For example, if we were able to notice that a bug occurred in a piece of software and closely observed the situation in which it happened, if it was a Bohrbug, then would be able to reproduce it by re-creating the same situation.

A Bohrbug is pretty easy to localize and pinpoint to a certain part of a codebase. As developers, this is a huge boon, since it means that we can reliably find and then fix the Bohrbug, as annoying as it might be!

Interestingly, when Jim Gray and Bruce Lindsey were researching Bohrbugs in more mature systems, they posited that the frequency of these reproducible little bugs actually reduced as a system grew older and more stable.

However, Anita Borr’s research added a bit more nuance to this. She found that the percentage of Bohrbugs didn’t strictly continue to drop as a system grew more stable; rather, her research found that, with each new upgrade or scheduled maintenance that was introduced into the system, there was also a slight uptick of Bohrbugs, since significant changes in the system were still very capable of introducing reproducible bugs.

Thankfully, even though new Bohrbugs might be introduced with these system-wide changes, at least they can be reproduced (and hopefully, fixed!). But things aren’t always that simple in the world of software (of course). Some bugs don’t always behave the same…in fact, some of them seem to behave differently when we try to investigate them!

Dealing with difficult, distributed bugs

There is one species of bug that is particularly relevant to distributed systems, and it’s finally time for us to come face-to-face with it in this series. I’m talking, of course, about the Heisenbug!

A Heisenbug can be super frustrating to deal with as a programmer. This is a bug that actually behaves differently when it is observed closely. As one begins to investigate a Heisenbug, it may change how it manifests itself. In some cases, when a Heisenbug is, well, being debugged, it disappears completely. And in some situations, when certain conditions are recreated in an effort to reproduce the bug, the bug just won’t appear! Pretty frustrating, right?

For example, a bug such as a data structure running out of space or a portion of a program that overflows some allocated memory in a production environment might not be to easily reproduce locally or in a test; however, this exact bug could cause a system to crash, and is pretty fatal!

This is part of the reason that makes Heisenbugs so difficult to reason about. They are incredibly hard to actually reason about, because it’s hard to actually localize them and pin them down. And, because they’re hard to reproduce reliably, they’re hard to identify and thus, difficult to actually solve!

The Heisenbug is especially relevant to distributed systems because they’re more likely to occur in a distributed system than in a localized, central one. These kinds of bugs are actually indicators of problems and failures in the system that occurred much earlier than when the bug manifested itself.

Heisenbugs are much more common in distributed systems.

A Heisenbug is usually a red flag that something else went wrong in the system awhile ago, and it is only surface now, and it only just so happens that it is surfacing in the form of this bug.

In actuality, a Heisenbug is just a long-delayed side effect of a much earlier problem.

In more mature distributed systems, it is Heisenbugs — not Bohrbugs — that cause major failures and system crashes. If we think about it more deeply, this starts to make sense; there are many moving parts in a distributed system, and many dependencies and nodes that rely upon one another. A failure that appears to be coming from one node in the system might actually be three nodes removed from a failure that originated elsewhere, but propagated throughout the system. While a Bohrbug might be easy to reproduce, localize, and reason about, a Heisenbug is much tricker to think about and thus, to fix.

Anita Borr — my new favorite researcher on residual bugs — actually found in her research that many engineers have a hard time reasoning about Heisenbugs, and that making attempts to fix a Heisenbug actually cause more problems than there were to begin with! So if you’ve been feeling like Heisenbugs are tricky little creatures and hard to contend with, don’t worry; the research agrees with you!

Resources

Software failures in distributed systems are pretty cool to learn about, and you can learn a whole lot more about them. There is a lot of interesting research and writing on how to deal with and guard against Heisenbugs in your system. Check out some of the resources below if you’re curious to learn more!

Reliable Distributed Systems: Technologies, Web Services, and Applications, Kenneth Birman
Why Do Computers Stop and What Can Be Done About It?, Jim Gray
Introduction to Distributed Systems, University of Washington
Heisenbugs and Bohrbugs: Why are they different?, Richard Martin (?)
Protecting Applications Against Heisenbugs, Chris Hobbs

Top comments (3)

Dylan Anthony • Jun 6 '19

Great post! I love the comparison to Bohr and Heisenberg, I hadn’t heard that before.

The Heisenberg bugs are definitely more prevalent in any situation where there are parallel tasks, whether it be multi-threading or completely distributed.

Also, with IoT we’ve encountered several Heisenberg hardware failures! Environmental changes can have effects on hardware that are near impossible to reproduce in the lab.

Vaidehi Joshi • Jun 6 '19

Great point re: hardware failures in IoT! I haven't done much hardware stuff but I can imagine that, even with the many improvements we've seen, hardware failures are really hard to deal with whenever they do happen

rhymes • Jun 18 '19

Great post, thank you Vaidehi!