Welcome to episode 4 of our webinar series, From Theory to Practice. Blameless’s Matt Davis and Kurt Andersen were joined by Joanna Mazgaj, Director of Production Support at Tala, and Laura Nolan, Principal Software Engineer at Stanza Systems. They tackled a tricky and often overlooked aspect of incident management: problem detection.
It can be tempting to gloss over problem detection when building an incident management process. The process might start with classifying and triaging the problem and declaring an incident accordingly. The fact that the problem was detected in the first place is treated as a given, something assumed to have already happened before the process starts. Sometimes it is as simple as your monitoring tools or a customer report bringing your attention to an outage or other anomaly. But there will always be problems that won’t be caught with conventional means, and those are often the ones needing the most attention.
Our panel comes from diverse backgrounds, with Laura working with a very new and small startup and Joanna focusing on production at a large company, but each had experience dealing with problem detection challenges. The problems that are difficult to detect will vary greatly depending what you’re focused on observing, but our panel found thought processes that consistently helped.
You might think of your system as having two basic states: working and broken, or healthy and unhealthy. This binary way of thinking is nice and simple for declaring incidents, but can be very misleading. It may lead you to overlook problems that exist in the gray areas between success and failure.
Kurt Andersen gave an example of this type of failure that’s becoming more relevant today: gray failure in machine learning projects, resulting from the learning data drifting away from reality. Machine learning projects can give very powerful results, using a process where an algorithm is fed tons of classified data until it learns to apply the same classification for new data. For example, an algorithm can be trained to identify species of birds from a picture after being shown thousands of labeled pictures.
But what happens when the supplied data starts drifting away from accuracy? If the algorithm starts misidentifying birds because it starts working from bad data or works with incorrect patterns, it won’t throw up an error. A user trying to learn the name of a species won’t likely be able to tell that the result is incorrect. The system will start to fail in a subtle way that requires deliberate attention to detect and address.
Laura Nolan pointed out that this type of gray failure is an example of an even more general problem – how do you know what “correct” is in the first place? “If you know something is supposed to be a source of truth, how did you double check that?” she asked. “In some cases there are ways, but it is a challenge.”
There’s no single way to detect “incorrectness” in a system when the system’s definition of “correct” can drift away from your intent. What’s important is identifying where this can happen in your system, and building efficient and reliable processes (even if they’re partially manual) to double check that you haven’t drifted into gray failure.
Another big challenge in problem detecting: even if you’ve detected a problem, is it the right problem? Joanna gives an example of your system having an outage or another very high priority incident that needs dealing with. When you dive into the system to find the cause of the outage, you end up finding five other problems with the system. This is only natural, as complex systems are always “sort of broken”. But are any of these problems the problem, the one causing impact to users?
Matt shared an example from his personal life. When getting an MRI to detect a problem with his hearing, doctors found a congestion in his sinus. It wasn’t causing his hearing issues, and furthermore, the doctors guessed that if they gave an MRI to anyone, they’d probably have the same sinus problem. Some problems will be detected that, while certainly problems, are ones that most systems simply “live with” and are unrelated to what you’re trying to find.
However, this “functioning problem” isn’t a guarantee of safety. Laura discusses how a robust system can run healthily with all sorts of problems happening behind the scenes. This can be a double-edged sword. If eventually these problems cumulate into something unmanageable, it can be difficult to sort through the cause and effect of these problems that have been piling up. For example, if you find your system “suddenly” runs out of usable memory, it could be because of many small memory leaks, individually unnoticeable, adding up. At the same time, the problems resulting from insufficient memory can seem like issues in themselves, instead of just symptoms of this one problem.
These tangled and obscured causes and effects are inevitable in a complex system. At the same time, you can’t overreact and waste time on every minor problem you see. Tools like SLOs, which proactively alert you to when an issue will start impacting customer happiness, can help you strike a balance.
Focusing on the user experience can help you understand some problems that are otherwise impossible to detect, even ones that occur even when your system is functioning entirely as intended. These problems can result from a situation where your system is behaving as you expect, but not as how your user expects. If the user relies on certain outputs of your system, and it produces a different output, it can create a hugely impactful problem without anything appearing wrong on your end.
Laura gave an example of this sort of problem. Data centers will use a tool known as a “class network” as a method of redundancy and reliability. To put it simply, this tool will create a new backup link if a link fails, generally providing uninterrupted connection over this small and common failure. However, a customer’s system might react immediately to the link failing, causing a domino effect of major failure resulting from the mostly normal operations of the data center. So where does the problem exist, in the data center’s system or the user’s system? Both are functioning as intended. Laura suggests that the problem lies only in the interaction between the two systems – difficult to detect, to say the least!
Another high-profile example of this type of problem happened with a glitch where UberEats users in India were able to order free food. In this case, an error given by a payment service UberEats had integrated with was incorrectly parsed by UberEats as “success”. The problem only occurred in the space between how the message was generated and how it was interpreted.
This example teaches a good lesson in detecting and preventing this sort of problem. Building robust processes for handling what your system receives from other systems is essential – you can’t assume things will always arrive as you expect. Err on the side of caution, and have “safe” responses to data that your system can’t interpret. Simulate your links with external systems to make sure you cover all types of output that could come in.
We hope you’re enjoying our continued deep dives into the challenges of SRE in From Theory to Practice. Check out the full episode here, and look forward to more episodes coming soon. Have an idea for a topic? Share it with us in our Slack Community!