5 tips on debugging a production outage

#devops #webdev #architecture

1. Tools

There's no way to debug an outage if there is no way to extract information out of your systems. You'll need some tooling to give you insights into what's going on on the inside of your systems. That can be as basic as log files, or as advanced as some of the amazing observability tools that exist nowadays.

I clearly remember how overwhelming it was for me as a new developer to try and navigate the tools that we have. It took quite some time before I was somewhat comfortable with searching through our logs. There is just so much information, it can be hard to know where to start.

One way you can get more comfortable with your tooling is by doing small exercises. Last week, I organized a session for my team with exactly this goal. They got a whole list of questions about our applications and had to find the answer. How many requests did we serve the last day? How many of those failed? What was the most common failure reason?

This achieves three things:

You'll have to navigate the UI of your observability tools
You'll have to find the right information to look at, whether that is alerts, dashboards or metrics
You'll find out how well you are able to interpret the information you get from your tools

Doing this in a situation where there is no pressure can be very valuable.

Tip 1: Get comfortable with your tools

2. Infrastructure

Once I was part of a long lasting outage of our webshop. The impact was significant, the whole website was down. We weren't seeing any requests on our load balancers, meanwhile all our customers were getting error pages.

None of us knew exactly how our application worked from this perspective. We often debugged production issues coming from our application, but everything that happened between the browser of a customer and our application was somewhat unknown.

Because of this, the outage lasted pretty long. By the time we located the issue, we were already multiple hours into the outage. With a better understanding of our whole infrastructure, impact would have been way lower.

A quick overview of your application documented somewhere can make all the difference. Knowing how the components of your application interact is crucial, especially because most problems occur at connections between components.

Tip 2: Know your infrastructure

3. Experiment

It's very helpful to approach debugging in a methodical way. The free Google Site Reliability Engineering book dives into a lot of the details of how you can make sure your debugging efforts are effective.

The general idea is very similar to a scientific experiment. At every step, you formulate a hypothesis based on the information you have. Then, you verify if that hypothesis is true. Based on the new information you just obtained, you repeat the process. This structured approach helps because it prevents you from making assumptions about what is going on.

Tip 3: Hypothesize an explanation, check this hypothesis, repeat

4. Summarize

Inevitably, you'll get stuck in your debugging at some point. This can be tough to deal with, especially if you feel pressure from something being broken.

This is the time to summarize what you've learned about the problem until now. It really helps if there's someone else, so they can check your summary for gaps or inconsistencies. It can also help to write down everything you learn about a problem. This makes it really easy to go over it again and can be really interesting for evaluation in for example a post-mortem.

You'll notice when you do this you'll always have one of a couple outcomes:

You mention two observations that seem to contradict each other: there are no errors in the application logs, but I get an error page when I do a request
You notice there is something you haven't looked at yet: the application can't talk to our database anymore, did we change anything in our configuration?
Your observations point you to an obvious conclusion: the database looks fine, but the load balancer shows errors, so the problem is probably in the application

In all cases, it will be easier to think of the next thing to investigate.

Tip 4: When stuck, summarize what you know

5. Practice

Chances are, your applications don't randomly break every day. In a lot of cases, there are numerous safe guards in place to prevent outages of your applications. This can mean that you get out of touch with the current architecture, observability tooling, dashboards and metrics.

This is one of the reasons why Pagerduty has a weekly "Failure Friday". In these, they simulate outages in a controlled way. This way you're guaranteed to look at production systems regularly. You can keep your knowledge of systems you don't touch that often fresh, and you can stay up to date on the current setup of applications that change often.