DEV Community

bfuller
bfuller

Posted on • Originally published at opencontext.com

Six Days and Seven Nights: An Outage Tale

I'm moving older posts to dev.to This post is from January 31, 2023.

Imagine you’re the oncall DevOps person and you get paged at 11pm on Wednesday. You check to see what’s alerting you, which policy is failing, what’s down? You’re hoping for something simple. It’s not.

What you see is some random part of your code base is alerting. You’ve NEVER seen this happen before and you’ve been with the company for at least a year. Now you’re feeling a sort of way I can only describe to others as watching the movie Hereditary. You’ve got that jolt of adrenaline where you’re not sure what fresh Hell awaits you.

Wait it gets worse.

You page the folks listed in the…nope there is no runbook for this part of the code. You start wondering “WHO WORKS ON THIS?!?!?!” So you back track it. “Think, think think, how are we going to triage this?” Ok. You have someone check the audit logs, someone on the monitoring tools, release engineer (if you have one) gets called in to sort what might have changed.

What you have here is the making of a proper DevOps horror story.

Turns out, no one owns the code that changed. Which means DAYS into this thing you are still pulling in new people as you detangled the problem. You’re thinking to yourself, “Seriously, how did it take us so long to realize Megan was the SME?”

You have literally ALL EYES on you. Managers, Directors, shoot the C Level is on you to figure out what’s going on. All you needed to understand on the seventh night was Megan.

Megan got in there and fixed the issue in under an hour. Megan who isn’t the project owner, CODE OWNER, or even on the current team working the project. Because there isn’t a team. Megan is the only person left from the team who built and maintained this piece of code that up until this incident, just worked.

It’s common enough that I’m sure most of us know someone who has experienced this or been intimately involved in this scenario.

But what leads every company into this trap?

Human error? Inexperience? Flawed technology? Inadequate budgets? Pressure?

Solving Problems and Why Context Switching Sucks

One of the things I love about being a Product Manager is solving problems. Over the years, I’ve heard some stories about a Day in the Life. In fact, I used to sit with the Ops team at one company, they would let me stand over their shoulder when they were problem solving and even during a few incidents.

They’d walk me through the process of how they problem solve, why they start where they start, and flash all the different tools on the screen. I’d ask, isn’t it hard to context switch between all of these? They shrug, you get used to it, create a system, and break up the problem solving so experts on a set of tools have a focus. The problem that eluded them was the simple case of trying to figure out who worked on what. Sometimes it was clear, sometimes it was not.

Sometimes simple problems are the hardest to solve. If we all knew who was working on which piece of code, or infrastructure, or who created that one template we all use over and over, that would solve so many problems. Not only for Ops but for my Security friends as well.

OpenContext continues to work on how to solve that. We have a pretty neat roadmap of tools we want to implement this year based on problems we see folks trying to solve. (Shoutout to everyone who has and continues to give us feedback.)

The Problem Behind the Incident

Let’s walk through a specific scenario. We have a fictitious company let’s call it Scatter.ly At Scatter.ly we have 2 teams, Squirrels, and Raccoons. They are working on different product lines.

Today, we will focused on the Raccoons Retail App but it’s important to understand that there are 3 Product lines managed by two primary teams:

  • Dumpster (Raccoons)
  • Blue Sky (Raccoons)
  • Orchard (Squirrels)

Following me so far? Needless to say, there is bound to be code crossover and some legacy code that may not be owned by anyone right now. Not to mention new code on top of those old templates that just makes everything easier, until it doesn’t.

Even in our small demo environment You can see the shape of the company is complex.

Image description

The upper right portion of the graph is the code owned by the Raccoon team with the lower left being owned by the Squirrel team. The Raccoon’s built the first product and everything else came after.

You can see clearly that the Squirrel’s had the opportunity to learn and grow from the Raccoon’s scrappy start. I’m guessing most of you have a sense for the shape of your company. There might even be parts that you deal with every day and think to yourself, “You’re killing me Smalls!!” We’ve all done it, felt it, and are likely living it. We’ve got you.

Let’s go to our outage. Retail App is down. In fact, it’s been down for hours and NO ONE can figure out what the heck even happened. The on-call teams are asking themselves:

  • Who owns what aspect or code path associated with Retail App?
  • Was there a template used to build this part of the code? Who created it? How long ago? Are they still here? if someone built something on top of a random template,
  • Did someone touch code that sat for ages without incident? When? How did they change it?

What the team does know for sure is whatever “it” is, is causing a massive outage.

I’ve sat with amazing folks who have had variations of this problem, who broke what, when, how, and most importantly, can I fix it quickly?

It’s All in a Day in the Life of a DevOps/Cloud/SRE/Platform Engineering Person

Without OpenContext or tools like it, there is a lot of research, deep dives, backtracking merges across teams, and pinpointing when it all went wrong is essential. This is dependent on whether the failure was immediate and not more of a cascading failure.

That’s when you need to have someone who just sort of knows. If you don’t have someone with both Explicit knowledge and Tacit Knowledge, you’re potentially looking at a multi-day problem.

In several instances I’ve watched teams divide and conquer to solve a problem. Someone looking at logs to see if there were any anomalies worth investigating. If so, they did just that, they investigated further.

Others looking at monitoring perhaps starting with Nagios, to check your infrastructure and networking. If they find anything suspicious they investigate.

Others might be looking at Honeycomb or DataDog.

The rough one is when they need to go to GitHub to track what changed, who changed it and identify any unseen connections. This is hard if you have a monorepo, it’s hard if you have services. It’s not as hard if you have a handful of devs committing code.

The teams who I’m talking about are the ones with 50+ devs.

Different teams with different cadences for code delivery, sometimes you have folks that go outside the release schedule. Maybe you had a path on one product line. Often you have devs on call for this but maybe you don’t. I think we’ve all seen the meme and know the feeling.

Often these are the problems that take days and evenings to resolve. They are hard fought problems that require a level of communication, problem solving, and team work that is frankly impressive to watch. This is the problem we are solving for out of the gate.

What Product Managers Dream for Their Future Users

So how do we do that in OpenContext? We've got a video on how you can troubleshoot a service outage like we've described in 2 minutes (instead of 2 hours or days or weeks...) but keep reading to see a breakdown of the steps involved.

For the retail app going down, I’d start at the context page. Here we can see who owns the Retail App. Great! Now we know WHO we should be talking to.

Just to be sure, we’ll make sure there isn’t an issue with DNS. Looking down to platform components we can see retail-dns. Clicking on it, we are now at the retail-dns context page. Awesome!! This shows us that there aren’t any current incidence associated with retail-dns via the PagerDuty card. Brilliant.

Next, we will check out the Run Book for Retail App, which we can find under the Aux Components.

After review of the doc, we work through the doc but the problem remains.

At this point, we take stock of the Platform and Code Components. The team is on call but no one on the team is able to figure out what’s going on. Until… you scroll down to Blue Sky SDK!!!

In this case, we’ll say the component is tied to an individual, Fernando, not a team. If you go to the context page for Blue Sky SDK, you’ll see Fernando listed there even though he’s the Raccoon teams manager.

Image description

You can confirm this by clicking on the link to Fernando’s context page. From here you can sort out how to reach Fernando.

Context and Tacit Knowledge For The Win

In this scenario, Fernando could be assigned because you had a contractor on the project and decided to list Fernando as the Owner, the team may not work at Scatter.ly anymore. Fernando may have someone doing work on the side but no one is officially assigned to the project. OR Fernando was the code owner before he was promoted to team manager. The remaining context lives with the company. The team now has a human they can reach out to in order to help resolve this incident.

In talking to our Infrastructure, DevOps, Platform Eng friends, we know that having all the information easily accessible so the person can evaluate the problem and use their tacit knowledge, is the ideal path forward.

Getting the right information to the right people when they need it, allows the people to leverage their knowledge in a more effective and streamlined way. We love that for our Ops community.

Problem solving with OpenContext

OpenContext brings together all of the components of your business so you can understand everything you need to know about building and growing your product. Learn more about how it works or could help you with what you're working on and subscribe to our blog to stay in the loop for our latest life-changing resources for SREs.

Top comments (0)