DEV Community: Emily Arnott

What's difficult about problem detection? - Three Key Takeaways

Emily Arnott — Wed, 14 Sep 2022 17:50:56 +0000

Welcome to episode 4 of our webinar series, From Theory to Practice. Blameless’s Matt Davis and Kurt Andersen were joined by Joanna Mazgaj, Director of Production Support at Tala, and Laura Nolan, Principal Software Engineer at Stanza Systems. They tackled a tricky and often overlooked aspect of incident management: problem detection.

It can be tempting to gloss over problem detection when building an incident management process. The process might start with classifying and triaging the problem and declaring an incident accordingly. The fact that the problem was detected in the first place is treated as a given, something assumed to have already happened before the process starts. Sometimes it is as simple as your monitoring tools or a customer report bringing your attention to an outage or other anomaly. But there will always be problems that won’t be caught with conventional means, and those are often the ones needing the most attention.

Our panel comes from diverse backgrounds, with Laura working with a very new and small startup and Joanna focusing on production at a large company, but each had experience dealing with problem detection challenges. The problems that are difficult to detect will vary greatly depending what you’re focused on observing, but our panel found thought processes that consistently helped.

Listen to their discussion to learn from their experiences. Or, if you prefer to read, I’ve summarized three key insights in this blog post as I have for previous episodes (one, two, and three).

Losing focus on the truth and gray failure

You might think of your system as having two basic states: working and broken, or healthy and unhealthy. This binary way of thinking is nice and simple for declaring incidents, but can be very misleading. It may lead you to overlook problems that exist in the gray areas between success and failure.

Kurt Andersen gave an example of this type of failure that’s becoming more relevant today: gray failure in machine learning projects, resulting from the learning data drifting away from reality. Machine learning projects can give very powerful results, using a process where an algorithm is fed tons of classified data until it learns to apply the same classification for new data. For example, an algorithm can be trained to identify species of birds from a picture after being shown thousands of labeled pictures.

But what happens when the supplied data starts drifting away from accuracy? If the algorithm starts misidentifying birds because it starts working from bad data or works with incorrect patterns, it won’t throw up an error. A user trying to learn the name of a species won’t likely be able to tell that the result is incorrect. The system will start to fail in a subtle way that requires deliberate attention to detect and address.

Laura Nolan pointed out that this type of gray failure is an example of an even more general problem – how do you know what “correct” is in the first place? “If you know something is supposed to be a source of truth, how did you double check that?” she asked. “In some cases there are ways, but it is a challenge.”

There’s no single way to detect “incorrectness” in a system when the system’s definition of “correct” can drift away from your intent. What’s important is identifying where this can happen in your system, and building efficient and reliable processes (even if they’re partially manual) to double check that you haven’t drifted into gray failure.

Detecting the wrong problems and mixing up symptom and cause

Another big challenge in problem detecting: even if you’ve detected a problem, is it the right problem? Joanna gives an example of your system having an outage or another very high priority incident that needs dealing with. When you dive into the system to find the cause of the outage, you end up finding five other problems with the system. This is only natural, as complex systems are always “sort of broken”. But are any of these problems the problem, the one causing impact to users?

Matt shared an example from his personal life. When getting an MRI to detect a problem with his hearing, doctors found a congestion in his sinus. It wasn’t causing his hearing issues, and furthermore, the doctors guessed that if they gave an MRI to anyone, they’d probably have the same sinus problem. Some problems will be detected that, while certainly problems, are ones that most systems simply “live with” and are unrelated to what you’re trying to find.

However, this “functioning problem” isn’t a guarantee of safety. Laura discusses how a robust system can run healthily with all sorts of problems happening behind the scenes. This can be a double-edged sword. If eventually these problems cumulate into something unmanageable, it can be difficult to sort through the cause and effect of these problems that have been piling up. For example, if you find your system “suddenly” runs out of usable memory, it could be because of many small memory leaks, individually unnoticeable, adding up. At the same time, the problems resulting from insufficient memory can seem like issues in themselves, instead of just symptoms of this one problem.

These tangled and obscured causes and effects are inevitable in a complex system. At the same time, you can’t overreact and waste time on every minor problem you see. Tools like SLOs, which proactively alert you to when an issue will start impacting customer happiness, can help you strike a balance.

Problems occurring in the intersection of systems

Focusing on the user experience can help you understand some problems that are otherwise impossible to detect, even ones that occur even when your system is functioning entirely as intended. These problems can result from a situation where your system is behaving as you expect, but not as how your user expects. If the user relies on certain outputs of your system, and it produces a different output, it can create a hugely impactful problem without anything appearing wrong on your end.

Laura gave an example of this sort of problem. Data centers will use a tool known as a “class network” as a method of redundancy and reliability. To put it simply, this tool will create a new backup link if a link fails, generally providing uninterrupted connection over this small and common failure. However, a customer’s system might react immediately to the link failing, causing a domino effect of major failure resulting from the mostly normal operations of the data center. So where does the problem exist, in the data center’s system or the user’s system? Both are functioning as intended. Laura suggests that the problem lies only in the interaction between the two systems – difficult to detect, to say the least!

Another high-profile example of this type of problem happened with a glitch where UberEats users in India were able to order free food. In this case, an error given by a payment service UberEats had integrated with was incorrectly parsed by UberEats as “success”. The problem only occurred in the space between how the message was generated and how it was interpreted.

‍

This example teaches a good lesson in detecting and preventing this sort of problem. Building robust processes for handling what your system receives from other systems is essential – you can’t assume things will always arrive as you expect. Err on the side of caution, and have “safe” responses to data that your system can’t interpret. Simulate your links with external systems to make sure you cover all types of output that could come in.

‍

We hope you’re enjoying our continued deep dives into the challenges of SRE in From Theory to Practice. Check out the full episode here, and look forward to more episodes coming soon. Have an idea for a topic? Share it with us in our Slack Community!

‍

A Chat with Lex Neva of SRE Weekly

Emily Arnott — Mon, 22 Aug 2022 20:29:50 +0000

Since 2015, Lex Neva has been publishing SRE Weekly. If you’re interested enough in reading about SRE to have found this post, you’re probably familiar with it. If not, there’s a lot of great articles to catch up on! Lex selects around 10 entries from across the internet for each issue, focusing on everything from SRE best practices to the socio- side of systems to major outages in the news.

I had always figured Lex must be among the most well-read people in SRE, and likely #1. I met up with Lex on a call, and was so excited to chat with him on how SRE Weekly came to be, how it continues to run, and his perspective on SRE.

The origins of SRE Weekly

I felt like an appropriate start of our conversation was to ask about the start of SRE Weekly: why did he take on this project? Like many good projects, Lex was motivated to “be the change he wanted to see”. He was an avid reader of Devops Weekly, but wished that something similar existed for SRE. With so much great and educational content created in the SRE space, shouldn’t there be something to help people find the very best?

“I wanted there to be a list of things related to SRE every week, and such a thing didn’t exist, and I’m like… Oh.” Lex explained. “I almost fell into it sideways, I thought this was gonna be a huge time sink, but it ended up being pretty fun, actually.”

How SRE Weekly is made

When thinking about the logistics of SRE Weekly, one question likely comes to mind: how? How does he have time to read all those articles? SRE is a methodology of methodologies, a practice that encourages building and improving practices. Lex certainly embodies this with his efficient method of finding and digesting dozens of articles a week.

First, he finds new articles. For this, rss feeds are his favorite tool. Once he’s got a buffer of new articles queued up, he uses an Android application called @voice to listen to them with text to speech – at 2.5x speed! Building up the ability to comprehend an article at that speed is a challenge, but for someone tackling the writing output of the entire community, it’s worth it.

To choose which articles to include, Lex doesn’t have any sort of strict requirements. He’s interested in articles that can bring new ideas or perspectives, but also likes to periodically include well-written introductory articles to get people up to speed. Things that focus on the socio- side of the sociotechnical spectrum also interest him, especially when highlighting the diversity of voices in SRE.

Incident retrospectives are also a genre of post that Lex likes to highlight. Companies posting public statements about outages they’ve experienced and what they’ve learned is a trend Lex wants to encourage growing. Although they might seem to only tell the story of one incident at one company, good incident retrospectives can bring out a more universal lesson. “An incident is like an unexpected situation that can teach us something – if it’s something that made you surprised about your system, it probably can teach someone else about their system too.”

Lex explained how in the aviation industry, massive leaps forward in reliability were made when competing airlines started sharing what they learned after crashes. They realized that any potential competitive advantages should be secondary to working together to keep people safe. “The more you share about your incidents, the more we can realize that everyone makes errors, that we’re all human,” Lex says. Promoting incident retrospectives is how he can further these beneficial trends.

Lex’s view of SRE

As someone with a front row seat to the evolution of SRE, I was curious what sort of trends Lex had seen and how he foresees them growing and changing. We touched on many subjects, but I’ll cover three major ones here:

Going beyond the Google SRE book

Since it was published in 2016, the Google SRE book has been the canonical text when it comes to SRE. In recent years, however, the idea that this book shouldn’t be the end-all be-all is becoming more prominent. At SREcon 21, Niall Murphy, one of the book’s authors, ripped it up live on camera!

Lex has seen this shift in attitudes in a lot of recent writing, and he’s happy to see a more diverse understanding of what SRE can be: “Even if Google came up with the term SRE, lots of companies had been doing this sort of work for even longer,” Lex said. “I want SRE to not just mean the technical core of making a reliable piece of code – although that’s important too – but to encompass everything that goes into building a reliable system.”

As SRE becomes more popular, companies of more sizes are seeing the benefits and wanting to hop aboard. Not all of these companies can muster the same resources as Google… Actually, practically only Google is at Google’s level! Lex has been seeing more learning emerge around the challenges of doing SRE at other scales, like startups, where there aren’t any extra resources to spare.

Broadening what an SRE can be

As we break away from the Google SRE book, we also start to break away from traditional descriptions of what a Site Reliability Engineer needs to do. “SRE is still in growing pains,” Lex said. “We’re still trying to figure out what we are. But it’s not a bad thing. I’ve embraced that there’s a lot under the umbrella.”

We often think of the “Engineer” in Site Reliability Engineer to be like “Software Engineer”, that is, someone who primarily writes code. But Lex encourages a more holistic view: that SRE is about engineering reliability into a system, which involves so much more than just writing code. He’s been seeing more writing and perspectives from SREs who have “writing code” as a small percentage of their duties – even 0%.

“They’re focusing more on the people side of things, the incident response, and coming up with the policies that engender reliability in their company… And I think there’s room for that in SRE, because at the heart of it is still engineering, it’s still the engineering mindset. If you only do the technical side of things, you’re really missing out.”

Diversifying the perspectives of SREs

Alongside diversifying the role of SREs, Lex hopes to see more diversity among SREs themselves. In our closing discussion, I asked Lex what message he would broadcast to everyone in this space if he could. “It’s all about the people,” he said. “These complex systems that we’re building, they will always have people. They’re a critical piece of the infrastructure, just as much as servers.”

Even if what we build in SRE seems to be governed just by technical interactions, people are intrinsic to making those systems reliable. This isn’t a negative; this isn’t just people being “error-makers”. People are what gives a system strength and resiliency. To this point, Lex highlighted what can make this socio- side of systems better: diversity and inclusion.

“Inclusion is important for the reliability of our socio-technical systems because we need to understand the perspective of all our users, not just the ones that are like us. That means thinking across race, gender expression, class, neurodivergence, everything. It’s an area where we need to do better.” Lex hopes to highlight the richness in this diversity in SRE Weekly.

As people standing at the relative beginning of SRE, working together to build and evolve the practice, we’re given both a challenge and an opportunity. In order to truly understand and engineer reliability into what we do, we need to discuss proactively our goals and how we’re achieving them. We hope you take the time to reflect on the learning that many great SRE writers share through spaces like SRE Weekly.

Read SRE Weekly here, and follow it on Twitter here.

SRE: From Theory to Practice | What’s difficult about tech debt?

Emily Arnott — Wed, 10 Aug 2022 18:39:56 +0000

In episode 3 of From Theory to Practice, Blameless’s Matt Davis and Kurt Andersen were joined by Liz Fong-Jones of Honeycomb.io and Jean Clermont of Flatiron to discuss two words dreaded by every engineer: technical debt. So what is technical debt? Even if you haven’t heard the term, I’m sure you’ve experienced it: parts of your system that are left unfixed or not quite up to par, but no one seems to have the time to work on.

Pretend your software system is a house. Tech debt is the leak in your sink that you’ve haven’t gotten around to fixing yet. Tech debt is the messy office you haven’t organized in a while. It’s also the new shelf you bought but haven’t installed. To-do’s quickly build up over time. Even if certain tasks are quick, there are just so many of them that it’s tough to know where to start.

As software systems become more complex, with more integrations, microservices, and features, it becomes more likely that technical debt will accumulate. How do you get ahead of it before it piles up? And how do you deal with tech debt you already have, without sacrificing velocity on new projects?

It’s a question every engineer faces. We were excited to have two experts joining our panel to dive into the issue. Liz Fong-Jones is an SRE with 16 years of experience in improving the reliability of sociotechnical systems. She’s seen a wide scale of tech debt, from the challenges of high-velocity startups avoiding accumulating tech debt, to paying down tech debt built up over years at major enterprise orgs. Jean Clermont is a program manager at Flatiron, a medical tech company. Handling incident management and building resilience when dealing with something as critical as the human body requires a proactive mind for long-term issues, including tech debt.

Watch the recording to hear their insights. And as I did for episode one and episode two, I’ll be summarizing three key insights from their discussion!

Paying off tech debt means really knowing where it is

It can be difficult to track where tech debt is accumulating. Even when you have the opportunity to proactively reduce tech debt, how do you know what to tackle first? This is a question that has to be answered both proactively and reactively.

Proactively, you should try to log technical debt as it’s created. Sometimes technical debt is inevitable. You might need to implement a fix or feature that has a toil-intensive process to maintain it, or is unintuitive when expanding on it. Matt suggests logging whenever making a change that creates these issues. That way, when you’re looking to reduce tech debt, you have a list ready of issues to address. You can also use this list when estimating how long future tasks and updates will take – you can compensate for the delays that the tech debt will cause when working in specific areas of the codebase. Judging these delays can motivate you to deal with the most damaging tech debt.

Liz pointed out that focusing entirely on this “known” tech debt can “lull you into a false sense of security”. A lot of tech debt is created without people noticing, in small decisions that have unexpected consequences. Jean described tech debt as like an “iceberg”, where the vast majority of it could be hidden below the surface. Running up against this unknown tech debt is likely to be even more damaging than expected tech debt, as you won’t be able to proactively compensate for it.

So how do you find this tech debt “beneath the surface”? You need to look for the symptoms of it. Liz highlights the two major ways tech debt manifests: making it harder to develop new software, and increasing the toil required to maintain the system. Look for common processes that are very toilsome, or review projects that hit a lot of unexpected hurdles. Underlying tech debt could be the cause.

Paying down tech debt incrementally is best

It can be tempting to make proclamations like “we’ll spend until the end of the quarter dealing with all our tech debt, and then we’ll be fine going forward”. It’s the same behavior people have towards financial debt, or other tasks they’ve been avoiding. Rather than having to deal with them as they crop up, one imagines that the future will bring a totally different attitude or opportunity that allows easy cleanup of the neglected tasks.

Unfortunately, this plan doesn’t usually work out. If you aren’t in the habit of dealing with tech debt continuously, switching gears into focusing on it will be jarring and limit productivity. You likely won’t have a good understanding of where to start paying down tech debt if you usually ignore it. Moreover, you’ll immediately start accumulating tech debt again, without having the habits in place to track it and deal with it.

Instead, deal with tech debt incrementally, solving small parts as you become aware of them. Liz suggested on-call engineers finding chances to invest in battling tech debt. Working on documentation and runbooks can help counter the toil of tech debt. Even if you don’t have the bandwidth to overhaul code itself, having documentation and processes will reduce the problems tech debt causes. It also highlights where effort should be spent to make fundamental changes when possible. Every step you take to reduce tech debt helps more than waiting for the perfect time to try to wipe it all out.

Incentivize dealing with tech debt

Paying down tech debt isn’t always the most glamorous or visible work. Unfortunately, compared to developing new features that get celebrated releases, cleaning up or documenting old code may not get the same recognition. Rather than try to ignore this disparity between the types of work and hope that people will rise to the challenge of tech debt regardless, try to find ways to make tech debt work more celebrated.

One option, suggested by Matt Davis, is to set up programs like “bug bounties” for troublesome pieces of tech debt. Whoever finds the time and energy to rectify the tech debt would be able to claim the bounty. This bounty could be an actual financial reward, or some recognition in a team-wide or organization-wide meeting.

Another option is to add clearing tech debt to requirements to sprints and projects alongside new feature work. For example, a project plan could include three new features and one major piece of tech debt dealt with. This plan, suggested by Kurt, puts tech debt on the same level of feature development, equally instrumental in finishing the project. This helps the work get recognized as its tracked alongside the rest of the new feature work.

SRE: From Theory to Practice | What's difficult about incident command

Emily Arnott — Wed, 27 Jul 2022 16:45:02 +0000

A few weeks ago we released episode two of our ongoing webinar series, SRE: From Theory to Practice. In this series, we break down a challenge facing SREs through an open and honest discussion. Our topic this episode was “what’s difficult about incident command?” When things go wrong, who is in charge? And what does it feel like to do that role? To discuss, Jake Englund and Matt Davis from Blameless were joined by Varun Pal, Staff SRE at Procore, and Alyson Van Hardenburg, Engineering Manager at Honeycomb.

To explore how organizations felt about incident command, we asked about the role on our community Slack channel, an open space for SRE discussion. We found that most organizations don’t have dedicated incident commander roles. Instead, on-call engineers are trained to take on the command role when appropriate. Because of this wide range of people who could end up wearing the incident commander hat, it’s important to have an empathetic understanding of exactly what the role entails.

With this conversation, we wanted to work through what incident command theoretically entails, and connect it to the messy reality of what it often looks like. As we did for last episode, we’ll highlight three key takeaways as an introduction to the episode.

Create support structures for groups of incident commanders

Varun discussed how at Procore, he and his colleague started an incident commander “guild”, a group of people who may have to take on the incident command role that meets weekly. Before starting the guild, they recognized that each person taking on the role may have vastly different areas of expertise and perspectives on how incidents should be run. When reviewing incidents in retrospectives, they’d often find inconsistencies based on who was commanding the incident. This created challenges for finding patterns across incidents, and for using consistent methods to investigate the causes of incidents. This was the impetus to gather incident commanders in this new guild.

By bringing together everyone who could wear the incident commander hat, they not only got everyone on the same page, but on the “best” page. This meant you would collect the expertise from everyone in the group and establish those best practices as a methodology to which everyone adheres. Everyone could contribute what they found most effective, synthesizing everyone’s experiences into an agreed-upon set of practices. The program was started from the bottom up, knowing that the time and energy invested would make everyone’s lives easier in the long run.

Perhaps even more important than coming up with good procedures, the incident commander guild provides solidarity and empathy. It’s a safe space for people who respond to incidents to share in one another’s triumphs, and commiserate and vent about frustrations. Incident command is tough work: it’s a job that can have you leaping out of bed at 3am and suddenly being asked to direct a team of other tired people. Without support, people can quickly burn out.

Empathize with anxiety around expertise

“I’d rather be on-call 24/7 for something I’m the subject matter expert on than spend 5 minutes being incident commander for something I don’t know about,” said Jake in our discussion. It might be an exaggeration – but not a huge one. Everyone else on the call echoed this sentiment. The anxiety around not knowing is only sensible. In the crunch time of an incident, no one wants to cause further delay because they don’t know how something works.

The first step to addressing this, as Jake emphasized, is to realize that engineers are not fungible. You can’t assume that every engineer has the expertise and experience of every other engineer. For engineers to be effective on-call, they need to be brought up to speed on system functioning. Without that, you won’t be able to know that “deploying people” to resolve a problem will have any effect.

Even with training, some engineers will always be more familiar with some service areas than others, perhaps because they worked on the project itself. No matter how prepared they are for on-call in general, this relative lack of expertise will always cause anxiety: people will inevitably fear an incident that exposes the things they don’t know, or even the things they don’t know that they don’t know. This is why Alyson emphasized, to everyone’s agreement, that subject matter experts shouldn’t be the incident commander. Good incident response shouldn’t be about “getting lucky” and having the expert on call, but establishing learning and processes that help anyone solve issues.

Since this anxiety is to some extent inevitable, the important thing is to empathize with it and set up systems that support it. Often, there will be designated people to escalate to. It’s helpful to know who to call, but it can be intimidating if you think you’re bothering someone with a question you “ought to know”. One panelist brought up the subject of “on-call buddies”, someone you trust yourself to contact even when you’re unsure of what you “ought to know”. Then, even if both of you don’t know, you can be more encouraged to escalate further. In general, escalation policies shouldn’t be strict and linear, but more based on expertise and connections.

Incident command is like first aid

We’ve looked at some best practices to make life better for incident commanders, but a key question remains: what exactly is incident command? Is it a duty that rotates through everyone on-call, with that designated person taking command for every incident on that shift? Or is it determined at the time of the incident – perhaps the person who first responds to the incident, or the most expert person on-call, or the most senior person involved in each incident? Or should you hire a designated incident command person? What duties does someone have when they’re on incident command?

When discussing these questions, our panel concluded that… it depends. Who an incident commander is and what they do may vary from org to org, and from incident to incident. But when you’re building up the practice yourself, “it depends” isn’t a very helpful answer. That’s why I wanted to highlight a framework for incident command suggested by Alyson: incident command is like first aid.

First aid isn’t about fully treating a patient, or even fully diagnosing them. It’s about taking charge of a situation and making sure critical tasks are happening and not falling through the cracks. Alyson described a scene where you witness an accidentand immediately give direction: “you, elevate the head and try to stop the bleeding”; “you, call an ambulance”, etc. Instructing particular people bypasses the bystander effect and ensures the task is completed.

When you’re the incident commander, it can be helpful to focus on this role of immediate task allocation, instead of getting bogged down immediately by diagnosis and response itself. Matt also emphasized the importance of the incident commander knowing when to step away. You’ll naturally want to see every step of the incident through, but trying to power through when exhausted can have diminishing returns. For the sake of the incident, and your own health, it’s important to take breaks. During that time, hand off the command role to someone else. Matt suggests the incident lead is a good option. No one expects the first aid person to stay at the patient’s bedside all night.

We hope you’re enjoying our look into the real experiences of engineers in SRE: From Theory to Practice. You can look forward to new episodes coming soon. If you have an SRE topic that you’d like to see covered by a panel, please let us know on Twitter or on our Slack community channel.

SRE: From Theory to Practice | What's difficult about on-call?

Emily Arnott — Mon, 20 Jun 2022 19:29:21 +0000

We launched the first episode of a webinar series to tackle one of the major challenges facing organizations: on-call. SRE: From Theory to Practice - What’s difficult about on-call sees Blameless engineers Kurt Andersen and Matt Davis joined by Yvonne Lam, staff software engineer at Kong, and Charles Cary, CEO of Shoreline, for a fireside chat about everything on-call.

As software becomes more ubiquitous and necessary in our lives, our standards for reliability grow alongside it. It’s no longer acceptable for an app to go down for days, or even hours. But incidents are inevitable in such complex systems, and automated incident response can’t handle every problem.

This set of expectations and challenges means having engineers on-call and ready to fix problems 24/7 is a standard practice in tech companies. Although necessary, on-call comes with its own set of challenges. Here’s the results of a survey we ran asking on-call engineers what they find most difficult about the practice:

These results indicate that on-call engineers primarily struggle with the absence of practical resources, like runbooks and role management. However, to solve a practical problem, taking a holistic approach to the systems behind that problem is often necessary.

When we see these challenges in the world of SRE, we want to dive into the challenges behind the challenges, building a holistic approach to improvement. This new webinar series will tackle a different set of challenges in the world of SRE with this perspective. We think that honest and open conversations between experts is an enlightening and empathetic way to improve in these practices.

As the title suggests, this webinar bridges theoretical discussion of how on-call ought to be with practical implementation advice. If you don’t have the time to watch, here’s three key takeaways. I’ll be continuing a series of wrap-up blog posts alongside each episode, so you can keep up with the conversation no matter what format you prefer.

Internal on-call is just as important as external

Yvonne works mostly on internal tools for Kong. As a result, the incidents she’s alerted to aren’t the typical ones that directly impact customers, like a service being down. Instead, her on-call shifts are spent fighting fires that prevent teams from integrating and deploying new code. Ultimately, this is just as customer impacting as an outage when teams can’t deploy fixes for outages quickly because of internal tool failure. It can be even worse – if these tools are shared between teams, an internal outage can be totally debilitating.

Yvonne sharing her experiences with these challenges kicked off a discussion of the importance of internal on-call. She discussed how sometimes these internal issues can be uniquely hard to pinpoint. Rather than having a suite of observation tools reporting on the customer experience, internal monitoring is often just engineers vaguely reporting that a tool “seems slow”. Internal issues needed some of the structures that helped with external outages, while also incorporating the unique aspects of dealing with internal issues.

To achieve this mix, it’s important to have some universal standards of impact. SRE advocates tools like SLIs and SLOs to measure incidents in what ultimately matters most: if customers are satisfied with their experience. You can apply this thinking to internal issues too. Think of your engineers as “internal customers” and build their “user journeys” in terms of what tools they rely on, the impact of those tools failing on deploying code, etc. This will help you build internal on-call systems that reflect the importance of internal system reliability, alongside the resources to support it, like runbooks. Investing in these resources strategically requires understanding the positive impact they’d have.

Think of your engineers as “internal customers” and build their “user journeys” in terms of what tools they rely on, the impact of those tools failing on deploying code, etc.

Assessing customer impact is hard to learn

We’ve discussed the importance of a universal language that reflects customer happiness, but how do you learn that language? Charles discussed the challenge in building up this intuition. Customer impact is a complex metric with many factors – how much the incident affects a service, how necessary that service is to customer experiences, how many customers use that service, how important those customers are in terms of business… the list goes on.

Incident classification systems and SLO impact can help a lot in judging incident severity, but there will always be incidents that fall outside of expectations and patterns. All of our participants related to experiences where they just “knew” that an incident was a bigger deal than the metrics said. Likewise, they could remember times that following the recommended runbook for an incident to a T would have caused further issues. Charles gave an example of knowing that restarting a service, although potentially a fix for the incident, could also cause data loss, and needing to assess the risk and reward.

Ultimately, the group agreed that some things can’t be taught directly, but have to be built from experience – working on-call shifts and learning from more experienced engineers. The most important lesson: when you don’t know what to do and the runbook is unclear, get help! We often think of incident severity as what dictates how you escalate, but what if you don’t know the severity? Matt emphasized the importance of having a psychologically safe space, where people feel comfortable alerting other people whenever they feel unsure. Escalation shouldn’t feel like giving up on the problem, but a tool that helps provide the most effective solution.

The group discussed some of the nuances of escalation. Escalation shouldn’t be thought of as a linear hierarchy, but a process where the best person for a task is called in. Find the people who are just “one hop away” from you, where you can call them in to handle something you don’t have the capacity for. Incident management is a complex process with many different roles and duties; you shouldn’t need to handle everything on your own. The person you call won’t always be the person who's the most expert in the subject area. Sometimes someone you have a personal relationship with, like a mentor, will be the most useful to call upon. The social dynamics we all have as humans can’t be ignored in situations like on-call, and can even be a strength.

Escalation shouldn’t feel like giving up on the problem, but a tool that helps provide the most effective solution.

Lower the cost of being wrong

Social dynamics come into play a lot with on-call. As we discussed, there can be a lot of hesitation when it comes to escalating. People naturally want to be the one that solves the problem, the hero. They might see escalating as akin to admitting defeat. If they escalate to an expert, they might feel embarrassed that the expert will judge their efforts so far as being “wrong”, and might defer escalating to avoid that feeling of wrongness.

To counteract this, Yvonne summarized wonderfully: “you have to lower the cost of being wrong”. Promote a blameless culture, where everyone’s best intentions are assumed. This will make people feel safe from judgment when escalating or experimenting. Matt focused on the idea of incidents as learning opportunities, unique chances to see the faults in the inner workings of your system. The more people fear being wrong, the more they pass up exploring this opportunity and finding valuable insights.

The fear of being wrong can also lead to what Kurt described as “the mean time to innocence factor” – when an incident occurs, each team races to prove that they weren’t at fault and bear no responsibility for solving the problem. Escaping the challenge of solving the problem is a very understandable human desire, but this game of incident hot potato needs to be avoided. Again, lower the cost of being wrong to keep people at the table: it doesn’t matter if your code caused the crash, what matters is that the service is restored and a lesson is learned.

The group also discussed getting more developers and other stakeholders to work on-call for their own projects. Development choices will always have some ramifications that you can’t really understand until you experience them firsthand. Developer on-call builds this understanding and empathy between developers and operations teams. Once again, lowering the cost of being wrong makes on-call more approachable. It shouldn’t be a dreadful, intimidating experience, but a chance to learn and grow, something to be embraced.

The more people fear being wrong, the more they pass up exploring this opportunity and finding valuable insights.

We hope you enjoyed the first episode of SRE: From Theory to Practice. Please look forward to more episodes dealing with other challenges in SRE, and the challenges behind the challenges.

Canary Deployments | The Benefits of an Iterative Approach

Emily Arnott — Mon, 24 Jan 2022 22:09:33 +0000

At Blameless, we want to embrace all the benefits of the SRE best practices we preach. We’re proud to announce that we’ve started using a new system of feature flagging with canaried and iterative rollouts. This is a system where new releases are broken down and flagged based on the features each part of the release implements. Then, an increasing subset of users are given access to an increasing number of features. By avoiding big changes for big groups, we reduce the chances of major outages and provide a more reliable product faster.

Of course, switching to this system comes with challenges and decisions to make. In this blog post, we’ll share what we’ve learned about canarying and flagging best practices. We’ll look at:

Why you should consider an iterative canarying approach to releases
Knowing when it’s safe to expand and iterate
Understanding how users rely on your services to find the ideal groups to canary

Why do iterative canarying releases?

Iteration and canarying is more involved than traditional big releases. You need to look at the code being deployed and flag everything that comprises each new feature. You’ll also need to tag groups of users. Finally, instead of one big release, you do several smaller releases where more groups of users receive more features each time. Flagging features and making user groups creates overhead, and each release will take a bit of additional time. However, the benefits of this system are worth it. Here are a few to consider:

More reliable services. Perhaps the biggest benefit of this approach is improved reliability for your service. Changes in production are a common source of incidents and outages. With small, iterative deployments, you’ll know that things can only go wrong for a subset of your services. Likewise, canarying means that only a subset of your users will be affected. By preventing major outages, you’ll greatly improve the perceived reliability of your service.

Continuous feedback. By iterating through new features, you have many more opportunities to hear feedback from users. You’ll be able to tell what should be improved before you commit to the entire feature set.

More manageable operations. By reducing the scope and spacing out each release, you give operations teams the opportunity to ensure they’re ready to support each feature.

This approach does come with challenges. As this approach only deals with the deployment of code, you don’t have to retroactively change your code to be modular or feature-flagged. However, you need to build new practices for code going forward. Developers need to invest time and energy building new habits for development. They need to instill a mindset of everything being modular and iterative. Switching gears like this can initially cause development to slow, but the payoff of better releases is worth it.

Balancing canarying and iteration safely

Our release system has essentially two dimensions: the amount of users accessing new features, the canary size; and the new features they have access to, the iteration. Our goal is to expand both of these until they encompass all users and all features without ever making leaps large enough to jeopardize the reliability of each change. How do you find this balance and cadence?

First, you need to build a release roadmap. This is a project that product, development, and operations teams should share. It outlines which features should be included in each iteration, and which canary groups should receive them. It should also contain an aspirational timeline for each stage. However, this timeline shouldn’t be written in stone. You’ll need to adjust your rollout speed based on how each iteration is performing.

Graph of Phased Rollout Approach

The key is monitoring data with feature flagging. You need to be able to see how each feature is individually performing. Whether or not a given feature should be rolled out can depend more on the performance of specific other features, rather than the overall health of the system. Blameless uses monitoring tools such as Sumologic to parse the information our system outputs. It allows us to break down which features are causing issues or unreliability, and which are stable enough to be built upon.

Once you know a given iteration is safe, you can roll it out to the designated groups. The modular setup gives an extra layer of protection, as individual features can be rolled back without impacting the entire system. Don’t depend on this going off without a hitch, though. Like any other backup system, simulating the need to roll back an iteration is necessary to understand what your options actually are.

Building the right canary groups

Another best practice for canarying releases is to use customized specific canary groups. Intuitively, you might just break your users down into indiscriminate chunks — maybe 10 groups of 10% each. This works fine to get many of the benefits of canarying, but you can get even more insights with tailor-made canary groups.

To do this, you first need to understand how your users interact with your service. Blameless uses tools such as Pendo to see how much each user relies on each feature. This is supplemented by meeting with customer success teams, who can relay reports from customers on what matters most to them. Creating things like user journeys and SLIs can quantify this importance.

Once you have profiles for your users, create groups for each iteration based on the features in that iteration. Some qualities that you’d like your ideal canary group to have include:

They use the feature. If you roll out an updated feature to a group of users that don’t even notice, you won’t get the feedback you need.
They don’t use the feature too much. On the other hand, updates are more likely to have outages during these canarying phases. Avoid users that wholly rely on a feature to keep them safe from this risk.
They provide feedback. Some users are more inclined to discuss what they think of new updates. These communicative users are ideal canaries.

Of course, you won’t necessarily find a perfect set of users for every feature. The important thing is considering these things when building canary groups and choosing the best candidates you have.

What you’ll need

To make these frequent, iterative, and specifically targeted deployments, you’ll need to have a strong deployment system first. Practices like CI/CD are necessarily for this speed and flexibility.

The Universal Language: Reliability for Non-Engineering Teams

Emily Arnott — Tue, 18 Jan 2022 16:31:08 +0000

We talk about reliability a lot from the context of software engineering. We ask questions about service availability, or how important it is for specific users. But when organizations face outages, it becomes immediately obvious that the reliability of an online service or application is something that impacts the entire business with significant costs. A mindset of putting reliability first is a business imperative that all teams should share.

But what does reliability mean for people outside of engineering? And how does it translate into best practices for other teams? In this blog post, we’ll investigate:

What reliability means
Business benefits for adopting a reliability mindset
How specific business functions adopt a reliability mindset.

What does reliability mean?

Reliability is often something people only think about when it isn’t there. This perspective explains what reliability means to your users. It isn’t just about the availability of services, but how important those services are to users. Let’s compare two different incidents. One causes lag briefly for a service that everyone, including your most valuable customers, uses. The other causes a total outage for an hour, but just for a service that a very small percentage of users, who are generally on a low-tier subscription, ever access. Which one causes more damage to users’ perception of reliability? Being able to answer this question is key to building a reliability mindset and framework to make important decisions.

A mindset of putting reliability first should be fundamental to how the organization makes decisions. Ultimately, having users access your services without frustration is what keeps them around, and keeping users around is what keeps your organization around. This is why we call reliability “feature #1”: no matter how impressive your other features are, it doesn’t matter if users can’t reliably use them.

The business benefits of a reliability mindset

Since it’s fundamental to the entire business, engineers shouldn’t be alone when thinking and planning for reliability. Having reliability inform the decisions of every team cultivates what Google describes as the “strategic” and “visionary” phases of the reliability spectrum. These phases are relative to the capabilities of each company. It isn’t about checking off certain milestones as much as building something that works for all of your teams. Let’s look at the benefits of this shared standard of reliability.

A universal language of what matters

Different teams within an organization can have very different perspectives and priorities. Picture a product team hoping to release an important new feature as soon as possible, an operations team looking to reconfigure the deployment process, and customer success driving to change the development roadmap in response to customer feedback. These goals can conflict, and each team’s perspective and desire for their project to take priority is valid. How do you decide?

All three teams have a valid claim that their priority is essential to user happiness. The cohorts of users that each project focuses on is different — one is focused on current customers, one is focused on the dev teams (who are internal users), and one is focused on prospective customers. You can’t simply compare the number of “users” affected.

A reliability mindset provides clarity and a way to make decisions by combining factors, such as:

The types of user experience affected
Frequency of affected user experiences
Importance of affected aspects of users’ experience
How detrimental to their experience a given change could be
Business importance of keeping the affected user cohorts satisfied

Your ultimate goal is a metric that expresses how much user satisfaction could change based on a decision. This metric would apply not just to code changes, but decisions made by any team. It creates a universal, cross-team language with which to discuss user satisfaction levels and therefore business impact.

Of course, determining these factors for each decision isn’t trivial. It requires extensive research into how different user cohorts engage with your service, with continuous discussion and revision. This isn’t a downside of reliability focus, however: it’s one of its biggest strengths. Having an ongoing org-wide discussion of what matters to customers is one of the best ways to break down silos and spread knowledge and insight. Getting a complete view of the customer experience comes from many teams — support, sales, success — that can be consolidated under the metric of reliability.

Using reliability metrics to hit the brakes or the gas pedal

Let’s look at our three distinct functional projects again. If you’ve established a working definition of reliability, you’ll be in a much better position to prioritize each project based on impact and overarching goals. However, this might not be as simple as just seeing which one makes the biggest positive impact, or has the lowest potential negative impact. You should set a baseline of acceptable reliability. Consider how satisfied the cohorts of users are with service reliability thus far.

Improving the reliability of a service that users are happy with may not be appreciated or even noticed. Trying to indefinitely improve reliability has rapidly mounting costs and diminishing returns. What’s important is maintaining service reliability at a level that doesn’t cause user pain or friction. When looking at reliability metrics, the focus should be on keeping them at or slightly above that point, rather than improving them as much as possible. At some point, marginal improvement won’t make any positive impact on user satisfaction. Time and energy improving past this point is better spent elsewhere.

It’s important not just to agree on what reliability looks like for different user experiences, but to also determine the point at which each user experience becomes too unreliable: your service level objective. How each user experience is doing compared to that agreed-upon objective determines how to prioritize projects across teams.

Returning to our previous scenario, let’s say we know the following about each project’s current user experiences:

Users are demanding the new feature, but their continued use isn’t dependent on it coming out immediately
Customer feedback provided to the customer success team for the roadmap change isn’t based on current pain, and can be postponed
The current deployment process currently creates a delay of at least a day for each deployment.

The first two projects will certainly increase customer satisfaction more than a backend change that customers aren’t aware of. However, not implementing those projects right away won’t cause customer unhappiness to the extent that they leave the service. On the other hand, leaving the deployment process as-is could easily lead to scenarios where customers experience pain. Having an agreed-upon view of acceptable reliability allows you to justifiably prioritize operations’ projects.

By looking at how close each user experience is to unacceptable reliability, you can judge for any given project:

If users are happy with the experience, projects that risk an acceptable decrease in reliability can be safe
If users are unhappy with the experience, projects that improve the reliability of the experience have to be prioritized
If users’ happiness with the experience is slightly above the acceptable level, projects dealing with that user experience could be safely deprioritized.

Beyond engineering projects, this mentality can apply to marketing campaigns, hiring choices, design roadmaps, and more. It allows for a big picture perspective that keeps everyone happy while still pushing onwards as effectively as possible.

The cost-benefit of reliability investments

Having an agreed framework and view of what reliability means for your customers is critical to quantify and weigh decisions. It also gives you the ability to better plan for investment pay-back. This is applicable for any investment an organization makes, including:

New hires
Infrastructure tools
New policies or procedures

Each one will have some initial costs, with the expectation that it will ultimately create more value than that cost. But how do you know how much is being lost and how much you could stand to gain? Thinking of it in purely financial terms is too narrow. You can’t put a hard dollar value on many of the returns.

The answer is, unsurprisingly at this point, to think about how to improve and maintain user satisfaction through the lens of reliability. Although a hard dollar value might be difficult to calculate, it is possible to consider the impact to user experiences after an investment is implemented. Likewise, you can think of how that experience could be impacted by the cost of the investment. Potential challenges caused by the investment could include:

Issues caused by teams getting up to speed on a new policy
Reprioritizing other projects to implement a tool
The loss of resources that could have been spent on a user experience otherwise.

Once you have a thorough cost-benefit investment plan, you can use the same perspective of aiming to maintain an acceptable reliability. An investment can look like it has a big payoff for only a small cost, but if that small cost pushes a user experience over the line and causes user pain, it still may not be worth it.

For example, a customer success team could determine that investing in a new process for receiving feedback is a worthy investment. Initially, it would cause some frustration among customers who were used to the previous feedback template, but in the long run it would allow for much faster feedback. However, if you know that the user cohort that relies on that template is unhappy and on the verge of leaving the service, that initial cost may not be acceptable at the time. Making that cohort happier with the service could need to happen before you move ahead with the investment.

Reliability for specific teams

How different business functions think about reliability varies depending on how they impact user satisfaction. Different business functions directly or indirectly affect user satisfaction. For revenue teams, like marketing and sales, their decisions won’t directly impact user experiences for the product.

However, revenue teams create an expectation in the market for what the experience will be. As reliability is based on user perception, expectations can change where a satisfactory point is for users. On the other hand, lowering expectations will also lower interest in the product. Like any other team, revenue teams have to achieve their goals without sacrificing reliability.

Finding the connection points through reliability and user satisfaction across all functions is critical to driving top-line decisions from the executive team down. This allows you to understand the impact of any decision on the goals of every team.

Reliability is a team sport. It’s so fundamental to business success that it can’t just be the focus and responsibility of engineers. Instead, every team needs to align on what reliability means and how to prioritize based on it. In upcoming articles, we’ll look at how specific teams can better prioritize around reliability metrics and how they connect to other teams and the wider business goals.

Building an SRE Team with Specialization

Emily Arnott — Tue, 11 Jan 2022 19:15:20 +0000

As organizations progress in their reliability journey, they may build a dedicated team of site reliability engineers. This team can be structured in two major ways: a distributed model, where SREs are embedded in each project team, providing guidance and support for that team; and a centralized model, where one team provides infrastructure and processes for the entire organization. Most structures will be some combination of these ideas, with some SREs focusing on specific projects and other SRE projects completed as an SRE team.

When looking at centralized models of SRE teams, there are further distinctions to make based on the role of each SRE. One perspective says every SRE should be a generalist, capable of performing every duty of the role. This has the advantage of being very robust - if each SRE can do any given job, any person’s absence won’t cause an issue. On the other hand, you could run into a “jack of all trades, master of none” issue, where your potential is limited. This is where the specialization perspective can help.

In this blog post, we’ll look at:

The advantages of an SRE team where each member is a specialist
Some SRE specialist roles and how they help

Why specialize in SRE?

The SRE role is extremely diverse. An SRE may be tasked with contributing to the code base of the service, writing policies and procedures for development practices, spreading cultural values, and everything in between. Even if tasks aren’t in the official SRE job description, SREs are often the ones who pick up glue work. This is work that isn’t technically anyone’s job, but is necessary for work to proceed - gluing everyone else’s efforts together.

Finding one person who’s expert in and enthusiastic about all these different responsibilities is difficult. The roles an SRE can take on often appear to be polar opposites. Sometimes an SRE is the reliability guardian, reigning in teams to make sure they don’t breach SLOs. At other times, an SRE is the champion of failure, encouraging teams to take risks as long as they’re ready to learn from them.

Fortunately, SREs are well positioned to be able to specialize without losing the big picture perspective. They ultimately need to contextualize everything happening in development or operations in a way that has significance across the organization. That means that even if they focus on one aspect of the SRE job, they’ll be performing those duties aligned with the entire team.

Specializing in SRE allows people to spend more of their time and energy on their strongest areas. If you have someone who’s excellent at writing code, but not so much of a public speaker, you can have them work away at infrastructure and in-house tools, and let them skip giving a values update at all-hands. Conversely, SREs can come from not entirely technical backgrounds. If you have someone who is great at developing policy, but can’t grapple with the depths of your codebase, you can let them be a full-time educator and policy writer.

This specialization mindset can even help with hiring. When building your team, you can have roles you’re hoping to fill. You can then look for people who specialize in those roles as far back as the job posting. Just remember that people will always grow and change. As Blameless SRE Jake Englund points out, SRE is a discipline that is “constantly inventing tools”, redefining its capabilities as a whole. Having role experts will naturally educate and encourage the rest of your team purely through what Jake describes as “osmosis”.

Having specialized SREs should be balanced with giving people the opportunity to expand their role and take on more functions.This will create the perfect blend of people playing to their strengths while still having the bases covered if someone’s missing. Jake also emphasizes how much this can help the entire team grow. By having experts leading learning, you end up with stronger engineers that are more satisfied with their jobs.

Specialist roles in SRE teams

Now that we’ve looked at how to set up a team of specialists and why you might want to, let’s look at some of the specialist roles you can have. Keep in mind that one person can serve parts of multiple roles, so don’t look for an exact 1:1 fit. Instead, look for people who can grow into these archetypes to get these benefits.

The educator

Who they are: SRE is all about building policies, processes, cultural values, and infrastructure that the whole organization can benefit from. Of course, these benefits only happen if the teams actually use them! The educator is someone who teaches and encourages teams to use SRE practices.

What they do: Educators can lead infosessions on new SRE practices to get people up to speed. They can also track how much practices are being used, and gather information on why things might go underutilized. If required, they can provide hands-on coaching to help people advance their abilities.

Skills they need: Educators need to be able to convince people to make the investment of adopting new practices. They need to be expert on the tangible benefits of adopting, able to cite specific figures where relevant. At the same time, they need to be personable and empathetic. They need to understand the pains that can come with having to switch to new practices, and convey that understanding through the connection.

The SLO Guard

Who they are: One of the key tools in SRE is the service level objective, or SLO. SLOs set a point where the unreliability of a service is such that it starts having a negative impact on customers. Teams set up policies, like slowing development or emergency code freezes, to prevent SLO breaches. SLOs should be understood and monitored across the organization, but this role specializes in being an absolute defense against breaches.

What they do: The SLO guard makes sure that the SLO isn’t breached by building and implementing preventative policies. This isn’t the full story, though. They also need to ensure that the SLO is measuring what it needs to. This involves setting up SLO review meetings, incorporating additional monitoring tools to get more sophisticated data, and researching user expectations.

Skills they need: While discussing different SRE roles, Blameless SRE Jake Englund mentioned the value of “someone who will say no”. When everyone is enthusiastic about some new feature push, no one wants to be the dissenting voice. Telling someone that development needs to be delayed to preserve the SLO is a skill in itself, one that requires an unwavering commitment to reliability, the expertise to back up their decision, and buy-in across teams to support the plan.

Infrastructure architect

Who they are: This role is focused on building SRE infrastructure that the entire organization can use. This covers a ton of different types of project, each with its own sub-specialization: internal tools for monitoring or resolving incidents, documentation and runbooks for procedures, processes for completing projects, or even cultural values to guide people’s decisions. You might want

What they do: Infrastructure architects are in constant communication with other teams to see what’s needed most. Educators can serve as a conduit for these relationships, compiling what they hear into a big picture report. Once the priorities are clear and aligned among teams, the architect works away at building. Of course, these infrastructure meta-projects are developed along the same workflow and processes as any other project. Therefore, the architect is a sort of SRE-developer and needs to work closely with development teams.

Skills they need: The skills needed depend greatly on the type of infrastructure being developed. In some cases, this is one of the most development-focused SRE roles, and so deep knowledge of the organization’s codebase is a must. If focused more on policy and procedure, the architect may not need coding skills, but will still need to understand how their processes will work on the level of development. Either way, this is primarily a technical role, focused on engineering solutions to specific needs. In our discussion, Jake emphasized the idea of SREs existing on a range of socialness - if educators are on the social end, architects can be on the other extreme.

Incident response leader

Who they are: Having processes in place to respond effectively and thoroughly to incidents is a major part of SRE. The incident response leader takes responsibility for making your organization as incident-ready as possible.

What they do: The incident response leader plays a role before, during, and after incidents. Before incidents, they lead in setting up runbooks, on-call schedules, and other tools to help respondents. Of course, all of this is done in collaboration with the teams that will be responding. During incidents, they serve as a procedure expert that ensures teams are working effectively. If there’s disagreement over roles and responsibilities, or when to escalate, the incident response leader can serve as a point of authority to keep things moving. After the incident, the incident response leader can drive the creation of a retrospective. This document gathers the lessons of the incident and serves as a hub for followup tasks. The leader makes sure this document is created, reviewed, and acted on.

Skills they need: Incident response leaders need both a lot of people skills - to understand how people will behave while panicking and empathizing with their abilities - and infrastructural skills - to know how the tools they build will interact with the system. They also need a strong ability to prioritize based on the bigger picture. Their world is one where everything is on fire, and they need to distinguish quickly between a big fire and a little fire. This means having a perspective that’s zoomed out to the entire organization while still able to see the little issues that each incident can bring.

Having a team of specialists can be a challenge, but it leads to opportunities. Of course, you may have SREs who can embody several of these specializations; they aren’t mutually exclusive. It’s just often a tradeoff, where one invests their time. People also have their personal interests, something we can appreciate and lean into. By allowing people to flourish in their skills without losing the robustness of shared knowledge, you’ll build the strongest possible team.

How Disaster Ready Are Your Backup Systems, Really?

Emily Arnott — Wed, 05 Jan 2022 18:08:45 +0000

In SRE, we believe that some failure is inevitable. Complex systems receiving updates will eventually experience incidents that you can’t anticipate. What you can do is be ready to mitigate the damage of these incidents as much as possible.

One facet of disaster readiness is incident response - setting up procedures to solve the incident and restore service as quickly as possible. Another strategy involves reducing the chances for failure with tactics like reducing single points of failure. Today, we’ll talk about a third type of readiness: having backup systems and redundancies to quickly restore function when things go very wrong.

Having backup systems can give organizations peace of mind: no matter what goes wrong, you just switch to the backup system for a bit, and then everything is fine... right? But will it really go so smoothly when disaster hits? In this blog post, we’ll help you ensure that your backup systems will perform as expected when you need them most by looking at:

The value of running restoration drills
Thinking holistically and laterally about incidents
Increasing resilience through imagining and preparing for “black swan” events

The value of running restoration drills

Many organizations have backups for their data and infrastructure that they can switch to when the main system fails. But what exactly is this “switch”? Consider this story shared with me by an engineer about a nightmare outage caused by a total wipe of their databases. The team had backup databases in place, but they needed to be decompressed before they could be used. How long would that take? They had no idea.

This engineer’s story is all too common. Organizations feel secure that everything is backed up, but they can’t actually rely on those backups to be immediately available in a disaster. Also key in the engineer’s story is a lack of resources: as they didn’t have an in-house infrastructure team, they had to rely on a single inexperienced person following an out-of-date runbook.

The solution to this nightmare is running regular restoration drills. Simulate a situation where everything needs to get switched from the production systems to the backups. How long does it take you? Are there any obstacles you encounter that you can remove now? Also look at what resources you’re relying on. Are individual people being consulted for advice? Are you using infrastructure to access runbooks? What if those people were missing, or that infrastructure was also down? You’ll want to prepare for these possibilities too.

Once you’ve finished these drills, collectively review where improvements could be made. And then - this is the most important part - schedule the next drill. As the codebase changes and databases grow, keep making sure that backup restoration runs smoothly.

Don’t become complacent with untested backups. In a recent discussion, Blameless SRE Jake Englund summarized it thusly: when it comes to having a backup policy, if you aren't testing your restore process then you can't be certain your backups are useful, and if you aren't sure that your backups are useful then they probably aren't.

Thinking holistically and laterally about incidents

When something goes wrong, it can be tempting to think in the singular: some*thing* goes wrong. The server goes down, a typo in code causes an error, high traffic causes latency, etc. But really, most incidents create a domino effect of other failures. When preparing for failure, it’s important to consider *every*thing that could go wrong.

Here are some types of things to consider:

Will your typical tools for communicating also be down?
Will resources like runbooks be available if your tools go down?
Will the services you use to restore backups also go down?
Will people you expect to be able to respond be dealing with bigger priorities in the event of a major outage? Will they be available at all?
Will engineering teams be stressed, burned out, and incapable of performing to their normal standards?

Each organization will have their own issues that could arise during incidents. Past incidents are your best teacher for finding them. Create incident retrospectives to investigate the causes and effects of incidents. Techniques like contributing factor analyses help you uncover these aligned issues.

Once you’ve identified these issues, make sure your backup plan compensates for them. Don’t leave anything out: consider every factor, from the technical to the personal. If there’s an in-house tool you use to spin up new servers, don’t assume you’ll have it. If engineers will be panicking when something goes wrong, make sure the solution is obviously marked and easily accessed.

Really think outside the box, and dig deep into your proposed solutions to uncover problems with them that could occur. An example Jake shared is relying on backup generators arriving via truck as a solution to a power outage — what if the truck gets stuck in traffic or breaks down? Don’t be content with just one solution; have a solution for if your solution breaks, and have a backup for your backup.

Imagining and preparing for “black swan” events

A “black swan” event is one that is nearly impossible to predict or even imagine, but causes catastrophic damage. In retrospect, it may seem obvious that the black swan event was a possibility; however, before it happens, it’s unthinkable.

An example of a black swan event in tech is the recent Facebook outage. Facebook didn’t prepare for a total collapse of their DNS servers, nor could they imagine the many problems that came downstream from them - like being unable to physically enter their offices. If a normal incident creates a domino effect, a black swan event can be like knocking over a house of cards.

So how do you prepare for an unthinkable incident? One strategy involves getting creative. Jake shared an example from his time at Google: simulate that the entire Mountainview Google HQ has been hit by a meteor. During the practice response, stop yourself every time you try to contact someone there, access a server hosted there, or even rely on the bandwidth managed there. You can’t: it’s been hit by a meteor.

Now, are your headquarters actually going to be wiped off the map by a meteor? Almost certainly not. And if they were, would the branch departments really be scrambling to restore service? No, they’d likely have bigger concerns. But by attacking this worst-case scenario, you prepare yourself for other events that you couldn’t otherwise imagine.

Jake emphasizes the importance of testing “not just for what you want to test”. The point of disaster preparedness isn’t to get the results you want, but to uncover vulnerabilities and drive systemic change. Jake describes this idea as differentiating between robustness - testing for everything you know could go wrong - and resilience - “testing what you hope you won’t have to know.” Generally, Jake finds that orgs are very good at the former and very bad at the latter.

Building resilience by testing for the unknown is a practice that requires iteration and reflection. There’s no one right way to do it, and no one right frequency to explore these scenarios. The important thing is to thoroughly document your process and results. Then analyze which types of experiments are yielding insights, and build future tests around them. Keep up the practice by always ensuring that the next experiment is scheduled when the last one finished.

But what solution could possibly exist for these apocalyptic scenarios? Jake discussed one viewpoint, which may seem counterintuitive at first. Generally a path to maturity and growth for an organization involves first relying on third party tooling, then building more and more tools and infrastructure in-house. Major enterprise organizations may build their own communication, alerting, and tracking tools.

However, black swan events suggest that there could be an even further stage of maturity: incorporating third party tools as backups. If you can’t use your tools to solve issues with your tools, you should have some other tools at the ready. Of course, like any backup system, you’d need to run drills to ensure functionality will actually be restored by the switch.

How to Write Meaningful Retrospectives

Emily Arnott — Mon, 13 Dec 2021 19:57:07 +0000

One of the foundations of incident management in SRE practice is the incident retrospective. It documents all the learnings from an incident and serves as a checklist for follow-up actions. If we step back, there are 7 main elements to a retrospective. When done right, these elements help you better understand an incident, what it reveals about the system as a whole, and how to build lasting solutions. In this article, we’ll break down how to elevate these 7 elements to produce more meaningful retrospectives.

1. Messages to stakeholders

Incident retrospectives can be the core of your communication with customers and other stakeholders, post incident. We talk a lot about how retrospectives function best when they involve input and feedback from all relevant stakeholders. That doesn't necessarily mean squeezing tons of folks into one meeting or sending out one long pdf to a large group without thoughtful considerations.

The best example of this is distinguishing between customer stakeholders and internal team stakeholders. Customers should be kept in the loop and assured that a resolution is imminent or has already come, but they probably don't need to know (or shouldn't know) the minutiae.

Communicating retrospectives to stakeholders requires empathizing with how they use your services. Describe the incident in the context of what matters most. But don’t beat around the bush, either — you don’t want to come across like you’re hiding or downplaying the impact. Simple, factual statements such as “if you use service x to do y, you lost that ability for 12 hours” is enough to convey your understanding.

Once you’ve established the impact, start to regain trust. Reassure stakeholders about relevant things that didn’t go wrong. In the aftermath of an incident, stakeholders could be worried that there are other problems that weren’t reported. Explicitly state that there wasn’t any data lost, or private information made public, or any other relevant concerns.

Share your action plans with stakeholders too. They may not have the context to understand the details of your solution, but you can explain the impact your plan will have. Be direct to convey your confidence. Again, simple statements work great: “the outage was caused by insufficient server bandwidth. A new process will automatically expand bandwidth in response to increased load. This will alleviate an incident like this in the future.” This is the language of scientific research, which removes personal pronouns from the prose. It’s a great way to keep statements simple, avoid finger-pointing, and remain factual and ideally data-driven.

By expanding your message to stakeholders in this way, they’ll understand that their pain has been understood, and addressed systematically and enduringly.

2. Monitoring context

In more technical retrospectives, generally for study by internal development teams, it’s useful to include any monitoring data your system captures at the time of the incident. Did the incident occur during significant traffic? Did it also lead to slowdowns in other areas of the system? This information can lead to helpful revelations.

But you can go even further with this data! Include long-term baseline measurements for these metrics to provide a standard. You might notice that some metrics follow a pattern that accounts for anomalies during the incident. Don’t mistake coincidence for causation.

Also note where your monitoring data was insufficient. Can you think of any metrics that, if you were capturing them, could have tipped you off about the incident earlier? One of the main goals of the retrospective is to drive systemic change. Look for these opportunities to improve your monitoring system.

3. Communication timelines

Hopefully you have a tool to easily build a communication timeline from Slack, MS Teams, or whatever else you use to chat. It’s important to know what steps were taken, how long they took, and when breakthroughs were made. Include information about what roles people played and what tasks they were assigned.

However, it’s also important to see where miscommunication occurred. Did people do redundant work? Were some tasks or steps forgotten or skipped over? Were there misunderstandings about expectations? Note these issues blamelessly. It’s not someone’s fault if they overlooked something; they were doing their best in a stressful situation. That’s why you need policies and procedures to cover the gaps. Investigate these issues to develop policies that would prevent them.

Inevitably, your war room discussion will have some chatter. You probably want to make an “all-business” retrospective that leaves out anything irrelevant, and that’s likely the right move for retrospectives that will be seen by external stakeholders. For internal retrospectives, though, this extra expression can be valuable. It’s good to see how people were feeling during an incident, when they felt stress and relief. It can open up thinking about the human side of incident response, and makes the retrospective more fun to review later.

4. Contributing factors

A big part of the retrospective is uncovering why the incident happened. Without determining that, you can’t make systemic changes to be stronger for next time. The key to making meaningful and enduring changes is to dig deep. Techniques such as the five whys can help you find the causes behind causes. Illustrating it with tools like the Ishikawa diagram can make it easier to understand.

When digging for these factors, be holistic. Don’t just think about technical issues, but dive into problems with training, headcounts, stress, personal factors in engineers’ lives — anything that could have impacted how people work on your system. Pulling in management and other teams into these discussions could be necessary when reflecting on major incidents.

Of course, all this investigation should be done blamelessly. Assume everyone’s good faith and best intentions. If a mistake was made, look into what information or safeguards could have prevented it. Settling for punishing an individual will prevent you from making major systemic improvements.

5. Technical analysis

This is a section mostly for your engineering teams. If there’s factors in here that should be understood by non-engineers, be sure to provide that information and its impact somewhere else in the retrospective report. Here, you should be detailed enough that future engineers can get useful information when resolving future, similar incidents.

As you did with monitoring data, you should include information about how the code should work, and how it usually works. This context is important, as the intended function of the code may have changed by the time someone reviews it. You should also discuss how future development is expected to impact code in production. Knowing how the code is expected to run in production allows you to be keenly aware when incidents occur.

6. Followup actions

This is one of the most important parts of the retrospective. All of your learning about why the incident happened should transform into actions. Find ways to change the factors that lead to the incident. The retrospective can act as a hub for tracking these items. As you review the retrospective, check and make sure they’re progressing.

The followup actions don’t just have to address the direct causes of the incident. This is also an opportunity to improve your incident response policies, your tools of measuring the impact of incidents (like SLOs), your monitoring setup, even your retrospective standards! You can never be too holistic when solving problems.

To motivate people to work on these followup tasks, include some context. Summarize why each action was chosen after the incident. Also discuss the impact it will have in preventing future incidents. No one wants to spend time responding to the same, or similar incidents over and over. It’s not only soul-destroying for the team, which can quickly lead to burnout. It’s also not good for business. You should include enough information that people will understand the importance without having to reread the entire retrospective.

7. Narrative

Including a narrative summary of the incident is often overlooked. It won’t contain any new information, but it’s still useful. All of this information can be overwhelming, so use this part of the retrospective as a way to make the incident approachable for future study. Think about it in terms of a story. You start with describing things as they should be. Then you introduce the problem and how it disrupts the norm. Walk through the experience of the affected customers as well as the team that was tasked to solve the issue. You cover what they tried, what worked, and what they learned.

Rather than details, you should focus on impact in this section. How severe was the incident, and what made it so? When studying the incident later, many details will be irrelevant to the current system. However, understanding how people responded when things went very poorly will always be a useful lesson.

How to Analyze Contributing Factors Blamelessly

Emily Arnott — Mon, 06 Dec 2021 21:34:04 +0000

SRE advocates addressing problems blamelessly. When something goes wrong, don’t try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don’t fear blame, leading to more initiative and innovation.

Learning everything you can from incidents is a challenge. Understanding the benefits and best practices of analyzing contributing factors can help. In this blog post, we’ll look at:

A definition for root cause analysis
A definition for contributing factor analysis
How to choose between RCAs and contributing factor analysis
Best practices for contributing factor analyses
How to incorporate learning from analyses back into development

What is a root cause analysis?

Root cause analysis, or RCA, is a method for finding the reason an incident occurred. Here it is, summarized in four steps:

Identify the incident. You should understand the exact boundary of what is and isn’t considered part of the incident.
Create a timeline. Log all events impacting the system. Start when the aberrant behavior begins and end when the system returns to normal.
Judge the events for causality. Consider the impact of each event leading up to the incident. Did it indirectly or directly cause the incident? Was it necessary for the incident to happen? Was it irrelevant?‍ ** 4. Build a causal diagram.** A causal diagram or graph is an illustrative tool. It shows how events contribute to the incident. Here is an example:

Example causal diagram

What is a contributing factor analysis?

A contributing factor analysis is another methodology for examining an incident. Rather than pinpoint a single root cause of an incident, the contributing factor analysis looks for a broader range of factors. This is a more holistic approach. It considers technical, procedural, and cultural factors. For the above example of a server outage, here are some factors you may also consider:

The feature launch schedule doesn’t account for server update timings
No policy to scale up server availability for feature launches
Server architecture could be updated to support more traffic
Incident response team could be overworked with new feature launch, delaying backup server availability

Contributing factor analysis should be part of a larger incident retrospective approach. Teams should try to identify contributing factors that can lead to actionable change.

How do you choose between an RCA and a contributing factor analysis?

RCAs and contributing factor analysis each have use cases. RCAs are often formally required while contributing factor analysis is a useful internal tool. Let’s break down why.

When are RCAs used?

RCAs can be part of an organization’s official response to an incident. Because they are often public-facing, they have strict guidelines for formatting. This standardization can be challenging. In a discussion with Blameless, Nic Benders from New Relic shared his thoughts on RCAs:

“The RCA process is a little bit of a bad word inside of New Relic. We see those letters most often accompanied by ‘Customer X wants an RCA.’ Engineers hate it because they are already embarrassed about the failure and now they need to write about it in a way that can pass Legal review.”

Even if they’re unpleasant, RCAs can be necessary. Customers have come to expect openness around failure. Dheeraj Khanna from Tenable explains:

“Today, the industry has become more tolerant to accepting the fact that if you have a vendor, either a SaaS shop or otherwise, it is okay for them to have technical failures. The one caveat is that you are being very transparent to the customer. That means that you are publishing your community pages, and you have enough meat in your status page or updates.”

When are contributing factor analyses used?

Contributing factor analyses help translate the causes of an incident into actionable changes. As this document is for internal use, teams can be more open about the failure and teams can improve.

Nic Benders discusses the shortcomings of RCAs in capturing these areas. “It remains challenging for me to try and find a way to address those people skills and process issues. Technology is the one lever that we pull a lot, so we put a ton of technical fixes in place. But, there are three elements to those incidents. And I worry that we're not doing a good job approaching the other two: people skills and processes.”

When trying to learn the most you can from incidents, looking at all contributing factors is a must. Although you may need both types of analysis, contributing factor analyses are often more useful.

Best practices for blameless contributing factor analysis

Remove the value of blame. While analyzing an incident, blame offers an easy answer. Making an individual at fault removes the responsibility from the system. This means that no changes are necessary to the system; the work is already done. You should not value the solution of blame. By focusing on systemic causes, you can learn more and improve your system further.

Look beyond individuals. Humans aren't perfect. Imagine while conducting a retrospective the team realized that an alert was triggered. But, a team member ignored it. Why? It's time to dig deeper than the individual. Are alerts often noisy or irrelevant? Has this person had enough on-call training and experience? Or have they been on call for too long without a break? By asking these questions, you can arrive at meaningful lessons. It is the best way to ensure the mistake doesn’t happen again.

Celebrate failure. When uncovering factors, celebrate each one as an opportunity for learning. It may seem that the more factors you uncover, the more work you’ve made for yourselves. You don’t want this to discourage team members from suggesting other factors. Create a psychologically safe environment for people to brainstorm. Make sure each contribution is valued.

How to feed learning from analyses back into development

One of the key benefits of a contributing factor analysis is generating actionable insights into the system. But how do you ensure that these lessons lead to changes in development and policy? Here are some tips:

Create a central repository of required actions per incident
Invite development teams to incident review meetings
Bake action items into future sprints, working with product when necessary
Link learning and tasks to larger initiatives for the organization
Have review meetings after task completion to ensure the desired changes occurred

Keep a cycle flowing between the causes of incidents and the changes you make. This will help your system continually improve in relevant ways.

DevOps & SRE Words Matter: How Our Language has Evolved

Emily Arnott — Thu, 18 Nov 2021 20:31:15 +0000

As the tech world changes, language changes with it. New technologies will always introduce new terms and descriptions to provide clear understanding. For example, the emergence of the cloud introduced language to describe the changing relationship between servers and clients. Then, of course, product providers will also dictate how their products are to be described, i.e. describing services as “cloud-native”.

On other occasions, language changes through deliberate effort to influence behavior. Thought leaders will often invent alternative words to describe existing ideas in order to effect cultural change. Even a slight change in diction can massively affect one’s engagement, attitude, and even their worldview. In this blog, we’ll look at how language colours how we perceive our environments, and we’ll break down three examples of how language has evolved in tech.

How language affects and shifts world perspectives

We all have associations with language. Because of our past experiences and culture, different types of messages will trigger different emotional responses. The language we use thus influences the way we think. Whether our associations are positive or negative can impact things such as:

Whether we dread something or get excited by it
How important we perceive something to be
If we perceive something to be collaborative or combative...
Innovative or legacy
Bleeding-edge or mainstream
Safe or provocative

“Postmortem” vs. “Retrospective”

Both of these terms refer to a document that summarizes a past incident and the steps that were taken to resolve it. “Postmortem” was originally a medical term dating back to the 1820s. The metaphorical usage of examining other things after their “death” has been widely used in many industries, including tech.

In recent years, many organizations are differentiating the idea of a retrospective from a postmortem as the culture mindset shifts to the ongoing learning from events and failures. The two practices are commonly considered to have some small differences, such as the timing and content of the documents. However, just as important as these differences are the psychological effects of the terminology being used, especially when these may be conducted in a high-pressure environment. Here are some of the reasons we’re using “retrospective” instead of “postmortem” at Blameless.

The negativity of postmortems: death has a negative association in most people’s minds. As responders attend to incidents, the negative connotation lingers. Engineers may feel worried about the consequences of an incident, and the idea of “death” surrounding this process may encourage feelings of guilt and fear. By removing negative associations, people will be more eager to review and look back at what actually occurred and take the time to revisit it as a team.

The finality of postmortems: at Blameless, we don’t see failure as the end. We see it as an opportunity to learn and grow, a starting point for positive change. Postmortems are very final; no examination happens “post-postmortem”. A retrospective implies that you’re looking back at something that just happened or occured a while ago, that still could have a purpose in the future.

The wide scope of retrospectives: a postmortem is defined by the single moment of failure and works backwards to determine the causes. A retrospective is concerned with more than just the direct causes of failure. Instead, it seeks to tell the complete story of the service, systems, and people, up to and beyond the incident.

We want our incident retrospectives to be documents that we are proud to contribute to, that serve as hubs of learning and impetus for change going forward. We believe that by using the word “retrospective”, it conveys this intent much better than “postmortem”.

“Root Cause Analysis” vs “Contributing Factors Analysis”

When determining why something went wrong, there are several competing schools of thought. The root cause analysis, or RCA, is a popular tool for uncovering the reason for failure. The idea of a “root cause” as being the primary factor causing failure dates back to the early 1900s, with “root cause analysis” emerging as a concept in engineering companies in the 1930s. It is commonly attributed to Kiichiro Toyota, founder of the Toyota Motors Corporation, who developed the Five Whys technique to find root causes.

Contributing factor analysis is a more recent term that has been growing in popularity. It also seeks to understand the causes of an incident, but with a different mindset. That mindset is reflected in the language itself as much as any specific practice. Let’s look at some examples of these differences, and why we at Blameless feel the contributing factors analysis is more useful.

The singularity of RCAs: the most obvious difference is that a root cause analysis refers to a singular root cause, where contributing factors emphasizes multiple factors. This is more important than it may seem. If you set out looking for a singular cause, you’ll resist branching out to other impactful areas. For example, if you only look for an engineering cause, you’ll disregard factors arising from product design or team culture.

The hierarchy of RCAs: the idea of a “root” cause is that it is the source from which other causes grow and branch off. Understanding what causes are more significant for the incident is necessary to properly prioritize follow-up items, but it isn’t the full story. You have to also consider how these changes will affect the team and system as a whole. Thinking about each factor’s contribution without trying to determine which is the “root” keeps you more open-minded.

The neutrality of contribution: when considering the cause of an incident, you’ll be inclined to find failures, mistakes, and other negative things. Instead you can think about every factor that contributed to the story of the incident - including things that went well, like helpful playbooks and good communication. The totality of this factor analysis gives you a more complete picture of how to respond to incidents going forward.

Blameless advocates SRE as a holistic practice, one that incorporates learning from all available sources. The Contributing Factors Analysis brings in as many sources as possible to best understand incidents.

“Disaster Recovery” vs “Incident Response”

The overall process initiated by something going wrong has gone by different names over the years. The attitudes people have towards this have changed alongside the evolution of language and terminology. At first, organizations typically referred to this as disaster recovery. This terminology dates back to the 1970s, where it focused on how systems would recover if natural (or other) disasters wiped out infrastructure and its ability to operate.

As IT systems became more virtual, outages started to be caused by a much wider range of technical aspects other than natural disasters. Organizations moved to referring to this process as incident response to reflect the range of problems and new processes and tools. Also, the processes themselves evolved along with the technology changes. Let’s look at how these terms reflect the attitudes of each era, and why we now use incident response.

The singularity of recovery: incident response, sometimes referred to as incident management, is much more than just restoring the environment to its previous state. After services are back online, you still need to gather information from the incident itself and build a retrospective, develop action items to carry the learning forward, and review the effectiveness of the response steps and procedures. Recovery is really only the first step towards resolution, and doesn’t convey how you can get the most learning and improvement from each incident.

The severity of disasters: people see disasters as major catastrophic events. Setting up policies and procedures to trigger only in the event of a “disaster” is a very high bar. However, your incident response process should work just as efficiently for all incidents In other words, not all incidents are ‘Sev 1” and so knowing the right steps to take depending on each incident is equally important. We believe there’s learning in every incident, and so every incident is worth responding to properly.

The inevitability of incidents: disasters are also thought of as something to avoid at all costs. Any effort spent on reducing the chances of a disaster would be justified, given how severe disasters can be to both customers and engineering teams. A goal of zero disasters is reasonable. However, we know that 100% reliability is impossible. By recognizing the inevitability of incidents, you embrace them and avoid overspending on infrastructure and other resources in trying to prevent them. Using the term “incidents'' vs “disasters” helps team-members understand their true inevitability and impact.