DEV Community: Blameless

What's difficult about problem detection? - Three Key Takeaways

Emily Arnott — Wed, 14 Sep 2022 17:50:56 +0000

Welcome to episode 4 of our webinar series, From Theory to Practice. Blameless’s Matt Davis and Kurt Andersen were joined by Joanna Mazgaj, Director of Production Support at Tala, and Laura Nolan, Principal Software Engineer at Stanza Systems. They tackled a tricky and often overlooked aspect of incident management: problem detection.

It can be tempting to gloss over problem detection when building an incident management process. The process might start with classifying and triaging the problem and declaring an incident accordingly. The fact that the problem was detected in the first place is treated as a given, something assumed to have already happened before the process starts. Sometimes it is as simple as your monitoring tools or a customer report bringing your attention to an outage or other anomaly. But there will always be problems that won’t be caught with conventional means, and those are often the ones needing the most attention.

Our panel comes from diverse backgrounds, with Laura working with a very new and small startup and Joanna focusing on production at a large company, but each had experience dealing with problem detection challenges. The problems that are difficult to detect will vary greatly depending what you’re focused on observing, but our panel found thought processes that consistently helped.

Listen to their discussion to learn from their experiences. Or, if you prefer to read, I’ve summarized three key insights in this blog post as I have for previous episodes (one, two, and three).

Losing focus on the truth and gray failure

You might think of your system as having two basic states: working and broken, or healthy and unhealthy. This binary way of thinking is nice and simple for declaring incidents, but can be very misleading. It may lead you to overlook problems that exist in the gray areas between success and failure.

Kurt Andersen gave an example of this type of failure that’s becoming more relevant today: gray failure in machine learning projects, resulting from the learning data drifting away from reality. Machine learning projects can give very powerful results, using a process where an algorithm is fed tons of classified data until it learns to apply the same classification for new data. For example, an algorithm can be trained to identify species of birds from a picture after being shown thousands of labeled pictures.

But what happens when the supplied data starts drifting away from accuracy? If the algorithm starts misidentifying birds because it starts working from bad data or works with incorrect patterns, it won’t throw up an error. A user trying to learn the name of a species won’t likely be able to tell that the result is incorrect. The system will start to fail in a subtle way that requires deliberate attention to detect and address.

Laura Nolan pointed out that this type of gray failure is an example of an even more general problem – how do you know what “correct” is in the first place? “If you know something is supposed to be a source of truth, how did you double check that?” she asked. “In some cases there are ways, but it is a challenge.”

There’s no single way to detect “incorrectness” in a system when the system’s definition of “correct” can drift away from your intent. What’s important is identifying where this can happen in your system, and building efficient and reliable processes (even if they’re partially manual) to double check that you haven’t drifted into gray failure.

Detecting the wrong problems and mixing up symptom and cause

Another big challenge in problem detecting: even if you’ve detected a problem, is it the right problem? Joanna gives an example of your system having an outage or another very high priority incident that needs dealing with. When you dive into the system to find the cause of the outage, you end up finding five other problems with the system. This is only natural, as complex systems are always “sort of broken”. But are any of these problems the problem, the one causing impact to users?

Matt shared an example from his personal life. When getting an MRI to detect a problem with his hearing, doctors found a congestion in his sinus. It wasn’t causing his hearing issues, and furthermore, the doctors guessed that if they gave an MRI to anyone, they’d probably have the same sinus problem. Some problems will be detected that, while certainly problems, are ones that most systems simply “live with” and are unrelated to what you’re trying to find.

However, this “functioning problem” isn’t a guarantee of safety. Laura discusses how a robust system can run healthily with all sorts of problems happening behind the scenes. This can be a double-edged sword. If eventually these problems cumulate into something unmanageable, it can be difficult to sort through the cause and effect of these problems that have been piling up. For example, if you find your system “suddenly” runs out of usable memory, it could be because of many small memory leaks, individually unnoticeable, adding up. At the same time, the problems resulting from insufficient memory can seem like issues in themselves, instead of just symptoms of this one problem.

These tangled and obscured causes and effects are inevitable in a complex system. At the same time, you can’t overreact and waste time on every minor problem you see. Tools like SLOs, which proactively alert you to when an issue will start impacting customer happiness, can help you strike a balance.

Problems occurring in the intersection of systems

Focusing on the user experience can help you understand some problems that are otherwise impossible to detect, even ones that occur even when your system is functioning entirely as intended. These problems can result from a situation where your system is behaving as you expect, but not as how your user expects. If the user relies on certain outputs of your system, and it produces a different output, it can create a hugely impactful problem without anything appearing wrong on your end.

Laura gave an example of this sort of problem. Data centers will use a tool known as a “class network” as a method of redundancy and reliability. To put it simply, this tool will create a new backup link if a link fails, generally providing uninterrupted connection over this small and common failure. However, a customer’s system might react immediately to the link failing, causing a domino effect of major failure resulting from the mostly normal operations of the data center. So where does the problem exist, in the data center’s system or the user’s system? Both are functioning as intended. Laura suggests that the problem lies only in the interaction between the two systems – difficult to detect, to say the least!

Another high-profile example of this type of problem happened with a glitch where UberEats users in India were able to order free food. In this case, an error given by a payment service UberEats had integrated with was incorrectly parsed by UberEats as “success”. The problem only occurred in the space between how the message was generated and how it was interpreted.

‍

This example teaches a good lesson in detecting and preventing this sort of problem. Building robust processes for handling what your system receives from other systems is essential – you can’t assume things will always arrive as you expect. Err on the side of caution, and have “safe” responses to data that your system can’t interpret. Simulate your links with external systems to make sure you cover all types of output that could come in.

‍

We hope you’re enjoying our continued deep dives into the challenges of SRE in From Theory to Practice. Check out the full episode here, and look forward to more episodes coming soon. Have an idea for a topic? Share it with us in our Slack Community!

‍

A Chat with Lex Neva of SRE Weekly

Emily Arnott — Mon, 22 Aug 2022 20:29:50 +0000

Since 2015, Lex Neva has been publishing SRE Weekly. If you’re interested enough in reading about SRE to have found this post, you’re probably familiar with it. If not, there’s a lot of great articles to catch up on! Lex selects around 10 entries from across the internet for each issue, focusing on everything from SRE best practices to the socio- side of systems to major outages in the news.

I had always figured Lex must be among the most well-read people in SRE, and likely #1. I met up with Lex on a call, and was so excited to chat with him on how SRE Weekly came to be, how it continues to run, and his perspective on SRE.

The origins of SRE Weekly

I felt like an appropriate start of our conversation was to ask about the start of SRE Weekly: why did he take on this project? Like many good projects, Lex was motivated to “be the change he wanted to see”. He was an avid reader of Devops Weekly, but wished that something similar existed for SRE. With so much great and educational content created in the SRE space, shouldn’t there be something to help people find the very best?

“I wanted there to be a list of things related to SRE every week, and such a thing didn’t exist, and I’m like… Oh.” Lex explained. “I almost fell into it sideways, I thought this was gonna be a huge time sink, but it ended up being pretty fun, actually.”

How SRE Weekly is made

When thinking about the logistics of SRE Weekly, one question likely comes to mind: how? How does he have time to read all those articles? SRE is a methodology of methodologies, a practice that encourages building and improving practices. Lex certainly embodies this with his efficient method of finding and digesting dozens of articles a week.

First, he finds new articles. For this, rss feeds are his favorite tool. Once he’s got a buffer of new articles queued up, he uses an Android application called @voice to listen to them with text to speech – at 2.5x speed! Building up the ability to comprehend an article at that speed is a challenge, but for someone tackling the writing output of the entire community, it’s worth it.

To choose which articles to include, Lex doesn’t have any sort of strict requirements. He’s interested in articles that can bring new ideas or perspectives, but also likes to periodically include well-written introductory articles to get people up to speed. Things that focus on the socio- side of the sociotechnical spectrum also interest him, especially when highlighting the diversity of voices in SRE.

Incident retrospectives are also a genre of post that Lex likes to highlight. Companies posting public statements about outages they’ve experienced and what they’ve learned is a trend Lex wants to encourage growing. Although they might seem to only tell the story of one incident at one company, good incident retrospectives can bring out a more universal lesson. “An incident is like an unexpected situation that can teach us something – if it’s something that made you surprised about your system, it probably can teach someone else about their system too.”

Lex explained how in the aviation industry, massive leaps forward in reliability were made when competing airlines started sharing what they learned after crashes. They realized that any potential competitive advantages should be secondary to working together to keep people safe. “The more you share about your incidents, the more we can realize that everyone makes errors, that we’re all human,” Lex says. Promoting incident retrospectives is how he can further these beneficial trends.

Lex’s view of SRE

As someone with a front row seat to the evolution of SRE, I was curious what sort of trends Lex had seen and how he foresees them growing and changing. We touched on many subjects, but I’ll cover three major ones here:

Going beyond the Google SRE book

Since it was published in 2016, the Google SRE book has been the canonical text when it comes to SRE. In recent years, however, the idea that this book shouldn’t be the end-all be-all is becoming more prominent. At SREcon 21, Niall Murphy, one of the book’s authors, ripped it up live on camera!

Lex has seen this shift in attitudes in a lot of recent writing, and he’s happy to see a more diverse understanding of what SRE can be: “Even if Google came up with the term SRE, lots of companies had been doing this sort of work for even longer,” Lex said. “I want SRE to not just mean the technical core of making a reliable piece of code – although that’s important too – but to encompass everything that goes into building a reliable system.”

As SRE becomes more popular, companies of more sizes are seeing the benefits and wanting to hop aboard. Not all of these companies can muster the same resources as Google… Actually, practically only Google is at Google’s level! Lex has been seeing more learning emerge around the challenges of doing SRE at other scales, like startups, where there aren’t any extra resources to spare.

Broadening what an SRE can be

As we break away from the Google SRE book, we also start to break away from traditional descriptions of what a Site Reliability Engineer needs to do. “SRE is still in growing pains,” Lex said. “We’re still trying to figure out what we are. But it’s not a bad thing. I’ve embraced that there’s a lot under the umbrella.”

We often think of the “Engineer” in Site Reliability Engineer to be like “Software Engineer”, that is, someone who primarily writes code. But Lex encourages a more holistic view: that SRE is about engineering reliability into a system, which involves so much more than just writing code. He’s been seeing more writing and perspectives from SREs who have “writing code” as a small percentage of their duties – even 0%.

“They’re focusing more on the people side of things, the incident response, and coming up with the policies that engender reliability in their company… And I think there’s room for that in SRE, because at the heart of it is still engineering, it’s still the engineering mindset. If you only do the technical side of things, you’re really missing out.”

Diversifying the perspectives of SREs

Alongside diversifying the role of SREs, Lex hopes to see more diversity among SREs themselves. In our closing discussion, I asked Lex what message he would broadcast to everyone in this space if he could. “It’s all about the people,” he said. “These complex systems that we’re building, they will always have people. They’re a critical piece of the infrastructure, just as much as servers.”

Even if what we build in SRE seems to be governed just by technical interactions, people are intrinsic to making those systems reliable. This isn’t a negative; this isn’t just people being “error-makers”. People are what gives a system strength and resiliency. To this point, Lex highlighted what can make this socio- side of systems better: diversity and inclusion.

“Inclusion is important for the reliability of our socio-technical systems because we need to understand the perspective of all our users, not just the ones that are like us. That means thinking across race, gender expression, class, neurodivergence, everything. It’s an area where we need to do better.” Lex hopes to highlight the richness in this diversity in SRE Weekly.

As people standing at the relative beginning of SRE, working together to build and evolve the practice, we’re given both a challenge and an opportunity. In order to truly understand and engineer reliability into what we do, we need to discuss proactively our goals and how we’re achieving them. We hope you take the time to reflect on the learning that many great SRE writers share through spaces like SRE Weekly.

Read SRE Weekly here, and follow it on Twitter here.

SRE: From Theory to Practice | What's difficult about incident command

Emily Arnott — Wed, 27 Jul 2022 16:45:02 +0000

A few weeks ago we released episode two of our ongoing webinar series, SRE: From Theory to Practice. In this series, we break down a challenge facing SREs through an open and honest discussion. Our topic this episode was “what’s difficult about incident command?” When things go wrong, who is in charge? And what does it feel like to do that role? To discuss, Jake Englund and Matt Davis from Blameless were joined by Varun Pal, Staff SRE at Procore, and Alyson Van Hardenburg, Engineering Manager at Honeycomb.

To explore how organizations felt about incident command, we asked about the role on our community Slack channel, an open space for SRE discussion. We found that most organizations don’t have dedicated incident commander roles. Instead, on-call engineers are trained to take on the command role when appropriate. Because of this wide range of people who could end up wearing the incident commander hat, it’s important to have an empathetic understanding of exactly what the role entails.

With this conversation, we wanted to work through what incident command theoretically entails, and connect it to the messy reality of what it often looks like. As we did for last episode, we’ll highlight three key takeaways as an introduction to the episode.

Create support structures for groups of incident commanders

Varun discussed how at Procore, he and his colleague started an incident commander “guild”, a group of people who may have to take on the incident command role that meets weekly. Before starting the guild, they recognized that each person taking on the role may have vastly different areas of expertise and perspectives on how incidents should be run. When reviewing incidents in retrospectives, they’d often find inconsistencies based on who was commanding the incident. This created challenges for finding patterns across incidents, and for using consistent methods to investigate the causes of incidents. This was the impetus to gather incident commanders in this new guild.

By bringing together everyone who could wear the incident commander hat, they not only got everyone on the same page, but on the “best” page. This meant you would collect the expertise from everyone in the group and establish those best practices as a methodology to which everyone adheres. Everyone could contribute what they found most effective, synthesizing everyone’s experiences into an agreed-upon set of practices. The program was started from the bottom up, knowing that the time and energy invested would make everyone’s lives easier in the long run.

Perhaps even more important than coming up with good procedures, the incident commander guild provides solidarity and empathy. It’s a safe space for people who respond to incidents to share in one another’s triumphs, and commiserate and vent about frustrations. Incident command is tough work: it’s a job that can have you leaping out of bed at 3am and suddenly being asked to direct a team of other tired people. Without support, people can quickly burn out.

Empathize with anxiety around expertise

“I’d rather be on-call 24/7 for something I’m the subject matter expert on than spend 5 minutes being incident commander for something I don’t know about,” said Jake in our discussion. It might be an exaggeration – but not a huge one. Everyone else on the call echoed this sentiment. The anxiety around not knowing is only sensible. In the crunch time of an incident, no one wants to cause further delay because they don’t know how something works.

The first step to addressing this, as Jake emphasized, is to realize that engineers are not fungible. You can’t assume that every engineer has the expertise and experience of every other engineer. For engineers to be effective on-call, they need to be brought up to speed on system functioning. Without that, you won’t be able to know that “deploying people” to resolve a problem will have any effect.

Even with training, some engineers will always be more familiar with some service areas than others, perhaps because they worked on the project itself. No matter how prepared they are for on-call in general, this relative lack of expertise will always cause anxiety: people will inevitably fear an incident that exposes the things they don’t know, or even the things they don’t know that they don’t know. This is why Alyson emphasized, to everyone’s agreement, that subject matter experts shouldn’t be the incident commander. Good incident response shouldn’t be about “getting lucky” and having the expert on call, but establishing learning and processes that help anyone solve issues.

Since this anxiety is to some extent inevitable, the important thing is to empathize with it and set up systems that support it. Often, there will be designated people to escalate to. It’s helpful to know who to call, but it can be intimidating if you think you’re bothering someone with a question you “ought to know”. One panelist brought up the subject of “on-call buddies”, someone you trust yourself to contact even when you’re unsure of what you “ought to know”. Then, even if both of you don’t know, you can be more encouraged to escalate further. In general, escalation policies shouldn’t be strict and linear, but more based on expertise and connections.

Incident command is like first aid

We’ve looked at some best practices to make life better for incident commanders, but a key question remains: what exactly is incident command? Is it a duty that rotates through everyone on-call, with that designated person taking command for every incident on that shift? Or is it determined at the time of the incident – perhaps the person who first responds to the incident, or the most expert person on-call, or the most senior person involved in each incident? Or should you hire a designated incident command person? What duties does someone have when they’re on incident command?

When discussing these questions, our panel concluded that… it depends. Who an incident commander is and what they do may vary from org to org, and from incident to incident. But when you’re building up the practice yourself, “it depends” isn’t a very helpful answer. That’s why I wanted to highlight a framework for incident command suggested by Alyson: incident command is like first aid.

First aid isn’t about fully treating a patient, or even fully diagnosing them. It’s about taking charge of a situation and making sure critical tasks are happening and not falling through the cracks. Alyson described a scene where you witness an accidentand immediately give direction: “you, elevate the head and try to stop the bleeding”; “you, call an ambulance”, etc. Instructing particular people bypasses the bystander effect and ensures the task is completed.

When you’re the incident commander, it can be helpful to focus on this role of immediate task allocation, instead of getting bogged down immediately by diagnosis and response itself. Matt also emphasized the importance of the incident commander knowing when to step away. You’ll naturally want to see every step of the incident through, but trying to power through when exhausted can have diminishing returns. For the sake of the incident, and your own health, it’s important to take breaks. During that time, hand off the command role to someone else. Matt suggests the incident lead is a good option. No one expects the first aid person to stay at the patient’s bedside all night.

We hope you’re enjoying our look into the real experiences of engineers in SRE: From Theory to Practice. You can look forward to new episodes coming soon. If you have an SRE topic that you’d like to see covered by a panel, please let us know on Twitter or on our Slack community channel.

SRE: From Theory to Practice | What's difficult about on-call?

Emily Arnott — Mon, 20 Jun 2022 19:29:21 +0000

We launched the first episode of a webinar series to tackle one of the major challenges facing organizations: on-call. SRE: From Theory to Practice - What’s difficult about on-call sees Blameless engineers Kurt Andersen and Matt Davis joined by Yvonne Lam, staff software engineer at Kong, and Charles Cary, CEO of Shoreline, for a fireside chat about everything on-call.

As software becomes more ubiquitous and necessary in our lives, our standards for reliability grow alongside it. It’s no longer acceptable for an app to go down for days, or even hours. But incidents are inevitable in such complex systems, and automated incident response can’t handle every problem.

This set of expectations and challenges means having engineers on-call and ready to fix problems 24/7 is a standard practice in tech companies. Although necessary, on-call comes with its own set of challenges. Here’s the results of a survey we ran asking on-call engineers what they find most difficult about the practice:

These results indicate that on-call engineers primarily struggle with the absence of practical resources, like runbooks and role management. However, to solve a practical problem, taking a holistic approach to the systems behind that problem is often necessary.

When we see these challenges in the world of SRE, we want to dive into the challenges behind the challenges, building a holistic approach to improvement. This new webinar series will tackle a different set of challenges in the world of SRE with this perspective. We think that honest and open conversations between experts is an enlightening and empathetic way to improve in these practices.

As the title suggests, this webinar bridges theoretical discussion of how on-call ought to be with practical implementation advice. If you don’t have the time to watch, here’s three key takeaways. I’ll be continuing a series of wrap-up blog posts alongside each episode, so you can keep up with the conversation no matter what format you prefer.

Internal on-call is just as important as external

Yvonne works mostly on internal tools for Kong. As a result, the incidents she’s alerted to aren’t the typical ones that directly impact customers, like a service being down. Instead, her on-call shifts are spent fighting fires that prevent teams from integrating and deploying new code. Ultimately, this is just as customer impacting as an outage when teams can’t deploy fixes for outages quickly because of internal tool failure. It can be even worse – if these tools are shared between teams, an internal outage can be totally debilitating.

Yvonne sharing her experiences with these challenges kicked off a discussion of the importance of internal on-call. She discussed how sometimes these internal issues can be uniquely hard to pinpoint. Rather than having a suite of observation tools reporting on the customer experience, internal monitoring is often just engineers vaguely reporting that a tool “seems slow”. Internal issues needed some of the structures that helped with external outages, while also incorporating the unique aspects of dealing with internal issues.

To achieve this mix, it’s important to have some universal standards of impact. SRE advocates tools like SLIs and SLOs to measure incidents in what ultimately matters most: if customers are satisfied with their experience. You can apply this thinking to internal issues too. Think of your engineers as “internal customers” and build their “user journeys” in terms of what tools they rely on, the impact of those tools failing on deploying code, etc. This will help you build internal on-call systems that reflect the importance of internal system reliability, alongside the resources to support it, like runbooks. Investing in these resources strategically requires understanding the positive impact they’d have.

Think of your engineers as “internal customers” and build their “user journeys” in terms of what tools they rely on, the impact of those tools failing on deploying code, etc.

Assessing customer impact is hard to learn

We’ve discussed the importance of a universal language that reflects customer happiness, but how do you learn that language? Charles discussed the challenge in building up this intuition. Customer impact is a complex metric with many factors – how much the incident affects a service, how necessary that service is to customer experiences, how many customers use that service, how important those customers are in terms of business… the list goes on.

Incident classification systems and SLO impact can help a lot in judging incident severity, but there will always be incidents that fall outside of expectations and patterns. All of our participants related to experiences where they just “knew” that an incident was a bigger deal than the metrics said. Likewise, they could remember times that following the recommended runbook for an incident to a T would have caused further issues. Charles gave an example of knowing that restarting a service, although potentially a fix for the incident, could also cause data loss, and needing to assess the risk and reward.

Ultimately, the group agreed that some things can’t be taught directly, but have to be built from experience – working on-call shifts and learning from more experienced engineers. The most important lesson: when you don’t know what to do and the runbook is unclear, get help! We often think of incident severity as what dictates how you escalate, but what if you don’t know the severity? Matt emphasized the importance of having a psychologically safe space, where people feel comfortable alerting other people whenever they feel unsure. Escalation shouldn’t feel like giving up on the problem, but a tool that helps provide the most effective solution.

The group discussed some of the nuances of escalation. Escalation shouldn’t be thought of as a linear hierarchy, but a process where the best person for a task is called in. Find the people who are just “one hop away” from you, where you can call them in to handle something you don’t have the capacity for. Incident management is a complex process with many different roles and duties; you shouldn’t need to handle everything on your own. The person you call won’t always be the person who's the most expert in the subject area. Sometimes someone you have a personal relationship with, like a mentor, will be the most useful to call upon. The social dynamics we all have as humans can’t be ignored in situations like on-call, and can even be a strength.

Escalation shouldn’t feel like giving up on the problem, but a tool that helps provide the most effective solution.

Lower the cost of being wrong

Social dynamics come into play a lot with on-call. As we discussed, there can be a lot of hesitation when it comes to escalating. People naturally want to be the one that solves the problem, the hero. They might see escalating as akin to admitting defeat. If they escalate to an expert, they might feel embarrassed that the expert will judge their efforts so far as being “wrong”, and might defer escalating to avoid that feeling of wrongness.

To counteract this, Yvonne summarized wonderfully: “you have to lower the cost of being wrong”. Promote a blameless culture, where everyone’s best intentions are assumed. This will make people feel safe from judgment when escalating or experimenting. Matt focused on the idea of incidents as learning opportunities, unique chances to see the faults in the inner workings of your system. The more people fear being wrong, the more they pass up exploring this opportunity and finding valuable insights.

The fear of being wrong can also lead to what Kurt described as “the mean time to innocence factor” – when an incident occurs, each team races to prove that they weren’t at fault and bear no responsibility for solving the problem. Escaping the challenge of solving the problem is a very understandable human desire, but this game of incident hot potato needs to be avoided. Again, lower the cost of being wrong to keep people at the table: it doesn’t matter if your code caused the crash, what matters is that the service is restored and a lesson is learned.

The group also discussed getting more developers and other stakeholders to work on-call for their own projects. Development choices will always have some ramifications that you can’t really understand until you experience them firsthand. Developer on-call builds this understanding and empathy between developers and operations teams. Once again, lowering the cost of being wrong makes on-call more approachable. It shouldn’t be a dreadful, intimidating experience, but a chance to learn and grow, something to be embraced.

The more people fear being wrong, the more they pass up exploring this opportunity and finding valuable insights.

We hope you enjoyed the first episode of SRE: From Theory to Practice. Please look forward to more episodes dealing with other challenges in SRE, and the challenges behind the challenges.

Canary Deployments | The Benefits of an Iterative Approach

Emily Arnott — Mon, 24 Jan 2022 22:09:33 +0000

At Blameless, we want to embrace all the benefits of the SRE best practices we preach. We’re proud to announce that we’ve started using a new system of feature flagging with canaried and iterative rollouts. This is a system where new releases are broken down and flagged based on the features each part of the release implements. Then, an increasing subset of users are given access to an increasing number of features. By avoiding big changes for big groups, we reduce the chances of major outages and provide a more reliable product faster.

Of course, switching to this system comes with challenges and decisions to make. In this blog post, we’ll share what we’ve learned about canarying and flagging best practices. We’ll look at:

Why you should consider an iterative canarying approach to releases
Knowing when it’s safe to expand and iterate
Understanding how users rely on your services to find the ideal groups to canary

Why do iterative canarying releases?

Iteration and canarying is more involved than traditional big releases. You need to look at the code being deployed and flag everything that comprises each new feature. You’ll also need to tag groups of users. Finally, instead of one big release, you do several smaller releases where more groups of users receive more features each time. Flagging features and making user groups creates overhead, and each release will take a bit of additional time. However, the benefits of this system are worth it. Here are a few to consider:

More reliable services. Perhaps the biggest benefit of this approach is improved reliability for your service. Changes in production are a common source of incidents and outages. With small, iterative deployments, you’ll know that things can only go wrong for a subset of your services. Likewise, canarying means that only a subset of your users will be affected. By preventing major outages, you’ll greatly improve the perceived reliability of your service.

Continuous feedback. By iterating through new features, you have many more opportunities to hear feedback from users. You’ll be able to tell what should be improved before you commit to the entire feature set.

More manageable operations. By reducing the scope and spacing out each release, you give operations teams the opportunity to ensure they’re ready to support each feature.

This approach does come with challenges. As this approach only deals with the deployment of code, you don’t have to retroactively change your code to be modular or feature-flagged. However, you need to build new practices for code going forward. Developers need to invest time and energy building new habits for development. They need to instill a mindset of everything being modular and iterative. Switching gears like this can initially cause development to slow, but the payoff of better releases is worth it.

Balancing canarying and iteration safely

Our release system has essentially two dimensions: the amount of users accessing new features, the canary size; and the new features they have access to, the iteration. Our goal is to expand both of these until they encompass all users and all features without ever making leaps large enough to jeopardize the reliability of each change. How do you find this balance and cadence?

First, you need to build a release roadmap. This is a project that product, development, and operations teams should share. It outlines which features should be included in each iteration, and which canary groups should receive them. It should also contain an aspirational timeline for each stage. However, this timeline shouldn’t be written in stone. You’ll need to adjust your rollout speed based on how each iteration is performing.

Graph of Phased Rollout Approach

The key is monitoring data with feature flagging. You need to be able to see how each feature is individually performing. Whether or not a given feature should be rolled out can depend more on the performance of specific other features, rather than the overall health of the system. Blameless uses monitoring tools such as Sumologic to parse the information our system outputs. It allows us to break down which features are causing issues or unreliability, and which are stable enough to be built upon.

Once you know a given iteration is safe, you can roll it out to the designated groups. The modular setup gives an extra layer of protection, as individual features can be rolled back without impacting the entire system. Don’t depend on this going off without a hitch, though. Like any other backup system, simulating the need to roll back an iteration is necessary to understand what your options actually are.

Building the right canary groups

Another best practice for canarying releases is to use customized specific canary groups. Intuitively, you might just break your users down into indiscriminate chunks — maybe 10 groups of 10% each. This works fine to get many of the benefits of canarying, but you can get even more insights with tailor-made canary groups.

To do this, you first need to understand how your users interact with your service. Blameless uses tools such as Pendo to see how much each user relies on each feature. This is supplemented by meeting with customer success teams, who can relay reports from customers on what matters most to them. Creating things like user journeys and SLIs can quantify this importance.

Once you have profiles for your users, create groups for each iteration based on the features in that iteration. Some qualities that you’d like your ideal canary group to have include:

They use the feature. If you roll out an updated feature to a group of users that don’t even notice, you won’t get the feedback you need.
They don’t use the feature too much. On the other hand, updates are more likely to have outages during these canarying phases. Avoid users that wholly rely on a feature to keep them safe from this risk.
They provide feedback. Some users are more inclined to discuss what they think of new updates. These communicative users are ideal canaries.

Of course, you won’t necessarily find a perfect set of users for every feature. The important thing is considering these things when building canary groups and choosing the best candidates you have.

What you’ll need

To make these frequent, iterative, and specifically targeted deployments, you’ll need to have a strong deployment system first. Practices like CI/CD are necessarily for this speed and flexibility.

The Universal Language: Reliability for Non-Engineering Teams

Emily Arnott — Tue, 18 Jan 2022 16:31:08 +0000

We talk about reliability a lot from the context of software engineering. We ask questions about service availability, or how important it is for specific users. But when organizations face outages, it becomes immediately obvious that the reliability of an online service or application is something that impacts the entire business with significant costs. A mindset of putting reliability first is a business imperative that all teams should share.

But what does reliability mean for people outside of engineering? And how does it translate into best practices for other teams? In this blog post, we’ll investigate:

What reliability means
Business benefits for adopting a reliability mindset
How specific business functions adopt a reliability mindset.

What does reliability mean?

Reliability is often something people only think about when it isn’t there. This perspective explains what reliability means to your users. It isn’t just about the availability of services, but how important those services are to users. Let’s compare two different incidents. One causes lag briefly for a service that everyone, including your most valuable customers, uses. The other causes a total outage for an hour, but just for a service that a very small percentage of users, who are generally on a low-tier subscription, ever access. Which one causes more damage to users’ perception of reliability? Being able to answer this question is key to building a reliability mindset and framework to make important decisions.

A mindset of putting reliability first should be fundamental to how the organization makes decisions. Ultimately, having users access your services without frustration is what keeps them around, and keeping users around is what keeps your organization around. This is why we call reliability “feature #1”: no matter how impressive your other features are, it doesn’t matter if users can’t reliably use them.

The business benefits of a reliability mindset

Since it’s fundamental to the entire business, engineers shouldn’t be alone when thinking and planning for reliability. Having reliability inform the decisions of every team cultivates what Google describes as the “strategic” and “visionary” phases of the reliability spectrum. These phases are relative to the capabilities of each company. It isn’t about checking off certain milestones as much as building something that works for all of your teams. Let’s look at the benefits of this shared standard of reliability.

A universal language of what matters

Different teams within an organization can have very different perspectives and priorities. Picture a product team hoping to release an important new feature as soon as possible, an operations team looking to reconfigure the deployment process, and customer success driving to change the development roadmap in response to customer feedback. These goals can conflict, and each team’s perspective and desire for their project to take priority is valid. How do you decide?

All three teams have a valid claim that their priority is essential to user happiness. The cohorts of users that each project focuses on is different — one is focused on current customers, one is focused on the dev teams (who are internal users), and one is focused on prospective customers. You can’t simply compare the number of “users” affected.

A reliability mindset provides clarity and a way to make decisions by combining factors, such as:

The types of user experience affected
Frequency of affected user experiences
Importance of affected aspects of users’ experience
How detrimental to their experience a given change could be
Business importance of keeping the affected user cohorts satisfied

Your ultimate goal is a metric that expresses how much user satisfaction could change based on a decision. This metric would apply not just to code changes, but decisions made by any team. It creates a universal, cross-team language with which to discuss user satisfaction levels and therefore business impact.

Of course, determining these factors for each decision isn’t trivial. It requires extensive research into how different user cohorts engage with your service, with continuous discussion and revision. This isn’t a downside of reliability focus, however: it’s one of its biggest strengths. Having an ongoing org-wide discussion of what matters to customers is one of the best ways to break down silos and spread knowledge and insight. Getting a complete view of the customer experience comes from many teams — support, sales, success — that can be consolidated under the metric of reliability.

Using reliability metrics to hit the brakes or the gas pedal

Let’s look at our three distinct functional projects again. If you’ve established a working definition of reliability, you’ll be in a much better position to prioritize each project based on impact and overarching goals. However, this might not be as simple as just seeing which one makes the biggest positive impact, or has the lowest potential negative impact. You should set a baseline of acceptable reliability. Consider how satisfied the cohorts of users are with service reliability thus far.

Improving the reliability of a service that users are happy with may not be appreciated or even noticed. Trying to indefinitely improve reliability has rapidly mounting costs and diminishing returns. What’s important is maintaining service reliability at a level that doesn’t cause user pain or friction. When looking at reliability metrics, the focus should be on keeping them at or slightly above that point, rather than improving them as much as possible. At some point, marginal improvement won’t make any positive impact on user satisfaction. Time and energy improving past this point is better spent elsewhere.

It’s important not just to agree on what reliability looks like for different user experiences, but to also determine the point at which each user experience becomes too unreliable: your service level objective. How each user experience is doing compared to that agreed-upon objective determines how to prioritize projects across teams.

Returning to our previous scenario, let’s say we know the following about each project’s current user experiences:

Users are demanding the new feature, but their continued use isn’t dependent on it coming out immediately
Customer feedback provided to the customer success team for the roadmap change isn’t based on current pain, and can be postponed
The current deployment process currently creates a delay of at least a day for each deployment.

The first two projects will certainly increase customer satisfaction more than a backend change that customers aren’t aware of. However, not implementing those projects right away won’t cause customer unhappiness to the extent that they leave the service. On the other hand, leaving the deployment process as-is could easily lead to scenarios where customers experience pain. Having an agreed-upon view of acceptable reliability allows you to justifiably prioritize operations’ projects.

By looking at how close each user experience is to unacceptable reliability, you can judge for any given project:

If users are happy with the experience, projects that risk an acceptable decrease in reliability can be safe
If users are unhappy with the experience, projects that improve the reliability of the experience have to be prioritized
If users’ happiness with the experience is slightly above the acceptable level, projects dealing with that user experience could be safely deprioritized.

Beyond engineering projects, this mentality can apply to marketing campaigns, hiring choices, design roadmaps, and more. It allows for a big picture perspective that keeps everyone happy while still pushing onwards as effectively as possible.

The cost-benefit of reliability investments

Having an agreed framework and view of what reliability means for your customers is critical to quantify and weigh decisions. It also gives you the ability to better plan for investment pay-back. This is applicable for any investment an organization makes, including:

New hires
Infrastructure tools
New policies or procedures

Each one will have some initial costs, with the expectation that it will ultimately create more value than that cost. But how do you know how much is being lost and how much you could stand to gain? Thinking of it in purely financial terms is too narrow. You can’t put a hard dollar value on many of the returns.

The answer is, unsurprisingly at this point, to think about how to improve and maintain user satisfaction through the lens of reliability. Although a hard dollar value might be difficult to calculate, it is possible to consider the impact to user experiences after an investment is implemented. Likewise, you can think of how that experience could be impacted by the cost of the investment. Potential challenges caused by the investment could include:

Issues caused by teams getting up to speed on a new policy
Reprioritizing other projects to implement a tool
The loss of resources that could have been spent on a user experience otherwise.

Once you have a thorough cost-benefit investment plan, you can use the same perspective of aiming to maintain an acceptable reliability. An investment can look like it has a big payoff for only a small cost, but if that small cost pushes a user experience over the line and causes user pain, it still may not be worth it.

For example, a customer success team could determine that investing in a new process for receiving feedback is a worthy investment. Initially, it would cause some frustration among customers who were used to the previous feedback template, but in the long run it would allow for much faster feedback. However, if you know that the user cohort that relies on that template is unhappy and on the verge of leaving the service, that initial cost may not be acceptable at the time. Making that cohort happier with the service could need to happen before you move ahead with the investment.

Reliability for specific teams

How different business functions think about reliability varies depending on how they impact user satisfaction. Different business functions directly or indirectly affect user satisfaction. For revenue teams, like marketing and sales, their decisions won’t directly impact user experiences for the product.

However, revenue teams create an expectation in the market for what the experience will be. As reliability is based on user perception, expectations can change where a satisfactory point is for users. On the other hand, lowering expectations will also lower interest in the product. Like any other team, revenue teams have to achieve their goals without sacrificing reliability.

Finding the connection points through reliability and user satisfaction across all functions is critical to driving top-line decisions from the executive team down. This allows you to understand the impact of any decision on the goals of every team.

Reliability is a team sport. It’s so fundamental to business success that it can’t just be the focus and responsibility of engineers. Instead, every team needs to align on what reliability means and how to prioritize based on it. In upcoming articles, we’ll look at how specific teams can better prioritize around reliability metrics and how they connect to other teams and the wider business goals.

How to Write Meaningful Retrospectives

Emily Arnott — Mon, 13 Dec 2021 19:57:07 +0000

One of the foundations of incident management in SRE practice is the incident retrospective. It documents all the learnings from an incident and serves as a checklist for follow-up actions. If we step back, there are 7 main elements to a retrospective. When done right, these elements help you better understand an incident, what it reveals about the system as a whole, and how to build lasting solutions. In this article, we’ll break down how to elevate these 7 elements to produce more meaningful retrospectives.

1. Messages to stakeholders

Incident retrospectives can be the core of your communication with customers and other stakeholders, post incident. We talk a lot about how retrospectives function best when they involve input and feedback from all relevant stakeholders. That doesn't necessarily mean squeezing tons of folks into one meeting or sending out one long pdf to a large group without thoughtful considerations.

The best example of this is distinguishing between customer stakeholders and internal team stakeholders. Customers should be kept in the loop and assured that a resolution is imminent or has already come, but they probably don't need to know (or shouldn't know) the minutiae.

Communicating retrospectives to stakeholders requires empathizing with how they use your services. Describe the incident in the context of what matters most. But don’t beat around the bush, either — you don’t want to come across like you’re hiding or downplaying the impact. Simple, factual statements such as “if you use service x to do y, you lost that ability for 12 hours” is enough to convey your understanding.

Once you’ve established the impact, start to regain trust. Reassure stakeholders about relevant things that didn’t go wrong. In the aftermath of an incident, stakeholders could be worried that there are other problems that weren’t reported. Explicitly state that there wasn’t any data lost, or private information made public, or any other relevant concerns.

Share your action plans with stakeholders too. They may not have the context to understand the details of your solution, but you can explain the impact your plan will have. Be direct to convey your confidence. Again, simple statements work great: “the outage was caused by insufficient server bandwidth. A new process will automatically expand bandwidth in response to increased load. This will alleviate an incident like this in the future.” This is the language of scientific research, which removes personal pronouns from the prose. It’s a great way to keep statements simple, avoid finger-pointing, and remain factual and ideally data-driven.

By expanding your message to stakeholders in this way, they’ll understand that their pain has been understood, and addressed systematically and enduringly.

2. Monitoring context

In more technical retrospectives, generally for study by internal development teams, it’s useful to include any monitoring data your system captures at the time of the incident. Did the incident occur during significant traffic? Did it also lead to slowdowns in other areas of the system? This information can lead to helpful revelations.

But you can go even further with this data! Include long-term baseline measurements for these metrics to provide a standard. You might notice that some metrics follow a pattern that accounts for anomalies during the incident. Don’t mistake coincidence for causation.

Also note where your monitoring data was insufficient. Can you think of any metrics that, if you were capturing them, could have tipped you off about the incident earlier? One of the main goals of the retrospective is to drive systemic change. Look for these opportunities to improve your monitoring system.

3. Communication timelines

Hopefully you have a tool to easily build a communication timeline from Slack, MS Teams, or whatever else you use to chat. It’s important to know what steps were taken, how long they took, and when breakthroughs were made. Include information about what roles people played and what tasks they were assigned.

However, it’s also important to see where miscommunication occurred. Did people do redundant work? Were some tasks or steps forgotten or skipped over? Were there misunderstandings about expectations? Note these issues blamelessly. It’s not someone’s fault if they overlooked something; they were doing their best in a stressful situation. That’s why you need policies and procedures to cover the gaps. Investigate these issues to develop policies that would prevent them.

Inevitably, your war room discussion will have some chatter. You probably want to make an “all-business” retrospective that leaves out anything irrelevant, and that’s likely the right move for retrospectives that will be seen by external stakeholders. For internal retrospectives, though, this extra expression can be valuable. It’s good to see how people were feeling during an incident, when they felt stress and relief. It can open up thinking about the human side of incident response, and makes the retrospective more fun to review later.

4. Contributing factors

A big part of the retrospective is uncovering why the incident happened. Without determining that, you can’t make systemic changes to be stronger for next time. The key to making meaningful and enduring changes is to dig deep. Techniques such as the five whys can help you find the causes behind causes. Illustrating it with tools like the Ishikawa diagram can make it easier to understand.

When digging for these factors, be holistic. Don’t just think about technical issues, but dive into problems with training, headcounts, stress, personal factors in engineers’ lives — anything that could have impacted how people work on your system. Pulling in management and other teams into these discussions could be necessary when reflecting on major incidents.

Of course, all this investigation should be done blamelessly. Assume everyone’s good faith and best intentions. If a mistake was made, look into what information or safeguards could have prevented it. Settling for punishing an individual will prevent you from making major systemic improvements.

5. Technical analysis

This is a section mostly for your engineering teams. If there’s factors in here that should be understood by non-engineers, be sure to provide that information and its impact somewhere else in the retrospective report. Here, you should be detailed enough that future engineers can get useful information when resolving future, similar incidents.

As you did with monitoring data, you should include information about how the code should work, and how it usually works. This context is important, as the intended function of the code may have changed by the time someone reviews it. You should also discuss how future development is expected to impact code in production. Knowing how the code is expected to run in production allows you to be keenly aware when incidents occur.

6. Followup actions

This is one of the most important parts of the retrospective. All of your learning about why the incident happened should transform into actions. Find ways to change the factors that lead to the incident. The retrospective can act as a hub for tracking these items. As you review the retrospective, check and make sure they’re progressing.

The followup actions don’t just have to address the direct causes of the incident. This is also an opportunity to improve your incident response policies, your tools of measuring the impact of incidents (like SLOs), your monitoring setup, even your retrospective standards! You can never be too holistic when solving problems.

To motivate people to work on these followup tasks, include some context. Summarize why each action was chosen after the incident. Also discuss the impact it will have in preventing future incidents. No one wants to spend time responding to the same, or similar incidents over and over. It’s not only soul-destroying for the team, which can quickly lead to burnout. It’s also not good for business. You should include enough information that people will understand the importance without having to reread the entire retrospective.

7. Narrative

Including a narrative summary of the incident is often overlooked. It won’t contain any new information, but it’s still useful. All of this information can be overwhelming, so use this part of the retrospective as a way to make the incident approachable for future study. Think about it in terms of a story. You start with describing things as they should be. Then you introduce the problem and how it disrupts the norm. Walk through the experience of the affected customers as well as the team that was tasked to solve the issue. You cover what they tried, what worked, and what they learned.

Rather than details, you should focus on impact in this section. How severe was the incident, and what made it so? When studying the incident later, many details will be irrelevant to the current system. However, understanding how people responded when things went very poorly will always be a useful lesson.

How to Analyze Contributing Factors Blamelessly

Emily Arnott — Mon, 06 Dec 2021 21:34:04 +0000

SRE advocates addressing problems blamelessly. When something goes wrong, don’t try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don’t fear blame, leading to more initiative and innovation.

Learning everything you can from incidents is a challenge. Understanding the benefits and best practices of analyzing contributing factors can help. In this blog post, we’ll look at:

A definition for root cause analysis
A definition for contributing factor analysis
How to choose between RCAs and contributing factor analysis
Best practices for contributing factor analyses
How to incorporate learning from analyses back into development

What is a root cause analysis?

Root cause analysis, or RCA, is a method for finding the reason an incident occurred. Here it is, summarized in four steps:

Identify the incident. You should understand the exact boundary of what is and isn’t considered part of the incident.
Create a timeline. Log all events impacting the system. Start when the aberrant behavior begins and end when the system returns to normal.
Judge the events for causality. Consider the impact of each event leading up to the incident. Did it indirectly or directly cause the incident? Was it necessary for the incident to happen? Was it irrelevant?‍ ** 4. Build a causal diagram.** A causal diagram or graph is an illustrative tool. It shows how events contribute to the incident. Here is an example:

Example causal diagram

What is a contributing factor analysis?

A contributing factor analysis is another methodology for examining an incident. Rather than pinpoint a single root cause of an incident, the contributing factor analysis looks for a broader range of factors. This is a more holistic approach. It considers technical, procedural, and cultural factors. For the above example of a server outage, here are some factors you may also consider:

The feature launch schedule doesn’t account for server update timings
No policy to scale up server availability for feature launches
Server architecture could be updated to support more traffic
Incident response team could be overworked with new feature launch, delaying backup server availability

Contributing factor analysis should be part of a larger incident retrospective approach. Teams should try to identify contributing factors that can lead to actionable change.

How do you choose between an RCA and a contributing factor analysis?

RCAs and contributing factor analysis each have use cases. RCAs are often formally required while contributing factor analysis is a useful internal tool. Let’s break down why.

When are RCAs used?

RCAs can be part of an organization’s official response to an incident. Because they are often public-facing, they have strict guidelines for formatting. This standardization can be challenging. In a discussion with Blameless, Nic Benders from New Relic shared his thoughts on RCAs:

“The RCA process is a little bit of a bad word inside of New Relic. We see those letters most often accompanied by ‘Customer X wants an RCA.’ Engineers hate it because they are already embarrassed about the failure and now they need to write about it in a way that can pass Legal review.”

Even if they’re unpleasant, RCAs can be necessary. Customers have come to expect openness around failure. Dheeraj Khanna from Tenable explains:

“Today, the industry has become more tolerant to accepting the fact that if you have a vendor, either a SaaS shop or otherwise, it is okay for them to have technical failures. The one caveat is that you are being very transparent to the customer. That means that you are publishing your community pages, and you have enough meat in your status page or updates.”

When are contributing factor analyses used?

Contributing factor analyses help translate the causes of an incident into actionable changes. As this document is for internal use, teams can be more open about the failure and teams can improve.

Nic Benders discusses the shortcomings of RCAs in capturing these areas. “It remains challenging for me to try and find a way to address those people skills and process issues. Technology is the one lever that we pull a lot, so we put a ton of technical fixes in place. But, there are three elements to those incidents. And I worry that we're not doing a good job approaching the other two: people skills and processes.”

When trying to learn the most you can from incidents, looking at all contributing factors is a must. Although you may need both types of analysis, contributing factor analyses are often more useful.

Best practices for blameless contributing factor analysis

Remove the value of blame. While analyzing an incident, blame offers an easy answer. Making an individual at fault removes the responsibility from the system. This means that no changes are necessary to the system; the work is already done. You should not value the solution of blame. By focusing on systemic causes, you can learn more and improve your system further.

Look beyond individuals. Humans aren't perfect. Imagine while conducting a retrospective the team realized that an alert was triggered. But, a team member ignored it. Why? It's time to dig deeper than the individual. Are alerts often noisy or irrelevant? Has this person had enough on-call training and experience? Or have they been on call for too long without a break? By asking these questions, you can arrive at meaningful lessons. It is the best way to ensure the mistake doesn’t happen again.

Celebrate failure. When uncovering factors, celebrate each one as an opportunity for learning. It may seem that the more factors you uncover, the more work you’ve made for yourselves. You don’t want this to discourage team members from suggesting other factors. Create a psychologically safe environment for people to brainstorm. Make sure each contribution is valued.

How to feed learning from analyses back into development

One of the key benefits of a contributing factor analysis is generating actionable insights into the system. But how do you ensure that these lessons lead to changes in development and policy? Here are some tips:

Create a central repository of required actions per incident
Invite development teams to incident review meetings
Bake action items into future sprints, working with product when necessary
Link learning and tasks to larger initiatives for the organization
Have review meetings after task completion to ensure the desired changes occurred

Keep a cycle flowing between the causes of incidents and the changes you make. This will help your system continually improve in relevant ways.

DevOps & SRE Words Matter: How Our Language has Evolved

Emily Arnott — Thu, 18 Nov 2021 20:31:15 +0000

As the tech world changes, language changes with it. New technologies will always introduce new terms and descriptions to provide clear understanding. For example, the emergence of the cloud introduced language to describe the changing relationship between servers and clients. Then, of course, product providers will also dictate how their products are to be described, i.e. describing services as “cloud-native”.

On other occasions, language changes through deliberate effort to influence behavior. Thought leaders will often invent alternative words to describe existing ideas in order to effect cultural change. Even a slight change in diction can massively affect one’s engagement, attitude, and even their worldview. In this blog, we’ll look at how language colours how we perceive our environments, and we’ll break down three examples of how language has evolved in tech.

How language affects and shifts world perspectives

We all have associations with language. Because of our past experiences and culture, different types of messages will trigger different emotional responses. The language we use thus influences the way we think. Whether our associations are positive or negative can impact things such as:

Whether we dread something or get excited by it
How important we perceive something to be
If we perceive something to be collaborative or combative...
Innovative or legacy
Bleeding-edge or mainstream
Safe or provocative

“Postmortem” vs. “Retrospective”

Both of these terms refer to a document that summarizes a past incident and the steps that were taken to resolve it. “Postmortem” was originally a medical term dating back to the 1820s. The metaphorical usage of examining other things after their “death” has been widely used in many industries, including tech.

In recent years, many organizations are differentiating the idea of a retrospective from a postmortem as the culture mindset shifts to the ongoing learning from events and failures. The two practices are commonly considered to have some small differences, such as the timing and content of the documents. However, just as important as these differences are the psychological effects of the terminology being used, especially when these may be conducted in a high-pressure environment. Here are some of the reasons we’re using “retrospective” instead of “postmortem” at Blameless.

The negativity of postmortems: death has a negative association in most people’s minds. As responders attend to incidents, the negative connotation lingers. Engineers may feel worried about the consequences of an incident, and the idea of “death” surrounding this process may encourage feelings of guilt and fear. By removing negative associations, people will be more eager to review and look back at what actually occurred and take the time to revisit it as a team.

The finality of postmortems: at Blameless, we don’t see failure as the end. We see it as an opportunity to learn and grow, a starting point for positive change. Postmortems are very final; no examination happens “post-postmortem”. A retrospective implies that you’re looking back at something that just happened or occured a while ago, that still could have a purpose in the future.

The wide scope of retrospectives: a postmortem is defined by the single moment of failure and works backwards to determine the causes. A retrospective is concerned with more than just the direct causes of failure. Instead, it seeks to tell the complete story of the service, systems, and people, up to and beyond the incident.

We want our incident retrospectives to be documents that we are proud to contribute to, that serve as hubs of learning and impetus for change going forward. We believe that by using the word “retrospective”, it conveys this intent much better than “postmortem”.

“Root Cause Analysis” vs “Contributing Factors Analysis”

When determining why something went wrong, there are several competing schools of thought. The root cause analysis, or RCA, is a popular tool for uncovering the reason for failure. The idea of a “root cause” as being the primary factor causing failure dates back to the early 1900s, with “root cause analysis” emerging as a concept in engineering companies in the 1930s. It is commonly attributed to Kiichiro Toyota, founder of the Toyota Motors Corporation, who developed the Five Whys technique to find root causes.

Contributing factor analysis is a more recent term that has been growing in popularity. It also seeks to understand the causes of an incident, but with a different mindset. That mindset is reflected in the language itself as much as any specific practice. Let’s look at some examples of these differences, and why we at Blameless feel the contributing factors analysis is more useful.

The singularity of RCAs: the most obvious difference is that a root cause analysis refers to a singular root cause, where contributing factors emphasizes multiple factors. This is more important than it may seem. If you set out looking for a singular cause, you’ll resist branching out to other impactful areas. For example, if you only look for an engineering cause, you’ll disregard factors arising from product design or team culture.

The hierarchy of RCAs: the idea of a “root” cause is that it is the source from which other causes grow and branch off. Understanding what causes are more significant for the incident is necessary to properly prioritize follow-up items, but it isn’t the full story. You have to also consider how these changes will affect the team and system as a whole. Thinking about each factor’s contribution without trying to determine which is the “root” keeps you more open-minded.

The neutrality of contribution: when considering the cause of an incident, you’ll be inclined to find failures, mistakes, and other negative things. Instead you can think about every factor that contributed to the story of the incident - including things that went well, like helpful playbooks and good communication. The totality of this factor analysis gives you a more complete picture of how to respond to incidents going forward.

Blameless advocates SRE as a holistic practice, one that incorporates learning from all available sources. The Contributing Factors Analysis brings in as many sources as possible to best understand incidents.

“Disaster Recovery” vs “Incident Response”

The overall process initiated by something going wrong has gone by different names over the years. The attitudes people have towards this have changed alongside the evolution of language and terminology. At first, organizations typically referred to this as disaster recovery. This terminology dates back to the 1970s, where it focused on how systems would recover if natural (or other) disasters wiped out infrastructure and its ability to operate.

As IT systems became more virtual, outages started to be caused by a much wider range of technical aspects other than natural disasters. Organizations moved to referring to this process as incident response to reflect the range of problems and new processes and tools. Also, the processes themselves evolved along with the technology changes. Let’s look at how these terms reflect the attitudes of each era, and why we now use incident response.

The singularity of recovery: incident response, sometimes referred to as incident management, is much more than just restoring the environment to its previous state. After services are back online, you still need to gather information from the incident itself and build a retrospective, develop action items to carry the learning forward, and review the effectiveness of the response steps and procedures. Recovery is really only the first step towards resolution, and doesn’t convey how you can get the most learning and improvement from each incident.

The severity of disasters: people see disasters as major catastrophic events. Setting up policies and procedures to trigger only in the event of a “disaster” is a very high bar. However, your incident response process should work just as efficiently for all incidents In other words, not all incidents are ‘Sev 1” and so knowing the right steps to take depending on each incident is equally important. We believe there’s learning in every incident, and so every incident is worth responding to properly.

The inevitability of incidents: disasters are also thought of as something to avoid at all costs. Any effort spent on reducing the chances of a disaster would be justified, given how severe disasters can be to both customers and engineering teams. A goal of zero disasters is reasonable. However, we know that 100% reliability is impossible. By recognizing the inevitability of incidents, you embrace them and avoid overspending on infrastructure and other resources in trying to prevent them. Using the term “incidents'' vs “disasters” helps team-members understand their true inevitability and impact.

Here's what SLIs AREN'T

Emily Arnott — Thu, 11 Nov 2021 18:28:11 +0000

SLIs, or service level indicators, are powerful metrics of service health. They’re often built up from simpler metrics that are monitored from the system. SLIs transform lower level machine data into something that captures user happiness.

Your organization might already have processes with this same goal. Techniques like real-time telemetry and using synthetic data also build metrics that meaningfully represent service health. In this article, we’ll break down how these techniques vary, and the unique benefits of adopting SLIs.

What are SLIs?

Service level indicators, or SLIs, are metrics that represent your service’s health in specific areas. They can be simple metrics like percentile latency of request for a method, or complex metrics like a latency histogram of 3 different methods. Complexity aside, the most important goal of an SLI is to quantify customer satisfaction.

For example, you might have SLIs that reflect the user experience of adding an item to their shopping cart, which causes a pop up to show the cart’s current contents. These metrics can include:

How long it takes for the database to update the shopping cart internally
How long it takes for the database to update the customer’s current total cost
How long it takes for the shopping cart pop up to load

You might also know that most users don’t immediately click through the pop up to see the full shopping cart page. If you are experimenting with advanced SLIs, you can then decrease the presence of metric for the pop up links in the SLI compared to the others. The end result is a single composite SLI composed of all of the smaller indicators above that represent how satisfied the customer is with adding something to their cart.

SLIs are always tied to an SLO, or service level objective. The SLO sets the point at which the company is no longer accepting of unreliability of the SLI and the resulting inconvenience to the customers. In our example, you might determine that 99% of the time, adding something to the shopping cart should take 500ms or less. Maintaining this SLO, or having this latency SLI remain greater than 99%, ensures your customers remain happy.

Now that we’ve taken a look at what SLIs are, let’s look at what SLIs AREN’T.

What is real-time telemetry?

Real-time telemetry is the practice of observing data coming from a system as it runs. The concept of telemetry is used in a wide variety of industries, including:

Agricultural telemetry places monitoring stations in fields to relay data about the conditions of the crops
Medical telemetry includes devices embedded in the body which transmit reports on health conditions
Aerospace telemetry uses sensors to relay the conditions of aircraft back to pilots
Retail telemetry tracks sales of each product at each location and correlates them to find trends
Server monitoring tracks CPU usage over time to indicate overutilization or under-utilization

In each example, the basic process is the same: a monitoring tool is deployed within the system which then reports back to a central repository. The repository is then analyzed to make informed decisions about the system.

This process is the same for telemetry in software. Code is added to each service that continually updates a log of the service’s behavior. The log is then monitored to determine the health of the system.

How are SLIs and real-time telemetry different?

Both SLIs and real-time telemetry report on the health and reliability of your system. However, SLIs are more focused on user experience than overall system health. Reliability is a subjective term reflecting how users perceive the responsiveness of your service. SLIs are based on the aspects of your service that quantify the customer satisfaction, whereas telemetry generally reports neutrally.

Because of this focus on the user experience, SLIs use more black box monitoring than telemetry. Black box and white box monitoring refer to whether or not data is gathered from within the system’s code (white box) or by testing the system from the outside, as a user would (black box). SLIs want to account for every factor in the most critical user experiences, so gathering data from a user’s perspective is helpful.

Also because of the focus on user experience, SLIs are always tied to an SLO, or service level objective. SLOs are set to the point where the user is pained by the unreliability of the SLI. Unlike telemetry, which neutrally reports on system health, SLIs are always seen in the context of an SLO. Until the SLO is in danger of being breached, changes in the SLI aren’t always cause for alarm. You can monitor the rate at which the SLI approaches the SLO, and adjust your velocity accordingly. SLIs allow you to prioritize responses based on customer impact.

What is synthetic data?

Synthetic data refers to data which isn’t directly observed from a system, but comes from simulations of the system. This helps you gather information about how the system would respond in rare situations, or situations that are difficult to directly measure.

Synthetic data can also refer to simulating usage of your real system in order to gather results. This helps you see the effects of rare or extreme use cases, or use cases that are difficult to observe when they naturally occur.

In both cases, you’re abstracting away from your real system or real users to access new information. Getting accurate results requires accurate models. You need to determine whether the investment in building models is worth the information gained.

How are SLIs and synthetic data different?

Whereas synthetic data is helpful for extreme cases, SLIs focus on the most common and important use cases. SLIs can be built by studying user journeys, which track how a user typically interacts with your service. The goal is to encapsulate the most common ways users rely on your service into metrics.

SLIs and synthetic data also differ in their intent. Synthetic data is usually created for a particular experiment or test. The service is modeled under the chosen conditions or is accessed with the chosen use cases. Once the scenario is explored, that particular use of synthetic data is likely discontinued. On the other hand, SLIs continually reflect the real use of services in production. Rather than seeking new scenarios, you’re making sure incidents don’t impact regular operations.

Here’s a summary of some key differences between SLIs, real time telemetry, and synthetic data:

Why try SLIs?

SLIs have many unique benefits for your organization. Here are a few worth considering:

SLIs align goals on customer happiness

It can be difficult to know where to allocate your resources for improving reliability. Ultimately, you know that customer happiness is the most important factor for your organization. But how do you know your efforts will make your customers happy? SLIs provide the solution.

SLIs are built by studying user journeys. These model the most common ways customers use your services. If most of your customers use the search functions and login page for your site, you can prioritize those service areas highly in your SLIs. Conversely, if very few customers use another service area, you can reduce the number of SLIs or even eliminate them for that service.

When considering development projects or operations policies, you can consider how they’ll affect the SLI. Let’s revisit our example of the shopping cart update SLI. If you were to make a change to how the database links items for sale with customers, it could change the speed of the involved metrics. You can estimate how such a change would propagate to the SLI. If it would risk breaching the SLO, you should reevaluate the decision. If not, you can be more confident in moving ahead.

This creates a bridge between the most basic monitoring data and the ultimate goal of customer happiness. All teams can look at how their choices will impact the basic metrics, and align their decision-making based on the SLI.

SLIs quantify customer happiness in an actionable way

Since SLIs reflect the areas that impact customer happiness, they allow you to track customer happiness as a metric. SLIs are also always tied to an SLO, which sets the point where the SLI becomes unacceptable to the customer. These metrics allow you to see how much an incident impacts your customers. This allows you to triage and classify incidents in a meaningful and actionable way.

For example, if you experience a server outage that takes down certain service areas, it can be difficult to understand exactly what the impact was. A very small blip in the availability of a crucial service might bother customers more than a longer failure of a seldom-used service. SLIs can put this all in an actionable context. Incidents that cause big customer impacts will receive proportionally big responses.

SLIs drive learning and growth

Your SLIs and SLOs shouldn’t be set once and then forgotten about. Instead, they should be continually reviewed and revised as your customers’ needs change. Don’t think of this as a burden, but an opportunity. Revisiting your SLIs is the perfect chance to study your users’ behaviours again. Challenge your assumptions of what customers need most from your services. The lessons SLIs teach you can improve even your largest strategic roadmaps.

Self-Compassion Instead of Self-Blame

Emily Arnott — Fri, 22 Oct 2021 19:17:31 +0000

The tech industry is competitive and not without challenges. People are always growing and improving by pushing their limits. Innovation comes in many forms. In order to foster a healthy culture while allowing people to flourish, organizations must carefully enact policies. Growth should be encouraged while discouraging competition and comparison. One of the core policies organizations implement to achieve these goals is blamelessness.

Every company can benefit from a blameless culture, and it is necessary for true accountability. However, even when an organization espouses a blameless culture, individuals may hold themselves to impossible standards and blame themselves when things go wrong. They may believe:

Blaming themselves counts as blamelessness because they aren’t blaming others
Self-blame is best for the organization because it doesn’t impact the team
Holding themselves responsible is the best way to improve their skills and reliability.

Although these values are admirable, self-blame is precisely the wrong approach to bettering the organization and improving skills. In this post, we’ll show you how to deal with the urge to self-blame by looking at its downsides and instead focus on how to be self-compassionate.

Downsides to self-blame

People self-blame because they think it is the best way to achieve their goals. The idea is that self-blame forces the onus on oneself to change behavior in order to achieve better future outcomes. However, the downsides to self-blame greatly outweigh the perceived benefits. Let’s look at some examples of how:

Self-blame limits systemic improvement

We’ve discussed how blaming others doesn’t lead to meaningful solutions. Even if an individual made a mistake, punishing that individual won’t prevent the mistake from reoccurring. Instead, you need to find the systemic issues that allowed the mistake to be made.

The same is true when you blame yourself. It’s easy to say “I just won’t make that mistake again,” but it’s insufficient. You need to make the effort to dig into all of the factors that lead you to make that choice.

This process can be frustrating or painful. You might discover fundamental issues with your system that require major overhauls. You could also realize that your understanding of parts of your system is insufficient, and that you need to take the time to ask questions and learn. Just remember that it isn’t your fault: your actions were likely a consequence of the available resources and policies. Analyzing these factors is the only way to reduce the chances of future mistakes.

Self-blame leads to lower productivity in the long term

You may be tempted to self-blame because it looks like the fastest resolution to the situation. You simply resolve to do better and move on. It seems like this is the best way to maintain the momentum and productivity of your team. However, in the long term, self-blame is detrimental to productivity.

By not looking into systemic causes, incidents are more likely to recur. Not only would the causes of these incidents go unchanged, but you’d miss out on learnings that can help you solve new incidents. Self-blame limits your growth, stifling your ability to improve your productivity. Carol Dweck, a professor of psychology focusing on studying mindset, describes a self-blaming mindset as “fixed”. If you believe innate traits within yourself caused the incident, you won’t be able to grow. Instead, see your traits and the systems around them as changeable, and you’ll cultivate a growth mindset.

Self-blame also leads to stress and burnout. These factors cannot be ignored; no one is immune. As you internalize the idea that incidents are your fault, you’ll begin to doubt yourself. You’ll be unable to confidently make decisions and move forward with projects. The more responsibility you feel you have for your system, the more heavy your burden will seem. The stress that accumulates undoes any benefits you think may come from self-blame.

Self-blame lowers team solidarity

Taking on self-blame is isolating. If you believe that an incident was caused solely by deficiency in your work, there’s no way to communicate that to your teammates. Instead, if you share the context of the incident with everyone, the team grows together.

The context of the incident can involve technical aspects of the system you misunderstood, or resources that were unavailable. You’ll likely find that your teammates have struggled with similar problems, and together you can solve them. No one should shame you for not knowing something; instead, you should celebrate an opportunity to strengthen your system.

Personal challenges also provide context for the incident. If you feel stressed, overworked, or disoriented because of any aspect of your life, share how you’re feeling with your team. This isn’t making an excuse for a mistake; instead, it’s giving your team the opportunity to support you better and improve the whole team’s abilities. No one can be expected to operate at 100% all the time, and holding yourself to that standard will lead to problems. Build a resilient team by sharing the load. Open, honest communication will facilitate that end-goal.

How to be self-compassionate

Even if you know that self-blame isn’t helpful, you may still habitually blame yourself. To move past this instinct, you need to replace self-blame with self-compassion. This practice can be transformative in every area of your life, and its many elements and benefits go well beyond the scope of this post. Let’s take a look at a few examples of how it applies to self-blame at work.

Treating yourself as a teammate

You wouldn’t blame a teammate, so why blame yourself? It sounds simple, but this perspective can help you avoid your instinct for self-blame. Maybe you think the explanations you have for why you made a mistake are insufficient, just excuses. It’s easy to be dismissive of your own perspective.

Now imagine a teammate is explaining this context to you. You’d be sympathetic, and work with them to improve the situation, rather than just blame them. If you treat yourself the same way, you’ll be able to move on to a better solution. Berkeley’s Greater Good in Action program has an exercise you can try to help practice this perspective.

Be mindful of your emotions

When something goes wrong, it can be frustrating, humiliating, confusing, disorienting, or otherwise painful. Self-blame can be a response to these feelings. Because you feel bad, you may be inclined to treat yourself badly. You might think that blaming yourself for what happened is a way to control and take ownership of these feelings.

However strong this urge is, remember that self-blame won’t make these feelings go away. Be mindful and observe your feelings and reactions. Don’t try to stop them or control them. Focus instead on how to make the most meaningful improvement to the situation: working as a team to discover systemic issues.

In a paper for Mindfulness and Self-Regulation, authors Kristin D. Neff and Katie A. Dahm break down how mindfulness helps with self-compassion: when we aren’t mindful of negative thoughts and feelings, we can become “overidentified” with them, seeing them as aspects of our self. Believing this inherently leads to self-blame. If instead we recognize passing thoughts and feelings for what they are, we can be compassionate with ourselves when working through them.

Disconnect self-evaluation from self-worth

When you invest yourself into your work, your self-image and self-worth may be affected by your performance. This can be motivating when things are going well - your hard work will translate into capability and confidence . But when something goes wrong, this mindset backfires.

If you feel that good work makes you a good person, then you might think that bad work makes you a bad person. This leads to self-blame, as you attribute the incident to your personal flaws. Rise above this pattern by remembering that your self-worth isn’t dependent on your productivity as an engineer. Being a good engineer isn’t about how fast you work or never making mistakes. Instead, it’s about being collaborative, compassionate, eager to learn, and willing to admit when you don’t know something. Focus on these qualities when things go wrong.

Understanding the difference between self-esteem and self-compassion is key to making this realization. While both self-esteem and self-compassion allow you to feel confident and happy, self-compassion is more resilient to setbacks.

Practicing self-compassion

Becoming self-compassionate isn’t easy. Don’t expect to shift your mentality overnight, and don’t be discouraged if you catch yourself falling into old habits. Like any practice, building self-compassion takes iteration and continued effort. But it’s so worth it. Your productivity, team camaraderie, and incident response patterns will profoundly improve. Plus, you’ll enjoy your work more.

However, blamelessness doesn’t mean that things will be easy and smooth sailing all the time. It also means confronting systemic issues and resolving inadequacies. This can be a painful process, perhaps even more painful than just blaming yourself and moving on. However, it’s a pain that will lead to meaningful improvement. The Center for Compassion and Altruism Research and Education at Stanford Medicine and the Greater Good Center at Berkeley have great resources for continuing your journey to self-compassion.

How to encourage self-compassion

Management can discourage self-blame and encourage self-compassion with policy and practices. Here are some suggestions for how to cultivate this at your organization:

Don’t use comparative feedback of employees - feedback on performance should be given in private and compare only the employee’s performance to established standards, instead of how other employees are performing
Create space for sharing challenges at meetings - at standups or other team meetings, encourage employees to share what challenges they’re facing - both system-wide and role-specific obstacles
Check in with employees after incidents - after major incidents, check in one-on-one with employees that were involved. Make sure they aren’t blaming themselves for the incident, and that they know they have support.

Elephant in the Blameless War Room: Accountability

Emily Arnott — Thu, 08 Jul 2021 18:26:04 +0000

We’ve always advocated that every company can benefit from a blameless culture. Fostering a blameless culture can profoundly boost your organization in powerful ways, from employee retention to developer velocity and innovation. However, there’s an elephant in the room when we talk about blamelessness with executives: accountability. When things go wrong, people still need to get fired, right?

In a discussion with Ajay Varia, former VPE of MasterClass and COO of Emeritus, he shared an example of a real incident and how he held space for a blameless resolution without sacrificing accountability. An engineer was making changes to the administrative panel of what they thought was a testing environment. Regrettably, it turned out to actually be controlling the production environment. Their changes caused a significant outage for the service.

Imagine an executive pressing to know who was responsible for the outage. How would you respond to this demand for accountability while maintaining the ideal of blamelessness? In this blog post, we’ll look at:

What blaming executives want when they blame
How to skillfully respond to demands for blame
When is accountability fair game?
How to be blamelessly accountable

What does a blaming executive want?

Although we might not agree with their blameful approach when they ask “Who’s responsible for this incident and what should we do about this person?”, we need to remember that the executive’s goal is the same as ours: to solve the problem and ensure the company’s success. Just like us, they take responsibility of the situation and are eager to restore the service to health.

The executive likely has three goals in mind: dealing with the person involved, resolving and preventing the incident, and restoring trust with affected stakeholders. Given their distance from the day-to-day context of the incident, they may see blaming an individual as one of the only ways to meet their goals.

To understand what an executive wants to achieve when they look for someone to blame, we must empathize with their perspective. They may have certain assumptions about the situation or how their actions will affect it. We must account for those assumptions and meet them where they are so we can skillfully and constructively respond to their questions.

Assumptions about the incident

This incident should never have happened. The executive may not realize that failure is inevitable in complex systems. They may assume that if people did their jobs right, then there should be no incidents or outages. Therefore, the person causing the incident must not care or not be skilled enough.

Punishment will deter others from making the same mistake. If they assume the mistake was made because of negligence, they could see punishment as a “wake up call” that would make other engineers more dutiful. They may think that remembering the punished engineer would make other engineers spend more time double checking which admin panel they access.

Assumptions about the person involved

A skillful engineer would never make this mistake, and therefore the person involved must be unskilled. A lack of context around the issue may lead the executive to assume that it resulted entirely from individual error, with no systemic factors. With our admin panel mixup example, they might assume that any good engineer would be able to consistently identify the production panel.

Removing the person removes the problem. Given their first assumption, it follows that they’d see this as an effective solution. If they removed the unskilled engineer responsible and only skilled engineers remained, the problem wouldn’t reoccur.

Without punishment, the engineer won’t appreciate that they made a mistake. When dealing with the stress of the incident, it’s easy for the executive to feel that they’re alone in their frustration. They may believe that the engineer lacks the perspective to understand the significance of the incident and feel bad about it. They may see punishment as a way of conveying the impact to the engineer.

Assumptions about stakeholders

Punishment is the most persuasive way to alleviate customer concerns and restore trust. Even if the executive believes in blamelessness and addressing systemic causes, they may worry that their stakeholders don’t. They may assume that only removing the involved person will satisfy stakeholders worried about the incident reoccurring. Blame-heavy press releases reflect this concern.

Stakeholders may expect punishment to maintain fairness. Some stakeholders might assume that the incident was due to negligence. These stakeholders could include other teams in your organization impacted by the incident, such as customer success. Since those teams may be measured based on the churn as a direct consequence of the incident, and customer success managers are held accountable to churn, the executive could assume that the engineering team needs to experience similar degrees of negative consequence to uphold fair distribution of accountability across the company. They could see punishment as the best method to achieve this.

How can you skillfully respond to a demand for blame?

Given these assumptions and the deep desire to resolve the problem, it makes sense that the executive would look to blame. To respond to this demand without blame, you have to convince the executive that their goals will still be met. You must assure them that their concerns about the person involved, the incident itself, and the stakeholders will all be addressed.

Responses about the incident

Engineers in Fight-or-Flight Mode Cannot Problem Solve Well. Finding sources of blame during an incident will likely cause the resolution to go slower, not faster. Engineers who are stressed about their job security will likely not be able to resolve complex problems as quickly as they would if their minds could focus on the incident at hand. Since engineering is about solving complex problems, blame could detriment an engineer’s overall productivity and effectiveness.

A systemic change is more enduring and beneficial. Digging into systemic causes can provide insights in the most fundamental assumptions about your organization. It will provide changes that will help regardless of who is on the team or who is at the helm. In this case, Ajay asked, “Why do the testing and production environment look so similar? Can we create a big red banner warning the team when they are making changes to production?”

Complex system failures are inevitable. There will always be incidents we cannot prepare for, but it’s not all hopeless. We can take measures to get better at detection, get faster at mitigation, and get more proactive with prevention. We can measure improvements across these three areas and focus reliability efforts on the most revenue-critical user journeys. Expecting engineers to keep the service running 100% of the time is not realistic, even our biological hearts do not have 100% reliability. Help the executive see incidents as unplanned investments in reliability.

Responses about the person involved

Anyone in that position could have made the same mistake. When explaining the systemic causes, show how they would apply equally to any engineer, even the most senior architect. Therefore, punishing the specific person who happened to be there won’t necessarily prevent the issue in the future.

We could make this mistake again with a different person. As anyone could have made the mistake, it could easily happen again unless something systematic changes.

No one wanted this outcome, least of all the engineer involved. Assure executives that everyone on the team is aligned on the importance of reliability and customer trust. Re-establish trust with the executive by demonstrating that the engineer and the team overall clearly understand the impact of the incident and feel committed to improving the system. Emphasize the resulting action plan as proof of this commitment.

Responses to the stakeholders affected

Even if you don’t directly interface with the customers or other stakeholders, you can advise the executive to respond in these ways:

Our action plan will inspire confidence. If the goals and timeline of the action items are communicated to stakeholders, they’ll see how it creates a more reliable solution than blame.

We can acknowledge their pain without blame. There are ways to show stakeholders that you understand the pain the outage has caused them without retribution. Ajay recommended that people in leadership positions hear out everything from the stakeholder’s perspective. It isn’t about finding a scapegoat, but making sure that the stakeholders understand they aren’t being dismissed or trivialized.

By responding to a demand for blame with these assurances, you can convince the executive that their goals will be met without needing to resort to blame.

When is accountability fair game?

There could still be situations where someone needs to be held personally accountable. It may be the best option to advance systemic changes. There are some prerequisites to meet before holding a person accountable.

Here are some questions to ask yourself:

Were expectations for this person’s job clear, realistic and documented?
Are there multiple mistakes caused as a direct result of this person’s lack of skill, good intentions, or earnest effort, and little to nothing else?
Have you shared feedback about gaps in their performance on a consistent basis?
Do you and/or other members of your team have reasons to believe that this person is not sufficiently coachable?
Do you have consistent and reliable evidence that this person cannot be trusted to meet the explicitly stated expectations associated with their role?
Does your organization’s culture acknowledge that complex system failures are inevitable?
Did you look for contributing factors of their mistakes?

If the answer to all of the above questions is yes, then holding accountability is fair.

However, as you can see, the traditional definition of accountability - attribution and punishment - fall under performance management, which should be an intentional and separate process from incident resolution. Pointing blame for inevitable system failures is not an appropriate substitution for performance management.

How to be blamelessly accountable

Accountability isn’t incompatible with blamelessness. In fact, true accountability - ownership to make the system better facing forward - requires blamelessness.

Blame is an easy way out. It allows people to punish a “responsible” individual and call it a day. But what the company really needs is for leaders and teams to do the hard work of solving the complex and nuanced challenges of your system.

True accountability incorporates nuance and faces forward.

For our example of the production panel being mistaken for the testing panel, here’s how Ajay took accountability as a leader. He asked,

Why do the admin panels for the production and testing environments look so similar? Should production have a big flashing banner reminding you that you’re working in production?
Should a single person be able to make changes to admin in the production environment? Should there be a two-person verification system?
Should every engineer be given the ability to make changes on the production admin panel? Maybe most engineers only need to make changes in testing.

We are sure you can imagine how confidence-inspiring the ensuing follow-up action items are. Ajay showed that companies don’t have to sacrifice accountability to have a blameless culture, nor do they have to default to blame to uphold accountability.

It takes incredible empathy, stress-tolerance, and critical thinking to get blamelessness and accountability working together in harmony, but it is possible.

So don’t hide the elephant, ride it!