Hannah Culver for Blameless

Posted on Jun 19, 2020 • Originally published at blameless.com on Jun 19, 2020

Best Practices for Effective Incident Management

#sre #devops

Incident management is a set of processes used by operations teams to respond to latency or downtime, and return a service to its normal state. Incident management practices have long been well-defined through frameworks such as ITIL, but as software systems become more complex, teams increasingly need to adapt their incident management processes accordingly.

Below are five incident management best practices that your team can begin using today to improve the speed, efficiency, and effectiveness of your incident management process.

Why use incident management best practices

In today’s high-stakes, high-availability world, uptime has never been more important to focus on. Reliability has become the No. 1 feature for companies, and unreliable services can make or break an organization’s revenue and reputation.

Teams responding to incidents have become the soldiers on the front lines for a company’s overall health and well-being. With downtime costs skyrocketing, it’s important that your team is trained, prepared, and ready for battle.

This requires adopting a smooth, effective incident management process in order to resolve issues faster, communicate and collaborate through the process, and learn from these incidents to possibly prevent the same incident from happening again.

Adopt alerting and on-call best practices

Effective incident management begins with setting a strong foundation. Alerting and on-call procedures are crucial for your team’s success. When you’re experiencing an incident, this is how you determine what kind of incident you’re facing as well as who to call for help.

Set alerts that matter

There is such a thing as too much information. When your on-call team is getting paged at 12:34 AM, 1:11 AM, 2:46 AM, and on until dawn, it can be impossible for them to respond adequately to each alert. When pager fatigue sets in, quality and efficiency go down the drain. You need to determine what’s worth alerting on, and what isn’t.

One way to do this is by thinking about your customers first and determining SLIs, or service level indicators.

The touchpoints between the user and your service will involve requests and responses – the building blocks of SLIs. For each touchpoint you identify, you should be able to break down the specific SLIs measuring that interaction, such as the latency of the site’s response, the availability of key functions, and the liveness of data customers are accessing.

Next, you’ll use those SLIs to create SLOs, or service level objectives. This is the internal threshold you want to hit based on your SLI to keep your customers happy. If you exceed this threshold, then an alert should be triggered.

Alerts on SLOs are helpful to diagnose the severity of the incident as well as “quantify impact to clients: when an SLO-alert fires, the responder knows that a client is impacted. Not only do SLO alerts indicate that client’s are affected, they also indicate how many requests are affected.”

Prepare your team for on-call

Once you’ve been alerted to an incident, it’s just as important to make sure that your team is prepared to respond, no matter what the level of severity. While there are many components to this, two rise above the rest as priorities to focus on for a healthy on-call process.

Make your engineers feel safe. If your engineers are afraid of failure, or are insecure in their knowledge of the service, they can be hesitant to respond. As noted in the Google SRE book, “Stress hormones like cortisol and corticotropin-releasing hormone (CRH) are known to cause behavioral consequences—including fear—that can impair cognitive functions and cause suboptimal decision making.” Avoid this by cultivating a blameless culture and arranging for engineers to shadow on-call when learning the service. Additionally, make sure that each trained engineer spends adequate time on-call in order to grow accustomed to making decisions under pressure.

Take a qualitative approach to on-call. There is a huge difference between spending a weekend on call with no incidents, and spending a weekend on call with 3 high-severity incidents. If we only look at time spent on call, we don’t get an accurate view of who is most likely to be too tired or burnt out to respond to another incident. Instead of going by time on call, take a more qualitative look. This will help cultivate a culture of on-call empathy within your team.

So, you have your alerts set up and your on-call team is prepared. What comes next?

Prioritize incidents and use runbooks to get ahead of the curve

You’ve been alerted that you have an incident, and you know who to call. But is it time to ring everyone? It’s important to know whether an incident requires waking your entire team in the middle of the night, or if it can wait until Monday morning. It’s also important to know what steps to take once the incident is discovered.

One way to determine the severity of incidents is by customer impact. Afterall, if your customers won’t know anything is wrong, it can probably wait a few hours until your team has had the chance to wake up and grab a cup of coffee.

For example, PagerDuty published a chart with their defined severity levels, which our team at Blameless has adapted for our own internal processes.

This may not be accurate for your team or service, but it’s important to determine this so your team members can make the right call during an incident. Key information like this should also be baked into a comprehensive runbook.

Runbooks -- which are predefined procedures meant to be performed by operators -- are important components of incident response. They help with:

Automating the toil from incidents when possible
Describing what to do in the event of an incident.

Runbooks can tell you where to check for code, who to escalate to, as well as what the incident postmortem or retrospective process looks like, and can be tailored to the specific type and severity of incidents.

Though runbooks are very versatile and customizable, there are some components that all good runbooks should contain. According to AWS, here are a few of these must-haves:

Requirements to be able to execute the runbook
Constraints on the execution of the runbook
Procedure steps and expected outcomes
Escalation procedures

With prioritization and runbooks, your incidents are on the right path towards speedy resolution. But there are some additional incident management best practices that you’ll need to pay attention to as well.

Set defined roles, responsibilities, and communication guidelines

There are countless moving pieces during an incident, and even if you have runbooks, it can be difficult to keep in touch with your team about what you’ve done and haven’t done.

This is especially true in the era of remote work when you can’t simply go to your teammate’s desk or head to the incident war room to check in.

Instead, we need to focus on improving our collaboration skills with defined roles and responsibilities and communication guidelines.

Roles and responsibilities for incident management

There are four main roles during incident management, and each role has different responsibilities. With smaller teams, sometimes you’ll need to combine these roles in order to cover all your bases, and that’s fine. As long as someone takes charge of the responsibilities, the roles can be combined in the way that best fits your team.

Incident Commander: The Incident Commander's job is to run the incident, and their ultimate goal is to bring the incident to completion as fast as possible. When the incident has started, generally the person that's first paged is by default the Incident Commander, and responsible for helping to kick off the triage process.

Communication Lead: The Comms Lead is in charge of communications leadership, though for smaller incidents, this role is typically subsumed by the Incident Commander. Communication responsibilities include keeping both customers and management apprised of the situation, as well as communicating progress within the team.

Technical Lead: This individual is knowledgeable in the technical domain in question, and helps to drive the technical resolution by liaising with Subject Matter Experts.

Scribe: This person may not be active in the incident, but they transcribe key information during the incident. With today’s tools, this role could be automated through bots that execute tasks such as grabbing log files and highlighting key information in the channel. Regardless of how it’s done, taking notes during an incident is incredibly important to get the full value.

Establishing communication guidelines

Once the roles have been filled and responsibilities dolled out, you need to understand how teammates are expected to communicate with each other during an incident. While it’s important to know whether the protocol involves communicating over Slack and Zoom, or whether your team chats over Microsoft Teams, it’s even more important to know how to treat one another.

Every engineer makes mistakes; it’s how lessons are learned. When an incident happens, it’s easy to place blame on the last person who pushed code.

However, people are never the root cause of an incident; processes are. To be great at incident response, you will need to be compassionate in the face of these mistakes and seek to learn from them.

Issues won’t just cause incidents; they’ll pop up during incidents. Sometimes a fix can cause more damage to a service than it repairs, and you’ll need to learn to have compassion during these moments too.

Instead of getting angry with a team member, remember that they are just trying to help. Everyone is making the decisions they feel are best at that moment in time with the information they have.

Create comprehensive incident retrospectives

It is important that good incident management spans the whole lifecycle of an incident, beyond resolving or closing an incident. Even after resolution, there are important steps to complete for exceptional incident management. Creating comprehensive incident retrospectives to properly document what happened is key to overall success. Not only is it a record that your team can refer back to during future incidents, but it’s also something that you can share more widely to help spread knowledge within the entire organization.

There’s a craft to creating retrospectives that are valuable, however. Below are some tips to help:

Use visuals. As Steve McGhee says, “A ‘what happened’ narrative with graphs is the best textbook-let for teaching other engineers how to get better at progressing through future incidents.” Graphs provide an engineer with a quick and in-depth explanation for what was happening during the incident days, weeks, or even years later.

Use timelines. Using timelines when writing postmortems is very valuable. However, they also require the perfect balance of information. Too much to sift through, and the postmortem will become cluttered. Too little and it’s vague. To get clarity on this, try asking an engineer from another team to read through the timeline. Record where they have questions or feel that there is too much information and adjust accordingly.

Publish promptly. Promptness has two main benefits: first, it allows the authors of the postmortem to report on the incident with a clear mind, and second, it soothes affected customers with less opportunity for churn.

Be blameless. Remember, people are not points of failure, everyone is doing their best, and failure is guaranteed to happen.

Tell a story. An incident is a story. To tell a story well, many components must work together. Make sure that your postmortems have all the necessary parts to create a compelling and helpful narrative.

Once you have a retrospective that you are proud to publish, it’s time to make sure all that knowledge is fed back into your system. Otherwise, this incident will have just been a hit to the business, and a missed opportunity for learning.

Close the circle in your incident management lifecycle

With the increasing frequency of incidents and complexity of systems, it’s not enough to simply fix an issue, fill out a quick Google doc for a retrospective, and move on. We need to make sure that we’re taking every opportunity to close the learning gap and take proactive, remediative actions in our incident management lifecycle.

To do this, make sure to track all follow up items assigned from each incident. If some action items are lengthy, costly fixes, make sure to discuss with the product teams how this can be prioritized.

Additionally, you’ll need to make sure that you share concerns about these issues with stakeholders and adjoining teams to create battle plans. If reliability is being compromised for new features, you’ll need to discuss ways to incentivise reliability and encourage buy-in from all stakeholders.

Lastly you will need to regularly examine your SLOs and error budget policies. This helps keep you apprised of changing customer expectations and makes sure you’re on the same page as your consumers.

If you’re consistently exceeding your error budgets yet customer satisfaction isn’t being affected, perhaps you’re not giving your team enough slack. If you’re meeting your SLOs but customers are unhappy, maybe it’s time to make your criteria more stringent.

Incident management best practices are crucial components to your team’s success during a crisis. They help you: