A few weeks ago we released episode two of our ongoing webinar series, SRE: From Theory to Practice. In this series, we break down a challenge facing SREs through an open and honest discussion. Our topic this episode was “what’s difficult about incident command?” When things go wrong, who is in charge? And what does it feel like to do that role? To discuss, Jake Englund and Matt Davis from Blameless were joined by Varun Pal, Staff SRE at Procore, and Alyson Van Hardenburg, Engineering Manager at Honeycomb.
To explore how organizations felt about incident command, we asked about the role on our community Slack channel, an open space for SRE discussion. We found that most organizations don’t have dedicated incident commander roles. Instead, on-call engineers are trained to take on the command role when appropriate. Because of this wide range of people who could end up wearing the incident commander hat, it’s important to have an empathetic understanding of exactly what the role entails.
With this conversation, we wanted to work through what incident command theoretically entails, and connect it to the messy reality of what it often looks like. As we did for last episode, we’ll highlight three key takeaways as an introduction to the episode.
Varun discussed how at Procore, he and his colleague started an incident commander “guild”, a group of people who may have to take on the incident command role that meets weekly. Before starting the guild, they recognized that each person taking on the role may have vastly different areas of expertise and perspectives on how incidents should be run. When reviewing incidents in retrospectives, they’d often find inconsistencies based on who was commanding the incident. This created challenges for finding patterns across incidents, and for using consistent methods to investigate the causes of incidents. This was the impetus to gather incident commanders in this new guild.
By bringing together everyone who could wear the incident commander hat, they not only got everyone on the same page, but on the “best” page. This meant you would collect the expertise from everyone in the group and establish those best practices as a methodology to which everyone adheres. Everyone could contribute what they found most effective, synthesizing everyone’s experiences into an agreed-upon set of practices. The program was started from the bottom up, knowing that the time and energy invested would make everyone’s lives easier in the long run.
Perhaps even more important than coming up with good procedures, the incident commander guild provides solidarity and empathy. It’s a safe space for people who respond to incidents to share in one another’s triumphs, and commiserate and vent about frustrations. Incident command is tough work: it’s a job that can have you leaping out of bed at 3am and suddenly being asked to direct a team of other tired people. Without support, people can quickly burn out.
“I’d rather be on-call 24/7 for something I’m the subject matter expert on than spend 5 minutes being incident commander for something I don’t know about,” said Jake in our discussion. It might be an exaggeration – but not a huge one. Everyone else on the call echoed this sentiment. The anxiety around not knowing is only sensible. In the crunch time of an incident, no one wants to cause further delay because they don’t know how something works.
The first step to addressing this, as Jake emphasized, is to realize that engineers are not fungible. You can’t assume that every engineer has the expertise and experience of every other engineer. For engineers to be effective on-call, they need to be brought up to speed on system functioning. Without that, you won’t be able to know that “deploying people” to resolve a problem will have any effect.
Even with training, some engineers will always be more familiar with some service areas than others, perhaps because they worked on the project itself. No matter how prepared they are for on-call in general, this relative lack of expertise will always cause anxiety: people will inevitably fear an incident that exposes the things they don’t know, or even the things they don’t know that they don’t know. This is why Alyson emphasized, to everyone’s agreement, that subject matter experts shouldn’t be the incident commander. Good incident response shouldn’t be about “getting lucky” and having the expert on call, but establishing learning and processes that help anyone solve issues.
Since this anxiety is to some extent inevitable, the important thing is to empathize with it and set up systems that support it. Often, there will be designated people to escalate to. It’s helpful to know who to call, but it can be intimidating if you think you’re bothering someone with a question you “ought to know”. One panelist brought up the subject of “on-call buddies”, someone you trust yourself to contact even when you’re unsure of what you “ought to know”. Then, even if both of you don’t know, you can be more encouraged to escalate further. In general, escalation policies shouldn’t be strict and linear, but more based on expertise and connections.
We’ve looked at some best practices to make life better for incident commanders, but a key question remains: what exactly is incident command? Is it a duty that rotates through everyone on-call, with that designated person taking command for every incident on that shift? Or is it determined at the time of the incident – perhaps the person who first responds to the incident, or the most expert person on-call, or the most senior person involved in each incident? Or should you hire a designated incident command person? What duties does someone have when they’re on incident command?
When discussing these questions, our panel concluded that… it depends. Who an incident commander is and what they do may vary from org to org, and from incident to incident. But when you’re building up the practice yourself, “it depends” isn’t a very helpful answer. That’s why I wanted to highlight a framework for incident command suggested by Alyson: incident command is like first aid.
First aid isn’t about fully treating a patient, or even fully diagnosing them. It’s about taking charge of a situation and making sure critical tasks are happening and not falling through the cracks. Alyson described a scene where you witness an accidentand immediately give direction: “you, elevate the head and try to stop the bleeding”; “you, call an ambulance”, etc. Instructing particular people bypasses the bystander effect and ensures the task is completed.
When you’re the incident commander, it can be helpful to focus on this role of immediate task allocation, instead of getting bogged down immediately by diagnosis and response itself. Matt also emphasized the importance of the incident commander knowing when to step away. You’ll naturally want to see every step of the incident through, but trying to power through when exhausted can have diminishing returns. For the sake of the incident, and your own health, it’s important to take breaks. During that time, hand off the command role to someone else. Matt suggests the incident lead is a good option. No one expects the first aid person to stay at the patient’s bedside all night.
We hope you’re enjoying our look into the real experiences of engineers in SRE: From Theory to Practice. You can look forward to new episodes coming soon. If you have an SRE topic that you’d like to see covered by a panel, please let us know on Twitter or on our Slack community channel.