Until recently I worked on the GOV.UK team at the UK government digital service. One thing this team is really good at is responding to incidents and learning from them.
I've been thinking a lot about this lately as my new team is still formalising how to respond to incidents, and as a result it's a lot more chaotic when stuff actually goes wrong. But it doesn't have to be this way.
I think most teams would benefit from treating incidents as a normal thing that occasionally happens, rather than something scary and bad.
What should you do during an incident?
An incident can be any unexpected interruption to a service people care about. For example:
- the service is completely unavailable
- something is unacceptably slow
- functionality is broken in a way that stops people using it
- you discover a security vulnerability or breach
During the incident you don't always need to understand the root causes. The goal is to restore normal service as quickly as possible, and communicate what is happening to the people affected (and management).
However, you want to be able to understand the root causes later on, so you should keep a record of actions you took and things you observed. Without this it can be impossible to understand what happened, particularly if some of the actions you took actually made things worse (this can happen). So write stuff down.
It's easy for one person to overlook important details during an incident. Ideally there should be more than one person responding so you can discuss what to do next. On the other hand, having multiple people crowding around one desk just increases the pressure on one person without actually helping them. If this is happening, nominate a person to lead the incident and let them request help as needed.
The importance of the comms part depends on the impact to users. If the GOV.UK website goes down, that blocks access to government information and services, and people notice. So they always assign one person to update the status page and field questions from people who are not actively resolving the incident.
Run a postmortem after the incident
After the incident you should get everyone who was involved in a room to talk about what happened and why it happened that way. This is the postmortem (AKA incident review).
Here's a rough agenda for a postmortem:
- Remind everyone how the postmortem works
- Get someone to read out the timeline of how the incident unfolded (if you didn't produce one during the incident, write it up before the meeting so people can review ahead of time)
- Read out key details: what the impact was, how long it took to restore service, and whether there's any further investigation or fixes needed
- Discuss causes and key decisions that were made, and write down any recommendations you have
- Decide which recommendations are worth doing, and who should make them happen
You can write up a report of the postmortem as you go. If you captured the timeline during the incident all you have to do is write up your conclusions and recommendations.
Blameless postmortems are effective postmortems
It's very tempting in these meetings to latch onto a single root cause and focus exclusively on that, but that's overly simplistic. An incident can involve complex systems of software and people, and there are many factors that influence the way the incident plays out.
There's a thing called the retrospective prime directive which we stuck on the walls of our meeting rooms, which says:
Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.
Some people roll their eyes at this, because it paints a rosy picture of everyone being perfect all the time, but that's not really the point. The goal of the postmortem is to learn from what happened and avoid the same problems happening again. Assigning blame is a very ineffective way to do that, because it encourages people to hide their mistakes and not address them.
It's also very easy to look back on an event and assume the outcome was obvious, when it really wasn't at the time (this is called hindsight bias). So it helps to assume people responded in a reasonable way, and think about what information and tools they had access to when they made decisions.
Everyone accepts some level of risk in order to get software out the door, so we shouldn't panic when incidents actually happen. The best thing we can do when something goes wrong is to learn from it, and I think blameless postmortems are one of the best tools we have for actually improving the way we do things. GOV.UK would not be as resilient as it is if not for this continuous feedback and improvement.
I think you can also learn a lot from reading other people's incident reports (see this collection of postmortems for some interesting ones).
How do you handle incidents where you work?
Level up every day