Mat

Posted on Sep 8, 2018 • Edited on Sep 5, 2020

It's ok for things to go wrong

Until recently I worked on the GOV.UK team at the UK government digital service. One thing this team is really good at is responding to incidents and learning from them.

I've been thinking a lot about this lately as my new team is still formalising how to respond to incidents, and as a result it's a lot more chaotic when stuff actually goes wrong. But it doesn't have to be this way.

I think most software teams would benefit from treating incidents as a normal thing that occasionally happens, rather than something scary and bad, to be avoided at all costs.

What should you do during an incident?

An incident can be any unexpected interruption to a service people care about. For example:

the service is completely unavailable
something is unacceptably slow
functionality is broken in a way that stops people using it
you discover a security vulnerability or breach

During the incident you don't always need to understand the root causes. The goal is to restore normal service as quickly as possible, and communicate what is happening to the people affected (and management).

However, you want to be able to understand the root causes later on, so you should keep a record of actions you took and things you observed. Without this it can be impossible to understand what happened, particularly if some of the actions you took actually made things worse (this can happen). So write stuff down.

Don't panic

It's easy for one person to overlook important details during an incident. Ideally there should be more than one person responding so you can discuss what to do next. On the other hand, having multiple people crowding around one desk (or spamming a slack channel) can increase the pressure on the person responding without actually helping them. If this is happening, nominate a person to lead the incident and let them request help as needed.

The importance of the comms part depends on the impact to users. If the GOV.UK website goes down, that blocks access to government information and services, and people notice. So they always assign one person to update the status page and field questions from people who are not actively resolving the incident.

Run a postmortem after the incident

After the incident you should get everyone who was involved in a room to talk about what happened and why it happened that way. This is the postmortem (AKA incident review).

Here's a rough agenda for a postmortem:

Remind everyone how the postmortem works
Get someone to read out the timeline of how the incident unfolded (if you didn't produce one during the incident, write it up before the meeting so people can review ahead of time)
Read out key details: what the impact was, how long it took to restore service, and whether there's any further investigation or fixes needed
Discuss causes and key decisions that were made, and write down any recommendations you have
Decide which recommendations are worth doing, and who should make them happen

You can write up a report of the postmortem as you go. If you captured the timeline during the incident all you have to do is write up your conclusions and recommendations.

Blameless postmortems are effective postmortems

It's very tempting in these meetings to latch onto a single root cause and focus exclusively on that, but that's overly simplistic. An incident usually involves complex systems of software and people, and there are many factors that influence the way it plays out.

There's a thing called the retrospective prime directive which we stuck on the walls of our meeting rooms, which says:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

Some people roll their eyes at this, because it paints a rosy picture of everyone being perfect all the time, but that's not really the point. The goal of the postmortem is to learn from what happened and avoid the same problems happening again. Assigning blame is a very ineffective way to do that, because it encourages people to hide their mistakes and not address them.

It's also very easy to look back on an event and assume the outcome was obvious, when it really wasn't at the time (this is called hindsight bias). So it helps to assume people responded in a reasonable way, and think about what information and tools they had access to when they made decisions.

I've attended some incident reviews where senior stakeholders were invited. In my opinion, this is a very bad idea. When there is a power differential it's very difficult to keep the meeting on track. It's harder for team members to admit mistakes, and probing questions from stakeholders can easily steer the conversation back to proximate causes. If the stakeholder is accountable for a particular area, they will be focused on that, and it will be harder to get the group thinking about the bigger picture.

Good questions to ask during a postmortem

Good meeting hygiene makes a massive difference. You should always nominate someone to take notes and someone to ensure the meeting runs smoothly.

Part of this is encouraging people to be precise in their language and distinguishing between observations and post-incident interpretations.

Here are some prompts you can use to dig into what really happened and why. These are from The Field Guide To Understanding Human Error by Sidney Dekker (there's also a printable version).

Cues

What were you focused on?
What were you seeing?
What were you expecting to happen?

Interpretation

If you had to describe the situation to your colleague at that point, what would you have told them?

Errors

What mistakes (for example in interpretation) were likely at this point?

Previous knowledge/experience

Were you reminded of any previous experience?
Did this situation fit a standard scenario?
Were you trained to deal with this situation?
Were there any rules that applied clearly here?
Did any other sources of knowledge suggest what to do?

Goals

What were you trying to achieve?
Were there multiple goals at the same time?
Was there time pressure or other limitations on what you could do?

Taking Action

How did you judge you could influence the course of events?
Did you discuss or mentally imagine a number of options or did you know straight away what to do?

Outcome

Did the outcome fit your expectation?
Did you have to update your assessment of the situation?

Communications

What communication medium(s) did you prefer to use? (phone, chat, email, video conf, etc.?)
Did you make use of more than one communication channels at once?

Help

Did you ask anyone for help?
What signal brought you to ask for support or assistance?
Were you able to contact the people you needed to contact?

Example template

[Update: September 2020]

Here's a template I used with another team to record incident and postmortem details in a github repository. This is based on an example from Site Reliability Engineering, by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy.

# Title (incident #)

### Linked tickets
Include here any tickets/pull requests related to the incident or its follow up actions.

### Summary
Explain what happened in one or two sentences.

### Impact
How did this affect the users of the service? Quantify the impact if possible.

### Detection
How did the problem get noticed?

If it was raised through a ticket, what level was it reported as (low/medium/high impact)?

### Resolution
How did we restore service?

### Contributing factors
Discuss this in an incident review meeting. The goal is to identify systemic problems, so you can then generate action items which will make the system safer.

Do not stop at a single root case. When things go wrong there are usually multiple points where an intervation could have averted or mitigated the incident.

## Action Items

- To be agreed

## Lessons Learned

### What went well

- Something good

### What went wrong
- Something bad

### Where we got lucky
- Something fortunate

## Timeline
All times are assumed to be local time unless specified otherwise.

| Time | Event |
| -----|-------|
| YYYY-mm-dd HH:MM:SS | Something happened |

Summary

Everyone accepts some level of risk in order to get software out the door, so we shouldn't panic when incidents actually happen. The best thing we can do when something goes wrong is to learn from it, and I think blameless postmortems are one of the best tools we have for actually improving the way we do things. GOV.UK would not be as resilient as it is if not for this continuous feedback and improvement.

I think you can also learn a lot from reading other people's incident reports (see this collection of postmortems for some interesting ones).

How do you handle incidents where you work?

DEV Community