TL;DR: Get the template here
The more time you spend working for tech companies, the more likely it is that you will run a retrospective for an incident.
There are lots of theories out there about how to perform retros — from Amazon's "Five Why's" strategy to Netflix's Dispatch tooling.
For most engineering teams however, incident response ends with the fix and a couple words or paragraphs about what happened. This is a great starting place, but often times focusing on what happened over what caused it to happen means we are missing out on the ability to see the patterns in our incidents and responses and learn a build on those. Making the shift from thing about incidents to thinking about the systems that produce them is a great next step to improve your responses.
I've been using a template to record all the information related to an incident for a while now, I thought you all might like to see it!
The template focuses on a few main sections and is a simple to copy Google Doc. You can find it here! The rest of this post will be a breakdown of each section and how to use it!
The header of the document focuses on some simple details about the event. First, put in the incident's date and give this document a title. Focus on something short and descriptive like "Product Name AWS outage". Ideally, you will be filling this out on the same day it happens, but if not make sure the date at the top is the date the incident happened. This makes it easy to search logs, chat messages, and PR's for what happened that day if people want to do more investigation.
The next section is Details, which records to type of event. Type definitions are largely up to you, but make sure they are standardized and documented somewhere. Some examples might be "Vulnerability", "Third Party", or "Production Bug". I'd recommending starting of with just a few types and adapting your system as you gather more information - if a few incidents are similar and don't cleanly fit a type, consider adding a new one.
Up next is the name of the Assignee - this is the incident leader, the person responsible for handling the incident (and hopefully the person filling out the template).
Next is a bit of metadata on this incident: Were customers impacted, what notifications were sent when it happened, and what was the source of detection (i.e. who or what reported this incident)?
The next bit is a table for writing up the details of the incident.
First, type up a quick description of what's happening. Feel free to tweak and refine this as the situation changes - at the end you should have a short overview of the event and solution.
This is a great session for tossing quick notes as you are going. It helps keep track of various findings and artifacts while you are still learning about them. This section is most useful when you are still working on the incident, but it can also store any links/notes that don't fit into other categories later.
Here is where we start to get to the important bits of the document. Instead of focusing on root cause analysis or the 5 whys, we're going to focus on contributors and enablers. No incident is ever caused by just one thing, rather they are the result of a flaw in a complex system of tooling, engineers, and opportunity costs. Here, we focus on the things that enabled this incident to occur or contributed to the impact it had. For example, a broken production push might have been enabled by a mis-configured deployment file or a lack of testing around a specific area of the application. Silenced logs and an engineer on vacation might have contributed to the impact it had. By logging these things, we get an idea of the holes in our system.
Mitigators are the things that went right - what made this have less of an impact than it could have? For example, perhaps a critical notification was seen caught by the on-call process or an engineer happened to be looking at a page when they noticed a bug. These kind of insights help you develop resilient pathways through which your incidents travel.
This section is relatively straightforward - were customers impacted and what was the extent of that impact? Did production go down for everyone or was this a small bug for a single customer?
This section is pretty similar to the "Contributors & Enablers" section, but more focused on "one-off" events that made this more difficult to handle. Was there a holiday that made it hard to coordinate? Maybe the logs weren't logging correctly or the incident monitoring software didn't ping right away. These events aren't always solvable, but the do give a clearer picture of your weak spots.
This section is for any artifacts that were created for the resolution of this incident - charts, documents, dashboards, etc. Anything that was generated for the explicit purpose of gather/collating data for the resolution of the incident.
This section is for all the links, PR's, tickets, etc that are related to this incident. This should give a holistic view of all the work related to this incident.
The final page contains a basic table template for tracking the timeline of the event with a simple branching color scheme to highlight co-incident work. It's a lot easier to fill this out as you go, but it can be filled out after the fact as well. Reading this should give a casual observer a basic understanding of what happened and what steps were taken to solve the incident.
This document should help you start tracking and logging the important information around your incidents, but all that data isn't super useful unless you are learning from it. A great way to do that is to have incident review meetings monthly - each team puts together a quick slideshow with the data contained here and presents it to the other teams. Bonus points for taking notes and pointing out commonalities. Even bonus-er points for coming up with some strategies to try out to lessen the impact of contributors or make mitigators even more impactful.
I hope this template is helpful! What does your incident response look like? What kind of learning does it optimize for?