Photo by bruce mars on Unsplash
Let’s face it, no matter what you do, things will go sideways. Whether it’s a bug in your code, a complication from your sources, or an unexpected behaviour of your teammates, you occasionally have to fix new issues. It’s annoying to hell when something you thought is fixed pops up again. And it’s always during your lunch hour!
Don’t you hate that?
We have to face the challenges that come with failure and take the lessons we learn to become stronger. We shouldn't be afraid to fail, as it can be the first step in our journey to resilience. We have to use our mistakes to become more experienced and, more successful.
You’ve been doing that your entire life, haven’t you? What if I tell you, you can do it even better?
By implementing a postmortem process in your work – or even in your personal life – you can gain invaluable insights from your mistakes and use them as an opportunity to make positive changes. This process can help you develop a deeper understanding of the situation. You can also use it to inform future decisions and behaviours. Not only can this help you to become a better problem solver, but it can also provide an invaluable opportunity for personal growth.
Thanks for reading Data Gibberish! Subscribe for free to receive new posts and support my work.
I’m going to tell you what is a postmortem document, and how to write one in three easy steps. I’ll also share some examples of my favourite reports, I’ve read through the years. Even more, I have prepared a postmortem template at the end of this article as a gift to you.
Definition
A postmortem is a document that describes the details of an incident. We usually write it after we resolve an outage. It is intended to provide an analysis of the incident, including the reasons behind it any lessons learned as well as potential action points.
At that point, you may be wondering: Why would I write a document when you already fixed a problem? Think about it:
Writing a postmortem demands concentration and focus to ensure all necessary details are considered. It offers you the opportunity to analyse the incident in more depth and identify any issues that may be present in your processes and tools, enabling you to make your project more robust. By writing a postmortem document, you can ensure you are better prepared for similar events in the future and you can put in place the necessary changes to prevent them from occurring again.
But wait — there is more!
A postmortem report is a useful tool for providing stakeholders with information about a certain incident, offering them a detailed description of the problem as it unfolded. This report provides a clear and concise account of the incident which helps to build trust between you and your stakeholders. That allows them to feel confident in your team’s ability to handle similar situations in the future. But be careful, because with great trust comes great responsibility. 😉
Photo by Agustin Fernandez on Unsplash
Now, If you have never heard of the term postmortem , that term may sound a bit creepy. At least that was my case before I learned what it does and how it works. It turns out postmortem is an incredibly valuable tool, used by teams to document their successes and failures and to identify ways to improve in the future. With this knowledge, postmortems can be seen in a completely different light - the light of shared knowledge and learned lessons.
C_an I be completely frank with you?_ I prefer using the term incident report as postmortem still gives the chills.
Does that make any sense to you? Would you like to learn how to write incident reports? Let’s dive in!
Document structure
Writing a postmortem is a straightforward process. All you need to do is compile three sections, making sure that each one communicates key elements of the process. Don’t get tricked. The process may be simple, but you should take it seriously and approach it with great care. You want to put all the details correctly and optimise it for the readers.
Now, here’s the step-by-step process:
Summary
You usually start by providing a comprehensive overview of what happened. You should include as much detail as possible, such as affected dashboards, number of affected records, duration of the downtime, and any other information that may be relevant. This section should provide a concise summary of the events around the incident so that readers who are not interested in all the technical details can still gain an understanding of what occurred.
In this section, you should also include a root cause analysis of the initial failure. To identify the root cause of the incident, you’d typically use the "Five Whys" technique, which involves asking a series of questions to dig deeper into the problem and analyze the situation from multiple angles. This approach can help you uncover the underlying issue and get to the heart of the matter.
Photo by Kaleidico on Unsplash
Details
In this section, you provide an overview of the steps taken to diagnose and resolve the downtime. You’d usually use a timeline to structure this section, which would allow you to break down the process into distinct stages. This helps to ensure that all the necessary steps are taken to identify and address the issue as quickly as possible. Additionally, employing a timeline aids you in identifying any potential areas of weakness in your troubleshooting process. This allows you to assess your approach and make any necessary changes that can help you to improve your response in the future.
But wait, let me tell you something very important!
Do I have your attention?
OK, Now:
More often than not, you might be inclined to point a finger in an attempt to find the source of any mistakes or issues that arise. However, you must not do that! It is vital to foster a blameless culture within the team and to focus on the learnings that can be taken away from any given situation, rather than attempting to place blame on one particular individual or group. That way you’ll allow everyone to learn and grow. It is also important to remember that mistakes can be an invaluable opportunity for growth, as long as you address them in the right way. Therefore, it is essential to resist the urge to blame and instead focus on the learnings that can be gained in the long run.
Thanks for reading Data Gibberish! Subscribe for free to receive new posts and support my work.
Actually, would you like to know a secret?
Usually, there’s not a single person responsible for an incident. Even in rare cases, when there’s just one involved individual, you should dig a bit deeper and look for the actual reason. Instead of blaming people, you should make your project a space for creative minds to learn in a safe environment. Just a few questions you can ask: Should that person have access to the system that failed? Did anybody review their work? Had you set up any monitoring and notifications?
Photo by Priscilla Du Preez on Unsplash
Takeaways
You can't let your failures define you. You have to let your failures teach you.
― Barack Obama
That certainly holds true for all aspects of life, including work. You can do your best to try and prevent failures from happening, yet they still occur, time and time again. It is not the failures themselves that people see in you as a person and professional, but rather, it is how you use those failures to learn and grow that truly matters. We should look into the past to build a better future. That is why I believe this section is the most important in the whole postmortem document.
Sounds good? Let’s see how to put that into action.
In this section, document what went well and what went poorly. Aim for objectivity and focus on the facts. Analyse the incident and extract any lessons learned. This could include insights about your tools, communication issues, or anything else. Don't jump to solutions yet; just consider what you learned from the incident. It is crucial, to be honest, and objective when reflecting on the downtime. Ask yourself questions such as: W_hat processes could have been improved? Were there any tools or resources that could have been better utilized? Were there any communication barriers that could have been addressed?_ Consider the entire outage and make sure to evaluate it from all angles. This will help ensure you are extracting the most important lessons to improve the process for future similar incidents.
Thank you for reading Data Gibberish. This post is public so feel free to share it.
Now, you gained valuable knowledge and insights from the lessons you learn. It’s time to think about how to make your project better and how to prevent the same issues from occurring in the future. Write down a list of action points. You can do that in a free text format, or in the form of Jira tickets. Just make sure those are linked in the postmortem document. Take your time. It is important to think about any potential risks that could arise from making changes, and how to mitigate them. Just make sure you consider all angles and other projects you might have. Discuss timeframes with anyone involved and make sure your actions would make your world a better place.
Photo by Glenn Carstens-Peters on Unsplash
That’s it. Three sections. You can make those as long or short as you want to. You can of course add some more sections if you wish. There are no strict rules as far as it works for you, your team, and your stakeholders. Just beware to get all the details right, avoid blaming people and focus on learning.
Showcase
Okay, I know what you are thinking: That looks great. Can I see some examples? Yes, there are!
As mentioned in the beginning, I’m going to share some of the best incident reports I read through the years:
- Monzo deployed a faulty update preventing customers from doing some basic operations
- Cloudflare broke their DNS service because of wrong beliefs about time
- GitLab accidentally removed production data from their primary database and was down for about one working day
These examples are not here to shame the companies for their outages. On contrary, I’d like to praise them for the effort they made to write those documents and be transparent. Those reports are examples of dedication to provide the best possible service, strengthening relationships and sharing knowledge. I’d strongly encourage you to steal learn something from those.
Recap
To summarise:
Incident reports provide a comprehensive overview of the incident, a root cause analysis, a timeline of the steps taken to resolve the downtime, and a list of action points to prevent similar issues from occurring in the future. They are a fantastic way to learn from your mistakes and share knowledge with your audience. It is vital to foster a blameless culture and focus on the learnings that can be taken away from the situation, rather than attempting to place blame on one particular individual or group.
We didn’t just talk about what a postmortem document is. We discussed how to structure a document, and what should be our focus. We also saw some extremely good examples from a few well-known companies.
And that’s it!
Well, not really. As I promised, I have a small gift for you. It contains some basic examples, of how to use it. I am sure that template would be useful no matter if you are just starting your journey in structured incident reporting, or already have some experience with it.
All you need to do to access the document is to follow these three easy steps:
- Click the button bellow
- Input your name and email
- Click the button in the confirmation button
ProTip: You need to copy the document to your Google Drive account if you want to be able to edit it.
Yes, I want to get the template!
Don’t take my words for granted. If you are looking for a way to improve your learning, consider using postmortem documents.
Now, it’s over to you.
What do you think? Let me know in the comments, or message me directly. Please, be brutally honest!
Top comments (0)