Jesse van Herk for Jobber

Posted on Nov 26, 2021

Incident Post-Mortems at Jobber

#productivity #devops #postmortems

No matter how stable your software product is, occasionally things go wrong in production, and Jobber is committed to doing a post-mortem investigation to follow up and learn from each incident.

At a high-level, an incident post-mortem answers these questions:

What went wrong?
What did we do to fix it?
What will we do differently, so it doesn't happen again?
What went well during the incident, that we should keep doing?

As we’ve grown and moved to a remote working environment, we’ve changed our process to work better for remote teams and super busy schedules. This is a summary of what we’re doing to make sure that incidents remain rare and our customers can keep getting their work done!

Our process

Our process is broken down into 4 steps: Resolve the incident, investigate it, debrief about it, then share the results

Collect data during the incident. We collect as much data as we can in a slack channel dedicated to incidents, keeping it organized with threads. This includes server graphs, snippets from logs, and screenshots showing what was going on at each point in the incident. It doesn’t all end up being useful, but it’s nice to have everything collected when you start going through the investigation.

Start the investigation right away. We get one of the involved people to take on the role of lead investigator, which really means they’re in charge of making sure the investigation gets done, the post-mortem document gets filled in, and the debrief gets held. Starting it right away makes sure nothing gets lost.

Review the results within a week. While things are still fresh, hold a debrief to review the post-mortem document, discuss the action items, and make any edits needed. This is a 30-60min zoom session with the team involved in the incident as well as reps from other departments (mainly the customer support/escalation team).

Share the results as soon as the debrief is done, so everyone gets a chance to learn from it! We post it to a slack channel that the whole company has access to, for transparency.

New Challenges

With a larger company, people working in all sorts of time zones, and everyone being remote, scheduling and coordinating got a lot more complicated. The process is still mostly the same, but with some tweaks to keep it effective.

Shorter timelines

We’ve shortened the timeline expectations - getting the incident doc started faster and the debrief done sooner helps get all the data and lets everyone involved get back to their sprint work sooner.

Assume async

Scheduling the debrief sooner means that it’s harder to find a spot in everyone’s calendars. Rather than pushing the meeting further and further out, do more of the work asynchronously. Make sure the document can stand on its own, and use slack to ask people for their contributions.

We also record the debrief (easy with zoom) so that anyone who couldn’t attend is also able to watch it later, so nobody has to worry about missing out.

Simple incident doc template

We’re using a wiki template for consistency, and over time we’ve simplified down the template repeatedly so there’s less sections to worry about.

Setting it up with a button to auto-create the new page from the template works well.

The template has sections for:

Impact and Scope
Trigger (what started the incident)
Resolution (what ended up fixing it)
Timeline of events
Root Cause
What went well
What didn’t go well
Action items
Data & Analysis (all the charts!)

Asking for input from customer-facing teams right away

Our customer success team always has great input and is able to help fill in gaps in the timeline. We reach out to them early so there’s time for their input to be added into the post-mortem doc before the debrief. Waiting for the debrief is too late!

Tracking action items in Jira

Why track action item progress in an incident doc when we already have a standard tool for tracking work? As soon as we can, we get all action items from post-mortems in as Jira tickets so they can be assigned to backlogs and don’t get lost.

We also have some reports set up to view the list of outstanding post-mortem actions - driven by a post-mortem label on the items.

Have a section for “things we should do if we have time”

Realistically, not all action items are actually actionable - some are more aspirational or are something we just need everyone to keep in mind. In order to keep the Jira action items clearer, we’ve included this section as a spot to put the things we think are important but we couldn’t turn into assignable/trackable work.

Our approach is that it’s better to have a smaller set of action items that we actually do than a giant list of things we’d like to do given infinite time.

Keep it Blameless

This one isn’t actually new, but it’s well worth repeating! We’re interested in what happened and what we’re going to do to fix it going forward, not in pointing fingers.

"Removing blame from a postmortem gives people the confidence to escalate issues without fear."
– the SRE book

About Jobber

We're hiring for remote positions across Canada at all software engineering levels!

Our awesome Jobber technology teams span across Payments, Infrastructure, AI/ML, Business Workflows & Communications. We work on cutting edge & modern tech stacks using React, React Native, Ruby on Rails, & GraphQL.

If you want to be a part of a collaborative work culture, help small home service businesses scale and create a positive impact on our communities, then visit our careers site to learn more!

DEV Community