Post-Mortem Best Practices That Actually Drive Change

#sre #postmortem #incidents #devops

The Post-Mortem Nobody Learns From

I've sat through hundreds of post-mortems. Most follow the same pattern: something breaks, someone writes a Google Doc, we have a meeting, we list action items, nobody follows up, the same thing happens again in 3 months.

Here's how to break the cycle.

The Blameless Culture Trap

"Blameless" doesn't mean "actionless." The biggest failure mode I see is teams that use blameless culture as an excuse to avoid accountability.

Blameless means: we don't punish the person who pushed the bad deploy.
Blameless does NOT mean: nobody is responsible for fixing the systemic issue.

My Post-Mortem Template

# Incident: [SERVICE] [SYMPTOM] on [DATE]

## Impact
- Duration: X minutes
- Users affected: N
- Revenue impact: $X
- SLO budget consumed: X%

## Timeline (UTC)
- HH:MM - First alert fired
- HH:MM - On-call acknowledged
- HH:MM - Root cause identified
- HH:MM - Fix deployed
- HH:MM - Service recovered
- HH:MM - All-clear declared

## Root Cause
[2-3 sentences. Technical but readable.]

## Contributing Factors
1. [Factor that made the incident possible]
2. [Factor that made detection slow]
3. [Factor that made resolution slow]

## What Went Well
- [Something that worked]
- [Something that helped]

## What Went Wrong
- [Process failure]
- [Technical gap]

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| ...    | ...   | P1/P2/P3 | ...      | Open   |

## Lessons Learned
[1-2 paragraphs of genuine insight]

The Action Item Problem

Action items from post-mortems have a 30% completion rate industry-wide. That's terrible. Here's why:

Too many items (I've seen post-mortems with 15 action items)
No clear ownership
No deadline
No follow-up mechanism
Competing with feature work

The Fix: Three Rules

Rule 1: Maximum 3 action items per post-mortem.

If you can't narrow it to 3, you haven't identified the real problems.

Rule 2: Every action item gets a JIRA ticket linked to the next sprint.

Not "someday." Not "backlog." Next sprint. If it's not important enough for next sprint, it's not an action item.

Rule 3: Review completion in the next post-mortem.

Start every post-mortem meeting by reviewing open action items from previous incidents. This creates accountability without blame.

# Post-mortem meeting agenda

1. Review open action items (10 min)
   - Incident #42: "Add circuit breaker" — DONE
   - Incident #43: "Add canary deploys" — IN PROGRESS (blocked on CI)
   - Incident #44: "Fix retry logic" — NOT STARTED (reassigning)

2. Current incident review (30 min)
   - Timeline walkthrough
   - Contributing factors
   - Action items (max 3)

3. Pattern analysis (10 min)
   - Any recurring themes?
   - Systemic issues to address?

The Metric That Matters

Track Repeat Incident Rate: what percentage of incidents have the same root cause as a previous incident?

When we started tracking this, our repeat rate was 45%. After implementing the three rules above, it dropped to 12% over six months.

That's the real measure of whether your post-mortems are working.

If you're looking for better incident learning loops and pattern detection across your post-mortems, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

DEV Community