DEV Community

Cover image for Incident Retrospectives Without Blame
Samson Tanimawo
Samson Tanimawo

Posted on

Incident Retrospectives Without Blame

I've run over 100 post-mortems. The worst ones end with 'Alice will be more careful.' The best ones end with 'we fixed the system.' Here's how you get from the first to the second.

The language rule

Ban these phrases from retros:

  • 'Should have...'
  • 'Alice forgot to...'
  • 'If only...'

Replace with:

  • 'The system let this happen because...'
  • 'The runbook didn't cover...'
  • 'The signal was missing...'

People make mistakes. Systems that allow those mistakes to reach production are the actual bug.

The 5 whys that work

Classic 5 whys often end at human error. Push through that.

  • Why did the outage happen? Alice deployed broken config.
  • Why did broken config deploy? Our config validation didn't catch it.
  • Why didn't validation catch it? It didn't cover this edge case.
  • Why didn't it cover it? We didn't have a test for this case.
  • Why didn't we have a test? Nobody owns the config validation pipeline.

Action item: assign an owner to config validation and add the missing test. That's a system fix.

The people thing

Blameless doesn't mean consequence-free. If someone keeps making the same mistake after the system is fixed, that's a management issue, not a retro issue. Handle it privately, not in the retro.

Retros are for learning. Everything else belongs elsewhere.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)