Mrinal Narang

Posted on Jun 29

Blameless Postmortems in Practice

#devops #management #sre

Most teams claim they do blameless postmortems.

Then the incident happens.

"Jane didn't validate the input."

"The on-call missed the alert."

"We should have caught this in code review."

That's blame. It's just dressed up in process language.

The Gap

Blameless postmortems aren't about ignoring human error. They're about understanding why a reasonable person made a decision that, in hindsight, was wrong.

The question isn't: "What did Jane do wrong?"

It's: "What made Jane's action seem reasonable at the time?"

If you can't answer the second question, your postmortem isn't blameless. It's just performative.

What Actually Happens

Blameless postmortem (real):

"The deployment happened without running tests. Why?

The test environment was down for maintenance.
Nobody documented which environment Jane should use instead.
It was 11 PM on a Friday.
Jane has deployed 200 times without incident.
The process allowed skipping tests if 'urgent.'

So we added automated test gates that can't be bypassed. We documented the backup environment. We made urgent deployments require two people."

Blamed postmortem (disguised):

"The deployment happened without running tests.

Root cause: Insufficient process discipline.

Action item: Remind team to follow procedures."

One actually changes behavior. One just documents that someone messed up.

The Test

Read your last three postmortems.

Count how many times you see:

"Person X should have..."
"We should have caught..."
"Insufficient discipline..."
"Better communication would have..."

If the focus is on what people should do differently, you're not doing blameless postmortems. You're doing blame with better language.

Real blameless postmortems focus on:

What system allowed this to happen?
What information was missing?
What would have made the better decision obvious?
What tool could have caught this?

The Shift That Matters

Blame mindset: "How do we stop people from doing this?"

Blameless mindset: "How do we build systems where the wrong decision is harder than the right one?"

Example:

Blame: "The engineer deployed without approval."

Action: "Require manual approvals before deployment."

Result: Engineers find workarounds. Deployments slow. Nothing changes.

Blameless: "The engineer deployed without approval. Why did that seem reasonable?"

Answer: "The approval process was taking 2 hours, and the customer issue was urgent. The engineer bypassed it."

Action: "Implement auto-approval for critical hotfixes if all tests pass."

Result: Urgent deployments don't require workarounds. Actual behavior changes.

The Questions That Reveal Blame

"Why did the on-call miss the alert?"

vs.

"Why didn't the on-call see the alert? Was the alert buried in noise? Was the alert configured wrong? Was the on-call context-switching too much?"

First question assumes blame. Second question discovers systems.

"The engineer didn't validate input."

vs.

"Why wasn't input validation enforced at the framework level? Why didn't the linter catch this? Why was this pattern possible?"

First question is about the engineer. Second question is about the system.

What Actually Works

Document the decision-making context. Not judgment.

"The engineer believed the data was validated upstream" is context.

"The engineer was careless" is judgment.

Ask: "If this exact situation happened tomorrow, would the same decision seem reasonable to a competent person?"

If yes, it's a system problem. Fix the system.

If no, you've found something else.

The Honest Part

Real blameless postmortems are harder than blamed ones.

It's easier to say "Person did bad thing" than to trace the systems that made the bad thing seem reasonable.

It requires admitting that your process enabled the failure.

It requires changing things instead of just documenting them.

But it's the only approach that actually changes behavior.

Teams that claim "blameless" but still use postmortems as accountability theater don't fix anything. They just have better documentation of blame.

Teams that actually ask "why would a reasonable person make this decision?" build systems where the failures stop happening.

Check your last postmortem. What were the action items?

If they're mostly about "team discipline" or "better communication," you're doing blame with better language.

If they're about systems, tools, and removing friction from the right path, you're actually being blameless.

DevOps #IncidentResponse #Postmortem #Blameless #TeamCulture #SRE

DEV Community