DEV Community

Ian Johnson
Ian Johnson

Posted on

Post-mortems and RCAs: why you should be doing them

The hard part of incidents isn't surviving them. It's making them count.

Most teams handle the acute phase reasonably well. Something breaks, people swarm, the symptoms get suppressed, the dashboards go green. The team is exhausted but relieved. The natural instinct is to move on. There's a backlog waiting and nobody wants to spend another hour on something that's already over.

This is where the lesson gets thrown away.

Post-mortem vs RCA

A quick terminology note, because these get used interchangeably and they shouldn't.

A post-mortem is the document and the conversation that follow an incident. It captures the timeline, the response, the impact, what worked, what didn't, and what the team commits to changing as a result. It is fundamentally a learning artifact.

A Root Cause Analysis is a technique (typically the "five whys" or a variant) used inside the post-mortem to push past the symptom and find the underlying conditions that allowed the incident to happen. RCA is one of the tools you use to fill in the most important section of the post-mortem.

You can run an RCA without writing a post-mortem (and many teams do, badly, in Slack threads that disappear). You can write a post-mortem without doing real RCA (and many teams do, ending up with documents that say things like "the database fell over" without ever asking why). The combination is what actually produces learning.

Blameless, or worthless

The single most important property of a post-mortem is that it is blameless. Not "blameless except for this one obvious case." Not "blameless but we'll note who was on call." Blameless.

The reason is mechanical, not moral. The moment people believe a post-mortem might be used to assign blame, the information dries up. Engineers stop volunteering what they were actually thinking when they shipped the change. Operators stop admitting they didn't understand the runbook. The document becomes a careful work of fiction, optimized to protect its authors, and the real causes go uninvestigated.

You cannot fix a system you cannot see clearly. Blame guarantees you cannot see clearly.

The framing that works is the one John Allspaw articulated at Etsy more than a decade ago: assume everyone involved acted reasonably given the information they had at the time. If their actions led to an incident, the interesting question is what made those actions look reasonable. That question is almost always answered by something about the system — missing signals, unclear ownership, confusing tooling, inadequate testing — and those are things you can actually fix.

Actionable, or it didn't happen

A post-mortem that ends with "the team will be more careful next time" is not a post-mortem. It is a feeling, written down.

The output of a useful post-mortem is a list of action items that are concrete, owned, and tracked. "Add alerting on queue depth above 10,000, owned by the platform team, target this sprint" is an action item. "Improve our monitoring" is not.

The better lists go a step further and ask: is this specific to this incident, or does it represent a class of problem? A misconfigured retry policy is a specific bug. A pattern where retry policies are easy to misconfigure and hard to test is a class. The class is where the leverage is. Fixing it prevents not just this incident but the dozen variants you haven't had yet.

Where TDD comes in

Here is where the discipline gets concrete.

Test-Driven Development, at its core, is a feedback loop: you write a failing test that describes the behavior you want, then write the code that makes it pass. The test is the specification, the safety net, and the regression check, all in one artifact.

The same loop applies, almost perfectly, to incidents. The behavior you want is "this specific failure mode does not happen." You write a test that reproduces the failure or, if reproducing the exact failure is impractical, a test that exercises the condition that allowed it. You watch the test fail. Then you fix the underlying issue and watch the test pass. That test now lives in your pipeline forever.

This is the practice sometimes called bug-fix-by-test, and it converts a one-time lesson into permanent organizational knowledge. The next engineer who tries to reintroduce the problem (and someone will, six months from now, while refactoring something unrelated) gets stopped by a failing test with a clear name and a link back to the post-mortem that produced it.

Without this step, the lesson lives in the heads of the people who were on the call. Those people leave. The lesson goes with them. The same incident happens again in eighteen months and the new team wonders why nobody saw it coming.

The pipeline as institutional memory

This is where post-mortems connect back to the broader argument about Continuous Delivery. The pipeline is not just a verification system for new changes. It is the place where every lesson the organization has ever learned gets enforced automatically, on every commit, regardless of who made it.

Every post-mortem that produces a test makes the pipeline a little smarter. Over years, the accumulated tests become a kind of institutional memory that does not depend on anyone remembering anything. A junior engineer who joins next year inherits all the scars without having to feel them, because the scars are encoded as tests that fail when they should.

This is how teams actually get better over time, as opposed to claiming to. Without the test, the post-mortem is a story you tell yourself about being a learning organization. With the test, you are one.

The minimum viable practice

If you aren't doing this today, you don't need to roll out a perfect SRE program to start. The minimum viable practice is small:

After every incident worth the name, hold a meeting. Write a document. Run a five whys. Identify two or three concrete action items, with owners and dates. For at least one of them, write a test that would have caught the problem. Merge the test. Move on.

That is the whole loop. It is not complicated. The hard part is doing it consistently, including the times when everyone is tired and the next incident is already starting.

But every time you do, the system gets a little harder to break in the same way twice. Which, in the end, is the only definition of "more reliable" that actually means anything.

Top comments (3)

Collapse
 
gimi5555 profile image
Gilder Miller

The TDD angle is the strongest point here. Turning a onetime incident into a regression check that lives in the pipeline is the only way to prevent the same failure from resurfacing later.
I agree with the actionable requirement too. You can't just say 'investigate this' and hope it gets done.
How do you handle the write a test step when the root cause is a race condition or timing issue that is hard to reproduce consistently?

Collapse
 
tacoda profile image
Ian Johnson • Edited

Great question! If the race case is within the app, then doing a race-condition 'integration' test works well. If it is a higher level, for example traffic, performance, or edge cases, I'll typically handle that with performance tests or E2E tests as a last ditch (because E2E is expensive). If you are worried about the consistency, mutation testing is a really good way to address this.

Collapse
 
gimi5555 profile image
Gilder Miller

Thank you Ian.
And I'm wondering if you remember to discuss more on other platforms.
Talk soon!