The ritual most senior engineers skip
Everyone in this field has heard the word "pre-mortem." Almost nobody runs one as an actual habit.
The people who do ship cleaner work. The cost is small: about 10 minutes to generate the failure list, 25 to 30 for the whole ritual. What it buys you is avoiding the moment on a post-incident call where someone says "yeah, we knew this could happen and shipped anyway." Plenty of Sev-1s aren't preventable. For the ones that are, this is the cheapest way I know to catch them early.
What follows is the version I actually use: the mental flip that makes it work, the 10-minute structure, the three ways you can respond to each risk, and the cases where running a pre-mortem is just procrastination wearing a lab coat.
One thing to hold onto if you skip the rest: the useful work happens in past tense, before the incident, in concrete sentences.
Why risk registers do not work
A risk register is a passive list. It gets filed somewhere nobody opens, written in language nobody acts on.
It tells you "what could go wrong" without telling you what to do about any of it. And the entries are written in future-conditional, "might happen," "could cause," which your brain quietly files under low-probability, deal-with-it-later. You read the list, you feel covered, you did the responsible thing. You wrote the risks down.
The other problem is structural. The shipping plan and the register usually live in two different documents, so the register never forces anything back into the plan. It sits next to the work instead of changing it.
After a decade-plus of shipping, here's my honest read: most risk registers get filled in the afternoon before launch to satisfy a checkbox. Nobody who writes one believes in it, and nobody reads it afterward. Which is fine, because usually nobody reads it at all.
A pre-mortem has the opposite shape. It's active, it reads like a story, and it lives inside the plan.
The cognitive shift: past tense as forcing function
A pre-mortem is a story you write in the wrong direction. You write the post-incident retrospective before the incident, in past tense, with detail.
The past tense is what forces specificity. There's a study on prospective hindsight (Mitchell, Russo and Pennington, 1989) that found imagining an outcome has already happened produces roughly 25% more concrete reasoning than predicting it forward. That's the lever the pre-mortem pulls. It's also why "we missed the deadline" is useless and "row-level locks piled up under the signup spike and transactions queued past the 30-second timeout" is not.
Put the two side by side. The lazy version is "it broke." The useful version: the 30-second timeout cascaded into a 502 wall by minute three of the signup spike, conversion dropped 40% at peak hour because the payment provider rate-limited us, and the alerting system went quiet for 18 minutes because the alerting stack was its own alert sink.
The second version you can actually test. Every clause points at a specific failure mode you can prevent, shrink, or at least watch for.
This isn't pessimism dressed up. It's negative thinking pointed at the plan, run like a debugging session. You're not predicting the future. You're trying to break the document while it's still cheap to break.
I ran one last week on a SaaS I'm shipping. Ten failure modes in ten minutes. Ranked by impact times likelihood, the top three were the lock-contention story, the payment rate-limit story, and the silent-alerting story above. Each one got a countermeasure before ship. But the bigger payoff was sequencing: the pre-mortem killed two days of work, because writing down how the wrong order failed showed me a feature I'd put first should have been third. You only see the right sequence after you've described the wrong one falling over.
The 10-minute protocol
The "10 minutes" is the generation core, not the whole thing. Steps 1 and 2 eat the ten minutes; steps 3 through 5 add another 15 to 20. So the full ritual is closer to half an hour. Most teams stop after step 2, which is the whole reason pre-mortems get a reputation for not working. The discipline is in finishing.
The original method (Klein's) is a group exercise: five to ten minutes of imagined-failure writing, then you read them round-robin, no fixed count. What I describe below is an opinionated solo adaptation for ship-level calls. The thing I actually run on a Monday morning before a non-trivial deploy.
Solo, you only catch the failures you can already picture. So for anything launch-scale (new market, new product line, architecture you can't easily walk back), run it as a 30-minute group exercise with anonymous submission. That way the most junior person on the team can write down the failure mode the founder is too close to see. The brainstorming research is fairly brutal about face-to-face group ideation; anonymous parallel generation tends to beat it by 20 to 40%.
The five steps:
- Set a timer for 10 minutes. The time-box is what stops you polishing.
- Write 10 specific past-tense failure narratives. Concrete, not categories. Vendor outages, traffic shapes, a regulatory surprise, a key person out sick, a third-party API quietly changing its contract. Under 6 and you're still being abstract; over 12 and you're padding. Ten is the right squeeze for a 10-minute clock: one a minute, no editing.
- Rank each one by impact times likelihood. Three buckets: ship-killing, recoverable, and acceptable.
- The top 3 get countermeasures in the plan itself. Not in a separate doc. In the plan, as real tasks with an owner and a date.
- The rest go in a deferred-risk file that you actually re-read at the next milestone.
The protocol is deliberately dumb. The hard part is running it before every ship that takes more than a day to roll back.
Countermeasure design: three patterns
For each of the top three, you pick exactly one response. Not two.
Eliminate
Change the plan so the failure mode can't happen. This is the most expensive option and the most durable. For the lock-contention case, what landed was moving primary-key allocation off a sequence and onto UUIDv7 on the signup-burst path, running READ COMMITTED. The failure mode doesn't get smaller. It's gone.
Attenuate
Shrink the blast radius for when it does happen. Cheaper, and partial. If the alerting stack can go dark, put an external uptime probe on a different vendor in a different region. The alerting system can still die; you just hear about it in 60 seconds instead of 30 minutes.
Instrument
Accept the failure mode but make sure you detect it in minutes, not hours. Cheapest of the three, and it leans entirely on someone being around to act on the signal. If you can't design the lock contention out before ship, then instrument lock-wait-time histograms and transaction-queue depth, alert on the 99th percentile, and accept that you'll respond in ten minutes rather than never having the bug.
There's a fourth pattern in the textbooks: transfer, meaning you push the risk onto a vendor SLA, insurance, or a contract clause. I leave it out on purpose. Mid-launch, transfer is usually not yours to pull; when it is, procurement moves in weeks, not inside a 10-minute loop. ISO 31000 and PMBOK keep it on the list. The version that survives contact with a Monday deploy doesn't.
Most launches end up using all three across their top-3 failures. The discipline is one pattern per failure. The moment you let yourself pick two, the countermeasure list bloats and the plan turns back into theater.
When NOT to run a pre-mortem
Skip it on low-stakes reversible stuff. A one-line PR, a typo, a copy tweak on an internal page. The ten minutes won't earn themselves back.
Skip it on calls you've already argued into the ground, where the choice is genuinely a coin flip. Running a pre-mortem on a coin flip is just a way to avoid deciding.
And skip it on anything you've already shipped. That's a post-mortem: a different ritual, with different head-space and different stakes.
The line I use: run a pre-mortem on anything that takes more than a day to roll back if it goes sideways. Under that bar, the ritual costs more than it saves.
One honest limit. A pre-mortem only reaches as far as your imagination. The real incident is very often the eleventh failure mode, the one nobody wrote down. Run them anyway; catching seven of eight ordinary failures still beats catching zero. Just don't let a complete-looking list trick you into thinking you've seen the whole risk picture.
Closing rule
If you can't describe how your next ship fails in 10 specific past-tense sentences, you haven't planned it. You've hoped. And hope has never held up under peak-hour traffic.
Sources
- Klein, G. (2007). "Performing a Project Premortem." Harvard Business Review, September 2007 — https://hbr.org/2007/09/performing-a-project-premortem (canonical origin of the method)
- Mitchell, D. J., Russo, J. E., & Pennington, N. (1989). "Back to the future: Temporal perspective in the explanation of events." Journal of Behavioral Decision Making, 2(1), 25-38 — https://doi.org/10.1002/bdm.3960020103 (empirical foundation for "past-tense forces specificity"; the prospective hindsight study)
- Kahneman, D. (2011). Thinking, Fast and Slow. Chapter 24, on pre-mortem as antidote to optimism bias and planning fallacy
- ISO 31000:2018 — Risk management guidelines — https://www.iso.org/standard/65694.html (canonical risk-treatment framework; eliminate/attenuate/instrument is an engineering simplification of clauses 6.5.2-6.5.3)
- PMBOK Guide, 7th Edition (PMI, 2021) — https://www.pmi.org/standards/pmbok (alternative risk-response taxonomy: escalate / avoid / transfer / mitigate / accept)
- Klein, G. (2009). Streetlights and Shadows: Searching for the Keys to Adaptive Decision Making. MIT Press — https://mitpress.mit.edu/9780262013390/streetlights-and-shadows/ (extended Klein corpus on naturalistic decision-making and pre-mortem limits)
- Mullen, B., Johnson, C., & Salas, E. (1991). "Productivity loss in brainstorming groups: A meta-analytic integration." Basic and Applied Social Psychology, 12(1), 3-23 — https://doi.org/10.1207/s15324834basp1201_1 (why anonymous parallel generation outperforms face-to-face group brainstorming)
- Bezos, J. — Type 1 vs Type 2 decisions, 2016 Amazon shareholder letter (restating the 2015 framing) — https://www.aboutamazon.com/news/company-news/2016-letter-to-shareholders (conceptual ancestor of the "one day to roll back" threshold)
Top comments (0)