DEV Community

Cover image for What Five Years of Reading Production Postmortems Taught Me About Code Quality
Sonia Bobrik
Sonia Bobrik

Posted on

What Five Years of Reading Production Postmortems Taught Me About Code Quality

I have spent an embarrassing number of weekends reading public postmortems from companies whose engineering blogs I respect, and the pattern that emerges contradicts almost everything junior developers are told about writing good code. The incidents that take down major platforms rarely involve clever algorithms gone wrong or exotic concurrency bugs. They involve mundane decisions that seemed reasonable at code review time, made by engineers who were following established conventions. A community thread on this exact topic, where practitioners traded stories about their worst production incidents, reinforced something I have suspected for a while. The gap between writing code that works and writing code that survives production at scale is not bridged by frameworks or methodologies. It is bridged by changing how you think about what code is actually doing when nobody is watching.

The Myth of the Senior Engineer

There is a comforting narrative in our industry that experience automatically translates to better code. Hire enough senior engineers, the thinking goes, and your systems will become more reliable. The actual data from incident reports tells a different story. The Cloudflare outage of July 2019 was triggered by a regex pattern deployed by experienced engineers, reviewed by experienced engineers, and tested in a staging environment maintained by experienced engineers. The pattern caused catastrophic backtracking in their WAF, and within minutes, a significant portion of internet traffic worldwide started returning 502 errors.

What separates engineers who consistently ship resilient code from those who do not is something less glamorous than expertise. It is a willingness to remain suspicious of code that appears correct. The Cloudflare engineers were not careless. They were operating within a system that did not catch a class of failure that humans are poorly equipped to predict.

Reading Code Versus Running Code

Most code review focuses on whether code does what it claims. This is necessary but insufficient. The more important question is what the code does when its inputs violate the assumptions baked into its design. Defensive programming has fallen out of fashion, partly because it adds visual noise and partly because type systems are supposed to handle these concerns. Both arguments are weaker than they appear when you examine real failures.

The Knight Capital incident of 2012 remains one of the cleanest examples of how operational decisions interact with code in ways that pure software thinking misses. A deployment script reused an obsolete feature flag, activating dead code that had been dormant for years. The dormant code interpreted normal trading flags as instructions to flood the market with orders, and the company lost approximately 440 million dollars in 45 minutes. The code itself was not buggy in any traditional sense. The flag value, the deployment process, and the absence of dead code removal combined into a catastrophic accident.

The lesson here is not that you need better deployment scripts, though you probably do. The lesson is that code does not exist in isolation from the systems that deploy, configure, and operate it. Martin Fowler's writing on this topic, particularly his continuous integration principles developed alongside Kent Beck, treats code and its delivery pipeline as a single system whose behavior emerges from their interaction. Every developer should read that essay annually until it changes how they structure their commits.

What Type Systems Cannot Save You From

Strong type systems eliminate entire categories of bugs, and I will defend Rust's borrow checker against any complaint about ergonomics. But types describe shape, not meaning. A function signature that accepts a Duration tells you the units, but it cannot tell you whether 30 seconds is an appropriate timeout for your use case. The Robinhood outage during the March 2020 market volatility was not caused by type errors. It was caused by capacity assumptions that broke when trading volume exceeded what their architecture had been provisioned to handle.

This is where engineering judgment lives, and it cannot be automated away. Reading code for what it implicitly assumes about its environment, its inputs, and its operational context is a skill that develops through deliberate practice. The engineers I trust most are the ones who, when reviewing a pull request, ask questions about scenarios that have not happened yet rather than only verifying that intended behavior works.

Practical Habits That Pay Compound Interest

Over years of trying to improve at this, certain practices have proven worth the investment. Writing them down feels almost embarrassing because they sound obvious, but their obviousness is precisely why they are routinely skipped.

  • Reading the failure mode before the success path when reviewing code, looking specifically at what happens when network calls hang, when data is malformed, or when callers retry aggressively
  • Treating comments as commitments rather than decoration, with explicit notes about why something was done a particular way and what assumptions would invalidate the approach
  • Building tools to inspect production behavior rather than relying on logging statements added retroactively, because by the time you need observability for an incident, it is too late to instrument
  • Deleting code as a primary activity rather than only adding code, because every unused branch, deprecated flag, and obsolete configuration is a future incident waiting for the right deployment to activate it

The Honest Conclusion

I have stopped believing that good engineers write fewer bugs. The engineers I have worked with whose systems run reliably are not the ones with cleaner commits or more elegant abstractions. They are the ones who have internalized that their code will be deployed by tired humans, run on infrastructure they do not control, called by clients with different assumptions than the documentation describes, and maintained by future colleagues who lack the context that made the original decisions feel obvious.

This framing changes what counts as quality. A function that handles its happy path beautifully but fails opaquely when the database hiccups is worse than one with clunkier code that produces clear error messages and degrades predictably. Code is a hypothesis about how the world will behave, and every line carries an implicit prediction. The engineers who get better are the ones who treat each production incident not as a failure to be moved past but as evidence that their model of the world was wrong in a specific, learnable way.

The path forward is not more frameworks or better tools, though those help. It is the unglamorous discipline of staying curious about why working code sometimes stops working, and refusing to accept explanations that protect your ego at the expense of your understanding.

Top comments (0)