Michael burry

Posted on Apr 8

I Broke Prod 3 Times — Here's How Proper Retesting Would Have Saved Us

#testing #devops #agile #software

I've been in software for eight years. I've survived death marches, a startup pivot that rewrote half the codebase in six weeks, and a migration to microservices that nobody fully understood until it was already in production.

But the three incidents I think about most aren't the big architectural disasters. They're the ones that started with a developer — sometimes me — saying: "It's just a small fix. We already tested this. Ship it."

This is the story of those three incidents, what actually went wrong, and how a proper retesting protocol would have stopped each one before it became a 2 AM Slack storm.

If you want the structured playbook, here's a solid retesting guide to bookmark. But if you want the human version — the version with the panic and the postmortems and the lessons that actually stuck — keep reading.

Incident #1: The "One-Line Fix" That Took Down Checkout

What happened

It was a Tuesday afternoon. A bug had been sitting in our backlog for two sprints — a minor formatting issue in how we displayed discount codes at checkout. Wrong case, nothing functional, just cosmetic. The ticket had been deprioritized twice because it wasn't affecting conversions.

Then a customer-facing exec noticed it during a demo and suddenly it was P1.

Our developer found the fix in about four minutes. Literally one line — a .toLowerCase() call on the coupon input field. She tested it locally, it looked great, and we pushed it to production through our fast-track deploy process (which existed specifically for "low-risk" cosmetic fixes).

Within 20 minutes, our error monitoring lit up. Checkout was failing for anyone who had a coupon applied.

The root cause: our coupon validation logic upstream was case-sensitive. It expected codes in uppercase. The .toLowerCase() fix made the UI display correctly, but broke the validation handshake. Coupons that were valid were now being rejected as invalid. Customers were losing discounts in the middle of checkout and abandoning.

We rolled back in 40 minutes. The incident window was about an hour.

What proper retesting would have caught

The fix was never tested against the full checkout flow — only the display behavior. A proper retest would have included:

Boundary testing: What happens when a valid uppercase coupon is entered after this change?
Integration verification: Does the front-end input still communicate correctly with the validation service?
End-to-end scenario: Complete a checkout with a coupon applied.

The fix was cosmetic on the surface but touched an input field with downstream dependencies. Retesting only the visual output while ignoring the functional chain is how one-line fixes become one-hour outages.

Lesson: There is no such thing as a "cosmetic" fix that touches user input. The blast radius of any change to an input field includes everything downstream of that field.

Incident #2: The Regression Nobody Ran

What happened

Six months later, different team, same pattern.

We had a nasty bug in our notifications service — users weren't receiving email confirmations for certain account actions. It had been reported by a handful of users, confirmed by QA, and assigned to a senior engineer who tracked it to a race condition in our async job queue.

The fix was genuinely complex. It took three days, two code reviews, and a solid round of unit testing before it was merged. QA verified the specific scenario from the bug report — the exact action that triggered the race condition — and it passed cleanly. Ticket closed. Sprint closed. Everyone went home.

The following Monday we discovered that password reset emails had stopped working entirely.

The notifications service powered both flows. The fix had resolved the race condition for account confirmations by changing how jobs were enqueued — but that change had altered behavior for the password reset flow in a way nobody had mapped out. Password reset emails had been silently failing since Friday's deploy.

We caught it because a new employee tried to reset their password on their first day and got nothing. Not exactly the onboarding experience we aimed for.

What proper retesting would have caught

The QA engineer verified the bug report scenario. Nobody ran a broader regression on the notifications service.

What was missing:

Component-level regression: After fixing the queue logic, every feature that uses the notifications service should have been retested — not just the broken one.
Dependency mapping: A quick audit of "what else calls this service?" before closing the ticket.
Smoke test in staging: A post-deploy smoke test covering core user flows (including password reset) would have surfaced this within minutes of Friday's deploy.

The unit tests were thorough for the race condition. But unit tests don't catch integration-level regressions. The component was fixed; the system was broken.

Lesson: Retesting a bug fix means retesting the component, not just the scenario. Map your dependencies before you close the ticket.

Incident #3: We Tested in the Wrong Environment

What happened

This one is the most embarrassing, because by this point we had a retesting checklist. We had learned from the previous incidents. We were doing the thing.

Except we weren't doing the thing in the right place.

A bug had been reported where users on a specific legacy plan tier were getting incorrect pricing displayed on their dashboard. The pricing logic was in a configuration service that read from a database table. A developer found the issue — a missing condition in a query — fixed it, and QA tested it thoroughly in our staging environment. All plan tiers displayed correctly. Ticket verified. Deployed to production Friday afternoon.

By Saturday morning, we had support tickets from enterprise customers — not the legacy tier, but our highest-value accounts — saying their pricing looked wrong.

What had happened: our staging database was months out of date. Enterprise plan configurations that existed in production didn't exist in staging. The query fix was correct, but it had an unintended side effect on plan types that our staging data didn't include. We tested correctly in an environment that didn't reflect reality.

The fix was straightforward, but the enterprise customer trust damage took weeks to repair.

What proper retesting would have caught

The retesting process was sound. The environment was the problem.

Production-parity staging: Our staging database needed to be refreshed with anonymized production data regularly — especially before testing anything that touches pricing or plan configuration.
Edge case data coverage: Any fix that touches multi-tier logic should be tested against a representative sample of all active configurations, not just the ones that happen to exist in staging.
Pre-deploy validation gate: A quick sanity check in a production-like environment before any pricing-related deploy, full stop.

We had a checklist. The checklist didn't include "verify the environment reflects production data." It does now.

Lesson: A perfect retesting process in an imperfect environment is still a broken process. Environment parity is not a DevOps nicety — it's a testing prerequisite.

The Pattern Across All Three

Looking back at these incidents, the surface-level causes are different — wrong scope, missed dependencies, wrong environment. But they all share the same root: we treated retesting as confirmation of the fix, not as verification of the system.

The developer fixed what was broken. QA confirmed it was fixed. Nobody asked: what else could this have changed?

That question — "what else?" — is the difference between retesting and real retesting.

Here's the mental model that changed how our team thinks about this:

A bug fix is a delta. Retesting is the process of understanding the full impact of that delta — not just the intended impact.

Every fix has:

The intended effect (the bug is gone).
The potential unintended effects (what else the change touches).
The environmental assumptions (does this hold in production, not just staging?).

Good retesting covers all three.

What We Changed After Incident #3

After the third incident, we stopped treating retesting as a QA-phase activity and started treating it as a shared engineering responsibility. Here's what actually changed in our process:

Developers now write regression tests as part of bug fixes. Not a separate story, not a future sprint item — part of the same PR. If you fixed it, you prove it with a test that would have caught it.

Bug tickets now require a dependency field. Before a fix goes to QA, the developer lists every component, service, or data model the fix touches. QA uses that list to scope the retest.

Staging data is refreshed before any pricing, billing, or configuration change. Non-negotiable gate in our deploy checklist.

We run a smoke test suite on every production deploy. Ten minutes, covers our twenty most critical user flows. It's caught three would-be incidents in the eight months since we introduced it.

None of this is revolutionary. It's the stuff every retesting guide recommends. The difference is that now we actually do it, because we remember what it felt like when we didn't.

The Uncomfortable Truth About "Fast" Teams

Here's the thing nobody says out loud: the pressure to skip retesting almost always comes from the top. Developers and QA engineers generally know when a fix needs more testing. They feel it. But when a manager is asking why a ticket isn't closed, or when a sprint is ending and the board needs to be cleared, the path of least resistance is to mark it done and hope.

That hope is expensive. An hour of proper retesting costs an engineer an hour. An incident costs engineering hours, support hours, customer trust, and sometimes revenue.

The math is not complicated. The organizational will to do the math is.

If you're a team lead or an engineering manager reading this: the single most effective thing you can do for your production stability is to give your QA team explicit permission to slow down and retest properly. Make it a cultural norm that reopening a ticket for insufficient testing is a sign of diligence, not failure.

The alternative is finding out at 2 AM.

If your team is building a retesting process from scratch or tightening up an existing one, this retesting guide is worth the read. Less war stories, more frameworks — but the lessons rhyme.

DEV Community

I Broke Prod 3 Times — Here's How Proper Retesting Would Have Saved Us

Incident #1: The "One-Line Fix" That Took Down Checkout

What happened

What proper retesting would have caught

Incident #2: The Regression Nobody Ran

What happened

What proper retesting would have caught

Incident #3: We Tested in the Wrong Environment

What happened

What proper retesting would have caught

The Pattern Across All Three

What We Changed After Incident #3

The Uncomfortable Truth About "Fast" Teams

Top comments (0)