I Shipped a Bug to Production That Cost Us 3 Hours of Downtime

#ai #devops #programming #discuss

It was a Tuesday afternoon. Nothing felt different about the deploy. Same process as always. Green tests. Approved PR. Merged to main. Deployed.
Seventeen minutes later the alerts started.

What Happened

We had a background job that processed orders and updated inventory counts across our marketplace. The job ran every few minutes and had been running reliably for months without a single issue.

I had made what I genuinely believed was a minor change. A refactor. Cleaner logic, same behavior. I had tested it locally. The tests passed. A colleague reviewed it and approved it. Everything looked fine.

What I had not accounted for was a race condition that only appeared under concurrent load.

In local testing and in our staging environment, the job ran sequentially. In production it ran concurrently across multiple workers. The refactored logic I had written assumed sequential execution. Under concurrent load two workers would occasionally read the same inventory record at the same time, both calculate an update based on the same stale value, and both write back — the second write overwriting the first.

The result was inventory counts that drifted further from reality with every job run. Orders were being accepted for products that were no longer in stock. The customer facing side kept working normally which is why the alerts took seventeen minutes to fire. By the time we caught it the damage was done.

Three hours to identify, roll back, audit the affected records, correct the data, and verify the fix. Three hours of degraded service. A handful of orders that had to be manually cancelled and refunded. A postmortem that took longer than the incident itself.

All of it from a change I had described in the PR as "minor refactor, no behavioral changes."

The Mistakes I Made

Looking back there were several distinct failures that each independently could have prevented this.

I did not read the existing tests carefully enough.

The existing tests covered the sequential case because the original code had been written with sequential execution in mind. The tests passed because I had not broken the sequential behavior — I had broken the concurrent behavior that the tests never covered. I looked at green tests and assumed coverage. Green tests mean the tested cases pass. They say nothing about the cases that were not tested.

I assumed staging was representative of production.

Our staging environment runs a single worker. Production runs multiple. This difference had never mattered before because nothing we had deployed before this change was sensitive to concurrency. I knew staging was a single worker. I did not think about whether that mattered for this specific change. It did.

I described the change inaccurately in the PR.

"Minor refactor, no behavioral changes" is the kind of description that makes reviewers less careful not more. My colleague approved it quickly because the description signaled low risk. If I had described it accurately — "refactoring the inventory update logic, please check whether the new approach handles concurrent writes correctly" — the review would have been different. The description I wrote was not intentionally misleading. It reflected my own incorrect assessment of the risk. That is worse in some ways. I had convinced myself it was minor before I convinced anyone else.

I had no concurrency testing in my local verification process.

I ran the job locally once and watched it complete successfully. I did not run it concurrently. I did not run it under load. I did not simulate multiple workers. Running a background job once in a local environment and calling it tested is not testing the job — it is testing one execution path under ideal conditions.

What We Changed After

We added a staging environment worker count that matches production.

This sounds obvious in retrospect. It was not obvious until it cost us three hours. Staging now runs the same number of workers as production for any service where concurrency matters. The environment is still not a perfect replica but the concurrency profile now matches.

We added explicit concurrency tests for any job that touches shared state.

Not just unit tests for the logic. Integration tests that spin up multiple workers, run them simultaneously against the same test data, and verify the outcome is consistent. These tests are slower. They are also the only tests that would have caught this specific failure.

We changed how we describe PR risk in reviews.

We added a required field to our PR template: "Concurrency considerations." It can be filled with "N/A — this change does not touch shared state" which takes ten seconds. For changes that do touch shared state it forces the author to think about and articulate the concurrency implications before a reviewer sees the code. The reviewer then knows to focus there specifically.

We introduced a pre-deploy checklist for background jobs.

Not a long checklist. Four questions. Does this job touch shared state? Does the logic assume sequential execution? Has it been tested under concurrent load? Does the rollback procedure work without manual data correction? The checklist takes two minutes. The incident it is designed to prevent took three hours.

The Thing I Keep Coming Back To

The bug itself was not the most interesting part of this story.
The most interesting part is how many independent checkpoints it sailed through without being caught.

I did not catch it in my own review of the change. My colleague did not catch it in the PR review. The automated tests did not catch it. The staging deployment did not catch it.

Each of those checkpoints failed for a different reason. My self review failed because I had already decided the change was minor and was not looking carefully. The PR review failed because my description set a low risk frame that the reviewer accepted. The automated tests failed because they covered the wrong execution model. Staging failed because the environment did not match production on the one dimension that mattered.
A failure that gets through four independent checkpoints is not bad luck. It is a systems problem. The checkpoints were not actually independent — they were all downstream of my initial incorrect assessment that the change was low risk. Once I had decided it was minor, every subsequent checkpoint was biased toward confirming that assessment.

This is the uncomfortable version of the lesson. It is not just "write better tests" or "stage more carefully." It is that your own confidence in a change actively degrades the quality of the checks that follow it. The times you are most certain a change is safe are the times you most need to deliberately stress test that certainty.

What I Do Differently Now

Before merging anything that touches background jobs, queues, or shared state I now ask one question regardless of how minor the change feels:
What would have to be true about the execution environment for this to fail silently?

Not loudly. Not with an immediate error that gets caught in tests. Silently — in a way that produces correct looking output under normal conditions and wrong output under specific conditions that do not exist in development or staging.

That question changes how I look at a change. It forces me to think about the gap between the environment where I tested and the environment where it will run.

It would have caught this one. It has caught two potential issues since. Neither of them made it to production.
The best lesson from a production incident is not a new process. It is a new question you ask yourself before you need the process.

Exact Solution is a certified refurbished electronics marketplace shipping across Europe. We stock the best refurbished laptops from Apple, Dell, HP, and Lenovo — all fully tested and ready to ship.