In the fast-paced world of software engineering, there are two kinds of bugs:
- Bugs that made it to production
- Bugs that didn't
- Bugs that were caused by "off-by-one" error
Today, I would like to share a story of #2 - one that was caught at the last possible moment.
The Setup: A Simple Clean-up Exercise
At a retail company with an e-commerce website, a junior developer was assigned a straightforward task: deprecate an unused field from an internal REST API.
How The Bug Was Born
As the junior dev navigated through 22 files to remove traces of the deprecated field, they spotted what looked like a "refactoring opportunity":
if (product.type === ProductType.BEST_SELLER) {
doSomeThing();
} else {
doSameThing();
}
Noticing that both branches seemed to do the same thing (or so they thought), they “cleaned it up” to:
doSameThing();
But there was a tiny detail that went unnoticed: a subtle difference in function spelling.
And because this was legacy code, no unit tests covered this logic.
The Code Review That Didn't Catch It
By the time the pull request (PR) was ready, the developer’s IDE had also applied auto-linting across the touched files.
So, the PR ended up like this:
The result was, as expected, "LGTM".
The Disabled End-To-End Test
Meanwhile, another team maintained an end-to-end (E2E) test for the BEST_SELLER
product. This test simulated a user placing an order on the website.
Unfortunately, this test failed intermittently due to data issues, such as the test product going out-of-stock.
So, when the test failed, the reflex action of the Quality Engineer (QE) was:
- Disable the test first.
- Investigate and rectify the data issue next.
- Don’t be the person blocking a production release.
This time, however, something was different.
After disabling the test, the QE verified that the test product was actually in-stock, so they came to me (the software engineer) for help.
The Pressure to Release
It was a lucky day - both frontend and backend had production releases scheduled.
The QE team, having dealt with countless data issues in the past, was inclined to dismiss the issue and sign off on the releases.
But something felt off. The issue spanned across a range of products within the staging environment, not just the test product.
With pressure mounting, we escalated the issue to the Delivery Lead, who thankfully was understanding and gave us time to investigate.
This decision saved us.
By systematically comparing staging vs production behavior, we finally traced the issue back to the root cause — a tiny refactoring mistake hidden in a big PR.
Had we not caught it, customers trying to buy our best-selling products wouldn’t have been able to place orders.
Key Takeaways
1. One PR should do one thing only.
Mixing refactoring with unrelated changes makes it harder to review (and catch bugs).
2. When refactoring, ensure the changes are covered by unit tests.
If there aren’t any, write them first.
3. When enforcing a linting rule, reformat the whole codebase in the same PR as well.
The next person who raised a PR will thank you.
4. Never disable an E2E test without fully understanding the root cause.
A failing test might be the only warning sign before disaster strikes.
Top comments (0)