We had 847 tests. Green checkmarks across the board. 100% coverage on our critical paths. I was proud of that dashboard.
Then a user reported that our checkout was double-charging on Safari. Another said the password reset emails weren't arriving. Within 24 hours we had 14 confirmed bugs — and our CI pipeline was still proudly green.
That's when I realized: 100% code coverage is a vanity metric that makes you feel safe while your users burn.
The Illusion of Coverage
Here's what our test suite was great at:
- Testing individual functions in isolation
- Verifying happy paths with clean inputs
- Catching regressions in pure utility functions
Here's what it completely missed:
- Browser-specific behavior — Safari's date parsing is different from Chrome's. Our test runner used Node.js. No browser, no Safari.
- Race conditions — Two API calls firing simultaneously? Our mocked fetch resolved instantly. In production, timing matters.
- Integration gaps — Each module had tests. The connections between modules did not.
- Real-world data — Our fixtures were clean. User data is never clean.
The Bug That Started It All
A user in Japan reported being charged twice for a single purchase. We couldn't reproduce it locally. Our payment integration tests passed every time.
The root cause: a double-submit button on slow networks. Our mock API responded in 12ms. Real networks: 800ms. That gap was enough for impatient fingers to click twice.
The fix was 3 lines of code:
const [isSubmitting, setIsSubmitting] = useState(false);
// Button: disabled={isSubmitting}
Three lines. But the test suite — our beautiful 847-test suite — had zero tests for this scenario because nobody wrote a test for "user clicks button twice."
The 14-Bug Autopsy
After that incident, we categorized all 14 bugs:
| Bug Category | Count | Tests Should've Caught It |
|---|---|---|
| Browser compatibility | 4 | ❌ No cross-browser tests |
| Race conditions | 3 | ❌ Mocks too fast |
| Edge-case user input | 3 | ❌ Fixtures too clean |
| Third-party API changes | 2 | ❌ No contract testing |
| Time zone bugs | 2 | ❌ All tests ran in UTC |
14 bugs. Zero caught by CI. The problem wasn't that we didn't have enough tests — we had the wrong kind of tests.
What We Changed
1. Added Integration Tests at Module Boundaries
Unit tests check the bricks. Integration tests check the mortar. We added tests specifically for the connections between services — where most real bugs hide.
2. Started Running Tests in Real Browsers
We added Playwright for critical user flows: checkout, auth, search. These run against a real Chrome and Firefox instance. Safari is next.
3. Mock Network Latency
Instead of instant mock responses, we randomized delays between 100ms and 2000ms. This surfaced race conditions we never knew existed.
4. Contract Testing for APIs
We used Pact to verify that our frontend's expectations of backend APIs actually match reality. Two bugs disappeared the day we added this.
5. Time Zone Roulette
We randomize the test runner's timezone. Half our date bugs appeared within the first week.
The New Philosophy
Coverage tells you what code runs. It doesn't tell you what breaks.
Now we track different metrics:
- Bug escape rate — bugs found by users vs. caught in CI
- Mean time to detection — how fast our tests find regressions
- Integration test coverage — not line coverage, but scenario coverage
Our total test count went down (we deleted 200+ redundant unit tests). Our bug escape rate went down 80%.
The dashboard looks less impressive. The product works better.
Have you been burned by "green tests, broken production"? What testing gaps surprised you most? I'd love to hear your war stories in the comments.
Top comments (0)