DEV Community

JEONSEWON
JEONSEWON

Posted on

My test suite said "pass." It was lying by 19 tests.

Before adding a feature, I ran the suite to confirm a clean baseline. It came back 139 passed, 3 skipped. Green-ish, no failures. The easy move is to nod and start building.
I didn't, because the number was wrong. My known-good baseline is 158. A run that comes back at 139 isn't "close enough" — it's 19 tests short, and I have no idea whether those 19 are passing, failing, or simply not there.

The cause turned out to be boring: a fresh virtualenv hadn't installed the adapter extras, so every OpenTelemetry-dependent test silently skipped or never got collected. Nothing errored. The run looked healthy precisely because the missing tests couldn't fail — they just weren't run. That's the most dangerous kind of green: not "everything passed," but "the things that would have caught a problem weren't even asked."
Why stop the whole feature for this? Because my entire way of working is "compare before and after my change to see what broke." If the starting point isn't the real 158, that comparison is meaningless — I'd later have no way to tell whether my new code broke something or whether it was already broken. A baseline you can't trust poisons every result built on top of it.

Installed the extras, reran: 158 passed, 0 skipped. Now there's a real starting line. Only then did I let myself build.
The lesson isn't about pytest. It's that "no failures" and "everything was actually checked" are different claims, and the gap between them is exactly where false confidence lives. Count what should be there, not just what's red.
Code: github.com/JEONSEWON/Clew-by-Custos

BuildInPublic #AIAgents #Testing #DevTools

Top comments (0)