My Test-Selection Tool Skipped 73% of the Suite. That Number Was the Bug.

BreadWasEaten — Sat, 20 Jun 2026 06:57:06 +0000

Test Impact Analysis is a simple promise: don't re-run the tests your change couldn't possibly have broken. Record which tests touch which code once, then use your git diff to run only the affected subset. Google and Meta do it internally; pytest-testmon does a local-dev version. I built an open-source one for pytest called tia — and the most important thing I learned had nothing to do with the algorithm.

It was this: the first time my benchmark looked great, it was lying to me.

A number too good to be true

To prove tia actually saved time, I avoided the trap of a toy demo where every test hits a unique function — that's rigged to look good. I pointed it at Flask, a real third-party suite, and replayed its last 15 real commits. For each one: record the impact map on the parent commit, then measure how much of the 483-test suite tia would skip for that change.

The result came back: median skip rate 73%. I almost wrote it down and moved on.

Why I didn't celebrate

73% felt too good. Flask is small and tightly coupled — a lot of its tests funnel through the same request-handling core. If I changed that core, a correct selector should pull in most of the suite, not skip three-quarters of it. A number that good, on a codebase that coupled, wasn't a win. It was a symptom.

So I dug in instead of celebrating. The bug was upstream of my own code, in coverage.py.

The false negative wearing a costume

tia records which test touches which line using coverage.py's "dynamic contexts": it switches the active context to each test's id, so the coverage data knows which test executed which line. On Python 3.12+, coverage.py defaults to a fast new measurement core called sysmon (built on sys.monitoring). sysmon, at the time, recorded only the FIRST context to hit each line and dropped the rest.

Read that again in the context of test selection. If test_a and test_b both call a shared helper, sysmon recorded that the helper's lines belonged to test_a — and silently forgot test_b. So when I later changed that helper, tia "knew" only test_a touched it, and skipped test_b.

That's not a high skip rate. That's a false negative — the one thing a test-selection tool must never do. The impressive 73% was tia confidently skipping tests that would have caught the change. The costume was a "feature"; underneath was the cardinal sin.

The fix, and the honest number

The fix was to force coverage's C tracer during recording:

COVERAGE_CORE=ctrace

It records every context per line. After that, all 483 tests mapped correctly — and the skip rate fell to its honest value: about 21% median on Flask.

21% is a far less exciting number. It is also a true one. And it taught me to publish the floor, not the ceiling.`

But a floor alone is its own kind of lie

Flask is near the worst case for test selection: small, tightly coupled. Publishing only its number would undersell the tool as badly as the 73% oversold it. So I ran the same per-commit replay on boltons — a modular utility library where each module is largely independent.

There, the honest median skip on real logic changes is about 96%. A change to one function runs the two tests that call it, not the other 400.

Same tool. The only variable is how decoupled your tests are. So the honest pitch isn't a single number, it's a range:

Flask (tightly-coupled, worst case): ~21% median skip boltons (modular library): ~96% median skip

Most real codebases live somewhere in between. Which means the skip rate is, quietly, a measurement of your own coupling.

Guarding the sin on purpose

After being burned by a false negative I couldn't see, I stopped trusting "it looks right." I wrote an adversarial test: build a small repo, record the map, then mutate every single covered function and assert that tia re-selects every test that exercises it.

Then — the part that actually matters — I deliberately broke the selector to confirm the test fails when a false negative is introduced. A guarantee you've never watched fail is just a hope with good posture.

What it doesn't do

None of this makes tia magic. Dynamic dispatch — getattr, eval, reflection — is undecidable in general; tia detects it and degrades to running every test that touches that file, and you should still run the full suite on a cadence. It overlaps heavily with testmon for the common case. Its genuinely distinct edges are narrow: it tracks non-Python dependencies (change a fixture file and it reruns the tests that actually read it), and it's built to share one map across CI machines, which testmon struggles with.

But the thing I'm proud of isn't a feature. It's that when the tool looked brilliant, I assumed it was broken — and it was. If you build anything that decides what NOT to run, that instinct is the entire job.

Try it / break it

pip install pytest-tia

Repo, benchmarks, and the full write-ups: https://github.com/breadMSA/pytest-tia

If you read one thing, read the benchmark methodology and try to break it. I'd much rather find the next "73%" from you than from a user who trusted a skipped test.

DEV Community: BreadWasEaten