DEV Community

jghiringhelli
jghiringhelli

Posted on

I Had 93% Test Coverage. Then I Ran Mutation Testing.

Line coverage is not wrong. It measures exactly what it says: the percentage of code lines executed when the test suite runs. 93.1% means 93.1% of lines are executed. That number is correct.

It does not tell you whether any test detects anything.


The Difference Between Execution and Detection

A test that calls calculateTax(100) and asserts result !== null executes the function completely. The line coverage tool marks it as covered. It will not catch a bug that returns the wrong tax rate. It will not catch a sign error. It will not catch a rounding failure. Every mutation that leaves the function returning something non-null will survive.

That test covers the line. It does not test the behavior.

Multiply that pattern across a codebase and you get a test suite with high line coverage and low test effectiveness. Both numbers are accurate. They measure different things.


The Adversarial Experiment

In the multi-agent adversarial experiment described in the Generative Specification white paper (https://doi.org/10.5281/zenodo.19073543), the GS treatment project produced a test suite with 93.1% line coverage.

Then Stryker mutation testing was applied to the treatment project's services layer: 116 effective mutants. Each mutant represents a realistic code change: flipped boolean, swapped comparison operator, removed return value. Stryker checks whether any test fails. If no test fails when the code is mutated, the test suite cannot detect that behavior.

The baseline mutation score: 58.62% MSI.

That is a 34-point gap. 93.1% line coverage, 58.62% mutation score. The distance between where the tests go and what they actually catch. Code that executes during tests and survives every mutation. That is the portion of your codebase where a bug can be introduced, your tests will pass, and your CI will go green.


Why AI-Generated Tests Show This Pattern

Before concluding this is a model problem, consider what it trained on.

The internet has decades of test suites. Most of them are bad. Tests that call a function and assert it returns something truthy. Tests named test_it_works. Tests that mock every dependency and assert nothing about behavior. Tests written to satisfy a coverage gate, not to catch defects. The model trained on all of it.

This may not be the only explanation. The model may also lack the domain specificity to know which boundary conditions matter for a given function. We do not know for certain. What we know is the outcome: tests that execute code without asserting its correctness.

This is not unique to AI-generated tests. Human-written test suites produced after the implementation show the same pattern. The cause is structural: tests written to satisfy a coverage metric optimize for the metric.

The fix is TDD. Write the test first, make it fail, then write the code that makes it pass. A test written before the implementation cannot pass until the implementation is correct. That is the only guarantee the test actually detects something. The model can follow TDD instructions. It does not do so spontaneously.


What Mutation Testing Unlocked

After the primary results were recorded, three rounds of targeted assertion improvements were run on the treatment project, guided by the surviving mutants Stryker identified. Each round tightened assertions, added boundary conditions, and replaced presence checks with correctness checks.

The baseline was 58.62% MSI. After three rounds, the mutation score reached 93.10%, matching line coverage exactly.

When mutation score equals line coverage, the gap is zero: every covered line is verified. The tests not only execute the code, they detect when it breaks.

The line coverage number turned out to be exactly the number it takes to achieve quality at that level. Once mutation testing provided a precise external measure, the AI had something to iterate against. The quality improved.

That is what enforcement does. It moves quality measurement outside the generation loop.


Mutation Testing is Not Interchangeable with Line Coverage

Line coverage answers: did the test suite execute this line? It does not ask whether the test suite detected anything wrong with the line.

Mutation testing answers: if I change this line, does any test fail? If no test fails, the test suite cannot detect that behavior.

The same gap appeared in the Shattered Stars migration case study in the paper: line coverage was 80%, mutation score was 58%. A 22-point gap between execution and detection. The adversarial experiment reproduced the pattern under controlled conditions. High line coverage and low mutation score is the signature of tests written to satisfy a metric.

Your coverage number tells you about execution. The mutation score tells you about detection. Both gates are required. Neither substitutes for the other.


Two Numbers

Line coverage: what the tool reports when the suite runs. Your floor. It tells you where the tests reach.

Mutation score: what Stryker or an equivalent tool produces. Your ceiling. It tells you how much of what the tests reach they actually verify.

A codebase with 93.1% line coverage and 58.62% mutation score has a 34-point gap between where the tests go and what they catch. That gap is where bugs survive CI.

When both numbers converge, the gap is zero. That is the target.


The Action

Find the mutation testing tool for your stack. Stryker is for JavaScript and TypeScript only. If you are on Python, Java, Go, or anything else, you need a different tool. Pick the one that matches your language and run it on your services layer.

Stack Tool Command
JavaScript / TypeScript Stryker npx stryker run
Python mutmut mutmut run
Java / Kotlin PIT mvn org.pitest:pitest-maven:mutationCoverage
C# / .NET Stryker.NET dotnet stryker
Go go-mutesting go-mutesting ./...
Ruby mutant mutant run
Rust cargo-mutants cargo mutants

Run it. Then look at the gap between your line coverage and your mutation score.

If both numbers are close, be proud. That means your tests are not just executing code; they are verifying it. Every covered line breaks a test when it changes. That is what a test suite is supposed to do. Most codebases, AI-generated or not, do not start there.

If there is a significant gap, look at which mutants survived. Each one is a category of defect your test suite cannot currently detect. Ask your AI assistant to fix them: give it the list of surviving mutants and ask it to strengthen the assertions. The same model that wrote weak tests can write better ones; it needs the mutation report to know what "better" means.

That is what enforcement does. A gate that measures outside the generation loop produces quality improvement that prompt engineering alone cannot reliably achieve.

Your line coverage number tells you where the tests go. The mutation score tells you what they actually do when they get there.

Top comments (0)