When AI writes both the code and the tests, coverage only proves the same blind spot was executed twice.
Every test passed. Nothing was tested.
AI wrote 20 tests for a Laravel controller.
Every endpoint was covered. Every test passed. The coverage report looked healthy.
Then I ran mutation testing.
95 mutants generated. 68 killed. 27 survived.
The tests ran the code. They did not test it.
The new failure mode: shared blind spots
Here's the old setup: a developer writes the code, developer/QA writes the tests. Bugs live in the gap between two mental models, and that gap is doing useful work. Two people, two sets of assumptions, and the disagreements are where the bugs surface.
That's changing. More and more, AI writes both the code and the tests, and the more of your workflow you hand it, the more this bites. Same model, same assumptions, same missed cases.
This is the real failure mode, and it's worse than "AI writes bad tests." The agent writes tests that agree with its own misunderstanding. Writing code forces you into the weeds, where you trip over the blind spots. Reading code skims past them, so the reviewer inherits the agent's assumptions without ever discovering they were assumptions.
The test suite agrees with the bug.
Mutation testing is fifty years old (Lipton 1971, DeMillo–Lipton–Sayward 1978) and there are good primers on the basics elsewhere. This post isn't one. It's about what the technique reveals now that it couldn't before: not human laziness, but agent overconfidence.
The experiment
I'm using PestPHP in a Laravel backend as it is what my current test application is written in but Mutation testing exists in every major language: mutmut for Python, Stryker for JS/TS, PIT for Java. The lesson and the survivor categories generalise.
A real Laravel 12 legacy controller. Cyclomatic complexity 13 (CC 13). The agent generated 16 Feature tests and 4 Unit tests, all passing. Then:
--covered-only restricts mutations to lines the tests actually run. So a survivor is a covered line whose assertions wouldn't notice it breaking, not a line no test ever reached. That separates the two failures coverage smears together: code you never exercised, and code you exercised without checking. This score only judges the second. It assumes coverage is already high and asks the harder question on top.
95 mutants generated. 68 killed. 27 survived.
Pest prints each mutation line by line with the mutator that generated it. A tick means the test suite noticed the change. A cross means it survived.
Each row is one transformation applied to the source. TernaryNegated flips a ternary's condition, IfNegated inverts a conditional, SmallerToSmallerOrEqual swaps < for <=. The survivor on line 25, where < became <= and no test failed, is exactly the date-equality blind spot below.
What survived
| Category | Count | % |
|---|---|---|
Response shape (RemoveArrayItem) |
11 | 41% |
Boundary equality (<= / < / > / >=) |
7 | 26% |
Logic inversion (negations, boolean and/or, ===/!==) |
6 | 22% |
| Loop control flow (break/continue/foreach) | 3 | 11% |
Before running it I made a prediction. The survivors would cluster in the hidden closure branches phpmetrics caught and I'd missed. (phpmetrics counted CC 13. I'd hand-counted 10. The three branches I missed were a ?? default, a short-circuit, and a closure inside ->map().)
I was wrong. The closure had two survivors. Out of twenty-seven.
The real gaps were boring: response bodies, date equality, loop cardinality. Places I'd skimmed because they looked routine, which is exactly where the agent skimmed when it wrote the tests.
What the survivors teach
1. Status code assertions are not contract tests.
assertStatus(422) passes whether the body is ['error' => '...'] or []. Use assertExactJson or assertJsonStructure on error responses. This single change kills 11 of 27 mutants here for free.
2. Date logic needs equality boundaries.
The agent wrote past, future, and "clearly active now" cases. It never wrote $end == $now. Equality on continuous values feels artificial, and that's exactly why it gets missed.
3. A loop with one fixture is not really tested.
With N≤1, break and continue are indistinguishable. Every foreach whose order matters needs a fixture with at least two items, ordered to discriminate.
The agentic feedback loop
Coverage becomes smoke. It tells you the agent didn't forget to write tests. Nothing more.
Mutation testing becomes the feedback loop. Run it on the diff, before review. The agent isn't bad at writing tests, it's bad at knowing which tests matter. Mutation testing tells it, mechanically, which ones do.
So I asked the agent to fix it. Same controller, same prompt, but this time with the surviving mutant list as input.
| Metric | Before | After |
|---|---|---|
| Tests in suite | 20 | 40 |
| Mutants generated | 95 | 106 |
| Killed | 68 | 95 |
| Survivors | 27 | 11 |
| Mutation score | 71.58% | 89.62% |
Doubling the test count took survivors from 27 to 11. Note the mutant count went up: the extra fixtures covered more lines, so Pest generated 106 mutants this run instead of 95. A bigger surface, killed harder. Every category dropped except the corners.
Where QA judgement still matters
The remaining 11 were not all worth chasing.
Seven of the eleven I judged to be equivalent mutants: syntactic changes that don't actually alter behaviour. Removing the 'date' rule from a Laravel validator that already has 'after_or_equal:today' is one of them. Non-dates still fail, because after_or_equal:today needs a parseable date too. Detecting equivalence is undecidable in general, so the tool can't rule these out for you. You read each one and make the call. You flag it and ignore it.
The other four are fixture-cost prohibitive. Killing them is technically possible but expensive. One survivor changes $start > $now to $start >= $now inside a loop. The only test that distinguishes them is a violation starting at exactly now(): frozen time, a multi-item fixture ordered so this case fires first, fragile under clock or timezone drift. Fifteen lines of setup for a one-character mutation that triggers at one specific second.
That's the QA judgement call. Mutation testing surfaces the gap. It does not decide whether the gap is worth closing.
90% is the honest stopping point. High enough to trust the suite, low enough to admit which mutants are intentionally left alive.
A practical /mutate command
You want this to run between the agent's first test pass and your PR review. Split it into two pieces: the work, and the stopping condition.
The work is a slash command. It describes a single pass:
Run pest --mutate on the changed classes.
Categorise survivors. Add a test for each.
Print the score and surviving mutant list.
Do not weaken the production code to kill mutants.
The last line is doing real work. The easiest way to kill a mutant is to delete the code it lives in, so the command isn't allowed to touch production code, only tests.
But a procedure can't teach taste. "Add a test for each survivor" is the step. How you kill the mutant is the part that matters. A response-shape survivor wants assertExactJson, not another assertStatus. A loop survivor wants a two-item ordered fixture, not a second assertion on the same one. I keep that knowledge in a pest-code-writer skill, so every pass writes tests that kill mutants the right way, not the easy way.
That split is what makes the loop worth trusting. The skill knows how to write the test, /mutate runs the pass and holds the guardrails.
The condition belongs in /goal, which exists for exactly this. You give it a completion condition and the model checks after every turn whether it's met:
/goal mutation score ≥ 85% or only equivalent mutants remain
/mutate
The catch: the evaluator only sees the transcript. That's why the command prints the score and the survivor list, if the number isn't on screen, the goal can't see it.
Closing
Mutation testing pressures the tests, not the spec. Every mutant you kill makes the suite a tighter net around today's behaviour, and says nothing about whether today's behaviour was ever right. Kill every mutant on a misread requirement and you just get the wrong answer. That gap needs a human in the loop reading the requirement, not a higher score.
But in a world where AI writes both sides of the contract, "would anyone notice if this changed?" is the question coverage was pretending to answer and never could.
Coverage proves the code ran. Mutation testing proves it matters.



Top comments (0)