AI coding agents can now generate code and tests in seconds.
That’s useful — but it also creates a problem we don’t talk about enough:
We can have high test coverage and passing tests … without actually having meaningful tests.
I recently ran into this at work on JavaScript projects where the code looks well-tested, CI is green and coverage is high.
Then I added mutation testing.
Everything changed.
What is mutation testing?
Mutation testing is a way of checking whether your tests actually detect bugs.
It works by introducing small, deliberate changes (called mutants) into your code, such as:
- changing > to >=
- flipping conditions
- modifying return values
- removing logic
Then it runs your test suite.
Two outcomes:
- Tests fail → the mutant is killed
- Tests pass → the mutant survives
If mutants survive, it means your tests didn’t catch a real change in behavior.
In short:
Mutation testing tells you whether your tests are actually doing their job.
Why this matters in the AI agent era
AI tools are great at generating tests quickly.
But those tests often:
- cover happy paths only
- miss edge cases
- contain weak assertions
- verify implementation details rather than results
This leads to a dangerous mismatch:
- High coverage ✔️
- Passing tests ✔️
- Real confidence ❌
Mutation testing helps close that gap.
It answers a better question:
“Would these tests actually catch a bug?”
Mutation testing in a real Java project (PIT + Gradle)
I integrated PIT into a Gradle-based Java project that already used JUnit and JaCoCo.
Setup is straightforward:
plugins {
id 'info.solidsoft.pitest' version '1.19.0'
}
pitest {
targetClasses = ['com.jonjam.accuweathermcp.*']
mutators = ['STRONGER']
junit5PluginVersion = '1.2.3'
outputFormats = ['HTML', 'XML']
mutationThreshold = 70
}
Run it with:
./gradlew pitest
What the results showed
On paper, the test suite already looked solid:
- 99% line coverage (366/368 lines)
- 96% mutation coverage (631/703 mutants killed)
But mutation testing tells a deeper story.
Coverage answers:
“Did we execute the code?”
Mutation testing answers:
“Would a test fail if the code was wrong?”
The surviving mutants highlighted cases where:
- code was executed but not properly asserted
- edge cases were missing
- tests were too implementation-focused
That’s the key insight:
High coverage ≠ strong tests
A simple example: surviving mutants
Here’s a basic example of what PIT is catching:
// original
if (temperature > 30) {
return "hot";
}
// mutant
if (temperature >= 30) {
return "hot";
}
If your tests only check broad behavior like:
- “above 30 is hot”
- “below 30 is not hot”
…then both versions pass.
That mutant survives.
And that’s the point: your tests didn’t actually detect a meaningful change.
Turning mutation testing into an AI feedback loop
The most interesting part of this experiment wasn’t the tool—it was the workflow.
I ended up using mutation testing as a loop for improving AI-generated tests:
- Run PIT
- Inspect surviving mutants
- Strengthen or add targeted tests
- Remove weak or redundant tests
- Re-run until confidence improves
This turns testing into something more dynamic:
not just generating tests, but continuously improving them based on real failure signals
It also works well with AI agents:
- AI generates initial tests
- PIT evaluates them
- AI iterates based on feedback
The above being wrapped up in an Agent skill.
Why this matters
In a world where AI can generate tests instantly, the bottleneck is no longer writing tests.
It’s trusting them.
We need better signals than:
- “tests pass”
- “coverage is high”
Mutation testing gives us that signal.
It tells us whether our tests would actually catch broken behaviour.
Closing thought
If you’re already using line coverage tools like JaCoCo, mutation testing is the next step up in confidence.
It shifts the question from:
“Do we have tests?”
to
“Would these tests catch a real bug?”
And in the age of AI-generated tests, that difference matters more than ever.



Top comments (0)