Jonathan Harrison

Posted on Jun 12

High Coverage, Low Confidence: What Mutation Testing Reveals About AI-Generated Tests

#ai #java #testing #softwareengineering

AI coding agents can now generate code and tests in seconds.

That’s useful — but it also creates a problem we don’t talk about enough:

We can have high test coverage and passing tests … without actually having meaningful tests.

I recently ran into this at work on JavaScript projects where the code looks well-tested, CI is green and coverage is high.

Then I added mutation testing.

Everything changed.

What is mutation testing?

Mutation testing is a way of checking whether your tests actually detect bugs.

It works by introducing small, deliberate changes (called mutants) into your code, such as:

changing > to >=
flipping conditions
modifying return values
removing logic

Then it runs your test suite.

Two outcomes:

Tests fail → the mutant is killed
Tests pass → the mutant survives

If mutants survive, it means your tests didn’t catch a real change in behavior.

In short:

Mutation testing tells you whether your tests are actually doing their job.

Why this matters in the AI agent era

AI tools are great at generating tests quickly.

But those tests often:

cover happy paths only
miss edge cases
contain weak assertions
verify implementation details rather than results

This leads to a dangerous mismatch:

High coverage ✔️
Passing tests ✔️
Real confidence ❌

Mutation testing helps close that gap.

It answers a better question:

“Would these tests actually catch a bug?”

Mutation testing in a real Java project (PIT + Gradle)

I integrated PIT⁠ into a Gradle-based Java project that already used JUnit and JaCoCo.

Setup is straightforward:

plugins {
    id 'info.solidsoft.pitest' version '1.19.0'
}

pitest {
    targetClasses = ['com.jonjam.accuweathermcp.*']
    mutators = ['STRONGER']
    junit5PluginVersion = '1.2.3'
    outputFormats = ['HTML', 'XML']
    mutationThreshold = 70
}

Run it with:

./gradlew pitest

What the results showed

On paper, the test suite already looked solid:

99% line coverage (366/368 lines)
96% mutation coverage (631/703 mutants killed)

But mutation testing tells a deeper story.

Coverage answers:

“Did we execute the code?”

Mutation testing answers:

“Would a test fail if the code was wrong?”

The surviving mutants highlighted cases where:

code was executed but not properly asserted
edge cases were missing
tests were too implementation-focused

That’s the key insight:

High coverage ≠ strong tests

A simple example: surviving mutants

Here’s a basic example of what PIT is catching:

// original
if (temperature > 30) {
    return "hot";
}

// mutant
if (temperature >= 30) {
    return "hot";
}

If your tests only check broad behavior like:

“above 30 is hot”
“below 30 is not hot”

…then both versions pass.

That mutant survives.

And that’s the point: your tests didn’t actually detect a meaningful change.

Turning mutation testing into an AI feedback loop

The most interesting part of this experiment wasn’t the tool—it was the workflow.

I ended up using mutation testing as a loop for improving AI-generated tests:

Run PIT
Inspect surviving mutants
Strengthen or add targeted tests
Remove weak or redundant tests
Re-run until confidence improves

This turns testing into something more dynamic:

not just generating tests, but continuously improving them based on real failure signals

It also works well with AI agents:

AI generates initial tests
PIT evaluates them
AI iterates based on feedback

The above being wrapped up in an Agent skill.

Why this matters

In a world where AI can generate tests instantly, the bottleneck is no longer writing tests.

It’s trusting them.

We need better signals than:

“tests pass”
“coverage is high”

Mutation testing gives us that signal.

It tells us whether our tests would actually catch broken behaviour.

Closing thought

If you’re already using line coverage tools like JaCoCo, mutation testing is the next step up in confidence.

It shifts the question from:

“Do we have tests?”

“Would these tests catch a real bug?”

And in the age of AI-generated tests, that difference matters more than ever.

DEV Community