DEV Community

Cover image for High Coverage, Low Confidence: What Mutation Testing Reveals About AI-Generated Tests
Jonathan Harrison
Jonathan Harrison

Posted on

High Coverage, Low Confidence: What Mutation Testing Reveals About AI-Generated Tests

AI coding agents can now generate code and tests in seconds.

That’s useful — but it also creates a problem we don’t talk about enough:

We can have high test coverage and passing tests … without actually having meaningful tests.

I recently ran into this at work on JavaScript projects where the code looks well-tested, CI is green and coverage is high.

Then I added mutation testing.

Everything changed.

What is mutation testing?

Mutation testing is a way of checking whether your tests actually detect bugs.

It works by introducing small, deliberate changes (called mutants) into your code, such as:

  • changing > to >=
  • flipping conditions
  • modifying return values
  • removing logic

Then it runs your test suite.

Two outcomes:

  • Tests fail → the mutant is killed
  • Tests pass → the mutant survives

If mutants survive, it means your tests didn’t catch a real change in behavior.

In short:

Mutation testing tells you whether your tests are actually doing their job.

Meme depicting mutation testing

Why this matters in the AI agent era

AI tools are great at generating tests quickly.

But those tests often:

  • cover happy paths only
  • miss edge cases
  • contain weak assertions
  • verify implementation details rather than results

This leads to a dangerous mismatch:

  • High coverage ✔️
  • Passing tests ✔️
  • Real confidence ❌

Mutation testing helps close that gap.

It answers a better question:

“Would these tests actually catch a bug?”

Mutation testing in a real Java project (PIT + Gradle)

I integrated PIT⁠ into a Gradle-based Java project that already used JUnit and JaCoCo.

Setup is straightforward:

plugins {
    id 'info.solidsoft.pitest' version '1.19.0'
}

pitest {
    targetClasses = ['com.jonjam.accuweathermcp.*']
    mutators = ['STRONGER']
    junit5PluginVersion = '1.2.3'
    outputFormats = ['HTML', 'XML']
    mutationThreshold = 70
}
Enter fullscreen mode Exit fullscreen mode

Run it with:

./gradlew pitest
Enter fullscreen mode Exit fullscreen mode

What the results showed

On paper, the test suite already looked solid:

  • 99% line coverage (366/368 lines)
  • 96% mutation coverage (631/703 mutants killed)

But mutation testing tells a deeper story.

Coverage answers:

“Did we execute the code?”

Mutation testing answers:

“Would a test fail if the code was wrong?”

The surviving mutants highlighted cases where:

  • code was executed but not properly asserted
  • edge cases were missing
  • tests were too implementation-focused

That’s the key insight:

High coverage ≠ strong tests

Mutation testing report depicting a java project

Mutating testing report for a single Java class showing surviving mutants

A simple example: surviving mutants

Here’s a basic example of what PIT is catching:

// original
if (temperature > 30) {
    return "hot";
}

// mutant
if (temperature >= 30) {
    return "hot";
}
Enter fullscreen mode Exit fullscreen mode

If your tests only check broad behavior like:

  • “above 30 is hot”
  • “below 30 is not hot”

…then both versions pass.

That mutant survives.

And that’s the point: your tests didn’t actually detect a meaningful change.

Turning mutation testing into an AI feedback loop

The most interesting part of this experiment wasn’t the tool—it was the workflow.

I ended up using mutation testing as a loop for improving AI-generated tests:

  1. Run PIT
  2. Inspect surviving mutants
  3. Strengthen or add targeted tests
  4. Remove weak or redundant tests
  5. Re-run until confidence improves

This turns testing into something more dynamic:

not just generating tests, but continuously improving them based on real failure signals

It also works well with AI agents:

  • AI generates initial tests
  • PIT evaluates them
  • AI iterates based on feedback

The above being wrapped up in an Agent skill.

Why this matters

In a world where AI can generate tests instantly, the bottleneck is no longer writing tests.

It’s trusting them.

We need better signals than:

  • “tests pass”
  • “coverage is high”

Mutation testing gives us that signal.

It tells us whether our tests would actually catch broken behaviour.

Closing thought

If you’re already using line coverage tools like JaCoCo, mutation testing is the next step up in confidence.

It shifts the question from:

“Do we have tests?”

to

“Would these tests catch a real bug?”

And in the age of AI-generated tests, that difference matters more than ever.

Top comments (0)