I Went Looking for Better AI-Generated Tests and Ended Up Building Mutagen

krish-arya — Mon, 08 Jun 2026 00:13:42 +0000

Over the past few months I've been seeing more and more tools that generate tests with LLMs.

The demos are impressive.

Point a model at a codebase, wait a few seconds, and suddenly you have dozens of new tests.

But every time I saw one of those demos, I found myself wondering the same thing:

How do we know the generated tests are actually good?

Coverage helps, but only up to a point.

A test can execute a line of code without really validating its behavior.

And an AI-generated test can look perfectly reasonable while completely missing the bug you'd actually care about.

That question turned into a weekend project.

I built Mutagen.

What is Mutagen?

At a high level, Mutagen generates tests using an LLM and then immediately starts trying to break them.

The workflow looks roughly like this:

Clone or ingest a repository
Analyze the codebase and identify testing targets
Generate pytest tests with an LLM
Execute them in an isolated environment
Mutate the source code and run mutation testing
Keep only the tests that can detect the introduced defects

The mutation-testing step is the interesting part.

Instead of asking:

Did the test run?

Did coverage increase?

Mutagen asks:

If I intentionally break the code, does this test notice?

If the answer is no, the generated test gets discarded.

The Unexpected Part

One thing I didn't expect while building it was how much of the challenge had nothing to do with LLMs.

A lot of the work ended up being around orchestration and reliability.

Runs needed to be resumable.

Targets needed to be processed in parallel.

Failures needed to be recoverable.

State transitions needed to be explicit rather than hidden inside control flow.

What started as:

"Generate a test and run mutmut"

eventually grew into:

Clean / Hexagonal Architecture
Multiple pluggable LLM providers
AST-based target selection
Mutation-gated test generation
SQLite-backed checkpointing
Parallel execution
Optional call-graph analysis
Optional retrieval over existing test suites
CI-friendly exit codes
A little over 11k lines of source code

The Part I'm Most Proud Of

The feature I'm happiest with is probably the feedback loop.

When mutations survive, Mutagen feeds those surviving mutants back to the model and asks it to strengthen the test against the exact blind spots it missed.

So instead of blindly generating tests, it iterates against evidence.

Not:

"Write another test."

But:

"You missed these specific defects. Try again."

That small change ended up making a huge difference.

What I Learned

The project is now open source, installable from PyPI, and fully documented.

What started as:

"Let's see if I can generate better tests."

somehow turned into thousands of lines of code, mutation testing pipelines, state machines, checkpointing systems, retrieval mechanisms, and significantly less sleep than I'd planned for the weekend.

And honestly, I'm glad it did.

Because the most interesting thing I learned wasn't about LLMs.

It was this:

Generating something is easy.
Proving it's actually useful is where the real engineering begins.

Mutagen is my attempt at tackling that problem, at least for tests.

Check It Out

GitHub

krish-arya / mutagen

Mutagen

LLM-assisted test generation, validated by mutation testing.

Mutagen ingests a Python repository, finds under-covered functions, generates pytest tests for them with an LLM, and keeps only the tests that actually kill mutants of the target. It is built on Clean Architecture with a strict dependency rule, two explicit state machines, full async I/O, SQLite-backed resume, and structured logging.

repo ──► ingest ──► select targets ──► generate tests ──► run ──► mutate ──► keep / discard ──► report
                                          ▲        │
                                          └── repair / strengthen loops ──┘

What it does

For each selected target the pipeline:

Generates a pytest module from the function's source, imports, and surrounding context — matching your project's existing test style.
Runs it in an isolated subprocess sandbox (timeout + resource limits flakiness detection via a double-run).
Mutation-gates it with mutmut: if the tests don't kill enough mutants, the surviving mutants become feedback for a regeneration…

View on GitHub

Documentation

mutagen-web.vercel.app

Installation

pip install mutagen-ai

I'd genuinely love feedback from engineers, testers, and builders.

If you've worked with AI-generated tests before, I'm especially curious whether you've run into the same trust problem.

ps:you can mail me personally at krisharya2k5@gmail.com is you have any suggestions !!!

DEV Community: krish-arya