<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: krish-arya</title>
    <description>The latest articles on DEV Community by krish-arya (@krisharya).</description>
    <link>https://dev.to/krisharya</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1780472%2Fc9fc89bb-6769-4d4a-9e65-a5bb5aff618d.png</url>
      <title>DEV Community: krish-arya</title>
      <link>https://dev.to/krisharya</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/krisharya"/>
    <language>en</language>
    <item>
      <title>I Went Looking for Better AI-Generated Tests and Ended Up Building Mutagen</title>
      <dc:creator>krish-arya</dc:creator>
      <pubDate>Mon, 08 Jun 2026 00:13:42 +0000</pubDate>
      <link>https://dev.to/krisharya/i-went-looking-for-better-ai-generated-tests-and-ended-up-building-mutagen-3i93</link>
      <guid>https://dev.to/krisharya/i-went-looking-for-better-ai-generated-tests-and-ended-up-building-mutagen-3i93</guid>
      <description>&lt;p&gt;Over the past few months I've been seeing more and more tools that generate tests with LLMs.&lt;/p&gt;

&lt;p&gt;The demos are impressive.&lt;/p&gt;

&lt;p&gt;Point a model at a codebase, wait a few seconds, and suddenly you have dozens of new tests.&lt;/p&gt;

&lt;p&gt;But every time I saw one of those demos, I found myself wondering the same thing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How do we know the generated tests are actually good?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Coverage helps, but only up to a point.&lt;/p&gt;

&lt;p&gt;A test can execute a line of code without really validating its behavior.&lt;/p&gt;

&lt;p&gt;And an AI-generated test can look perfectly reasonable while completely missing the bug you'd actually care about.&lt;/p&gt;

&lt;p&gt;That question turned into a weekend project.&lt;/p&gt;

&lt;p&gt;I built &lt;strong&gt;Mutagen&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Mutagen?
&lt;/h2&gt;

&lt;p&gt;At a high level, Mutagen generates tests using an LLM and then immediately starts trying to break them.&lt;/p&gt;

&lt;p&gt;The workflow looks roughly like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Clone or ingest a repository&lt;/li&gt;
&lt;li&gt;Analyze the codebase and identify testing targets&lt;/li&gt;
&lt;li&gt;Generate pytest tests with an LLM&lt;/li&gt;
&lt;li&gt;Execute them in an isolated environment&lt;/li&gt;
&lt;li&gt;Mutate the source code and run mutation testing&lt;/li&gt;
&lt;li&gt;Keep only the tests that can detect the introduced defects&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19m4gbyld3zxn7ngf6zd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19m4gbyld3zxn7ngf6zd.png" alt=" " width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The mutation-testing step is the interesting part.&lt;/p&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Did the test run?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;or&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Did coverage increase?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Mutagen asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If I intentionally break the code, does this test notice?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If the answer is &lt;strong&gt;no&lt;/strong&gt;, the generated test gets discarded.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Unexpected Part
&lt;/h2&gt;

&lt;p&gt;One thing I didn't expect while building it was how much of the challenge had nothing to do with LLMs.&lt;/p&gt;

&lt;p&gt;A lot of the work ended up being around orchestration and reliability.&lt;/p&gt;

&lt;p&gt;Runs needed to be resumable.&lt;/p&gt;

&lt;p&gt;Targets needed to be processed in parallel.&lt;/p&gt;

&lt;p&gt;Failures needed to be recoverable.&lt;/p&gt;

&lt;p&gt;State transitions needed to be explicit rather than hidden inside control flow.&lt;/p&gt;

&lt;p&gt;What started as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Generate a test and run mutmut"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;eventually grew into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean / Hexagonal Architecture&lt;/li&gt;
&lt;li&gt;Multiple pluggable LLM providers&lt;/li&gt;
&lt;li&gt;AST-based target selection&lt;/li&gt;
&lt;li&gt;Mutation-gated test generation&lt;/li&gt;
&lt;li&gt;SQLite-backed checkpointing&lt;/li&gt;
&lt;li&gt;Parallel execution&lt;/li&gt;
&lt;li&gt;Optional call-graph analysis&lt;/li&gt;
&lt;li&gt;Optional retrieval over existing test suites&lt;/li&gt;
&lt;li&gt;CI-friendly exit codes&lt;/li&gt;
&lt;li&gt;A little over 11k lines of source code&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Part I'm Most Proud Of
&lt;/h2&gt;

&lt;p&gt;The feature I'm happiest with is probably the feedback loop.&lt;/p&gt;

&lt;p&gt;When mutations survive, Mutagen feeds those surviving mutants back to the model and asks it to strengthen the test against the exact blind spots it missed.&lt;/p&gt;

&lt;p&gt;So instead of blindly generating tests, it iterates against evidence.&lt;/p&gt;

&lt;p&gt;Not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Write another test."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"You missed these specific defects. Try again."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That small change ended up making a huge difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The project is now open source, installable from PyPI, and fully documented.&lt;/p&gt;

&lt;p&gt;What started as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Let's see if I can generate better tests."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;somehow turned into thousands of lines of code, mutation testing pipelines, state machines, checkpointing systems, retrieval mechanisms, and significantly less sleep than I'd planned for the weekend.&lt;/p&gt;

&lt;p&gt;And honestly, I'm glad it did.&lt;/p&gt;

&lt;p&gt;Because the most interesting thing I learned wasn't about LLMs.&lt;/p&gt;

&lt;p&gt;It was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Generating something is easy.&lt;br&gt;
Proving it's actually useful is where the real engineering begins.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Mutagen is my attempt at tackling that problem, at least for tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check It Out
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GitHub
&lt;/h3&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/krish-arya" rel="noopener noreferrer"&gt;
        krish-arya
      &lt;/a&gt; / &lt;a href="https://github.com/krish-arya/mutagen" rel="noopener noreferrer"&gt;
        mutagen
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Mutagen&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;LLM-assisted test generation, validated by mutation testing.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Mutagen ingests a Python repository, finds under-covered functions, generates
pytest tests for them with an LLM, and keeps only the tests that actually
&lt;strong&gt;kill mutants&lt;/strong&gt; of the target. It is built on Clean Architecture with a strict
dependency rule, two explicit state machines, full async I/O, SQLite-backed
resume, and structured logging.&lt;/p&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;repo ──► ingest ──► select targets ──► generate tests ──► run ──► mutate ──► keep / discard ──► report
                                          ▲        │
                                          └── repair / strengthen loops ──┘
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What it does&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;For each selected target the pipeline:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Generates&lt;/strong&gt; a pytest module from the function's source, imports, and
surrounding context — matching your project's existing test style.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runs&lt;/strong&gt; it in an isolated subprocess sandbox (timeout + resource limits
flakiness detection via a double-run).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mutation-gates&lt;/strong&gt; it with &lt;a href="https://github.com/boxed/mutmut" rel="noopener noreferrer"&gt;mutmut&lt;/a&gt;: if the
tests don't kill enough mutants, the surviving mutants become &lt;strong&gt;feedback&lt;/strong&gt;
for a regeneration…&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/krish-arya/mutagen" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Documentation
&lt;/h3&gt;


&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://mutagen-web.vercel.app" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;mutagen-web.vercel.app&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mutagen-ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I'd genuinely love feedback from engineers, testers, and builders.&lt;/p&gt;

&lt;p&gt;If you've worked with AI-generated tests before, I'm especially curious whether you've run into the same trust problem.&lt;/p&gt;

&lt;p&gt;ps:you can mail me personally at &lt;a href="mailto:krisharya2k5@gmail.com"&gt;krisharya2k5@gmail.com&lt;/a&gt; is you have any suggestions !!!&lt;/p&gt;

</description>
      <category>python</category>
      <category>testing</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
