<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Wes Nishio</title>
    <description>The latest articles on DEV Community by Wes Nishio (@wesnishio).</description>
    <link>https://dev.to/wesnishio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F706247%2F4b293b0b-8aeb-4548-870e-6d709797bfcf.png</url>
      <title>DEV Community: Wes Nishio</title>
      <link>https://dev.to/wesnishio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wesnishio"/>
    <language>en</language>
    <item>
      <title>One Web Fetch Ate 28% of Our PR Cost</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Fri, 10 Apr 2026 04:38:03 +0000</pubDate>
      <link>https://dev.to/gitautoai/one-web-fetch-ate-28-of-our-pr-cost-3kdc</link>
      <guid>https://dev.to/gitautoai/one-web-fetch-ate-28-of-our-pr-cost-3kdc</guid>
      <description>&lt;h1&gt;
  
  
  One Web Fetch Ate 28% of Our PR Cost
&lt;/h1&gt;

&lt;h2&gt;
  
  
  58K tokens for a yes/no question
&lt;/h2&gt;

&lt;p&gt;Our agent was investigating whether Jest 30 supports the &lt;code&gt;@jest-config&lt;/code&gt; docblock pragma. It called &lt;code&gt;fetch_url&lt;/code&gt; on the Jest configuration docs page. That single page converted to 58,348 tokens of markdown - navigation menus, sidebar links, configuration options for &lt;code&gt;testMatch&lt;/code&gt;, &lt;code&gt;moduleNameMapper&lt;/code&gt;, &lt;code&gt;transform&lt;/code&gt;, and dozens of other settings the agent didn't need. All 58K tokens went into Claude Opus 4.6's context window.&lt;/p&gt;

&lt;p&gt;The agent needed one fact. It got an encyclopedia.&lt;/p&gt;

&lt;p&gt;This wasn't a one-off. Every &lt;code&gt;fetch_url&lt;/code&gt; call inflated the next agent turn by 10K-58K tokens. Worse, in an agentic loop, those tokens compound: they stay in the conversation history and get re-sent on every subsequent API call. We checked production data - that single Jest docs fetch added ~32K tokens to each of the 22 remaining turns. The compounded cost of one web page ate &lt;strong&gt;28% of the entire PR's agent cost&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;

&lt;p&gt;We replaced &lt;code&gt;fetch_url&lt;/code&gt; with &lt;code&gt;web_fetch&lt;/code&gt;, adding Claude Haiku 4.5 as a summarization layer. The tool now takes a &lt;code&gt;prompt&lt;/code&gt; parameter describing what information to extract. After HTML-to-markdown conversion, Haiku reads the full page and returns only the relevant content. The main model receives a focused summary instead of the raw page.&lt;/p&gt;

&lt;p&gt;For that Jest docs page: instead of 58K tokens hitting Opus, Haiku reads them (at 5x cheaper input pricing), returns a ~200-token summary with the answer, and Opus processes only that summary. The token waste drops from 99%+ to near zero.&lt;/p&gt;

&lt;p&gt;We also split the tool in two:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;web_fetch&lt;/code&gt;&lt;/strong&gt; - fetches HTML, summarizes with Haiku, returns the summary. For documentation and articles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;curl&lt;/code&gt;&lt;/strong&gt; - returns raw content with no processing. For JSON APIs and text files where exact content matters.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why the model can't solve this itself
&lt;/h2&gt;

&lt;p&gt;Opus is smart enough to ignore irrelevant content on a web page. But by the time it sees the content, you've already paid for the input tokens. The 58K tokens are in the context window whether Opus reads them carefully or skims past them. Asking a model to "focus on the relevant parts" doesn't reduce the input token count.&lt;/p&gt;

&lt;p&gt;The filtering has to happen before the tokens reach the expensive model. That's an application-layer decision - the model has no way to say "don't send me the tokens I'm about to ignore."&lt;/p&gt;

&lt;h2&gt;
  
  
  The broader pattern
&lt;/h2&gt;

&lt;p&gt;Any time your agent pipeline has a step that produces large output, feeds into an expensive model, and only needs a fraction of the content - insert a cheap model as a filter.&lt;/p&gt;

&lt;p&gt;In Python with pytest, verbose test output can be thousands of lines. In Java with Maven or Gradle, build logs run hundreds of KB. CI logs in any language are full of ANSI escape codes, download progress bars, and dependency resolution noise. All of this gets stuffed into the reasoning model's context window and stays there for every subsequent turn.&lt;/p&gt;

&lt;p&gt;Model capability and model cost are different axes. Route high-volume, low-complexity work (summarize this page) to cheap models. Route low-volume, high-complexity work (reason about the summary) to expensive ones.&lt;/p&gt;

</description>
      <category>tokenoptimization</category>
      <category>costreduction</category>
      <category>modelrouting</category>
      <category>aiagents</category>
    </item>
    <item>
      <title>Why PR Bodies Should Tell the Story</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Fri, 10 Apr 2026 01:24:55 +0000</pubDate>
      <link>https://dev.to/gitautoai/why-pr-bodies-should-tell-the-story-1bdo</link>
      <guid>https://dev.to/gitautoai/why-pr-bodies-should-tell-the-story-1bdo</guid>
      <description>&lt;h1&gt;
  
  
  Why PR Bodies Should Tell the Story
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;When an AI agent creates a pull request, the PR body typically contains the original issue description or schedule coverage info. After the agent finishes working - writing tests, fixing bugs, making trade-offs - none of that context appears in the PR body. The reviewer opens the PR and sees the original instructions, then has to piece together what actually happened from the diff and comments.&lt;/p&gt;

&lt;p&gt;This is backwards. The PR body should be the first thing a reviewer reads to understand what was done, what bugs were found, and what they need to verify.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we built
&lt;/h2&gt;

&lt;p&gt;After the agent completes its work, we now call Claude Sonnet 4.6 (via Anthropic's API) with the full context of what happened - the PR title, changed files with diffs, agent comments, and the agent's completion summary - and ask it to generate a structured summary. This gets appended to the PR body using HTML comment markers for idempotent upserts. Every call is recorded in our llm_requests table for cost tracking.&lt;/p&gt;

&lt;p&gt;The generated section includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What I Tested - specific functions, behaviors, and edge cases covered, referencing actual code from the diff&lt;/li&gt;
&lt;li&gt;Potential Bugs Found - edge cases, untested paths, or workarounds the agent discovered. Our agent (Claude Opus 4.6) tries to break the code before users do, so it often finds issues that need reviewer attention. If a bug was found, the summary explains whether it was actually fixed or worked around.&lt;/li&gt;
&lt;li&gt;Non-Code Tasks - tasks outside the code review like env vars to set, migrations to run, or configs to update&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bugs section is always present - if none were found, it says so explicitly. Non-Code Tasks is omitted when not applicable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The implementation
&lt;/h2&gt;

&lt;p&gt;The core is a pure function &lt;code&gt;upsert_pr_body_section&lt;/code&gt; that uses regex to find HTML comment markers (&lt;code&gt;&amp;lt;!-- GITAUTO_UPDATE --&amp;gt;...&amp;lt;!-- /GITAUTO_UPDATE --&amp;gt;&lt;/code&gt;) in the PR body. If the section exists, it replaces the content. If not, it appends with a &lt;code&gt;---&lt;/code&gt; separator before the first agent section.&lt;/p&gt;

&lt;p&gt;The trigger type (dashboard, schedule, check suite, review comment) determines both the marker name and the prompt used for generation. This mapping lives in &lt;code&gt;constants/triggers.py&lt;/code&gt; alongside the trigger type definitions, keeping the configuration centralized.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters for code review
&lt;/h2&gt;

&lt;p&gt;The hardest part of reviewing AI-generated PRs isn't reading the diff - it's understanding the intent. Why did the agent change this file? Did it find any issues? What should I look at carefully?&lt;/p&gt;

&lt;p&gt;By having the agent explain itself in the PR body, reviewers spend less time on archaeology and more time on actual review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key design decisions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude-generated, not template strings&lt;/strong&gt;: Early versions used hardcoded strings like "Fixed the failing CI check." This told reviewers nothing. Claude writes context-aware summaries because it has the agent's full completion reason.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotent upserts, not appends&lt;/strong&gt;: If the agent runs again on the same PR, the section is replaced, not duplicated. This keeps PR bodies clean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Original body preserved&lt;/strong&gt;: The agent's sections are always appended after a separator. The original PR body (issue description, instructions) is never modified.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>prworkflow</category>
      <category>codereview</category>
      <category>aiagents</category>
      <category>devrel</category>
    </item>
    <item>
      <title>Why Retargeting a PR Explodes the Diff</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Wed, 08 Apr 2026 22:23:15 +0000</pubDate>
      <link>https://dev.to/gitautoai/why-retargeting-a-pr-explodes-the-diff-3n3i</link>
      <guid>https://dev.to/gitautoai/why-retargeting-a-pr-explodes-the-diff-3n3i</guid>
      <description>&lt;h1&gt;
  
  
  Why Retargeting a PR Explodes the Diff
&lt;/h1&gt;

&lt;p&gt;A reviewer asked us to change a PR's base from, say, &lt;code&gt;release/20260401&lt;/code&gt; to &lt;code&gt;release/20260501&lt;/code&gt;. Simple request. GitHub even has an API for it: &lt;code&gt;PATCH /repos/{owner}/{repo}/pulls/{number}&lt;/code&gt; with a new &lt;code&gt;base&lt;/code&gt; field.&lt;/p&gt;

&lt;p&gt;GitAuto called it. The base branch label changed. And the PR diff went from 5 files to 300+.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Went Wrong
&lt;/h2&gt;

&lt;p&gt;GitHub's "change base branch" is &lt;strong&gt;metadata-only&lt;/strong&gt;. It updates which branch the PR targets but does nothing to the git history. When that doesn't matter - say, retargeting from &lt;code&gt;main&lt;/code&gt; to &lt;code&gt;develop&lt;/code&gt; where &lt;code&gt;develop&lt;/code&gt; was forked from &lt;code&gt;main&lt;/code&gt; - the diff stays clean because the commit graph still makes sense.&lt;/p&gt;

&lt;p&gt;But the two release branches were &lt;strong&gt;siblings&lt;/strong&gt;. Both were cut from &lt;code&gt;main&lt;/code&gt; at different points in time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main ──────●──────────────●──────
           │              │
           ▼              ▼
     release/0401    release/0501
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the PR originally targeted &lt;code&gt;release/0401&lt;/code&gt;, Git computed the merge base between that branch and the PR head. The diff showed only the PR's actual changes. After the API call switched the target to &lt;code&gt;release/0501&lt;/code&gt;, Git recomputed the merge base - now between a completely different branch and the same PR head. Every file that differed between the two release branches appeared in the PR diff.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Human Would Do
&lt;/h2&gt;

&lt;p&gt;A developer would run &lt;code&gt;git rebase --onto release/0501 release/0401 pr-branch&lt;/code&gt;. This replays the PR's commits on top of the new base, and the diff goes back to showing only the actual changes.&lt;/p&gt;

&lt;p&gt;But rebase has two problems for automation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Merge conflicts can halt execution.&lt;/strong&gt; Rebase replays commits one by one. If any commit conflicts with the new base, git stops and waits for manual resolution. An automated system can't resolve conflicts interactively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shallow clones break rebase.&lt;/strong&gt; Many CI systems and automation tools clone with &lt;code&gt;--depth 1&lt;/code&gt; for speed. Rebase needs the full commit history to find the fork point and replay commits. With a shallow clone, it simply fails.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Fix for Automation
&lt;/h2&gt;

&lt;p&gt;Instead of replaying commits, save the end result and rewrite it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Save&lt;/strong&gt; the PR's actual file changes (contents from the current branch)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change&lt;/strong&gt; the base branch on GitHub (the metadata part)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reset&lt;/strong&gt; the local branch to the new base (&lt;code&gt;git fetch&lt;/code&gt; + &lt;code&gt;git reset&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rewrite&lt;/strong&gt; the saved files onto the new base&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Force push&lt;/strong&gt; to update the remote&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is deterministic and conflict-free. It doesn't matter how the files got to their current state - we just read the final contents, reset to the new base, and write them back. Works with any clone depth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;When an API says it changes something, verify it changes &lt;strong&gt;everything&lt;/strong&gt; that needs changing. GitHub's API truthfully changes the base branch - the metadata. But "retarget a PR" in a developer's mind means "make the diff show only my changes against the new base." That requires git-level surgery that no REST API currently provides.&lt;/p&gt;

&lt;p&gt;If you maintain release branches cut from the same trunk, be aware that retargeting PRs between them is not a one-API-call operation. The label moves instantly. The diff needs work.&lt;/p&gt;

&lt;p&gt;GitAuto handles this automatically. See our &lt;a href="https://gitauto.ai/docs/actions/sibling-branch-retarget?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;docs on sibling branch retarget&lt;/a&gt; for details.&lt;/p&gt;

</description>
      <category>git</category>
      <category>githubapi</category>
      <category>pullrequests</category>
      <category>releasebranches</category>
    </item>
    <item>
      <title>Our Agent Had the Checklist and Ignored It</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Wed, 08 Apr 2026 18:22:41 +0000</pubDate>
      <link>https://dev.to/gitautoai/our-agent-had-the-checklist-and-ignored-it-478g</link>
      <guid>https://dev.to/gitautoai/our-agent-had-the-checklist-and-ignored-it-478g</guid>
      <description>&lt;h1&gt;
  
  
  Our Agent Had the Checklist and Ignored It
&lt;/h1&gt;

&lt;p&gt;We run an &lt;a href="https://gitauto.ai/blog/what-100-percent-test-coverage-cant-measure?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;LLM-based quality gate&lt;/a&gt; that evaluates tests across 41 checks in 8 categories: business logic, adversarial inputs, security, error handling, and others. When the gate fails, the agent is told to improve the tests and try again.&lt;/p&gt;

&lt;p&gt;Last week our agent - Claude Opus 4.6 - burned all its iterations rewriting tests for a CLI tool that parses CSV files and writes to a database. The quality gate failed on three specific categories every single time: adversarial inputs, security, and error handling. The agent never once added a test for any of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Agent Did Instead
&lt;/h2&gt;

&lt;p&gt;The agent had the full 41-check quality checklist in its system prompt. It knew which categories exist. When told "quality gate failed," here's what it did across 9 commits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Changed &lt;code&gt;expect(spy).not.toHaveBeenCalledWith(msg)&lt;/code&gt; to &lt;code&gt;spy.mock.calls.filter(...).toHaveLength(0)&lt;/code&gt;. A "spy" in testing is a wrapper that records how a function was called - what arguments it received and how many times. Both lines check the same thing: "this function was never called with this message." The agent just rewrote the syntax without changing what's being tested.&lt;/li&gt;
&lt;li&gt;Added a test combining all CLI flags together - useful, but not adversarial&lt;/li&gt;
&lt;li&gt;Added a test for path normalization (backslash replacement) - general coverage, not security&lt;/li&gt;
&lt;li&gt;Repeated similar cosmetic rewrites for the remaining commits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not a single null input test. Not a single injection test. Not a single error message test. The agent had the information. It just didn't use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Happens
&lt;/h2&gt;

&lt;p&gt;If you've worked with LLMs, you've seen this pattern. Given a vague directive ("improve quality") and a detailed reference (the checklist), the model takes the path of least resistance. Rewriting an existing assertion is easier than designing a new adversarial test from scratch. The model satisfies the surface instruction ("I improved the tests") without addressing the substance.&lt;/p&gt;

&lt;p&gt;This is the same behavior you see when asking an LLM to "review this code" - it often comments on formatting and naming instead of identifying logical bugs. The easy observations come first. The hard analysis gets skipped.&lt;/p&gt;

&lt;p&gt;A good engineer, given vague feedback from a reviewer, would either ask clarifying questions or self-review against the checklist they already have. Claude Opus 4.6 had both options available - it could have asked for clarification through its tools, or systematically walked through the 41 checks it had in its system prompt. Instead, it made a small tweak and hoped that would be enough. Then did it again. And again. Nine times.&lt;/p&gt;

&lt;p&gt;That's not what a capable engineer does. That's what a lazy one does - make a cosmetic change, submit, and hope the reviewer doesn't look too closely. It's a very human behavior, but not one we want a model to have learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Compounding Problem
&lt;/h2&gt;

&lt;p&gt;The quality gate returned a generic message: "Quality gate failed. Evaluate and improve test quality per coding standards." The agent knew the checklist existed but didn't know which 3 out of 41 checks actually failed. So it had to guess, and guessing led to cosmetic edits.&lt;/p&gt;

&lt;p&gt;But even with specific feedback, one of the three failures was a false positive. The gate flagged &lt;code&gt;path.resolve()&lt;/code&gt; as a command injection vector. It's not - it's a path normalization function. No amount of test-writing would satisfy that check.&lt;/p&gt;

&lt;p&gt;So the agent faced three problems simultaneously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It was lazy - it didn't systematically work through the checklist&lt;/li&gt;
&lt;li&gt;The feedback was generic - it didn't know which checks failed&lt;/li&gt;
&lt;li&gt;One check was wrong - a false positive that could never pass&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What We Changed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Specific feedback&lt;/strong&gt;: The error message now includes the exact failing checks with reasons. Instead of "quality gate failed," the agent sees &lt;code&gt;adversarial.null_undefined_inputs: No tests for null CLI arguments&lt;/code&gt; and &lt;code&gt;security.command_injection: No tests for malicious input values&lt;/code&gt;. This removes the guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same model for judging&lt;/strong&gt;: The quality gate was using a weaker model than the agent itself - a cheaper model evaluating a more capable model's work. Now both use the same model, which reduces false positives like the &lt;code&gt;path.resolve&lt;/code&gt; judgment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Escape hatch&lt;/strong&gt;: After 3 consecutive failures with no progress, accept the current quality and move on. Some checks may be false positives, and burning iterations on unfixable failures wastes compute. We get a Slack notification when this triggers so we can investigate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Lesson
&lt;/h2&gt;

&lt;p&gt;The model is fundamentally capable of writing null input tests, analyzing injection vectors, and designing error handling coverage. It does all of those in other contexts. The capability is there. But capability and behavior are different things - the model's native tendency toward path-of-least-resistance means it won't reliably use its full power without external pressure.&lt;/p&gt;

&lt;p&gt;This is why the fix has to be at the application layer. The checklist in the system prompt gives the model the knowledge. But knowledge alone doesn't produce diligence. Specific, targeted feedback ("these 3 checks failed") works better than comprehensive reference material ("here are all 41 checks") because it closes the gap between what the model &lt;em&gt;can&lt;/em&gt; do and what it &lt;em&gt;will&lt;/em&gt; do. It removes the opportunity to take shortcuts by making the exact problem inescapable.&lt;/p&gt;

&lt;p&gt;This is also why tools like GitAuto exist. The models are powerful enough to write high-quality tests, fix CI failures, and reason about security. But left to their own defaults, they take shortcuts. The application layer - verification gates, specific feedback loops, escape hatches, structured tool calls - is what turns raw model capability into reliable engineering output. The value isn't in the model. It's in making the model actually do the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  We Need a Laziness Eval
&lt;/h2&gt;

&lt;p&gt;The industry benchmarks models on reasoning, coding, math, and knowledge. There are evals for shortcut resistance and multi-step reasoning. But none of them measure laziness - the gap between what a model &lt;em&gt;can&lt;/em&gt; do and what it &lt;em&gt;will&lt;/em&gt; do when not forced. This incident would pass every existing eval. Claude Opus 4.6 can write adversarial tests. It can analyze injection vectors. It can read a checklist and work through it systematically. It just didn't.&lt;/p&gt;

&lt;p&gt;A laziness eval would give the model a task, a reference checklist, and vague feedback ("this isn't good enough"), then measure whether it systematically addresses the checklist or makes cosmetic changes and resubmits. The score isn't whether the model &lt;em&gt;can&lt;/em&gt; solve the problem - it's whether it &lt;em&gt;chooses&lt;/em&gt; to do the hard work when the easy path is available.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>llmbehavior</category>
      <category>qualitygates</category>
      <category>testquality</category>
    </item>
    <item>
      <title>Zero Changes Passed Our Quality Gate</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Wed, 08 Apr 2026 02:36:41 +0000</pubDate>
      <link>https://dev.to/gitautoai/zero-changes-passed-our-quality-gate-3h43</link>
      <guid>https://dev.to/gitautoai/zero-changes-passed-our-quality-gate-3h43</guid>
      <description>&lt;h1&gt;
  
  
  Zero Changes Passed Our Quality Gate
&lt;/h1&gt;

&lt;p&gt;We have a pipeline that evaluates test quality beyond coverage. It scores files on &lt;a href="https://gitauto.ai/blog/what-100-percent-test-coverage-cant-measure?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;41 checks&lt;/a&gt; across categories like boundary testing, error handling, and security. When a file scores poorly, the system creates a PR and assigns an AI agent to improve the tests.&lt;/p&gt;

&lt;p&gt;Last week, the agent looked at a test file with 100% line coverage, said "nothing to improve," and closed the task with zero changes. Our verification gate passed it through. The tests were still weak.&lt;/p&gt;

&lt;p&gt;The agent wasn't being clever. Our gate had a gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Tests Actually Looked Like
&lt;/h2&gt;

&lt;p&gt;The test file covered a function that transforms data and returns an object. Every line was exercised. But the assertions only checked that a return value existed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBeDefined&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;not&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toBeNull&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function always returns an object - it can never return &lt;code&gt;undefined&lt;/code&gt; or &lt;code&gt;null&lt;/code&gt;. These assertions pass no matter what the function does. You could replace the entire implementation with &lt;code&gt;return {}&lt;/code&gt; and every test would still be green. They test nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gap in Our Gate
&lt;/h2&gt;

&lt;p&gt;Our verification step runs when the agent declares the task complete. It checks for lint errors, type errors, and test failures. If everything passes, the task is marked done.&lt;/p&gt;

&lt;p&gt;The agent made zero changes. Zero changes means zero PR files. Zero PR files means nothing to lint, nothing to type-check, nothing to test. Our verification pipeline had nothing to verify, so it passed. "Do nothing" was a valid exit path even when the system had already flagged the tests as weak.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix: Three Layers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prompt-level instructions&lt;/strong&gt;: We added explicit rules telling the agent that 100% coverage doesn't mean the tests are good. The agent's coding standards now include guidance on what useless assertions look like and why &lt;code&gt;toBeDefined()&lt;/code&gt; on a non-nullable return proves nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero-change rejection&lt;/strong&gt;: When the agent completes a quality-focused PR with zero changes, we reject the first attempt - the scheduler already determined the tests were weak when it created the PR, so "no changes" contradicts that finding. But if the agent tries again and still makes no changes, we allow completion. Sometimes the tests are genuinely fine and the scheduler was wrong. No infinite loops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-based evaluation after changes&lt;/strong&gt;: When the agent does make changes, we run the quality evaluation again after all other checks pass (lint, types, tests). This runs last to avoid wasting an LLM call when the agent will need to retry anyway due to syntax errors or test failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost Problem
&lt;/h2&gt;

&lt;p&gt;The quality evaluation uses an LLM call. Running it costs money. If we run it early and lint fails, the agent fixes the lint error and calls verify again - triggering another LLM evaluation for nearly identical code. By running quality checks last, we only pay for the evaluation when everything else is already clean. One call per successful verification instead of one per attempt.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Broader Pattern
&lt;/h2&gt;

&lt;p&gt;This isn't specific to AI agents. Any automated pipeline with a "no changes needed" exit path has this gap. CI that only runs on changed files. Linters that skip untouched code. Review bots that auto-approve empty diffs.&lt;/p&gt;

&lt;p&gt;The fix is the same everywhere: if the system decided something needs work, don't let "no work done" count as completion. Track why the task was created and verify that the reason was addressed, not just that the pipeline didn't find new problems.&lt;/p&gt;

</description>
      <category>aiagents</category>
      <category>qualitygates</category>
      <category>testquality</category>
      <category>developertooling</category>
    </item>
    <item>
      <title>How We Reached 92% Coverage with GitAuto</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Mon, 06 Apr 2026 23:08:47 +0000</pubDate>
      <link>https://dev.to/gitautoai/how-we-reached-92-coverage-with-gitauto-1ll1</link>
      <guid>https://dev.to/gitautoai/how-we-reached-92-coverage-with-gitauto-1ll1</guid>
      <description>&lt;h1&gt;
  
  
  How We Reached 92% Test Coverage with GitAuto
&lt;/h1&gt;

&lt;p&gt;We decided to dogfood &lt;a href="https://gitauto.ai?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;GitAuto&lt;/a&gt; by using it on the GitAuto repository itself. The goal was simple: demonstrate whether we could really achieve high test coverage in a real production codebase. After 3 months, we hit &lt;strong&gt;92% line coverage, 96% function coverage, and 85% branch coverage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's exactly how we did it and what we learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup: 5 Files Per Day, Every Day
&lt;/h2&gt;

&lt;p&gt;Our approach was straightforward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enabled &lt;a href="https://gitauto.ai/dashboard/triggers?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;schedule trigger&lt;/a&gt;:&lt;/strong&gt; Set GitAuto to run automatically every day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;5 files per day:&lt;/strong&gt; Configured to target 5 files each morning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weekends included:&lt;/strong&gt; Tests ran 7 days a week&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository size:&lt;/strong&gt; ~250 files total&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The math was simple: at 5 files per day, we'd need roughly 50 days (about 2 months) to cover the entire codebase. In reality, it took closer to 3 months because we refined our approach along the way, experimented with different file counts, and occasionally restarted files when we improved the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Daily Routine
&lt;/h2&gt;

&lt;p&gt;Every morning, GitAuto would create 5 pull requests—one for each targeted file. Our review process evolved over time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Initially:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check if tests were passing&lt;/li&gt;
&lt;li&gt;Review the test code in detail&lt;/li&gt;
&lt;li&gt;Verify the changes made sense&lt;/li&gt;
&lt;li&gt;Merge if everything looked good&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;In the end:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most PRs were green out of the box&lt;/li&gt;
&lt;li&gt;Quick verification that only test files changed (or legitimate bug fixes)&lt;/li&gt;
&lt;li&gt;No code review—trusted the passing tests&lt;/li&gt;
&lt;li&gt;Merge and move on&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Coverage Growth Over Time
&lt;/h2&gt;

&lt;p&gt;We didn't start tracking coverage data historically from day one, so our &lt;a href="https://gitauto.ai/dashboard/coverage-trends?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;coverage charts&lt;/a&gt; only show the latter half of the journey. The growth rate varies because we adjusted the volume based on what we were working on—when we found issues to fix in GitAuto itself, we ran fewer PRs; when things were stable, we ran up to 10 PRs per day.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwt0s062nosoidh4esb5n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwt0s062nosoidh4esb5n.png" alt="Coverage growth over time showing Statement Coverage reaching 92%, Function Coverage reaching 96%, and Branch Coverage reaching 85%" width="800" height="501"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How We Actually Develop
&lt;/h2&gt;

&lt;p&gt;Here's important context: we build GitAuto using Claude Code. When we write new features, we do write unit tests for critical parts we especially want to verify. But we don't obsess over coverage or spend significant time writing comprehensive test suites.&lt;/p&gt;

&lt;p&gt;The result? Most features ship with decent but incomplete test coverage. Not 100%, not close. And bugs still happen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is where GitAuto came in.&lt;/strong&gt; It filled the gaps we left, systematically adding tests to increase coverage on files we'd already moved on from.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Results: What 90%+ Coverage Actually Feels Like
&lt;/h2&gt;

&lt;p&gt;Now that we're consistently above 90% coverage with &lt;strong&gt;242 test files, 2,680 test cases running in 3 minutes (67ms per test)&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bugs feel rare.&lt;/strong&gt; We encounter far fewer unexpected issues in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Merges feel safe.&lt;/strong&gt; We have confidence that changes won't break existing functionality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regression testing is faster.&lt;/strong&gt; Automated tests catch issues that used to require manual verification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Development velocity increased.&lt;/strong&gt; Less time spent on manual testing and bug fixes means more time building features.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One Downside
&lt;/h2&gt;

&lt;p&gt;There's one real cost we didn't anticipate: GitHub Actions minutes. Initially, the GitAuto repository ran on GitHub's free tier with no issues. But as coverage increased, so did the number of tests running on every PR.&lt;/p&gt;

&lt;p&gt;We eventually hit the free tier limits and had to upgrade. Now we also optimize by skipping test runs when there are no relevant changes (e.g., Python tests don't run when only documentation changes).&lt;/p&gt;

&lt;p&gt;It's a small price to pay for 90%+ coverage, but worth knowing upfront.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Achieving 92% line coverage, 96% function coverage, and 85% branch coverage wasn't the result of heroic manual effort. It came from:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enabling scheduled automation&lt;/li&gt;
&lt;li&gt;Reviewing and merging 5 PRs each morning&lt;/li&gt;
&lt;li&gt;Trusting the process over 3 months&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're skeptical that high coverage is achievable in real-world codebases, we were too. But the data doesn't lie: consistent, automated test generation works.&lt;/p&gt;

&lt;p&gt;Want to try the same approach on your repository? &lt;a href="https://github.com/apps/gitauto-ai/installations/new" rel="noopener noreferrer"&gt;Install GitAuto&lt;/a&gt; and &lt;a href="https://gitauto.ai/settings/triggers?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;enable the schedule trigger&lt;/a&gt;. Start with 3-5 files per day and let the coverage compound.&lt;/p&gt;

</description>
      <category>testcoverage</category>
      <category>testautomation</category>
      <category>gitauto</category>
      <category>cicd</category>
    </item>
    <item>
      <title>How We Finally Solved Test Discovery</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Wed, 01 Apr 2026 04:47:12 +0000</pubDate>
      <link>https://dev.to/gitautoai/how-we-finally-solved-test-discovery-3eji</link>
      <guid>https://dev.to/gitautoai/how-we-finally-solved-test-discovery-3eji</guid>
      <description>&lt;h1&gt;
  
  
  How We Finally Solved Test Discovery
&lt;/h1&gt;

&lt;p&gt;Yesterday I wrote about &lt;a href="https://gitauto.ai/blog/why-our-test-writing-agent-wasted-12-iterations-reading-files?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;why test file discovery is still unsolved&lt;/a&gt;. Three approaches (stem matching, content grepping, hybrid), each failing differently. The hybrid worked best but had a broken ranking function - flat scoring that gave &lt;code&gt;src/&lt;/code&gt; the same weight as &lt;code&gt;src/pages/checkout/&lt;/code&gt;. Today it's solved.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem With Flat Scoring
&lt;/h2&gt;

&lt;p&gt;The March 30 post ended with this bug: &lt;code&gt;+30&lt;/code&gt; points for any shared parent directory. One shared path component got the same bonus as three. With 3 synthetic inputs, other factors dominated. With 29 real file paths, unrelated test files ranked above relevant ones.&lt;/p&gt;

&lt;p&gt;The fix wasn't tweaking the constant. It was replacing the scoring model entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Tiers, Not Points
&lt;/h2&gt;

&lt;p&gt;Instead of adding up weighted scores, we rank by structural relationship. Higher tiers always win over lower ones, regardless of path depth or name similarity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 1 - Colocated tests.&lt;/strong&gt; Same directory, same stem with a test suffix. &lt;code&gt;Button.tsx&lt;/code&gt; and &lt;code&gt;Button.test.tsx&lt;/code&gt; side by side. This is the strongest signal possible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 2 - Same-directory content match.&lt;/strong&gt; A test file in the same directory whose source code imports the implementation file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 3 - Path-based match.&lt;/strong&gt; The test file's path contains the implementation stem. &lt;code&gt;tests/test_client.py&lt;/code&gt; for &lt;code&gt;services/client.py&lt;/code&gt;. The classic mirror-tree convention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 4 - Content grep match.&lt;/strong&gt; A test file anywhere in the repo references the implementation file in its source code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tier 5 - Parent directory content match.&lt;/strong&gt; A test file in a parent directory that references the impl. Weakest signal, but still a real connection.&lt;/p&gt;

&lt;p&gt;The key insight: tiers are ordinal, not additive. A Tier 1 match always outranks a Tier 3 match. No combination of bonus points can promote a distant test above a colocated one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Content-Aware Matching
&lt;/h2&gt;

&lt;p&gt;Path matching alone can't handle barrel re-exports. When a test imports from &lt;code&gt;'@/pages/checkout'&lt;/code&gt; and that resolves to &lt;code&gt;index.tsx&lt;/code&gt;, the string "index" never appears in the import statement. Path matching sees nothing.&lt;/p&gt;

&lt;p&gt;Content-aware matching reads the test file and greps for references to the implementation. If a test file contains &lt;code&gt;import { CheckoutPage } from './index'&lt;/code&gt; or &lt;code&gt;require('./checkout')&lt;/code&gt;, the content grep catches it. Tiers 2, 4, and 5 are the content tiers that fill gaps path-only matching leaves open.&lt;/p&gt;

&lt;h2&gt;
  
  
  Single-Source Patterns
&lt;/h2&gt;

&lt;p&gt;Every language has its own test naming convention:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.test.ts&lt;/code&gt;, &lt;code&gt;.test.tsx&lt;/code&gt; - JavaScript/TypeScript (Jest, Vitest)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.spec.ts&lt;/code&gt;, &lt;code&gt;.spec.tsx&lt;/code&gt; - Angular, Cypress, Playwright&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;test_*.py&lt;/code&gt; - Python (pytest)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;*_test.go&lt;/code&gt; - Go&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;*Test.java&lt;/code&gt;, &lt;code&gt;*Test.kt&lt;/code&gt; - Java/Kotlin (JUnit)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;*_spec.rb&lt;/code&gt; - Ruby (RSpec)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;*.spec.js&lt;/code&gt; - JavaScript (Mocha, Jasmine)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these are defined once and imported everywhere. Before this change, three different functions each maintained their own pattern list - slightly different, each missing cases the others caught.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Test file discovery looks like a string matching problem. It's actually a ranking problem with structural priors. Flat scoring collapses structure into numbers and loses information. Tiered ranking preserves the structural relationship and makes the algorithm's priorities explicit and debuggable. And the only way to validate ranking is against real data at real scale - not 3 curated inputs that any algorithm can pass.&lt;/p&gt;

</description>
      <category>testdiscovery</category>
      <category>developertooling</category>
      <category>architecture</category>
      <category>solvedproblems</category>
    </item>
    <item>
      <title>What 100% Test Coverage Can't Measure</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Wed, 01 Apr 2026 04:47:11 +0000</pubDate>
      <link>https://dev.to/gitautoai/what-100-test-coverage-cant-measure-23i5</link>
      <guid>https://dev.to/gitautoai/what-100-test-coverage-cant-measure-23i5</guid>
      <description>&lt;h1&gt;
  
  
  What 100% Test Coverage Can't Measure
&lt;/h1&gt;

&lt;p&gt;Customers started asking us: "How do you evaluate test quality? What does your evaluation look like?" We had coverage numbers - line, branch, function - and we were driving files to 100%. But we didn't have a good answer for what happens after 100%. Coverage proves every line was exercised. It doesn't say whether the tests are actually good.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coverage Is the Foundation
&lt;/h2&gt;

&lt;p&gt;Coverage tells you which lines ran during testing. That's important. A file at 30% coverage has obvious blind spots. Driving it to 100% forces tests to exercise error branches, conditional paths, and edge cases that might otherwise be ignored. We treat coverage as the primary goal and spend most of our effort getting files there.&lt;/p&gt;

&lt;p&gt;But coverage measures execution, not verification. A test that renders a payment form, types a valid card number, and clicks submit can hit every line and every branch. It proves the happy path works. It doesn't tell you whether the form handles an expired card, a malformed CVV, or a network timeout mid-submission.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Eight Categories After 100%
&lt;/h2&gt;

&lt;p&gt;Once a file reaches 100%, there are categories of testing that coverage can't capture. We built a checklist of 41 checks across eight categories. Each check gets a pass, fail, or not-applicable result per file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Business Logic
&lt;/h3&gt;

&lt;p&gt;Does the test verify that domain rules produce correct results? A pricing function that calculates premiums needs tests for each tier boundary, not just one valid input. State transitions (pending → approved → active) need tests that verify invalid transitions are rejected. Calculation accuracy matters when rounding errors compound across thousands of transactions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adversarial
&lt;/h3&gt;

&lt;p&gt;What happens when inputs are hostile? Null values, empty strings, empty arrays, boundary values (0, -1, MAX_INT), type coercion traps (&lt;code&gt;"0" == false&lt;/code&gt;), oversized inputs, race conditions, and unicode special characters. A function can pass every line with valid inputs and still crash on &lt;code&gt;null&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;p&gt;Does the code defend against attack vectors? XSS payloads in user-generated content, SQL injection through unsanitized parameters, command injection via shell calls, CSRF on state-changing endpoints, authentication bypass, sensitive data exposure in logs or responses, open redirects, and path traversal (&lt;code&gt;../../etc/passwd&lt;/code&gt;). Security tests verify that malicious input is rejected, not just that valid input is accepted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance
&lt;/h3&gt;

&lt;p&gt;Will this code scale? Quadratic algorithms hide behind small test datasets. N+1 queries don't show up until production traffic hits. Heavy synchronous operations block the event loop. Large imports increase bundle size. Redundant computation wastes cycles on every request. Performance tests catch what functional tests miss because functional tests use small inputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory
&lt;/h3&gt;

&lt;p&gt;Does this code clean up after itself? Event listeners that aren't removed on unmount leak memory on every navigation. Subscriptions and timers that outlive their component accumulate silently. Circular references prevent garbage collection. Closures that capture large scopes retain memory longer than expected. These bugs don't crash - they degrade slowly until the tab or process dies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Error Handling
&lt;/h3&gt;

&lt;p&gt;What does the user see when things go wrong? Graceful degradation means a failed API call shows a retry option, not a blank screen. User-facing error messages should say what happened and what to do next, not expose a raw stack trace or a generic "Something went wrong."&lt;/p&gt;

&lt;h3&gt;
  
  
  Accessibility
&lt;/h3&gt;

&lt;p&gt;Can everyone use it? ARIA attributes tell screen readers what an element does. Keyboard navigation means every interactive element is reachable without a mouse. Focus management ensures modal dialogs trap focus correctly and return it when closed. These aren't nice-to-haves - they're requirements for users who rely on assistive technology.&lt;/p&gt;

&lt;h3&gt;
  
  
  SEO
&lt;/h3&gt;

&lt;p&gt;Is this page discoverable? Meta tags control how search engines and social platforms display the page. Semantic HTML (&lt;code&gt;&amp;lt;article&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;nav&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;main&amp;gt;&lt;/code&gt;) helps crawlers understand page structure. Heading hierarchy (&lt;code&gt;h1&lt;/code&gt; → &lt;code&gt;h2&lt;/code&gt; → &lt;code&gt;h3&lt;/code&gt;, no skipping) signals content relationships. Alt text on images provides context when images can't load or can't be seen.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-File, Not Per-Repo
&lt;/h2&gt;

&lt;p&gt;We evaluate quality per file, not per repo. A repo-level score averages away the problems. Per-file evaluation means each source file and its test files are checked against all eight categories independently. Files that fail any check become candidates for test strengthening.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Built
&lt;/h2&gt;

&lt;p&gt;We shipped 41 checks across these eight categories. When a file hits 100% coverage, we automatically evaluate its tests against the full checklist. Each check returns pass, fail, or not-applicable. Files that fail any check get a PR to strengthen the tests. Coverage remains our primary goal - we still spend most effort getting files to 100%. But now we have a concrete answer when customers ask how we evaluate quality beyond coverage numbers. The checklist will evolve as we learn what matters most across different codebases and languages.&lt;/p&gt;

&lt;p&gt;See the &lt;a href="https://gitauto.ai/docs/how-it-works/quality-verification/quality-checklist?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;full checklist with all 41 checks&lt;/a&gt; and how &lt;a href="https://gitauto.ai/docs/how-it-works/quality-verification/quality-check-scoring?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;change detection avoids redundant evaluation&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>testquality</category>
      <category>codecoverage</category>
      <category>developertooling</category>
      <category>testingstrategy</category>
    </item>
    <item>
      <title>Test File Discovery Is Still Unsolved</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Mon, 30 Mar 2026 23:04:22 +0000</pubDate>
      <link>https://dev.to/gitautoai/test-file-discovery-is-still-unsolved-445c</link>
      <guid>https://dev.to/gitautoai/test-file-discovery-is-still-unsolved-445c</guid>
      <description>&lt;h1&gt;
  
  
  Test File Discovery Is Still Unsolved
&lt;/h1&gt;

&lt;p&gt;Given a file like &lt;code&gt;src/pages/checkout/index.tsx&lt;/code&gt;, which test files should you look at? Sounds simple. It's not.&lt;/p&gt;

&lt;p&gt;We build an AI agent that writes tests. Before the agent starts, we need to find existing test files so it can match the project's testing patterns. We looked at the agent's logs for one real run: 34 iterations total, and 18 of them were spent just reading files - fetching imported modules, searching for type definitions, re-reading files it had already seen. The agent can read 2-3 files per iteration in parallel, but it still burned half its budget on discovery instead of writing tests.&lt;/p&gt;

&lt;p&gt;The agent can solve this on its own - it does search, read, and eventually find the right files. But each iteration costs tokens. We want to pre-load as much context as possible before the agent loop begins, doing deterministically what the agent would do heuristically. Same work, but programmatic, stable, and cheaper. The discovery algorithm is the hard part - especially when you're language-agnostic and can't rely on any single project's conventions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Approach 1: Stem Matching
&lt;/h2&gt;

&lt;p&gt;Extract the filename stem and search the tree. Say you have &lt;code&gt;src/auth/SessionProvider.tsx&lt;/code&gt; - the stem is &lt;code&gt;SessionProvider&lt;/code&gt;. Walk the file tree, find test files containing "SessionProvider" in their path. This works for most files.&lt;/p&gt;

&lt;p&gt;It fails for generic stems. A file like &lt;code&gt;src/pages/checkout/index.tsx&lt;/code&gt; has stem &lt;code&gt;index&lt;/code&gt;. Grepping for "index" across a codebase matches almost everything - 29 test files in one real repo. The signal drowns in noise.&lt;/p&gt;

&lt;p&gt;We considered falling back to the parent directory name for generic stems (&lt;code&gt;index&lt;/code&gt; -&amp;gt; &lt;code&gt;checkout&lt;/code&gt;). This helps for some cases, but "generic" is a judgment call. Is &lt;code&gt;utils&lt;/code&gt; generic? &lt;code&gt;config&lt;/code&gt;? &lt;code&gt;handler&lt;/code&gt;? Every heuristic creates a new edge case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Approach 2: Content Grepping
&lt;/h2&gt;

&lt;p&gt;Instead of matching paths, grep test file contents for the stem. If a test file imports &lt;code&gt;SessionProvider&lt;/code&gt;, it references that implementation. This catches tests in completely different directories - e.g. a test in &lt;code&gt;src/pages/checkout/&lt;/code&gt; might import &lt;code&gt;../../auth/SessionProvider&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But content grep has a different failure mode. Many JavaScript projects use barrel exports (&lt;code&gt;index.ts&lt;/code&gt; re-exporting everything). A test might import from &lt;code&gt;'@/pages/checkout'&lt;/code&gt; which resolves to &lt;code&gt;index.tsx&lt;/code&gt; at runtime, but the string "index" never appears in the import. The connection exists at the module resolution level, not the string level.&lt;/p&gt;

&lt;p&gt;PHP and Go have the same problem differently. A PHP test file might reference &lt;code&gt;InvoiceService&lt;/code&gt; by class name without any file path in the import. A Go test lives in the same package directory and imports nothing explicitly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Approach 3: Hybrid (Current)
&lt;/h2&gt;

&lt;p&gt;We now combine both approaches. Path matching (walk the tree for test files whose path contains the stem) plus content grep (find test files that reference the stem in source code). Take the union. This catches both colocated tests and distant tests that import the file.&lt;/p&gt;

&lt;p&gt;The problem shifts from discovery to ranking. A real repo produces 29 test file hits for &lt;code&gt;index.tsx&lt;/code&gt; (from 51 raw grep matches). Five of them are highly relevant (in &lt;code&gt;src/pages/checkout/&lt;/code&gt; subtree). The other 24 are noise. Which 5 do we load into context?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Ranking Bug That Toy Tests Missed
&lt;/h2&gt;

&lt;p&gt;We scored each test file: +100 for name match, +50 for same directory, +30 for shared parent, -1 per distance. We wrote tests with 3 handcrafted files. They passed.&lt;/p&gt;

&lt;p&gt;Then we ran the ranker against 29 real file paths from a production repo. &lt;code&gt;src/index.test.tsx&lt;/code&gt; (the root app test, completely unrelated) ranked #2. &lt;code&gt;src/pages/checkout/components/PayButton/index.test.tsx&lt;/code&gt; (actually relevant) ranked #4.&lt;/p&gt;

&lt;p&gt;The bug: &lt;code&gt;+30&lt;/code&gt; was a flat bonus for any shared parent. One shared component (&lt;code&gt;src/&lt;/code&gt;) got the same +30 as three shared components (&lt;code&gt;src/pages/checkout/&lt;/code&gt;). With 3 synthetic inputs, other scoring factors dominated. With 29 real inputs at varying depths, the flat bonus broke everything.&lt;/p&gt;

&lt;p&gt;The fix was one line: change &lt;code&gt;+30&lt;/code&gt; to &lt;code&gt;common_len * 10&lt;/code&gt; so deeper shared paths score higher.&lt;/p&gt;

&lt;p&gt;This is the mutation testing principle. Imagine an "evil coder" who changes your constant: +30 to +0 or +1000. Do your tests fail? With 3 synthetic inputs, no. The tests pass regardless of the constant's value. That means they prove nothing about it. Only 29 real inputs exposed the flaw.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Remains Unsolved
&lt;/h2&gt;

&lt;p&gt;The fundamental issue is that the mapping between implementation files and test files is a convention, not a computable relationship. Every project invents its own rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Colocated&lt;/strong&gt;: &lt;code&gt;Button.tsx&lt;/code&gt; and &lt;code&gt;Button.test.tsx&lt;/code&gt; side by side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mirror tree&lt;/strong&gt;: &lt;code&gt;src/auth/Provider.tsx&lt;/code&gt; tested by &lt;code&gt;tests/auth/Provider.test.tsx&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate dir with different naming&lt;/strong&gt;: &lt;code&gt;core/app/Services/Foo.php&lt;/code&gt; tested by &lt;code&gt;core/tests/Unit/Service/FooTest.php&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Framework magic&lt;/strong&gt;: Go tests in the same package, Python tests discovered by pytest markers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Barrel re-exports&lt;/strong&gt;: The actual file path never appears in any import statement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No single algorithm handles all of these. Path matching fails for different directory structures. Content grep fails for barrel exports and framework-level imports. Even the hybrid approach requires a ranking function, and that ranking function needs real data to validate - not 3 handcrafted inputs.&lt;/p&gt;

&lt;p&gt;If you're building developer tooling that needs to answer "which test covers this file?" - there's no clean answer. The best we've found is: try multiple discovery methods, take the union, rank aggressively, and validate with real repository data at real scale. And even then, you'll miss cases.&lt;/p&gt;

</description>
      <category>testdiscovery</category>
      <category>mutationtesting</category>
      <category>developertooling</category>
      <category>architecture</category>
    </item>
    <item>
      <title>39 Duplicate Jest Errors Cost Us $300</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Mon, 30 Mar 2026 01:38:16 +0000</pubDate>
      <link>https://dev.to/gitautoai/how-39-duplicate-jest-errors-burned-300-in-claude-api-costs-47cd</link>
      <guid>https://dev.to/gitautoai/how-39-duplicate-jest-errors-burned-300-in-claude-api-costs-47cd</guid>
      <description>&lt;h1&gt;
  
  
  39 Duplicate Jest Errors Cost Us $300
&lt;/h1&gt;

&lt;p&gt;I'd been going back and forth for days on whether to buy a $300 badminton racket. Comparing models, reading reviews, watching YouTube videos. $300 is $300 - you think about it.&lt;/p&gt;

&lt;p&gt;Then I woke up one morning, checked our Claude API usage dashboard, and found that a single PR had already burned $300 overnight while I was sleeping. The exact amount I'd been agonizing over for days, gone in a few hours of automated retries.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happened
&lt;/h2&gt;

&lt;p&gt;The repo was a React app with Jest tests. It had 39 test files that all imported a shared module. That module had a TypeError. When Jest runs, it executes every test file independently, so the same TypeError appeared 39 times in the CI log - once per file, with identical stack traces.&lt;/p&gt;

&lt;p&gt;Our log cleaning pipeline already:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stripped ANSI escape codes&lt;/li&gt;
&lt;li&gt;Removed node_modules from stack traces&lt;/li&gt;
&lt;li&gt;Extracted the "Summary of all failing tests" section&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it treated each of the 39 copies as a unique error. The cleaned log was still 390K characters. That's roughly 100K tokens embedded in the first message of every API call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Cost $300
&lt;/h2&gt;

&lt;p&gt;GitAuto's agent loop sends the CI log in &lt;code&gt;messages[0]&lt;/code&gt; so the model always has the error context. With 8 retry iterations, each carrying 240K input tokens (the log plus conversation history), the total input token count hit millions. At Claude Opus pricing, that's $300 for one PR that never even got fixed - the error was unfixable by the agent (a missing environment variable in CI).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;Three changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication&lt;/strong&gt;: Group identical errors by their stack trace. Instead of showing the same TypeError 39 times, show it once with "39 tests failed with this same error." This reduced the 390K char log to under 10K.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File-based storage&lt;/strong&gt;: For logs that are still large after cleaning (over 50K chars), save the full log to &lt;code&gt;.gitauto/ci_error_log.txt&lt;/code&gt; in the cloned repo. Include a 5K char preview in the initial message. The agent can read or grep the full file on demand instead of carrying it in every API call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-model context windows&lt;/strong&gt;: Replace the hardcoded 200K context window with per-model values. Claude Opus 4.6 and Sonnet 4.6 support 1M tokens. Older models stay at 200K. This prevents unnecessary token trimming on newer models while keeping older models safe.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prevention
&lt;/h2&gt;

&lt;p&gt;The root pattern: cleaning pipelines that remove noise but don't deduplicate. ANSI codes and node_modules paths are noise - they add characters without information. But 39 identical errors aren't noise in the traditional sense. Each one is a valid error from a valid test file. The pipeline treated them as unique because they came from different files. The fix was recognizing that the error content, not the source file, determines uniqueness.&lt;/p&gt;

&lt;p&gt;For any system that feeds CI logs to an LLM, the question isn't just "how do I make this log smaller?" but "how many of these errors are actually saying the same thing?"&lt;/p&gt;

</description>
      <category>cilogs</category>
      <category>jest</category>
      <category>tokencosts</category>
      <category>claudeapi</category>
    </item>
    <item>
      <title>Vanilla Claude vs GitAuto Test Generation</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Sun, 29 Mar 2026 03:49:43 +0000</pubDate>
      <link>https://dev.to/gitautoai/vanilla-claude-vs-gitauto-test-generation-4d94</link>
      <guid>https://dev.to/gitautoai/vanilla-claude-vs-gitauto-test-generation-4d94</guid>
      <description>&lt;h1&gt;
  
  
  Vanilla Claude vs GitAuto: Test Generation Compared
&lt;/h1&gt;

&lt;p&gt;We ran an experiment. Take a &lt;a href="https://github.com/gitautoai/sample-calculator" rel="noopener noreferrer"&gt;simple Python calculator&lt;/a&gt; - 40 lines of code, four arithmetic operations, and a CLI main function. Give it to vanilla Claude with a generic prompt, then give the same file to GitAuto. Compare the results.&lt;/p&gt;

&lt;p&gt;Both use the same Claude Opus 4.6 model. The difference is in the &lt;strong&gt;system&lt;/strong&gt; around it - the prompts, the pipeline, and the adversarial testing approach.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Source Code
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;subtract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cannot divide by zero&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Simple Calculator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Operations: +, -, *, /&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter first number: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter operation (+, -, *, /): &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter second number: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;operations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;subtract&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;operations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown operation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;operations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Vanilla Claude: "Write Tests for This"
&lt;/h2&gt;

&lt;p&gt;We pasted this into Claude Opus 4.6 with a generic prompt and asked it to write unit tests. It produced &lt;a href="https://github.com/gitautoai/sample-calculator/pull/11" rel="noopener noreferrer"&gt;19 tests&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 tests for &lt;code&gt;add&lt;/code&gt; (positive, negative, mixed signs, floats with &lt;code&gt;pytest.approx&lt;/code&gt;, zeros)&lt;/li&gt;
&lt;li&gt;4 tests for &lt;code&gt;subtract&lt;/code&gt; (positive, negative result, negative numbers, floats)&lt;/li&gt;
&lt;li&gt;5 tests for &lt;code&gt;multiply&lt;/code&gt; (positive, by zero, negative, mixed signs, floats)&lt;/li&gt;
&lt;li&gt;5 tests for &lt;code&gt;divide&lt;/code&gt; (positive, float result, negative, mixed signs, divide by zero)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;19 well-written tests&lt;/strong&gt;. Clean structure, good use of &lt;code&gt;pytest.approx&lt;/code&gt; for floats, covers the happy paths and the one explicit error case. But notice what's missing: no &lt;code&gt;main()&lt;/code&gt; tests, no infinity, no duck typing, no type mismatches, no boundary values.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitAuto: 41 Tests
&lt;/h2&gt;

&lt;p&gt;GitAuto generated &lt;strong&gt;41 tests&lt;/strong&gt; for the same file (&lt;a href="https://github.com/gitautoai/sample-calculator/pull/10" rel="noopener noreferrer"&gt;PR #10&lt;/a&gt;). Both handle float precision correctly with &lt;code&gt;pytest.approx&lt;/code&gt; - that's table stakes. The difference is in the categories vanilla Claude skipped entirely:&lt;/p&gt;

&lt;h3&gt;
  
  
  Infinity and NaN
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_infinity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_inf_minus_inf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;float("inf")&lt;/code&gt; is a valid Python value. In 1982, the Vancouver Stock Exchange &lt;a href="https://gitauto.ai/blog/what-are-adversarial-tests?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;lost half its index value&lt;/a&gt; because nobody tested how repeated float operations accumulate. These tests verify behavior with values most developers never think to pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  Duck Typing and Type Mismatches
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_string_concatenation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_type_mismatch_raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;two&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In December 2025, Cloudflare's Lua proxy &lt;a href="https://gitauto.ai/blog/what-are-adversarial-tests?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;went down for 25 minutes&lt;/a&gt; because a nil value appeared where an object was expected - a type exploit in a dynamic language. These tests document what &lt;code&gt;add&lt;/code&gt; actually does with strings and mixed types, so you know before production does.&lt;/p&gt;

&lt;h3&gt;
  
  
  Division Boundaries and Main Function
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_very_small_divisor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1e-300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;approx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_invalid_first_number&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_mock_print&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_mock_input&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dividing by &lt;code&gt;1e-300&lt;/code&gt; produces &lt;code&gt;1e300&lt;/code&gt; - a valid but astronomically large result. And vanilla Claude never tested &lt;code&gt;main()&lt;/code&gt; at all - no invalid inputs, no empty operators, no error paths. GitAuto generated 9 tests for &lt;code&gt;main()&lt;/code&gt; covering all branches.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Vanilla Claude&lt;/th&gt;
&lt;th&gt;GitAuto&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Total tests&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;41&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Happy path tests&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge case tests&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Adversarial tests&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;main()&lt;/code&gt; function&lt;/td&gt;
&lt;td&gt;Not tested&lt;/td&gt;
&lt;td&gt;9 tests covering all branches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Float precision&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infinity/NaN&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Duck typing&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Type mismatch&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Fair Criticism
&lt;/h2&gt;

&lt;p&gt;Could you close this gap with a better prompt? Partially. Asking Claude to "test edge cases, type coercion, and boundary values" would get you closer. The gap isn't about a secret prompt - it's about doing this &lt;strong&gt;automatically across hundreds of files&lt;/strong&gt; without writing a prompt for each one. On a 14-repo codebase, we took statement coverage from 40% to 70% over 7 months using this approach. No developer wrote a single test prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Basic tests catch bugs you already thought about. Adversarial tests catch bugs you didn't - &lt;a href="https://gitauto.ai/blog/what-are-adversarial-tests?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;the kind that took down the Vancouver Stock Exchange, Bitcoin, and Cloudflare&lt;/a&gt;. The gap between 19 and 41 tests on a calculator becomes the gap between 40% and 70% coverage on a real codebase.&lt;/p&gt;

&lt;p&gt;Read more about &lt;a href="https://gitauto.ai/blog/what-are-adversarial-tests?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;what adversarial tests are&lt;/a&gt;, &lt;a href="https://gitauto.ai/blog/can-you-guess-what-tests-a-calculator-needs?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;try guessing what tests a calculator needs&lt;/a&gt;, or estimate the savings for your team with the &lt;a href="https://gitauto.ai/roi/calculator?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;ROI calculator&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>testgeneration</category>
      <category>claude</category>
      <category>aitesting</category>
      <category>adversarialtesting</category>
    </item>
    <item>
      <title>Can You Guess What Tests a Calculator Needs?</title>
      <dc:creator>Wes Nishio</dc:creator>
      <pubDate>Sun, 29 Mar 2026 03:46:45 +0000</pubDate>
      <link>https://dev.to/gitautoai/can-you-guess-what-tests-a-calculator-needs-2h91</link>
      <guid>https://dev.to/gitautoai/can-you-guess-what-tests-a-calculator-needs-2h91</guid>
      <description>&lt;h1&gt;
  
  
  Can You Guess What Tests a Calculator Needs?
&lt;/h1&gt;

&lt;p&gt;Here's a challenge. Below is a complete Python calculator - 40 lines, four operations, a CLI interface. Before scrolling down, think about what tests you'd write. How many test cases do you need for full coverage?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;subtract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cannot divide by zero&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Simple Calculator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Operations: +, -, *, /&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter first number: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter operation (+, -, *, /): &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter second number: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;operations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;subtract&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;op&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;operations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown operation: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;operations&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Got your number? Most developers say 10-15 tests. Something like: test each operation with positive numbers, test divide by zero, test invalid operator, test main with each operation. That covers the obvious cases.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitAuto Generated 41 Tests
&lt;/h2&gt;

&lt;p&gt;We pointed &lt;a href="https://gitauto.ai/?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;GitAuto&lt;/a&gt; at this file via our dashboard. It created a &lt;a href="https://github.com/gitautoai/sample-calculator/pull/10" rel="noopener noreferrer"&gt;PR with 41 tests organized into 5 test classes&lt;/a&gt;. Here's what you probably didn't think of.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did You Test Float Precision?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;approx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;0.1 + 0.2&lt;/code&gt; is &lt;code&gt;0.30000000000000004&lt;/code&gt; in IEEE 754 floating point. A bare &lt;code&gt;==&lt;/code&gt; would fail. This is the most common numerical bug in production systems, and most developers forget to test for it because it works fine with integers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did You Test Infinity?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-inf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;float("inf")&lt;/code&gt; is a valid Python value. Your calculator doesn't reject it. So what happens when someone adds infinity to 1? What about infinity minus infinity? The answer is NaN (Not a Number), which propagates silently through every subsequent calculation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did You Test Duck Typing?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hello world&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ab&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ababab&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python's &lt;code&gt;+&lt;/code&gt; operator works on strings. &lt;code&gt;*&lt;/code&gt; works with a string and an integer. Your calculator doesn't check input types, so &lt;code&gt;add("hello", " world")&lt;/code&gt; returns &lt;code&gt;"hello world"&lt;/code&gt;. That's not a bug per se - it's a documented behavior. But if you don't test it, you don't know when it changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did You Test Type Mismatches?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;two&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;int + str&lt;/code&gt; raises &lt;code&gt;TypeError&lt;/code&gt; in Python. No validation, no friendly error message - just a raw exception. Is that the behavior you want? Without a test, you don't know this is happening until a user hits it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did You Test Division by 0.0?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The guard is &lt;code&gt;if b == 0&lt;/code&gt;. Does that catch &lt;code&gt;0.0&lt;/code&gt;? Yes, in Python &lt;code&gt;0.0 == 0&lt;/code&gt; is &lt;code&gt;True&lt;/code&gt;. But it's worth testing explicitly because other languages behave differently, and someone might change the guard to &lt;code&gt;if b is 0&lt;/code&gt; (which would break for &lt;code&gt;0.0&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  Did You Test a Very Small Divisor?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;divide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1e-300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;approx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e300&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;1e-300&lt;/code&gt; is not zero, so it passes the division guard. The result is &lt;code&gt;1e300&lt;/code&gt; - a valid but enormous number. In a financial system, this could mean a $1 transaction produces a $10^300 result. The test verifies the calculator doesn't raise an error, but it also documents this potentially dangerous behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  Did You Test Invalid Main Inputs?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Non-numeric input
&lt;/span&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raises&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# input: "not_a_number", "+", "3"
&lt;/span&gt;
&lt;span class="c1"&gt;# Empty operator
&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# input: "5", "", "3"
&lt;/span&gt;&lt;span class="n"&gt;mock_print&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;assert_any_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown operation: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What if the user types "abc" as a number? &lt;code&gt;float("abc")&lt;/code&gt; raises &lt;code&gt;ValueError&lt;/code&gt; with no catch block. What about an empty string as the operator? It falls through to the "Unknown operation" branch. These are the exact inputs your users will provide.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scorecard
&lt;/h2&gt;

&lt;p&gt;If you said 10-15 tests, you're in good company. Here's what the typical developer tests vs what GitAuto tests:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;What developers test&lt;/th&gt;
&lt;th&gt;What GitAuto adds&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Basic arithmetic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2+3=5, 10-4=6, 3*4=12, 10/2=5&lt;/td&gt;
&lt;td&gt;Negative numbers, mixed signs, zero, identity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Division errors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;divide(1,0) raises&lt;/td&gt;
&lt;td&gt;divide(0,0), divide(5,0.0), divide(1,1e-300)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Floating point&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rarely tested&lt;/td&gt;
&lt;td&gt;0.1+0.2 with approx, float division precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infinity/NaN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rarely tested&lt;/td&gt;
&lt;td&gt;inf+1, inf+(-inf), inf/1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Duck typing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rarely tested&lt;/td&gt;
&lt;td&gt;String concat, string repeat, type mismatch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Main function&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One happy path&lt;/td&gt;
&lt;td&gt;All 4 ops, unknown op, empty op, invalid numbers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~10-15 tests&lt;/td&gt;
&lt;td&gt;41 tests&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Beyond a Calculator
&lt;/h2&gt;

&lt;p&gt;A 40-line calculator is a toy example. Does this pattern hold on real codebases?&lt;/p&gt;

&lt;p&gt;We ran &lt;a href="https://gitauto.ai/?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;GitAuto&lt;/a&gt; across a 14-repo insurance platform over 7 months. Statement coverage went from 40% to 70% - with the same adversarial approach: testing boundary values, type coercion, and untested code paths across hundreds of files. The gap between "obvious tests" and "thorough tests" compounds when you have API handlers, database queries, authentication logic, and business rules instead of &lt;code&gt;add(a, b)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Read more about &lt;a href="https://gitauto.ai/blog/what-are-adversarial-tests?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;what adversarial tests are and why they matter&lt;/a&gt;, &lt;a href="https://gitauto.ai/blog/vanilla-claude-vs-gitauto-test-generation?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;how this compares to generic AI test generation&lt;/a&gt;, or estimate the savings for your team with the &lt;a href="https://gitauto.ai/roi/calculator?utm_source=devto&amp;amp;utm_medium=referral" rel="noopener noreferrer"&gt;ROI calculator&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>unittesting</category>
      <category>testcoverage</category>
      <category>adversarialtesting</category>
      <category>python</category>
    </item>
  </channel>
</rss>
