DEV Community: Wes Nishio

Your Prompt Is Not Enough. Add a Gate.

Wes Nishio — Tue, 28 Apr 2026 00:05:34 +0000

The problem

We had a customer running our AI agent to generate tests for their backend services. Resolvers, DB query functions, API wrappers. The agent produced hundreds of test files. Every single one was pure mocks.

Their repo had test DB helpers sitting right there. The agent never used them.

Our prompt already said: "If the repo has test DB infrastructure, use it for data-access methods instead of mocking." One line, buried in a list of other instructions.

Why the model ignored it

The rest of the prompt biased toward mocking. The <mocks> section had four rules about how to mock correctly. The <passthrough_methods> section offered an easy out: "If no test infrastructure exists, verify the mock was called with correct parameters." The model took the path of least resistance every time.

One instruction saying "do X" loses to ten instructions explaining how to do Y.

This is not a model failure. It is a prompt design failure. If your prompt has competing signals, the majority wins. The minority instruction might as well not exist.

The fix: two layers

Layer 1: Rebalance the prompt. We restructured the test standards around three explicit levels: solitary (mocked), sociable (real collaborators), and integration (real external services). The rule is simple: if you wrote a mock, you must also write a sociable or integration test to prove the real thing works. Mocking alone is not enough.

Layer 2: Add a programmatic gate. We added an "integration" category to our quality checklist with three checks: db_operations_use_real_test_db, api_calls_tested_end_to_end, and env_var_guards_for_secrets. The quality evaluation now programmatically flags test files that mock DB operations without any integration tests.

A prompt suggests. A gate enforces.

The pattern

This applies beyond testing. Any time you want an AI agent to reliably follow a specific behavior:

Make the instruction prominent, not buried
Remove competing signals that bias toward the opposite behavior
Add programmatic verification that checks the output

In traditional software, we do not rely on comments saying "do not forget to validate input." We write validation middleware. The same principle applies to AI agents: treat important behaviors as invariants, not suggestions.

Practical considerations

Integration tests that need secrets (DB connection strings, API keys) need env var guards so they skip when secrets are unavailable:

const mongoUri = process.env.MONGODB_URI;
(mongoUri ? describe : describe.skip)('integration', () => {
  // tests that need real DB
});

The agent generates the test, skips it locally (no secrets), pushes the code, and CI runs it with real secrets. For paid APIs (Stripe, Twilio), use unconditional describe.skip to avoid costs, but keep the tests as documentation.

Prompts tell the model what to do. Gates make sure it actually did it.

Ask a File, Don't Read the Whole Thing

Wes Nishio — Mon, 13 Apr 2026 22:50:56 +0000

The problem

A coding agent needs to know what test framework a project uses. The answer is four characters: "Jest". But to find that answer, the agent loads conftest.py or package.json — and now that entire file sits in the conversation context for every subsequent API call.

One file is cheap. But the agent does this dozens of times per run: reading a config to check a naming convention, reading a setup file to find fixtures, reading a CI config to learn the build tool. Each file stays in context for every turn after it enters. Input tokens are 95% of our Claude costs, and most of these reads exist only to extract a short answer.

We already let the agent drop files from context after reading them. But that still loads the full file for at least one turn. What if the agent never needed the full file in the first place?

The pattern

We already had this pattern for web pages. When the agent fetches a URL, GitAuto sends the HTML through Claude Haiku ($1 per million input tokens) to extract a summary, then only the summary enters the main Opus context. A 30,000-token web page becomes a 1,000-token summary.

The same pattern works for local files. Instead of loading a file into Opus context to answer a simple question, route it through Haiku first.

How it works

The agent provides a file path and a question (e.g., query_file("tests/conftest.py", "What test framework and fixtures are used?")):

GitAuto reads the file and sends it to Haiku with the question
Haiku returns a focused answer
Only the answer enters the Opus conversation

The full file never touches the expensive model's context.

Three modes of file access

The agent now picks the cheapest way to interact with a file:

Full read — load the entire file into context. For when the agent needs exact code to edit.
Query — route through Haiku, return only the answer. For when the agent needs to learn something about the file.
Forget — drop file content already in context. For when a full read happened but the content is no longer needed.

What we expect

The agent should naturally prefer querying for pattern learning: test conventions, framework detection, project structure questions. It should only do a full read when it needs exact lines for editing.

Whether the agent actually makes this distinction is the experiment. The tool descriptions guide it, but real usage will tell us if the guidance is sufficient or if we need to adjust.

Let Your AI Agent Forget on Purpose

Wes Nishio — Sun, 12 Apr 2026 22:40:38 +0000

The problem

When an AI agent runs a multi-step coding task, it reads files early to learn patterns: test naming conventions, project structure, framework idioms. These reference files go into the conversation context and stay there forever.

Every subsequent API call sends the full message history. If the agent reads three reference files totaling 15,000 characters on turn 5, those 15,000 characters ship with every call from turn 6 through turn 50. That is 675,000 characters of redundant input across those 45 turns.

Input tokens are roughly 95% of our Claude costs. At Opus 4.6 pricing ($5 per million input tokens, $25 output), accumulated stale content adds up fast. We noticed reference files persisting across 30+ turns in production runs, contributing meaningfully to our daily invoices.

The usual approach

Most agent frameworks handle context management from the outside. The orchestrator decides when to truncate, summarize, or compact the conversation. LangChain has buffer windows. OpenAI's ChatGPT does automatic compaction. The agent itself has no say.

This works for generic chatbots. But coding agents have a different pattern: they read specific files for specific reasons, extract what they need, and move on. The orchestrator does not know when the agent is "done" with a file. Only the agent knows.

The tool

We added a forget_messages tool to our agent's toolkit. It takes a list of file paths and replaces their content in the conversation history with a short placeholder like ['src/ref.py' content removed because agent already extracted needed patterns].

The agent calls it when it decides it has extracted the patterns it needs. The content is gone from context, but the placeholder reminds the agent the file existed. If it needs the file again, it can re-read it with get_local_file_content.

The economics

Say the agent forgets 15,000 characters on turn 5 of a 50-turn run:

Cost of forgetting: near zero (one tool call, a few placeholder tokens)
Savings per subsequent turn: 15,000 fewer input characters
Total savings: 15,000 x 45 turns = 675,000 characters

The breakeven is immediate. The only risk is forgetting too early, but the agent can always re-read.

What we expect

Giving the agent control over its own context feels unusual but logical. The agent is the one reading files and deciding what to do with them. It already manages file edits, directory creation, and git operations. Managing its own memory is a natural extension.

We shipped this as an experiment. If it works, we will expand it beyond file content to other large tool results.

The broader lesson: if your agent accumulates stale content over long runs, consider giving it a tool to clean up after itself rather than building increasingly complex orchestrator logic.

One Web Fetch Ate 28% of Our PR Cost

Wes Nishio — Fri, 10 Apr 2026 04:38:03 +0000

One Web Fetch Ate 28% of Our PR Cost

58K tokens for a yes/no question

Our agent was investigating whether Jest 30 supports the @jest-config docblock pragma. It called fetch_url on the Jest configuration docs page. That single page converted to 58,348 tokens of markdown - navigation menus, sidebar links, configuration options for testMatch, moduleNameMapper, transform, and dozens of other settings the agent didn't need. All 58K tokens went into Claude Opus 4.6's context window.

The agent needed one fact. It got an encyclopedia.

This wasn't a one-off. Every fetch_url call inflated the next agent turn by 10K-58K tokens. Worse, in an agentic loop, those tokens compound: they stay in the conversation history and get re-sent on every subsequent API call. We checked production data - that single Jest docs fetch added ~32K tokens to each of the 22 remaining turns. The compounded cost of one web page ate 28% of the entire PR's agent cost.

The fix

We replaced fetch_url with web_fetch, adding Claude Haiku 4.5 as a summarization layer. The tool now takes a prompt parameter describing what information to extract. After HTML-to-markdown conversion, Haiku reads the full page and returns only the relevant content. The main model receives a focused summary instead of the raw page.

For that Jest docs page: instead of 58K tokens hitting Opus, Haiku reads them (at 5x cheaper input pricing), returns a ~200-token summary with the answer, and Opus processes only that summary. The token waste drops from 99%+ to near zero.

We also split the tool in two:

web_fetch - fetches HTML, summarizes with Haiku, returns the summary. For documentation and articles.
curl - returns raw content with no processing. For JSON APIs and text files where exact content matters.

Why the model can't solve this itself

Opus is smart enough to ignore irrelevant content on a web page. But by the time it sees the content, you've already paid for the input tokens. The 58K tokens are in the context window whether Opus reads them carefully or skims past them. Asking a model to "focus on the relevant parts" doesn't reduce the input token count.

The filtering has to happen before the tokens reach the expensive model. That's an application-layer decision - the model has no way to say "don't send me the tokens I'm about to ignore."

The broader pattern

Any time your agent pipeline has a step that produces large output, feeds into an expensive model, and only needs a fraction of the content - insert a cheap model as a filter.

In Python with pytest, verbose test output can be thousands of lines. In Java with Maven or Gradle, build logs run hundreds of KB. CI logs in any language are full of ANSI escape codes, download progress bars, and dependency resolution noise. All of this gets stuffed into the reasoning model's context window and stays there for every subsequent turn.

Model capability and model cost are different axes. Route high-volume, low-complexity work (summarize this page) to cheap models. Route low-volume, high-complexity work (reason about the summary) to expensive ones.

Why PR Bodies Should Tell the Story

Wes Nishio — Fri, 10 Apr 2026 01:24:55 +0000

Why PR Bodies Should Tell the Story

The problem

When an AI agent creates a pull request, the PR body typically contains the original issue description or schedule coverage info. After the agent finishes working - writing tests, fixing bugs, making trade-offs - none of that context appears in the PR body. The reviewer opens the PR and sees the original instructions, then has to piece together what actually happened from the diff and comments.

This is backwards. The PR body should be the first thing a reviewer reads to understand what was done, what bugs were found, and what they need to verify.

What we built

After the agent completes its work, we now call Claude Sonnet 4.6 (via Anthropic's API) with the full context of what happened - the PR title, changed files with diffs, agent comments, and the agent's completion summary - and ask it to generate a structured summary. This gets appended to the PR body using HTML comment markers for idempotent upserts. Every call is recorded in our llm_requests table for cost tracking.

The generated section includes:

What I Tested - specific functions, behaviors, and edge cases covered, referencing actual code from the diff
Potential Bugs Found - edge cases, untested paths, or workarounds the agent discovered. Our agent (Claude Opus 4.6) tries to break the code before users do, so it often finds issues that need reviewer attention. If a bug was found, the summary explains whether it was actually fixed or worked around.
Non-Code Tasks - tasks outside the code review like env vars to set, migrations to run, or configs to update

The bugs section is always present - if none were found, it says so explicitly. Non-Code Tasks is omitted when not applicable.

The implementation

The core is a pure function upsert_pr_body_section that uses regex to find HTML comment markers (...) in the PR body. If the section exists, it replaces the content. If not, it appends with a --- separator before the first agent section.

The trigger type (dashboard, schedule, check suite, review comment) determines both the marker name and the prompt used for generation. This mapping lives in constants/triggers.py alongside the trigger type definitions, keeping the configuration centralized.

Why this matters for code review

The hardest part of reviewing AI-generated PRs isn't reading the diff - it's understanding the intent. Why did the agent change this file? Did it find any issues? What should I look at carefully?

By having the agent explain itself in the PR body, reviewers spend less time on archaeology and more time on actual review.

Key design decisions

Claude-generated, not template strings: Early versions used hardcoded strings like "Fixed the failing CI check." This told reviewers nothing. Claude writes context-aware summaries because it has the agent's full completion reason.
Idempotent upserts, not appends: If the agent runs again on the same PR, the section is replaced, not duplicated. This keeps PR bodies clean.
Original body preserved: The agent's sections are always appended after a separator. The original PR body (issue description, instructions) is never modified.

Why Retargeting a PR Explodes the Diff

Wes Nishio — Wed, 08 Apr 2026 22:23:15 +0000

Why Retargeting a PR Explodes the Diff

A reviewer asked us to change a PR's base from, say, release/20260401 to release/20260501. Simple request. GitHub even has an API for it: PATCH /repos/{owner}/{repo}/pulls/{number} with a new base field.

GitAuto called it. The base branch label changed. And the PR diff went from 5 files to 300+.

What Went Wrong

GitHub's "change base branch" is metadata-only. It updates which branch the PR targets but does nothing to the git history. When that doesn't matter - say, retargeting from main to develop where develop was forked from main - the diff stays clean because the commit graph still makes sense.

But the two release branches were siblings. Both were cut from main at different points in time:

main ──────●──────────────●──────
           │              │
           ▼              ▼
     release/0401    release/0501

When the PR originally targeted release/0401, Git computed the merge base between that branch and the PR head. The diff showed only the PR's actual changes. After the API call switched the target to release/0501, Git recomputed the merge base - now between a completely different branch and the same PR head. Every file that differed between the two release branches appeared in the PR diff.

What a Human Would Do

A developer would run git rebase --onto release/0501 release/0401 pr-branch. This replays the PR's commits on top of the new base, and the diff goes back to showing only the actual changes.

But rebase has two problems for automation:

Merge conflicts can halt execution. Rebase replays commits one by one. If any commit conflicts with the new base, git stops and waits for manual resolution. An automated system can't resolve conflicts interactively.
Shallow clones break rebase. Many CI systems and automation tools clone with --depth 1 for speed. Rebase needs the full commit history to find the fork point and replay commits. With a shallow clone, it simply fails.

The Fix for Automation

Instead of replaying commits, save the end result and rewrite it:

Save the PR's actual file changes (contents from the current branch)
Change the base branch on GitHub (the metadata part)
Reset the local branch to the new base (git fetch + git reset)
Rewrite the saved files onto the new base
Force push to update the remote

This is deterministic and conflict-free. It doesn't matter how the files got to their current state - we just read the final contents, reset to the new base, and write them back. Works with any clone depth.

The Takeaway

When an API says it changes something, verify it changes everything that needs changing. GitHub's API truthfully changes the base branch - the metadata. But "retarget a PR" in a developer's mind means "make the diff show only my changes against the new base." That requires git-level surgery that no REST API currently provides.

If you maintain release branches cut from the same trunk, be aware that retargeting PRs between them is not a one-API-call operation. The label moves instantly. The diff needs work.

GitAuto handles this automatically. See our docs on sibling branch retarget for details.

Our Agent Had the Checklist and Ignored It

Wes Nishio — Wed, 08 Apr 2026 18:22:41 +0000

Our Agent Had the Checklist and Ignored It

We run an LLM-based quality gate that evaluates tests across 41 checks in 8 categories: business logic, adversarial inputs, security, error handling, and others. When the gate fails, the agent is told to improve the tests and try again.

Last week our agent - Claude Opus 4.6 - burned all its iterations rewriting tests for a CLI tool that parses CSV files and writes to a database. The quality gate failed on three specific categories every single time: adversarial inputs, security, and error handling. The agent never once added a test for any of them.

What the Agent Did Instead

The agent had the full 41-check quality checklist in its system prompt. It knew which categories exist. When told "quality gate failed," here's what it did across 9 commits:

Changed expect(spy).not.toHaveBeenCalledWith(msg) to spy.mock.calls.filter(...).toHaveLength(0). A "spy" in testing is a wrapper that records how a function was called - what arguments it received and how many times. Both lines check the same thing: "this function was never called with this message." The agent just rewrote the syntax without changing what's being tested.
Added a test combining all CLI flags together - useful, but not adversarial
Added a test for path normalization (backslash replacement) - general coverage, not security
Repeated similar cosmetic rewrites for the remaining commits

Not a single null input test. Not a single injection test. Not a single error message test. The agent had the information. It just didn't use it.

Why This Happens

If you've worked with LLMs, you've seen this pattern. Given a vague directive ("improve quality") and a detailed reference (the checklist), the model takes the path of least resistance. Rewriting an existing assertion is easier than designing a new adversarial test from scratch. The model satisfies the surface instruction ("I improved the tests") without addressing the substance.

This is the same behavior you see when asking an LLM to "review this code" - it often comments on formatting and naming instead of identifying logical bugs. The easy observations come first. The hard analysis gets skipped.

A good engineer, given vague feedback from a reviewer, would either ask clarifying questions or self-review against the checklist they already have. Claude Opus 4.6 had both options available - it could have asked for clarification through its tools, or systematically walked through the 41 checks it had in its system prompt. Instead, it made a small tweak and hoped that would be enough. Then did it again. And again. Nine times.

That's not what a capable engineer does. That's what a lazy one does - make a cosmetic change, submit, and hope the reviewer doesn't look too closely. It's a very human behavior, but not one we want a model to have learned.

The Compounding Problem

The quality gate returned a generic message: "Quality gate failed. Evaluate and improve test quality per coding standards." The agent knew the checklist existed but didn't know which 3 out of 41 checks actually failed. So it had to guess, and guessing led to cosmetic edits.

But even with specific feedback, one of the three failures was a false positive. The gate flagged path.resolve() as a command injection vector. It's not - it's a path normalization function. No amount of test-writing would satisfy that check.

So the agent faced three problems simultaneously:

It was lazy - it didn't systematically work through the checklist
The feedback was generic - it didn't know which checks failed
One check was wrong - a false positive that could never pass

What We Changed

Specific feedback: The error message now includes the exact failing checks with reasons. Instead of "quality gate failed," the agent sees adversarial.null_undefined_inputs: No tests for null CLI arguments and security.command_injection: No tests for malicious input values. This removes the guessing.

Same model for judging: The quality gate was using a weaker model than the agent itself - a cheaper model evaluating a more capable model's work. Now both use the same model, which reduces false positives like the path.resolve judgment.

Escape hatch: After 3 consecutive failures with no progress, accept the current quality and move on. Some checks may be false positives, and burning iterations on unfixable failures wastes compute. We get a Slack notification when this triggers so we can investigate.

The Lesson

The model is fundamentally capable of writing null input tests, analyzing injection vectors, and designing error handling coverage. It does all of those in other contexts. The capability is there. But capability and behavior are different things - the model's native tendency toward path-of-least-resistance means it won't reliably use its full power without external pressure.

This is why the fix has to be at the application layer. The checklist in the system prompt gives the model the knowledge. But knowledge alone doesn't produce diligence. Specific, targeted feedback ("these 3 checks failed") works better than comprehensive reference material ("here are all 41 checks") because it closes the gap between what the model can do and what it will do. It removes the opportunity to take shortcuts by making the exact problem inescapable.

This is also why tools like GitAuto exist. The models are powerful enough to write high-quality tests, fix CI failures, and reason about security. But left to their own defaults, they take shortcuts. The application layer - verification gates, specific feedback loops, escape hatches, structured tool calls - is what turns raw model capability into reliable engineering output. The value isn't in the model. It's in making the model actually do the work.

We Need a Laziness Eval

The industry benchmarks models on reasoning, coding, math, and knowledge. There are evals for shortcut resistance and multi-step reasoning. But none of them measure laziness - the gap between what a model can do and what it will do when not forced. This incident would pass every existing eval. Claude Opus 4.6 can write adversarial tests. It can analyze injection vectors. It can read a checklist and work through it systematically. It just didn't.

A laziness eval would give the model a task, a reference checklist, and vague feedback ("this isn't good enough"), then measure whether it systematically addresses the checklist or makes cosmetic changes and resubmits. The score isn't whether the model can solve the problem - it's whether it chooses to do the hard work when the easy path is available.

Zero Changes Passed Our Quality Gate

Wes Nishio — Wed, 08 Apr 2026 02:36:41 +0000

Zero Changes Passed Our Quality Gate

We have a pipeline that evaluates test quality beyond coverage. It scores files on 41 checks across categories like boundary testing, error handling, and security. When a file scores poorly, the system creates a PR and assigns an AI agent to improve the tests.

Last week, the agent looked at a test file with 100% line coverage, said "nothing to improve," and closed the task with zero changes. Our verification gate passed it through. The tests were still weak.

The agent wasn't being clever. Our gate had a gap.

What the Tests Actually Looked Like

The test file covered a function that transforms data and returns an object. Every line was exercised. But the assertions only checked that a return value existed:

expect(result).toBeDefined();
expect(result).not.toBeNull();

The function always returns an object - it can never return undefined or null. These assertions pass no matter what the function does. You could replace the entire implementation with return {} and every test would still be green. They test nothing.

The Gap in Our Gate

Our verification step runs when the agent declares the task complete. It checks for lint errors, type errors, and test failures. If everything passes, the task is marked done.

The agent made zero changes. Zero changes means zero PR files. Zero PR files means nothing to lint, nothing to type-check, nothing to test. Our verification pipeline had nothing to verify, so it passed. "Do nothing" was a valid exit path even when the system had already flagged the tests as weak.

The Fix: Three Layers

Prompt-level instructions: We added explicit rules telling the agent that 100% coverage doesn't mean the tests are good. The agent's coding standards now include guidance on what useless assertions look like and why toBeDefined() on a non-nullable return proves nothing.

Zero-change rejection: When the agent completes a quality-focused PR with zero changes, we reject the first attempt - the scheduler already determined the tests were weak when it created the PR, so "no changes" contradicts that finding. But if the agent tries again and still makes no changes, we allow completion. Sometimes the tests are genuinely fine and the scheduler was wrong. No infinite loops.

LLM-based evaluation after changes: When the agent does make changes, we run the quality evaluation again after all other checks pass (lint, types, tests). This runs last to avoid wasting an LLM call when the agent will need to retry anyway due to syntax errors or test failures.

The Cost Problem

The quality evaluation uses an LLM call. Running it costs money. If we run it early and lint fails, the agent fixes the lint error and calls verify again - triggering another LLM evaluation for nearly identical code. By running quality checks last, we only pay for the evaluation when everything else is already clean. One call per successful verification instead of one per attempt.

The Broader Pattern

This isn't specific to AI agents. Any automated pipeline with a "no changes needed" exit path has this gap. CI that only runs on changed files. Linters that skip untouched code. Review bots that auto-approve empty diffs.

The fix is the same everywhere: if the system decided something needs work, don't let "no work done" count as completion. Track why the task was created and verify that the reason was addressed, not just that the pipeline didn't find new problems.

How We Reached 92% Coverage with GitAuto

Wes Nishio — Mon, 06 Apr 2026 23:08:47 +0000

How We Reached 92% Test Coverage with GitAuto

We decided to dogfood GitAuto by using it on the GitAuto repository itself. The goal was simple: demonstrate whether we could really achieve high test coverage in a real production codebase. After 3 months, we hit 92% line coverage, 96% function coverage, and 85% branch coverage.

Here's exactly how we did it and what we learned.

The Setup: 5 Files Per Day, Every Day

Our approach was straightforward:

Enabled schedule trigger: Set GitAuto to run automatically every day
5 files per day: Configured to target 5 files each morning
Weekends included: Tests ran 7 days a week
Repository size: ~250 files total

The math was simple: at 5 files per day, we'd need roughly 50 days (about 2 months) to cover the entire codebase. In reality, it took closer to 3 months because we refined our approach along the way, experimented with different file counts, and occasionally restarted files when we improved the system.

The Daily Routine

Every morning, GitAuto would create 5 pull requests—one for each targeted file. Our review process evolved over time:

Initially:

Check if tests were passing
Review the test code in detail
Verify the changes made sense
Merge if everything looked good

In the end:

Most PRs were green out of the box
Quick verification that only test files changed (or legitimate bug fixes)
No code review—trusted the passing tests
Merge and move on

Coverage Growth Over Time

We didn't start tracking coverage data historically from day one, so our coverage charts only show the latter half of the journey. The growth rate varies because we adjusted the volume based on what we were working on—when we found issues to fix in GitAuto itself, we ran fewer PRs; when things were stable, we ran up to 10 PRs per day.

How We Actually Develop

Here's important context: we build GitAuto using Claude Code. When we write new features, we do write unit tests for critical parts we especially want to verify. But we don't obsess over coverage or spend significant time writing comprehensive test suites.

The result? Most features ship with decent but incomplete test coverage. Not 100%, not close. And bugs still happen.

This is where GitAuto came in. It filled the gaps we left, systematically adding tests to increase coverage on files we'd already moved on from.

The Results: What 90%+ Coverage Actually Feels Like

Now that we're consistently above 90% coverage with 242 test files, 2,680 test cases running in 3 minutes (67ms per test):

Bugs feel rare. We encounter far fewer unexpected issues in production.

Merges feel safe. We have confidence that changes won't break existing functionality.

Regression testing is faster. Automated tests catch issues that used to require manual verification.

Development velocity increased. Less time spent on manual testing and bug fixes means more time building features.

The One Downside

There's one real cost we didn't anticipate: GitHub Actions minutes. Initially, the GitAuto repository ran on GitHub's free tier with no issues. But as coverage increased, so did the number of tests running on every PR.

We eventually hit the free tier limits and had to upgrade. Now we also optimize by skipping test runs when there are no relevant changes (e.g., Python tests don't run when only documentation changes).

It's a small price to pay for 90%+ coverage, but worth knowing upfront.

Conclusion

Achieving 92% line coverage, 96% function coverage, and 85% branch coverage wasn't the result of heroic manual effort. It came from:

Enabling scheduled automation
Reviewing and merging 5 PRs each morning
Trusting the process over 3 months

If you're skeptical that high coverage is achievable in real-world codebases, we were too. But the data doesn't lie: consistent, automated test generation works.

Want to try the same approach on your repository? Install GitAuto and enable the schedule trigger. Start with 3-5 files per day and let the coverage compound.

How We Finally Solved Test Discovery

Wes Nishio — Wed, 01 Apr 2026 04:47:12 +0000

How We Finally Solved Test Discovery

Yesterday I wrote about why test file discovery is still unsolved. Three approaches (stem matching, content grepping, hybrid), each failing differently. The hybrid worked best but had a broken ranking function - flat scoring that gave src/ the same weight as src/pages/checkout/. Today it's solved.

The Problem With Flat Scoring

The March 30 post ended with this bug: +30 points for any shared parent directory. One shared path component got the same bonus as three. With 3 synthetic inputs, other factors dominated. With 29 real file paths, unrelated test files ranked above relevant ones.

The fix wasn't tweaking the constant. It was replacing the scoring model entirely.

Five Tiers, Not Points

Instead of adding up weighted scores, we rank by structural relationship. Higher tiers always win over lower ones, regardless of path depth or name similarity.

Tier 1 - Colocated tests. Same directory, same stem with a test suffix. Button.tsx and Button.test.tsx side by side. This is the strongest signal possible.

Tier 2 - Same-directory content match. A test file in the same directory whose source code imports the implementation file.

Tier 3 - Path-based match. The test file's path contains the implementation stem. tests/test_client.py for services/client.py. The classic mirror-tree convention.

Tier 4 - Content grep match. A test file anywhere in the repo references the implementation file in its source code.

Tier 5 - Parent directory content match. A test file in a parent directory that references the impl. Weakest signal, but still a real connection.

The key insight: tiers are ordinal, not additive. A Tier 1 match always outranks a Tier 3 match. No combination of bonus points can promote a distant test above a colocated one.

Content-Aware Matching

Path matching alone can't handle barrel re-exports. When a test imports from '@/pages/checkout' and that resolves to index.tsx, the string "index" never appears in the import statement. Path matching sees nothing.

Content-aware matching reads the test file and greps for references to the implementation. If a test file contains import { CheckoutPage } from './index' or require('./checkout'), the content grep catches it. Tiers 2, 4, and 5 are the content tiers that fill gaps path-only matching leaves open.

Single-Source Patterns

Every language has its own test naming convention:

.test.ts, .test.tsx - JavaScript/TypeScript (Jest, Vitest)
.spec.ts, .spec.tsx - Angular, Cypress, Playwright
test_*.py - Python (pytest)
*_test.go - Go
*Test.java, *Test.kt - Java/Kotlin (JUnit)
*_spec.rb - Ruby (RSpec)
*.spec.js - JavaScript (Mocha, Jasmine)

All of these are defined once and imported everywhere. Before this change, three different functions each maintained their own pattern list - slightly different, each missing cases the others caught.

The Takeaway

Test file discovery looks like a string matching problem. It's actually a ranking problem with structural priors. Flat scoring collapses structure into numbers and loses information. Tiered ranking preserves the structural relationship and makes the algorithm's priorities explicit and debuggable. And the only way to validate ranking is against real data at real scale - not 3 curated inputs that any algorithm can pass.

What 100% Test Coverage Can't Measure

Wes Nishio — Wed, 01 Apr 2026 04:47:11 +0000

What 100% Test Coverage Can't Measure

Customers started asking us: "How do you evaluate test quality? What does your evaluation look like?" We had coverage numbers - line, branch, function - and we were driving files to 100%. But we didn't have a good answer for what happens after 100%. Coverage proves every line was exercised. It doesn't say whether the tests are actually good.

Coverage Is the Foundation

Coverage tells you which lines ran during testing. That's important. A file at 30% coverage has obvious blind spots. Driving it to 100% forces tests to exercise error branches, conditional paths, and edge cases that might otherwise be ignored. We treat coverage as the primary goal and spend most of our effort getting files there.

But coverage measures execution, not verification. A test that renders a payment form, types a valid card number, and clicks submit can hit every line and every branch. It proves the happy path works. It doesn't tell you whether the form handles an expired card, a malformed CVV, or a network timeout mid-submission.

The Eight Categories After 100%

Once a file reaches 100%, there are categories of testing that coverage can't capture. We built a checklist of 41 checks across eight categories. Each check gets a pass, fail, or not-applicable result per file.

Business Logic

Does the test verify that domain rules produce correct results? A pricing function that calculates premiums needs tests for each tier boundary, not just one valid input. State transitions (pending → approved → active) need tests that verify invalid transitions are rejected. Calculation accuracy matters when rounding errors compound across thousands of transactions.

Adversarial

What happens when inputs are hostile? Null values, empty strings, empty arrays, boundary values (0, -1, MAX_INT), type coercion traps ("0" == false), oversized inputs, race conditions, and unicode special characters. A function can pass every line with valid inputs and still crash on null.

Security

Does the code defend against attack vectors? XSS payloads in user-generated content, SQL injection through unsanitized parameters, command injection via shell calls, CSRF on state-changing endpoints, authentication bypass, sensitive data exposure in logs or responses, open redirects, and path traversal (../../etc/passwd). Security tests verify that malicious input is rejected, not just that valid input is accepted.

Performance

Will this code scale? Quadratic algorithms hide behind small test datasets. N+1 queries don't show up until production traffic hits. Heavy synchronous operations block the event loop. Large imports increase bundle size. Redundant computation wastes cycles on every request. Performance tests catch what functional tests miss because functional tests use small inputs.

Memory

Does this code clean up after itself? Event listeners that aren't removed on unmount leak memory on every navigation. Subscriptions and timers that outlive their component accumulate silently. Circular references prevent garbage collection. Closures that capture large scopes retain memory longer than expected. These bugs don't crash - they degrade slowly until the tab or process dies.

Error Handling

What does the user see when things go wrong? Graceful degradation means a failed API call shows a retry option, not a blank screen. User-facing error messages should say what happened and what to do next, not expose a raw stack trace or a generic "Something went wrong."

Accessibility

Can everyone use it? ARIA attributes tell screen readers what an element does. Keyboard navigation means every interactive element is reachable without a mouse. Focus management ensures modal dialogs trap focus correctly and return it when closed. These aren't nice-to-haves - they're requirements for users who rely on assistive technology.

SEO

Is this page discoverable? Meta tags control how search engines and social platforms display the page. Semantic HTML (<article>, <nav>, <main>) helps crawlers understand page structure. Heading hierarchy (h1 → h2 → h3, no skipping) signals content relationships. Alt text on images provides context when images can't load or can't be seen.

Per-File, Not Per-Repo

We evaluate quality per file, not per repo. A repo-level score averages away the problems. Per-file evaluation means each source file and its test files are checked against all eight categories independently. Files that fail any check become candidates for test strengthening.

What We Built

We shipped 41 checks across these eight categories. When a file hits 100% coverage, we automatically evaluate its tests against the full checklist. Each check returns pass, fail, or not-applicable. Files that fail any check get a PR to strengthen the tests. Coverage remains our primary goal - we still spend most effort getting files to 100%. But now we have a concrete answer when customers ask how we evaluate quality beyond coverage numbers. The checklist will evolve as we learn what matters most across different codebases and languages.

See the full checklist with all 41 checks and how change detection avoids redundant evaluation.

Test File Discovery Is Still Unsolved

Wes Nishio — Mon, 30 Mar 2026 23:04:22 +0000

Test File Discovery Is Still Unsolved

Given a file like src/pages/checkout/index.tsx, which test files should you look at? Sounds simple. It's not.

We build an AI agent that writes tests. Before the agent starts, we need to find existing test files so it can match the project's testing patterns. We looked at the agent's logs for one real run: 34 iterations total, and 18 of them were spent just reading files - fetching imported modules, searching for type definitions, re-reading files it had already seen. The agent can read 2-3 files per iteration in parallel, but it still burned half its budget on discovery instead of writing tests.

The agent can solve this on its own - it does search, read, and eventually find the right files. But each iteration costs tokens. We want to pre-load as much context as possible before the agent loop begins, doing deterministically what the agent would do heuristically. Same work, but programmatic, stable, and cheaper. The discovery algorithm is the hard part - especially when you're language-agnostic and can't rely on any single project's conventions.

Approach 1: Stem Matching

Extract the filename stem and search the tree. Say you have src/auth/SessionProvider.tsx - the stem is SessionProvider. Walk the file tree, find test files containing "SessionProvider" in their path. This works for most files.

It fails for generic stems. A file like src/pages/checkout/index.tsx has stem index. Grepping for "index" across a codebase matches almost everything - 29 test files in one real repo. The signal drowns in noise.

We considered falling back to the parent directory name for generic stems (index -> checkout). This helps for some cases, but "generic" is a judgment call. Is utils generic? config? handler? Every heuristic creates a new edge case.

Approach 2: Content Grepping

Instead of matching paths, grep test file contents for the stem. If a test file imports SessionProvider, it references that implementation. This catches tests in completely different directories - e.g. a test in src/pages/checkout/ might import ../../auth/SessionProvider.

But content grep has a different failure mode. Many JavaScript projects use barrel exports (index.ts re-exporting everything). A test might import from '@/pages/checkout' which resolves to index.tsx at runtime, but the string "index" never appears in the import. The connection exists at the module resolution level, not the string level.

PHP and Go have the same problem differently. A PHP test file might reference InvoiceService by class name without any file path in the import. A Go test lives in the same package directory and imports nothing explicitly.

Approach 3: Hybrid (Current)

We now combine both approaches. Path matching (walk the tree for test files whose path contains the stem) plus content grep (find test files that reference the stem in source code). Take the union. This catches both colocated tests and distant tests that import the file.

The problem shifts from discovery to ranking. A real repo produces 29 test file hits for index.tsx (from 51 raw grep matches). Five of them are highly relevant (in src/pages/checkout/ subtree). The other 24 are noise. Which 5 do we load into context?

The Ranking Bug That Toy Tests Missed

We scored each test file: +100 for name match, +50 for same directory, +30 for shared parent, -1 per distance. We wrote tests with 3 handcrafted files. They passed.

Then we ran the ranker against 29 real file paths from a production repo. src/index.test.tsx (the root app test, completely unrelated) ranked #2. src/pages/checkout/components/PayButton/index.test.tsx (actually relevant) ranked #4.

The bug: +30 was a flat bonus for any shared parent. One shared component (src/) got the same +30 as three shared components (src/pages/checkout/). With 3 synthetic inputs, other scoring factors dominated. With 29 real inputs at varying depths, the flat bonus broke everything.

The fix was one line: change +30 to common_len * 10 so deeper shared paths score higher.

This is the mutation testing principle. Imagine an "evil coder" who changes your constant: +30 to +0 or +1000. Do your tests fail? With 3 synthetic inputs, no. The tests pass regardless of the constant's value. That means they prove nothing about it. Only 29 real inputs exposed the flaw.

What Remains Unsolved

The fundamental issue is that the mapping between implementation files and test files is a convention, not a computable relationship. Every project invents its own rules:

Colocated: Button.tsx and Button.test.tsx side by side
Mirror tree: src/auth/Provider.tsx tested by tests/auth/Provider.test.tsx
Separate dir with different naming: core/app/Services/Foo.php tested by core/tests/Unit/Service/FooTest.php
Framework magic: Go tests in the same package, Python tests discovered by pytest markers
Barrel re-exports: The actual file path never appears in any import statement

No single algorithm handles all of these. Path matching fails for different directory structures. Content grep fails for barrel exports and framework-level imports. Even the hybrid approach requires a ranking function, and that ranking function needs real data to validate - not 3 handcrafted inputs.

If you're building developer tooling that needs to answer "which test covers this file?" - there's no clean answer. The best we've found is: try multiple discovery methods, take the union, rank aggressively, and validate with real repository data at real scale. And even then, you'll miss cases.