The Minimum Viable Test Suite for Working with Agents

#ai #agents #testing #webdev

The advice "you need more tests" is correct in roughly the same way "you should eat better" is correct: technically true, infinitely deferrable, and not actionable until someone makes it specific.

Teams adopting agents tend to hear the same advice in slightly different form: "your test coverage needs to be high before you let agents work on real code." This is also correct in roughly the same way. It is not wrong. It is just not the question. The question is: which tests, where, and in what order, to get the most agent reliability out of the least investment.

Most teams have nowhere near the coverage they think they need. Most teams also do not need the coverage they think they do. The middle ground, strategic coverage at the seams that matter, is small enough to actually build and big enough to actually help.

Why "more tests" is the wrong frame

Test coverage is a stock, not a flow. The total percentage tells you how much code is exercised; it tells you almost nothing about whether the tests would catch the failures you care about. A codebase with 90% coverage made of weak assertions ("the function returned without throwing") is less safe than a codebase with 40% coverage made of strong ones ("the function returned the correct value, given these inputs, with these side effects").

Agents amplify this distinction. An agent can ship code that passes a weak test as readily as code that passes a strong one. The weak test was producing false confidence in human-authored work too; agents just produce that work faster, so the false confidence compounds faster.

The frame that works is not "more tests" but "tests that would fail when the agent gets it wrong." Coverage is a side effect of having those. Chasing coverage directly produces code that exercises lines without verifying behavior.

The seams that matter

Some parts of your codebase are higher-leverage to test than others. The high-leverage seams have three properties: they are boundaries where one part of the system meets another, they encode business rules the rest of the system depends on, and they fail in ways that are hard to detect by inspection.

API contracts are the canonical example. The boundary between your service and its callers. If the contract changes silently (the response shape shifts, a field becomes optional, an error code changes) every caller can break in subtle ways. Agents are particularly likely to make this kind of change while "cleaning up" a handler. A contract test catches it on the PR. Without one, it catches it in someone else's outage.

Data layer boundaries are the second class. The seam where your code meets a database, a message queue, an external service. Most agent-introduced bugs at this seam are not the obvious "it does not work" kind; they are the "it works but does the wrong thing under load" kind. A few good integration tests at this boundary catch a disproportionate share of those.

Critical business rules are the third. The pricing logic, the permission checks, the rules that determine who can do what to which resource. These are the rules where being subtly wrong is worse than being obviously broken. They earn dedicated tests because nothing else in the codebase will catch their failures, and because the cost of getting them wrong is high.

If you have nothing else, having these three categories tested well is enough to make working with agents meaningfully safer.

The boring 80%

Most of your code is not at a seam. It is internal: helpers, formatters, view logic, glue. This code benefits from tests, but it does not benefit enough to be the first investment.

The reason is twofold. First, internal code fails in ways the next layer of code will catch. A formatting helper that produces the wrong output makes the rendered page look wrong, which the visual test or the e2e test or the user catches. Second, internal code changes often and is refactored frequently, which means tests at this layer have a higher maintenance burden per bug caught. The ROI is real but lower.

The implication: do not start by trying to test everything. Start by testing the seams. Get to a place where the agent cannot ship a change at a boundary without proving the contract still holds. The internal code can be tested as it matures, when patterns stabilize, when the cost of writing the test is amortized over a long-lived function rather than a function that will be deleted next week.

This is a deliberate inversion of the "test as you go" advice that works for human-authored code. With agents, the throughput is high enough that "as you go" produces a lot of tests of code that will not survive. Front-load tests where they matter; back-load tests where they do not.

The first three tests to write

If you are starting from a codebase with effectively no tests and you want to add the minimum viable set this week, the three to write first:

A contract test for your most-called API endpoint. Send a request that exercises the happy path. Assert on the full response shape, including types of fields. If your agent ever produces a handler change that breaks the shape, this test fails before the change merges.

An integration test for your most-critical write path. The signup flow, the checkout, the place where the wrong outcome would be visible to a user or an auditor. Drive it end-to-end against a real-enough environment. Assert on the resulting state, not just the response.

A unit test for the trickiest business rule you can think of, with at least three cases: the happy case, the edge case that broke once in the past, and the case the team always argues about during code review. Naming the test after the rule itself is more useful than naming it after the function under test.

Three tests. An afternoon of work. These will not give you coverage; they will give you the disproportionate share of the protection coverage is supposed to represent. The rest of the suite is built outward from there.

The weekly ritual

Once you have the seams covered, the practice is small: every time the agent ships something wrong that should have been caught, write the test that would have caught it. Add it before the fix lands. The fix proves the test was right; the test prevents the regression.

This is the same loop that runs in the post-mortem ritual, just shorter. The agent's mistakes are smaller and more frequent than incidents, but the lesson is the same: a one-time bug becomes permanent knowledge only when it is encoded as a test.

A team that runs this loop for six months has a test suite shaped to the actual failure modes the agent produces in their codebase. That suite is more valuable than any amount of bulk coverage built ahead of time. It is grown rather than written.

The minimum viable suite is the starting point. The discipline is what grows it.