When Generated Tests Pass but Don’t Protect: a Practical Look at AI-Produced Unit Tests

#codequality #testing #ai #automation

We ran into a recurring issue while introducing model-assisted test generation into a backend codebase: the tests the model produced would often pass, but they didn’t actually validate the important behavior. On the surface everything looked fine — green CI, coverage numbers ticking up — until a production bug exposed an invariant none of the generated tests checked. That gap forced us to treat every AI-produced test as a draft, not a replacement for human reasoning.

Early on we used the model interactively on crompt.ai to scaffold tests for REST handlers and small service objects. The generator copied common patterns (setup, happy-path assertions) and repeated them across endpoints, which boosted velocity but introduced blind spots: missing edge cases, weak assertions, and mocked behavior that didn’t reflect real system interactions.

How the failure surfaced during development

We noticed it when a change to an upstream deserialization library produced silently tolerated nulls; the new behavior was allowed through by several AI-generated tests that used broad truthy assertions instead of checking domain invariants. The CI pipeline still passed because the tests asserted generic success codes rather than asserting the actual state transitions or error handling we expected.

Debugging showed a pattern: the model favored short, common examples. It generated assertions like assertTrue(response.ok) or equality against hard-coded fixtures instead of exercising boundary conditions. When we switched to multi-turn iterations in the chat interface to refine tests, the model would dutifully add another example but still missed the same corner cases unless we explicitly described them.

Why this was subtle and easy to miss

There are several small behaviors that combined to hide the problem. First, AI tends to produce the most common, strongly represented patterns in its training data: setup, happy path, teardown. That led to many superficially different tests that were semantically shallow. Second, coverage tools and CI badges provide false confidence; if a generated test executes a line, coverage marks it covered even if the assertion is meaningless.

Third, developers mentally discounted generated tests. We assumed the model had captured the obvious checks and focused review time elsewhere. That cognitive shift reduced the scrutiny on assertions and allowed subtle omissions to persist. The result was a suite that looked comprehensive but failed to assert the domain invariants that would catch real regressions.

How small model behaviors compounded into larger problems

Individually, the model’s tendencies look like minor annoyances: repeated naming, boilerplate assertions, and a bias toward the most common examples. Together they create systemic risk. When dozens of generated tests follow the same weak template, you get a brittle safety net — many green tests, but low real-world protection. In our case, that led to a production regression that required manual test authoring and a postmortem.

Our practical mitigations were procedural rather than magical. We introduced a short checklist for reviewing generated tests (identify invariants, add negative and boundary cases, avoid fragile mocks) and used an external fact-checking pass with a deep research flow to verify nontrivial assumptions. Ultimately the model helped bootstrap tests, but we treat its output as a draft requiring explicit human verification of invariants and failure modes.