DEV Community

Sofia Bennett
Sofia Bennett

Posted on

When Generated Tests Pass but Don't Catch the Bug: Lessons from AI-Produced Unit Tests

I remember a late-night sprint where a model-generated test suite gave us a warm sense of safety and then let a regression slip into production. The model produced dozens of unit tests that all passed locally and in CI; coverage numbers rose and reviewers nodded. What we didn't notice was that many of those tests were functionally validating the implementation's current behavior rather than the intended specification, so they confirmed the bug as the expected outcome. This is an example of an AI failure mode I see often: tests that appear correct because they mirror the implementation's outputs. In practice the model tended to synthesize assertions by replaying the code's logic, and to rely on trivial input/output pairs. The failure was subtle because our metrics—test counts, line coverage, CI green—looked healthy. If you use AI interactively for iteration, e.g. a crompt.ai assisted workflow, it can accelerate writing tests but also accelerate confirmation bias when not checked against independent oracles.

How the problem surfaced during development

We first noticed a job failing only after a downstream system started rejecting certain records. The unit tests had all been updated by the model days earlier, and they passed in CI. Investigation showed the generated tests were using the same helper logic as the production code to construct expected values, so they effectively asserted implementation equals implementation.

The model had also preferred examples that matched common cases and avoided edge inputs. Because the tests were synthetic and narrow, they didn’t exercise boundary conditions or error paths. A simple off-by-one bug in record batching went uncovered because the tests only checked identical-length outputs, not invariants about resource usage.

Why this failure mode is subtle and easy to miss

Several small behaviors compounded: the model copied naming and helper functions it found in the repository, used templated assertion styles, and favored happy-path examples. That made failures hard to spot because review patterns that flag missing tests tend to rely on counts and surface-level checks, not oracle quality. Humans skim diffs; a block of well-formed test functions reads as correct even if their assertions are vacuous.

Another subtlety is that models often assume the repository's current behavior is the ground truth. If the codebase already contains a bug, the model treats that as expected output. Tests generated in that context inherit the same mistake unless prompted with an explicit specification or counterexample.

Practical mitigations and small changes that help

We adopted a few low-friction practices that reduced recurrence. First, require at least one independent oracle per test: either a hard-coded expected value from a spec, a reference implementation, or a property-based assertion that expresses invariants. Use a multi-turn review with a chat interface to iteratively ask the model to produce negative and boundary cases rather than just more positive examples.

Second, combine generated tests with automated cross-checks using external sources: a lightweight integration with a deep research style verification step helped us fetch authoritative behavior examples and compare them to model outputs. Treat generated tests as drafts that need independent oracles and randomized inputs; failing to do so turns green CI into a false sense of security.

Top comments (0)