When AI-Generated Tests Pass but Miss the Point: a Postmortem

#codequality #testing #ai #automation

We started using a code-generation model to produce unit and integration tests for a small payments service. The idea was sensible: save time on boilerplate, get quick coverage, and let engineers focus on tricky edge cases. Instead, we ended up with a suite that looked green locally and in CI while missing several regressions that later reached production. Part of the reason was pragmatic: our UI snapshot strategy included images of rendered receipts, and we briefly experimented with automated fixture creation using tools like the AI Image Generator to generate representative screenshots. That amplified confidence in the golden-file approach, but it also hid behavior mismatches when the assertions targeted incidental output instead of the user-visible contract.

How the failure surfaced in CI

The immediate sign was subtle. A new release introduced a rounding change in a helper function; all tests kept passing. Only after a customer reported incorrect totals in an edge case did we trace the regression. The generated tests asserted on the presence of a specific DOM node and a formatted string fragment, not on the numeric equality or the correct currency rounding logic. We had iteratively asked the model to “add tests for receipt rendering” using a multi-turn workflow in our chat interface. Because the session focused on shapes and example outputs, the assistant produced tests that matched the samples it was shown. That pattern repeated across multiple generated files: brittle sample-matching instead of evaluating the underlying rule.

Why the model missed the real assertions

At scale, the model’s tendencies compounded: it favors concise, high-precision matches that reflect the examples it was given. If you show three rendered receipts with certain phrasing, the model learns to assert the phrasing rather than the invariant (for example, “total == sum(items)”). Those small behavior choices—matching tokens verbatim, preferring surface-level checks—make generated tests fragile to intentional refactors or small algorithmic changes. We also saw the model default to mocking strategies that hid state mutations. It stubbed internal helpers instead of exercising them, so tests could never catch a bug inside those helpers. To avoid that trap in future reviews we used a lightweight verification pass with a separate deep research step to cross-check generated assertions against documented business rules, which helped expose gaps faster.

Practical mitigations and trade-offs

Treat generated tests as drafts, not guarantees. Start by requiring a human to annotate the intended invariant for each generated test: is this checking an implementation detail or the contract? Prefer property-based or contract-style tests where possible, and add a small number of intentionally mutated inputs to verify the test actually fails on incorrect behavior (a manual mutation testing step). Operationally, add quick mutation checks in CI and keep generated tests in a reviewable folder with clear metadata about what they assert. These steps cost some time up-front, but they convert brittle, example-driven tests into durable checks. The core lesson: small, predictable model behaviors (literal matching, conservative mocking) can cascade into high confidence but low value tests unless you build verification around them.

DEV Community

When AI-Generated Tests Pass but Miss the Point: a Postmortem

How the failure surfaced in CI

Why the model missed the real assertions

Practical mitigations and trade-offs

Top comments (0)