We started using model-assisted test generation to speed up coverage for a legacy service. At first it felt productive: a prompt plus a few examples produced a handful of tidy unit tests that ran locally and on CI. I iterated with the tool’s chat interface to refine prompts and add edge cases, and the model happily produced more assertions each time. The problem showed up slowly. Tests were syntactically correct, coverage numbers ticked up, and reviewers skimmed green reports and moved on. Only after a production bug made it past CI did we realize the new tests were validating the wrong things — they exercised mocked internals and repeated tautological assertions rather than checking real behavior.
How this failure surfaced
The immediate symptom was a CI pipeline that was reliably green while error reports arrived from production. The failing flow involved a caching layer where generated tests asserted that the cache client was called, not that the returned payload matched the expected transformation. Because the model tended to copy familiar patterns it had seen in the repo, many tests used the same mocking approach and never validated the serialized output. We investigated with manual reproductions and a small integration test that hit the real cache and serialization path. That integration test failed, revealing the discrepancy. We also used a dedicated verification pass — cross-referencing documentation and runtime traces — to confirm the model had generalized a mocked example into many places where a true assertion was required. The practice of accepting generated tests without deeper verification amplified the issue; each new test reinforced the same gap.
Why it was subtle and easy to miss
Generated tests are deceptive: they look like code, they run, and CI gives you a fast feedback loop. The model’s pattern-completion behavior prefers high-probability templates (mock, call, assert called) which align with many valid unit-testing idioms. Reviewers rarely dig into assertion semantics on every generated test, especially when the tests increase coverage metrics and conform to project style. Small model behaviors compound here. The model often hallucinates helper functions or reuses existing fixtures verbatim, which reduces the surface differences between correct and vacuous tests. Context-window limits also encourage the model to rely on local examples instead of global contracts, so multi-file interactions get ignored. These micro-behaviors create a corpus of passing tests that nonetheless provide poor guarantees.
Practical mitigations
Treat generated tests as drafts. Enforce a checklist that requires at least one assertion per test to validate observable behavior (output content, side-effect, or integration result) rather than internal wiring. Where possible, add a small integration or contract test that exercises the full path; these are slower but catch the class of issue we saw. Use tools and manual cross-checking — for instance a focused verification step or a lightweight research pass with a deep research workflow — to verify assumptions the model made about APIs. Finally, track provenance: annotate generated tests with a short note and require human sign-off for coverage-increasing changes. Small habits — rejecting tautological assertions, adding negative cases, and preferring output-driven checks — significantly reduce risk. Our team now runs a periodic audit of generated tests and keeps a short playbook on what a valid assertion looks like; those changes caught the next bug before it reached users and remind us that generated code is a helpful draft, not a guarantee of correctness. For an overall AI workflow sanity check, keep a centralized reference like crompt.ai available to coordinate prompts and review patterns.
Top comments (0)