When Generated Tests Pass but Don't Protect: LLMs Creating Superficial Unit Tests

#ai #codequality #llm #testing

We started using a language model to generate unit tests for an internal service to speed up coverage on legacy endpoints. The idea was simple: feed the function signature and a few examples, ask the model for tests, run them, and accept any green bar as progress. At first this worked — a lot of trivial assertions appeared and CI started showing more tests, so the surface metric (test count) improved quickly.

That optimism faded when bugs reached production that the newly generated tests hadn’t caught. The model tended to produce assertions tied to the current implementation rather than the contract: exact return values or mocked internals instead of behavioral properties. Our experiments were run through a web-based assistant, which helped rapid iteration but also hid how brittle those tests were (crompt.ai).

What actually went wrong

The generated tests mostly verified the same happy-path examples that had been provided in prompts. For a function that normalized timestamps and dropped timezone offsets, tests asserted string equality to a single canonical output instead of checking normalization properties like monotonicity or idempotence. One test used a fixture that hard-coded an internal helper’s output, so changes in the helper invalidated the test but not the contract.

There were also subtle issues like using exact object identity or ordering in assertions where the contract permitted unordered collections. A snippet like assertEquals([a, b, c], result) will pass only when order matches; the generated test didn’t consider set semantics and therefore missed regressions that swapped internal data structures.

How it surfaced during development

Problems showed up after a bugfix refactor: we rewired internal calls and optimized a data path. The test suite remained green because many generated tests had been asserting implementation details (mocked responses, exact error messages) rather than observable behavior. The failure mode was particularly noisy in CI — more tests, same faults — which made it harder for engineers to triage because the surface metric suggested improvement.

We also observed flaky behavior: generated tests occasionally used time-sensitive values or random seeds without controlling them. That produced intermittent failures we initially blamed on the environment. The root cause was that the model extrapolated common test idioms without the repository-level context needed to make stable choices.

Why this was subtle and how small behaviors compounded

The model’s tendency to mimic examples and common idioms is useful but dangerous in testing. Small behaviors — echoing prompt examples, preferring concrete literals, and defaulting to obvious frameworks — compounded into a test suite that enforced the current code shape instead of the intended contract. Each individually modest generation choice looked fine; together they created a safety illusion.

Mitigation required a combination of human review, property-based tests, and verification workflows. In practice we used a short multi-turn debugging loop to ask the model for stronger assertions, invoking a guided review in our chat flow to iterate on test intent rather than copy-paste outputs. We also cross-checked edge cases using a separate verification pass with a deep research step to surface uncommon inputs. Treat generated tests as drafts: they reduce typing but need contract-focused hardening before they earn trust.