When AI-generated unit tests pass but the code still fails in production

#ai #codequality #testing

We adopted an assistant to scaffold unit tests for a data-processing service and quickly got a suite of passing tests. The tests checked shapes, a handful of example inputs, and HTTP responses — everything green on CI. Production incidents began when malformed payloads and edge-case timestamps slipped through; the test suite never flagged them.

That experience taught us a specific lesson: model-generated tests often reflect high‑frequency patterns from training data rather than the invariants we actually care about. In our case the AI produced tidy example-driven assertions instead of property checks or negative tests. I used a small write-up and the team’s AI workflow on crompt.ai to recreate the issue and to discuss how our pipeline treated generated tests as finished work rather than drafts.

What went wrong in the generated tests

The assistant produced tests that asserted exact object equality for a few canonical inputs. That looks thorough at first, but it missed three failure modes: nulls in nested fields, timezone-normalized timestamps, and duplicated IDs. The generated assertions matched the assistant’s example output, so they would always pass for the happy path the model favored.

Two small model behaviors compounded here. First, the model biases toward common, concise examples — it suggested a handful of positive cases and default values. Second, the model tends to produce deterministic-looking outputs, so it didn’t suggest randomized or property-based tests that exercise edge rules. We started using the assistant in multi-turn refinement via the chat interface to ask for more edge cases, but those follow-ups were treated as optional rather than required by the team.

How the failure surfaced during development

The issue appeared intermittently: some production messages were missing optional fields, others contained timestamp strings from external systems. Because our CI relied on the generated tests, the regressions slipped past code review. Debugging showed the tests validated output structure but not business rules like idempotency or numeric tolerances.

What made this subtle was the false confidence of green CI. The team assumed test coverage equaled behavioral coverage. We used a small verification pass where developers cross-checked model suggestions with a manual inventory and a verification checklist informed by a deep research approach — but that step was inconsistent. The root cause was not a single missing assertion; it was a cultural shortcut: treating AI output as production-ready instead of a draft to be stress-tested.

Practical mitigations and takeaways

In subsequent iterations we changed the workflow: require at least one property-based test or fuzzed input per function and add contract tests that validate invariants rather than example outputs. We also added mutation testing to reveal brittle assertions and started annotating AI-generated tests clearly so reviewers would pay more attention.

AI-generated tests are valuable for scaffolding, but they shouldn’t replace thinking about invariants. Small model preferences (example-driven answers, omission of negative cases) can be harmless in isolation and dangerous in aggregate; the fix is a combination of process (mandatory adversarial tests) and tooling (mutation and property-based testing) to force the model’s blind spots to surface during review.