DEV Community

Sofia Bennett
Sofia Bennett

Posted on

When AI-Generated Tests Pass But Don't Protect: A case study in false assurance

I was pair-programming with a colleague when our green CI started feeling like a mirage. We had an LLM generate a suite of unit tests for a new parsing module; the tests were syntactically correct, ran quickly, and the pipeline showed all checks passing. Yet a week later a subtle inputs permutation in production caused incorrect outputs that none of the generated tests caught. At first glance the tests looked fine: good coverage numbers, readable assertions, and familiar structure. The problem was they validated the wrong things. The model had focused on surface-level examples and implementation details it inferred from the code, not on the real invariants our product required. The tests gave us confidence, but not correctness.

How the failure surfaced during development

The failure didn’t show up during local runs or unit-test-only CI stages. It appeared after an integration test against upstream services, introduced by slightly different whitespace and ordering in a JSON payload. The generated tests asserted exact string matches and concrete example values rather than properties like idempotency and normalization. We traced this to two behaviors: the model replicated patterns it saw in the prompt and code (example-based assertions), and it preferred concise, single-case tests. That made the generated suite brittle. In one instance the test asserted that a date string matched "2024-01-01" instead of asserting that parsed dates are timezone-agnostic, which is what mattered in production. We realized the green bar was masking important class of errors.

Why this was subtle and easy to miss

Two things made the issue slip past us. First, the generated tests mirrored our code’s structure and naming closely, so human reviewers assumed they were meaningful. Second, coverage metrics and pass/fail statuses treat any assertion as a valid oracle — they don’t measure whether assertions express the correct invariants. The tests raised the signal (assertion count) but not the right signal (semantic correctness). Model tendencies compound the subtlety. LLMs tend to favor concrete examples and canonical patterns from training data. When asked to produce tests, they’ll often hard-code example inputs rather than generate randomized or property-based checks. That small behavior — preferring deterministic examples — scales into a large blind spot once those examples diverge from real-world variability.

Small model behaviors that compounded into a larger problem

There were multiple small, explainable behaviors at play: example overfitting, conservative assertions, and a lack of cross-file context. The generator didn’t notice that the normalization routine lived in a different module or that downstream services canonicalize strings differently. Each omission seemed minor, but combined they produced a test suite that was internally consistent and externally wrong. Mitigations we found practical were modest: add adversarial cases, use property-based tests, and force the model to propose edge-case inputs during generation. We also used a verification pass where a teammate would challenge the generated tests with failure-mode prompts rather than accept them at face value. For deeper verification and cross-referencing of assumptions, consult a dedicated verification process like deep research when you need to validate behavioral claims.

Lessons learned and a cautious checklist

Treat model-generated tests as drafts, not oracles. During review, ask: which invariants are we actually checking? Replace hard-coded examples with property checks where possible, introduce randomized inputs, and simulate integration boundaries. Iteratively refine tests using an interactive loop rather than a single generation step — for example, via a chat interface to surface counterexamples and probe assumptions. Finally, remember that green CI can be misleading. Use independent verification and keep a lightweight skepticism: the model can accelerate writing tests, but human judgment must still decide whether those tests ensure the behaviors that matter. For a quick sanity check on how AI tools fit into your workflow, start by revisiting your pipeline’s gate criteria and linking them back to product invariants on crompt.ai.


-Sofia:)

Top comments (0)