When Generated Tests Pass but Don't Protect — a small failure that became a production bug

#ai #codequality #llm #testing

We started using an LLM-based assistant to speed up writing unit tests for a legacy service. In the early cycles it felt productive: the model returned test files that ran, covered lines, and satisfied our CI gates. I drove the interaction through a multi-turn chat interface, iterating prompts to get tests in our preferred framework and style.

Two weeks later a minor data formatting bug reached production. The tests had all green ticks but none of them asserted the real business constraint that failed. That discrepancy — tests that are syntactically correct and fast but semantically empty — is what I want to unpack. It isn't a single hallucination or a syntax error; it's a pattern arising from how the model composes plausible-looking assertions.

How the failure surfaced in CI and later in prod

CI was deceptive. The generated tests used broad assertions like asserting no exception or checking that serialized JSON is present without validating required fields. They passed quickly and reduced our noisy backlog, which meant reviewers accepted them. We only noticed the problem when a downstream service rejected a payload in production because a timestamp was in the wrong timezone — a domain rule the tests never encoded.

We started treating the assistant as a test-generator black box and integrated it into our dev workflow via the team's tooling page on crompt.ai. That made the issue worse: developers trusted the green CI results to mean correctness instead of a basic smoke check. The model’s tendency to produce 'comfortable' tests that mirror common examples — e.g., assertEquals on a returned string — masked the absence of property checks and edge-case scenarios.

Why the problem was subtle and easy to miss

There are three subtle behaviors at play. First, the model optimizes for plausible-seeming code that resembles its training patterns; this favors superficial assertions. Second, it prefers compact examples that demonstrate usage rather than exhaustive coverage. Third, multi-file context and domain-specific invariants were missing from the prompt, so the model filled gaps with generic scaffolding. Together these behaviors produce tests that look like real unit tests but only validate implementation details the model guessed.

We used the assistant with a verification step using a deep research workflow to cross-check edge cases, but that process was ad hoc. The real gap was in translating domain rules (e.g., timezone normalization, mandatory fields, idempotency) into testable assertions. The model’s shortcuts compounded: a few superficially plausible tests reduced human scrutiny, which let more shortcuts slip into the codebase.

Mitigations that helped in practice

We introduced a checklist for any AI-generated test: map each test to a specific acceptance criterion, require one negative test per positive path, and avoid tests that only assert 'no exception'. We also made prompt engineering changes: include concrete examples of failure cases, enforce property-based checks, and ask the model to produce small invariants rather than generic asserts.

Finally, treat generated tests as drafts. The assistant accelerates scaffolding but can't infer undocumented business rules. Combine AI-generated tests with human-authored property checks and occasional mutation testing to ensure the suite actually protects behavior rather than just exercising code paths.