Why AI-Generated Code Passes Tests But Fails in Production

#webdev #productivity

The test suite passed. The CI pipeline went green. The code review got approved and the PR merged. Two weeks later, a bug turns up in production that nobody saw coming, in a function that had been sitting in the repository since a junior developer on the team - who happens to be a language model - suggested it in an autocomplete.

This is not a hypothetical. It is the pattern that development teams are reporting with increasing consistency as AI coding tools become a standard part of the workflow. The code works in test conditions. It breaks in production conditions. And because the failure mode looks like any other regression, teams spend days debugging something that was never right to begin with.

Understanding why this happens is the first step to preventing it.

Tests Verify What You Already Knew to Test For

The core issue is simple: tests can only verify the behavior they were written to verify. When a developer writes code manually, they think about the edge cases as they write. That thought process is imperfect, but it produces tests that at least cover the scenarios the developer considered.

AI coding tools do not think about edge cases the same way. They produce implementations that handle the inputs specified in the visible context. If the visible context includes a function signature, a docstring, and a few example cases, the generated implementation handles those cases correctly. It does not reason about what happens with unexpected inputs, concurrent modifications, network failures, or data that does not match the expected format.

The test suite, written before AI assistance was introduced, was designed to catch regressions against known behavior. It was not designed to catch the specific failure modes of AI-generated implementations. So the new code passes the existing tests - because those tests verify exactly the scenarios the AI handled correctly - and fails in production under the scenarios nobody thought to test.

Photo by Daniil Komov on Pexels

The Context Gap That Creates Production Failures

AI tools work from the context visible in the prompt. That context is typically a function signature, the surrounding code in the file, and whatever examples appear in the immediate vicinity. It does not include:

API documentation for the external services being called. The AI knows the common usage patterns for popular APIs from training data. It does not know the specific error states, rate limits, and edge cases documented for the version of the API your application depends on.

Your infrastructure's specific behavior. The AI does not know that your database replication lag runs at 200-300ms in peak traffic, that your message queue occasionally delivers duplicates, or that your load balancer drops connections with idle timeouts shorter than your background job duration.

Recent changes in your codebase. The AI does not know about the refactor that changed how user IDs are formatted, the library upgrade that changed the return type of a method, or the feature flag that changes which code path runs for certain user segments.

Each of these gaps creates a category of production failure that tests will not catch unless someone specifically wrote tests for that scenario.

"The AI does not know that the function it is writing will run behind a load balancer with a 30-second timeout, or that the database it is querying has a read replica with a half-second replication lag. Those details live in infrastructure documentation and institutional memory, not in the code it can see." - Dennis Traina, 137Foundry

Three Categories of Production Failures from AI Code

Performance failures under real load. The most common: an implementation that is functionally correct and fast on development data, but that degrades significantly on production data at scale. AI tools generate database queries that work on a table with 100 rows and cause timeouts on a table with 10 million rows. They generate code that holds locks longer than necessary, that makes serial network calls where parallel calls would work, or that allocates memory in a loop in a way that causes garbage collection pauses at volume.

These failures do not appear in tests because test environments do not replicate production data volume or concurrent load. The code passes tests because the tests are not designed to measure performance at scale.

Error handling gaps for real-world failure conditions. The second most common: an implementation that handles the success path correctly but fails silently or incorrectly when a dependency is unavailable, rate-limited, or returning an error. AI tools generate code that assumes external services will always respond successfully. When a downstream service returns a 503, or when an API rate limit is hit, the generated code either throws an unhandled exception or returns incorrect data as if the call had succeeded.

State assumptions that break under concurrency. Subtler and harder to diagnose: AI-generated code that assumes it is the only writer to a shared resource. Code that reads a value, operates on it, and writes it back without locking. Code that assumes a database row it checked five lines ago has not changed. Code that assumes a file it opened is still present by the time it reads from it. These bugs are invisible in single-threaded tests and intermittent in production, which makes them disproportionately expensive to debug.

What Actually Prevents These Failures

The solution is not more comprehensive test suites, though better tests help. It is a review process that specifically looks for the categories of issues that AI tools produce.

Before an AI-generated function is merged, someone should verify: what happens when each external call fails, what the function does under concurrent access to shared state, and whether the implementation would degrade under the data volumes it will see in production. These are not new review questions - they are the questions that careful developers apply to all code. AI assistance just makes it easier to merge code where these questions were not asked.

Static analysis tools catch some of these issues automatically. Semgrep can detect common concurrency anti-patterns. SonarQube tracks complexity that correlates with error-handling gaps. ESLint catches missing error handling in promise chains. But automated analysis catches surface patterns, not semantic correctness.

The semantic review - the question of whether the code is correct for your specific production environment - still requires a human who understands both the code and the context.

Photo by Mizuno K on Pexels

Making the Gap Visible

One practical change: add a field to your PR template that asks developers to indicate which parts of a PR were AI-generated, and what failure scenarios they specifically reviewed before submitting. This does not slow down the workflow significantly. It does change the implicit accountability model - from "the AI generated it and it looks fine" to "I reviewed this code and I own what it does in production."

Teams using AI tools with strong code governance find that the productivity gains are real and sustainable. Teams that adopt the tools without changing their review process find that the gains are partially or fully offset by debugging time and production incidents.

137Foundry works with development teams on building production-grade applications where code quality governance is established from the start, including when AI tools are part of the workflow. For a complete look at how to structure that governance - including what policies to formalize and what metrics to track - the guide on using AI coding tools without introducing technical debt covers the full process in detail.

The tests passing is not the same as the code being correct. That has always been true. AI coding tools just make it easier to write code where the gap between those two things is larger than it appears.