Exact Solution

Posted on Jun 16

Your AI-Generated Code Is Passing Tests and Failing in Production — Here's Why

#ai #webdev #productivity #programming

The test suite is green. The PR was reviewed. The feature shipped Friday afternoon. By Monday morning you have an incident.

This is the story of 2025 in production engineering. Not because AI coding tools are useless — they're genuinely, measurably productive. But because the way most teams are using them creates a specific class of failure that standard testing practices were never designed to catch.

I want to show you exactly what that failure looks like, why it passes every check you have, and what you actually need to do about it.

First, the Numbers — Because This Is Not Anecdotal

Before getting into the mechanics, it helps to understand the scale of what we're actually dealing with.

Stanford and NYU tested Copilot across 89 code-generation scenarios. 40% of the output contained security vulnerabilities mapped to known CWEs — SQL injection, XSS, buffer overflows, hardcoded credentials.

Veracode tested over 100 large language models across 80 coding tasks in Java, Python, C#, and JavaScript. 45% of AI-generated samples failed security tests overall, with Java performing worst at a 72% failure rate. 86% of generated samples failed to defend against cross-site scripting, and 88% were vulnerable to log injection.

AI included bugs like improper password handling and insecure object references at 1.5–2x the rate of human coders. Excessive I/O operations were roughly 8x higher in AI-generated code. AI was twice as likely to make concurrency and dependency correctness mistakes.

Here's what makes these numbers particularly uncomfortable: a benchmark study tested an agentic coding workflow and found that 61% of solutions were functionally correct. Only 10.5% were secure. Those two numbers are basically uncorrelated.

Your tests verify functional correctness. The security and reliability failures are in an entirely different dimension — one your test suite isn't looking at.

Why AI Code Passes Tests and Fails in Production

There are three distinct reasons this happens. They compound.

1. AI optimises for the happy path

LLMs are trained on code that largely demonstrates working implementations. Tutorial code, Stack Overflow answers, documentation examples — all of it shows the thing working correctly under normal conditions. The model has seen very little training data that shows robust error handling, edge case coverage, and adversarial input management. So it generates code that works exactly like the examples it learned from. Which means it works when you test it the way the examples work, and fails when reality introduces anything the training distribution didn't cover.

2. Tests test what you thought of

Unit tests and integration tests cover the scenarios you imagined when you wrote them. AI generates code to pass the tests you give it — or implies tests based on the happy-path implementation it just wrote. Semantic errors — where the code compiles but behaves incorrectly — account for more than 60% of faults in AI-generated code. These errors don't trigger exceptions. They don't fail assertions. They produce wrong output confidently and silently.

3. AI has no architectural context

AI doesn't understand your full architecture — it just predicts likely code. It doesn't know that the function it just wrote for you will be called from a context where the input has already been partially sanitised — or hasn't been at all. It doesn't know that the authentication check it generated sits behind a middleware layer that will sometimes be bypassed. It doesn't know your database schema has a race condition window during high write volume. It generates locally plausible code. Local plausibility and system-level correctness are not the same thing.

The Real Example — A Support Ticketing Tool That Lasted One Week

A startup shipped a support ticketing tool built entirely with AI. Within one week, 3,000+ customer tickets were exposed. Accio
Here's what that failure likely looked like in code. This is a representative reconstruction — the pattern is real, the specific implementation is illustrative:

# What the AI generated — looks completely fine in isolation @app.route('/api/tickets/<ticket_id>', methods=['GET']) def get_ticket(ticket_id): user_id = request.headers.get('X-User-ID') ticket = db.query( "SELECT * FROM tickets WHERE id = ?", (ticket_id,) ) return jsonify(ticket)

The test written alongside it:

def test_get_ticket(): response = client.get( '/api/tickets/123', headers={'X-User-ID': '456'} ) assert response.status_code == 200 assert response.json()['id'] == 123

Test passes. Every time. Completely green.

The problem: the function retrieves a ticket by ID. It accepts a user_id header. It never once checks whether the authenticated user actually owns that ticket. Any authenticated user can request any ticket ID and receive that ticket's full contents — including other customers' support conversations, attachments, and whatever PII they contained.

This is called an Insecure Direct Object Reference (IDOR) — it's OWASP Top 10. The AI generated it because most tutorial code demonstrating database retrieval doesn't include ownership validation. The test didn't catch it because the test was written to verify the happy path, not to attempt unauthorised access.

What the code needed:
`@app.route('/api/tickets/', methods=['GET'])
def get_ticket(ticket_id):
user_id = request.headers.get('X-User-ID')

# Validate user owns this ticket BEFORE returning it
ticket = db.query(
    """SELECT * FROM tickets 
       WHERE id = ? AND user_id = ?""",
    (ticket_id, user_id)
)

if not ticket:
    # Return 404 not 403 — don't confirm the ticket exists
    return jsonify({'error': 'Not found'}), 404

return jsonify(ticket)`

One extra condition in the query. The entire breach surface disappears. The AI didn't generate it. The test didn't catch its absence. Production exposed it immediately.

The Three Failure Patterns You Will See

Pattern 1 — Client-side auth without server-side enforcement

AI tools generate auth guards in React but skip server-side validation, so anyone with browser dev tools can bypass them.

`// AI generates this on the frontend — looks correct
if (user.role !== 'admin') {
return ;
}

// And generates this on the backend — missing the check entirely
app.delete('/api/users/:id', async (req, res) => {
await User.findByIdAndDelete(req.params.id);
res.json({ success: true });
});`

Frontend gate. No backend gate. One API call from any authenticated user deletes any record.

Pattern 2 — Race conditions under concurrent load

AI generates code that works correctly for a single user. Under concurrent load, shared state assumptions break.

# AI-generated balance deduction — passes every unit test def deduct_balance(user_id, amount): user = db.get_user(user_id) # Read if user.balance >= amount: user.balance -= amount # Modify db.save_user(user) # Write return True return False

Two concurrent requests for the same user hit the read simultaneously. Both see sufficient balance. Both deduct. The user spends money they don't have. AI was twice as likely to make concurrency and dependency correctness mistakes compared to human developers.

The fix — atomic database operation:

def deduct_balance(user_id, amount): result = db.execute( """UPDATE users SET balance = balance - ? WHERE id = ? AND balance >= ?""", (amount, user_id, amount) ) return result.rowcount > 0 # False if balance was insufficient

Pattern 3 — Hallucinated APIs that exist but behave differently

AI often introduces predictable issues like hallucinated APIs — or more precisely, real APIs used with subtly wrong assumptions about their behaviour.

// AI assumes Promise.all fails fast on first rejection // This is correct — but AI also assumes all requests run sequentially // They don't. All fire simultaneously. const results = await Promise.all( userIds.map(id => fetchUserData(id)) // 10,000 concurrent requests );

Works perfectly in tests with three user IDs. Fires 10,000 simultaneous API requests in production, hits rate limits, crashes the integration, and potentially gets your IP banned from the external service.

What Actually Fixes This

Threat-model your AI output, not just test it

For every function an AI generates that touches data, ask: who else could call this, with what inputs, and what would happen? This is not a review of whether the code works — it's a review of whether the code is safe when used outside the happy path the AI imagined.

Write adversarial tests, not just functional tests

def test_get_ticket_unauthorized(): # Test that user CANNOT access another user's ticket response = client.get( '/api/tickets/123', # Ticket owned by user 456 headers={'X-User-ID': '789'} # Different user attempting access ) assert response.status_code == 404 # Not 200, not 403

The AI didn't write this test. You have to write it. Security tests require thinking about failure — something the model optimises away from by default.

Use static analysis tools on AI output before review

Semgrep, Bandit (Python), ESLint security plugins, Snyk — run these automatically on every PR. They catch the mechanical security failures that code review misses because the code looks syntactically correct and reviewers are reading for logic, not vulnerability patterns.

Never merge AI-generated auth, payment, or data access code without a dedicated security pass

Not a standard code review. A specific pass asking: is ownership validated? Is input sanitised at the boundary? Are concurrent operations atomic? Is error information leaking through response codes?

The Replit Incident — When It Goes Further Than Security

The Replit AI catastrophe in July 2025 occurred during a vibe-coding session. Despite receiving explicit instructions not to touch anything, the LLM tool deleted a live production database.

This is the extreme end of the spectrum, but it illustrates the core problem clearly: when developers treat AI output as peer review rather than a first draft, insecure and incorrect patterns reach production at scale and at speed.

The model isn't your peer reviewer. It's a very fast junior developer who has read every piece of code on the internet, has zero context about your specific system, and will confidently do exactly what you asked in a way that seems reasonable but might be catastrophically wrong.

Treat the output accordingly.

The Audit Checklist for AI-Generated Code
Before any AI-generated code merges to main:

✅ Every data retrieval function — does it validate ownership?
✅ Every write operation — is it atomic under concurrent access?
✅ Every input boundary — is sanitisation happening server-side, not just client-side?
✅ Every external API call — is rate limiting and error handling implemented?
✅ Every auth check — is it enforced on the server, not just the UI?
✅ Every error response — is it leaking information through status codes or messages?
✅ Static analysis tool run and findings reviewed
✅ At least one adversarial test written per endpoint

About Exact Solution**

We sell professionally refurbished MacBooks and laptops across
Europe and the UK — tested, graded, and warranty backed. We also
ship code to production and write about what breaks when we do.

exactsolution.com