sehwan Moon

Posted on Jun 23

Why AI-Generated Code Passes Tests But Breaks Production (With Examples)

#security #webdev #ai #programming

You ship AI-generated code. Tests go green. CI passes. Production breaks anyway.

This happens because AI models optimize for plausibility, not correctness. The code looks right. It has the right function names, the right return shape, even a comment explaining what it does. But the logic is hollow.

Here are the 5 patterns I see most often — and how to catch them automatically.

Pattern 1: The DB Query That Goes Nowhere

# AI wrote this. Tests pass (they mock the DB).
def get_dashboard_stats(user_id):
    rows = db.execute(
        "SELECT event_type, COUNT(*) FROM events WHERE user_id=? GROUP BY event_type",
        (user_id,)
    ).fetchall()
    return {"status": "ok", "user_id": user_id}
    # rows is fetched and silently discarded

The AI fetched the data. It just... forgot to include it in the return value. The function always returns {"status": "ok"} regardless of whats in the database.

This is called DEAD_DB_RESULT. The query runs, costs you latency and DB load, and produces nothing.

Why tests miss it: Unit tests mock the DB call. They never notice that the return value doesnt contain any of the queried data.

Pattern 2: The Save Function That Doesnt Save

def save_user_preferences(user_id, preferences):
    validated = validate_preferences(preferences)
    return {"status": "saved", "user_id": user_id}
    # No INSERT. No UPDATE. Nothing was written.

The AI named it save_. It returns {"status": "saved"}. Users see a success message. Nothing was persisted.

This is MISSING_WRITE. The function name implies a write operation, but theres no INSERT, UPDATE, execute(), or file write anywhere in the body.

In production: Users change their settings, get a success toast, come back tomorrow and their settings are reset.

Pattern 3: Parameters That Do Nothing

def calculate_discount(user_tier, purchase_amount, promo_code):
    # AI generates plausible-looking logic...
    base_discount = 0.10
    return {"discount": base_discount, "final": purchase_amount * 0.90}
    # user_tier and promo_code are never used

The function signature accepts 3 parameters. The return value uses exactly one of them (purchase_amount), hardcoded at 90%. The other two are accepted and ignored.

This is INPUT_OUTPUT_DISCONNECTED. Every user gets 10% off, regardless of tier or promo code.

Why the AI does this: It sees the function signature in the prompt and generates a return value that looks correct for the common case. The edge cases — what should happen for user_tier="premium" or promo_code="SAVE20" — are never exercised in the prompt.

Pattern 4: The Exception Handler That Lies

def process_payment(amount, card_token):
    try:
        result = payment_gateway.charge(amount, card_token)
        transaction_id = result["transaction_id"]
        db.save_transaction(transaction_id, amount)
        return {"success": True, "transaction_id": transaction_id}
    except Exception:
        return {"success": True, "transaction_id": None}
        # Payment failed. Were telling the user it succeeded.

The except branch returns {"success": True}. A failed payment looks identical to a successful one.

This is SILENT_FAILURE. The try block does something meaningful. The except block returns success regardless of what went wrong.

In production: Payments fail silently. Users think they paid. You have no error logs. Finance is confused.

Pattern 5: debug=True in Production

# The AI helpfully set this during development
if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000, debug=True)

With Flask/FastAPI, debug=True enables the interactive debugger. Anyone who triggers an unhandled exception gets a full Python console in their browser — with the ability to execute arbitrary code on your server.

This is DEBUG_MODE_RISK. Its one of the most common findings in AI-generated backend code. The AI set it to True during scaffolding and never removed it.

Catching All of This Automatically

I built AINAScan specifically for these patterns. It uses AST analysis (not LLM guessing) to detect all 5 of the above — plus 43 more.

Try it: AINAScan — free, no signup for single files

For your whole project: zip it and upload. Up to 200 files scanned in parallel.

What you get per file:

✅/🔴/⚠️ for all 48 patterns
Senior code analysis (SILENT_FAILURE, EMPTY_EXCEPT, deep nesting...)
Code structure: function list, danger sinks, call graph
🧠 AI confidence score on each BLOCK issue (backed by 1.9M+ knowledge edges)

It catches what Semgrep and Bandit miss — because those tools werent designed for AI-generated code patterns. They look for known vulnerability signatures. They dont check whether your save function actually saves, or whether your parameters affect your output.

Why Traditional Linters Miss These

Tool	Approach	Misses
Semgrep	Pattern matching on known CVEs	Semantic logic bugs
Bandit	Security-focused AST rules	Functional correctness
Pylint	Style + basic errors	Intent vs. implementation
AINAScan	Semantic data-flow + intent	Nothing in the 48 patterns

The common thread: traditional tools check what the code does syntactically. AI-generated bugs are semantic — the code is syntactically valid, often stylistically fine, but functionally wrong.

If youre shipping AI-generated code to production, run it through a scan first. Takes 5 seconds. Could save you a very bad Friday night.

🔗 ainascan.dev · GitHub (⭐ to unlock ZIP scan + scan history)

DEV Community