Gabriel Anhaia

Posted on Apr 5

AI-Generated Code Has 1.7x More Bugs Than Human Code. Here's What Those Bugs Actually Look Like.

#ai #testing #security #discuss

My project: Hermes IDE | GitHub
Me: gabrielanhaia

The Productivity Illusion Has Numbers Now

Pull requests per author are up 20% year-over-year. Leadership loves it. Dashboards have never looked better.

Then CodeRabbit dropped their State of AI vs Human Code Generation report, and the other shoe fell hard. AI-generated code produces 1.7x as many bugs as human-written code. Not toy examples. Not hallucination artifacts caught during review. Shipped bugs. Production bugs. The kind that page you at 3 AM.

Incidents per pull request climbed 23.5% over the same period that velocity went up 20%. The industry optimized for throughput and forgot to measure what that throughput actually delivers.

The Data That Should Make Teams Uncomfortable

The severity breakdown is where things get ugly.

AI-generated code doesn't just produce more defects. It produces 1.3x to 1.7x more critical and major issues compared to human-written code. These aren't lint warnings or style violations. These are the bugs that trigger post-mortems and cost you customers.

Developers accept AI suggestions with minimal modification 40-60% of the time. Think about what that means. More than half the AI-generated code entering production hasn't been meaningfully reviewed by the person who requested it.

One organization documented 47 subtle bugs that passed every automated test but broke in production. All 47 cleared CI, cleared code review, cleared staging. They only surfaced when real users hit real edge cases under real load.

Gartner projected a 2,500% increase in generative AI software defects, with 75% of technology leaders facing moderate-to-severe technical debt from AI-accelerated coding by 2026. We're there now. Even if Gartner was off by half, the remaining number is still a crisis.

Why AI Code Breaks Differently

A language model doesn't understand code. It predicts the most statistically likely next token. That produces code which looks correct, follows patterns, uses the right APIs, handles the obvious paths.

But bugs don't live in obvious paths. They live at boundaries and race conditions, in state interactions that weren't in any training example. The model generates that code with the exact same confidence it uses for a trivial getter method.

Human developers carry context from past failures. They remember the outage caused by unvalidated input. They remember the data corruption from a missing transaction boundary. AI models start from zero every single time.

That difference shows up in specific, predictable patterns.

Bug Pattern #1: SQL Injection by Default

AI models have seen millions of database queries. They'll happily generate one that works. Whether it's safe is a different question entirely.

Here's a typical AI-generated endpoint for fetching user profiles:

# AI-generated: looks clean, works fine in testing
from flask import Flask, request, jsonify
import sqlite3

app = Flask(__name__)

def get_user(username):
    conn = sqlite3.connect("app.db")
    cursor = conn.cursor()
    query = f"SELECT id, username, email FROM users WHERE username = '{username}'"
    cursor.execute(query)
    result = cursor.fetchone()
    conn.close()
    if result:
        return {"id": result[0], "username": result[1], "email": result[2]}
    return None

@app.route("/api/user")
def user_endpoint():
    username = request.args.get("username", "")
    user = get_user(username)
    if user:
        return jsonify(user)
    return jsonify({"error": "User not found"}), 404

This passes every test a team would write against it. Request a user, get a user. Request a missing user, get a 404. Clean.

Except it's wide open to SQL injection. Send ' OR '1'='1 as the username, and the entire users table comes back. The AI produced confident, well-structured, dangerously vulnerable code.

The fixed version:

# Fixed: parameterized query, context manager, proper error handling
from flask import Flask, request, jsonify
import sqlite3

app = Flask(__name__)

DATABASE = "app.db"

def get_db_connection():
    conn = sqlite3.connect(DATABASE)
    conn.row_factory = sqlite3.Row
    return conn

def get_user(username):
    conn = get_db_connection()
    try:
        cursor = conn.cursor()
        cursor.execute(
            "SELECT id, username, email FROM users WHERE username = ?",
            (username,)
        )
        result = cursor.fetchone()
        if result:
            return dict(result)
        return None
    except sqlite3.Error:
        return None
    finally:
        conn.close()

@app.route("/api/user")
def user_endpoint():
    username = request.args.get("username", "")
    if not username or len(username) > 150:
        return jsonify({"error": "Invalid username"}), 400
    user = get_user(username)
    if user:
        return jsonify(user)
    return jsonify({"error": "User not found"}), 404

Three fixes: parameterized queries, input validation, and proper error handling with resource cleanup. A developer who's lived through a SQL injection incident would write this version instinctively. The AI doesn't have that instinct.

Bug Pattern #2: Race Conditions That Only Bite Under Load

AI-generated concurrent code often looks textbook-correct but falls apart when multiple threads or requests interact. Inventory, booking, and financial systems are full of this.

// AI-generated: inventory deduction for an e-commerce checkout
const db = require("./db");

async function purchaseItem(userId, itemId, quantity) {
  const item = await db.query(
    "SELECT stock FROM inventory WHERE item_id = $1",
    [itemId]
  );

  if (item.rows[0].stock < quantity) {
    throw new Error("Insufficient stock");
  }

  await db.query(
    "UPDATE inventory SET stock = stock - $1 WHERE item_id = $2",
    [quantity, itemId]
  );

  await db.query(
    "INSERT INTO orders (user_id, item_id, quantity) VALUES ($1, $2, $3)",
    [userId, itemId, quantity]
  );

  return { success: true };
}

Read it carefully. The stock check and the stock deduction are separate operations with no atomicity guarantee. Two users buying the last item at the same time both pass the stock check, both deduct, and now stock is negative. Classic TOCTOU (time-of-check-to-time-of-use) bug.

This will pass every unit test. It'll work fine in staging where there's one QA person clicking around. It'll explode on Black Friday.

// Fixed: atomic operation with transaction and row-level locking
const db = require("./db");

async function purchaseItem(userId, itemId, quantity) {
  const client = await db.getClient();

  try {
    await client.query("BEGIN");

    const item = await client.query(
      "SELECT stock FROM inventory WHERE item_id = $1 FOR UPDATE",
      [itemId]
    );

    if (!item.rows.length) {
      throw new Error("Item not found");
    }

    if (item.rows[0].stock < quantity) {
      throw new Error("Insufficient stock");
    }

    await client.query(
      "UPDATE inventory SET stock = stock - $1 WHERE item_id = $2",
      [quantity, itemId]
    );

    await client.query(
      "INSERT INTO orders (user_id, item_id, quantity, created_at) VALUES ($1, $2, $3, NOW())",
      [userId, itemId, quantity]
    );

    await client.query("COMMIT");
    return { success: true };
  } catch (err) {
    await client.query("ROLLBACK");
    throw err;
  } finally {
    client.release();
  }
}

SELECT ... FOR UPDATE locks the row. The transaction ensures atomicity. The ROLLBACK on error prevents partial writes. A developer who's debugged oversold inventory at 2 AM knows these patterns cold. The AI generated the first version confidently and skipped the concurrency guarantees that matter.

Bug Pattern #3: Silent Error Swallowing

This one is everywhere in AI-generated code. The model knows that error handling is expected, so it generates try/catch blocks. But it often generates them in ways that hide failures instead of surfacing them.

// AI-generated: config loader for a microservice
package config

import (
    "encoding/json"
    "os"
)

type Config struct {
    DatabaseURL string `json:"database_url"`
    APIKey      string `json:"api_key"`
    Port        int    `json:"port"`
}

func LoadConfig(path string) Config {
    cfg := Config{
        Port: 8080,
    }

    data, err := os.ReadFile(path)
    if err != nil {
        return cfg
    }

    json.Unmarshal(data, &cfg)

    return cfg
}

No crash. No panic. If the config file is missing or malformed, the service starts with defaults. The json.Unmarshal error is silently discarded. A service running with an empty DatabaseURL and APIKey won't fail immediately. It'll fail later, on the first request, with an error message that has nothing to do with the actual problem.

This is exactly the kind of bug that passes every automated test (because tests use valid config files) and only surfaces in production when a deployment goes sideways.

// Fixed: fail fast with clear errors
package config

import (
    "encoding/json"
    "fmt"
    "os"
)

type Config struct {
    DatabaseURL string `json:"database_url"`
    APIKey      string `json:"api_key"`
    Port        int    `json:"port"`
}

func LoadConfig(path string) (Config, error) {
    var cfg Config

    data, err := os.ReadFile(path)
    if err != nil {
        return cfg, fmt.Errorf("reading config file %s: %w", path, err)
    }

    if err := json.Unmarshal(data, &cfg); err != nil {
        return cfg, fmt.Errorf("parsing config file %s: %w", path, err)
    }

    if cfg.DatabaseURL == "" {
        return cfg, fmt.Errorf("config: database_url is required")
    }
    if cfg.APIKey == "" {
        return cfg, fmt.Errorf("config: api_key is required")
    }
    if cfg.Port == 0 {
        cfg.Port = 8080
    }

    return cfg, nil
}

The fixed version returns errors instead of swallowing them, wraps errors with context, and validates required fields. The service crashes at startup with a clear message instead of limping along broken.

Fail fast is a principle that comes from operational pain. AI models haven't operated anything.

Bug Pattern #4: The "Works On My Machine" Auth Check

AI-generated authorization code frequently checks authentication (who are you?) while skipping authorization (what are you allowed to do?). It's a subtle distinction that models consistently get wrong.

// AI-generated: endpoint to delete a user's project
import express from "express";
import { authenticateToken } from "./middleware/auth";
import { db } from "./db";

const router = express.Router();

router.delete("/projects/:projectId", authenticateToken, async (req, res) => {
  try {
    const { projectId } = req.params;

    const project = await db.query(
      "SELECT * FROM projects WHERE id = $1",
      [projectId]
    );

    if (project.rows.length === 0) {
      return res.status(404).json({ error: "Project not found" });
    }

    await db.query("DELETE FROM projects WHERE id = $1", [projectId]);

    return res.json({ message: "Project deleted" });
  } catch (err) {
    return res.status(500).json({ error: "Internal server error" });
  }
});

Authenticated? Yes. Authorized? Not checked. Any logged-in user can delete any other user's project. The AI generated a perfectly structured endpoint that's missing the one check that actually matters.

// Fixed: verify the requesting user owns the resource
router.delete("/projects/:projectId", authenticateToken, async (req, res) => {
  try {
    const { projectId } = req.params;
    const userId = req.user.id;

    const project = await db.query(
      "SELECT * FROM projects WHERE id = $1",
      [projectId]
    );

    if (project.rows.length === 0) {
      return res.status(404).json({ error: "Project not found" });
    }

    if (project.rows[0].owner_id !== userId) {
      return res.status(403).json({ error: "Forbidden" });
    }

    await db.query("DELETE FROM projects WHERE id = $1", [projectId]);

    return res.json({ message: "Project deleted" });
  } catch (err) {
    console.error("Delete project error:", err);
    return res.status(500).json({ error: "Internal server error" });
  }
});

One if statement. That's it. One ownership check is the difference between a working endpoint and an IDOR vulnerability. But the AI-generated version looks complete. It has auth middleware. It has error handling. It returns proper status codes. Every signal says "this code was reviewed."

That's what makes AI-generated security bugs so dangerous. They hide behind well-structured code.

The Common Thread

Every example above shares the same root cause: the AI optimized for the happy path.

It generated code that works when inputs are valid, when one request runs at a time, when config files exist, and when every user is honest. Production isn't like that. Production is hostile, and it runs a thousand requests at once with half the config missing.

Human developers who've been burned by production know this. They write defensive code not because it's fun but because they've lived through the exact class of failure they're now guarding against. AI models don't carry that context.

What Actually Helps

Abandoning AI tools isn't realistic, and it's not necessary. But using them without adjusting the review process is negligent at this point. The data is too clear to ignore.

Treat AI output like a junior developer's first draft. Read every line. Question every assumption. The clean formatting isn't evidence of correctness.

Write tests that target the boundaries, not the happy path. If AI-generated code passes all tests but breaks in production, the tests are testing the wrong things. Fuzz inputs. Simulate concurrent access. Test with missing config, expired tokens, malformed payloads.

Track defect rates by code origin. Teams can't improve what they don't measure. Tag AI-assisted PRs and compare their incident rates against fully human-written code. You'll probably hate the numbers.

Adopt a "reject by default" posture. That 40-60% minimal-modification acceptance rate is a cultural problem. Flip it. Start from "this is probably wrong" and verify before accepting. That single shift would cut AI-related defects dramatically.

Slow down deliberately. Shipping 20% more PRs with 23.5% more incidents per PR is negative ROI. Shipping 10% more PRs with the same incident rate is actual progress.

The Bottom Line

AI coding tools reduce boilerplate, help explore unfamiliar APIs, and speed up prototyping. Those benefits are real. But the data from CodeRabbit is equally real: 1.7x more bugs, 1.3-1.7x more severe bugs, and a 23.5% increase in incidents per PR.

The teams that'll come out ahead aren't the ones shipping the most code. They're the ones that learned to distrust AI output just enough to catch the SQL injection, the race condition, the silent error, and the missing auth check before they hit production.

Speed without correctness is just generating future incidents faster.

Has your team shipped an AI-generated bug that made it past review and tests? The industry needs war stories, not marketing decks. Share yours below.

Top comments (2)

NEXADiag Nexa • May 1

The race condition example hits hardest because every test
you'd write against it would still pass. Same with silent
error swallowing — fail-fast vs fail-quiet is a judgment
call AI can't make without operational context.

Something I've noticed running the same diff through
different LLMs: race conditions and TOCTOU patterns are
where models disagree the most. One flags the missing
FOR UPDATE, another doesn't see it at all. Single-model
review on concurrent code is where false confidence
becomes most dangerous.

In your experience, which of these 4 patterns slips past
code review most often? My bet is #3 (silent error
swallowing) — it doesn't trip any obvious smell. Curious
what you've actually seen in production.

George • Jul 18

Minor nitpick for the 4th example: it is sometimes better practice for security to return 404 instead of 403 to avoid divulging what resources exist to unauthorized clients.