5 AI Code Bugs That Pass All Your Tests — Until Production
We shipped AI-assisted code to production for 6 months before we realized our test suite was giving us false confidence. Everything passed locally. Everything passed CI. Then things started breaking in ways that felt... haunted.
Here are the 5 categories of bugs we found that are unique to AI-generated code — and that your existing toolchain is blind to.
Bug #1: The Phantom Import
The most common and most dangerous. AI models sometimes reference packages from their training data that were deleted, never published, or are private.
// Looks perfectly fine. Compiles. Types check. Tests pass.
import { validate } from 'email-validator-pro';
import { formatDate } from 'date-fns-utils';
import { middleware } from 'express-auth-helper';
None of these packages exist on npm.
They compile fine because TypeScript doesn't validate package existence — only type definitions. They pass tests because tree-shaking removes unused imports during dev builds. They blow up in production when the bundler tries to resolve them.
We found 14 phantom imports across our codebase. The most insidious one was lodash-deep-clone — it looked so real that three developers reviewed the PR and didn't catch it.
Why your linter misses it: ESLint and Prettier check syntax, not package registry existence. They don't know what's on npm.
Bug #2: The Time-Traveled API
AI models trained on older code will confidently use APIs that have been deprecated or removed.
// This worked in 2022. Does not work in 2026.
const result = await fetch(url).then(res => res.buffer());
// fetch().buffer() was never a standard API — it was a Node.js v18 experiment
// This one's worse — looks like valid React
componentDidMount() {
this.setState({ loading: true });
this.unsubscribe = firebase.auth().onAuthStateChanged(/* ... */);
}
// Firebase v10 removed the namespaced API entirely
The code looks right. It matches patterns the model saw thousands of times. But the API surface has moved on. Your tests might still pass if they use mocks that implement the old API.
Why your linter misses it: No linter tracks API deprecation across library versions.
Bug #3: The Context断层
This one is subtle and maddening. The AI generates correct code for a function, but it doesn't match the surrounding context.
// File: userService.ts
async function createUser(data: CreateUserDTO) {
// AI generates this correctly...
const hash = await bcrypt.hash(data.password, 10);
const user = await prisma.user.create({
data: { email: data.email, passwordHash: hash }
});
// ...but then generates THIS, which contradicts
// the actual Prisma schema where the field is 'hashedPassword'
await prisma.auditLog.create({
data: { userId: user.id, action: 'created' }
// auditLog doesn't have a 'userId' field — it's 'actorId'
});
}
Each piece looks correct in isolation. The function compiles. But at runtime, you get a cryptic Prisma error because the field names don't match the actual schema.
This happens because the AI loses context of your schema definitions when generating code further into a session.
Why your linter misses it: Cross-file semantic coherence is not a linter's job.
Bug #4: The Security Trojan
AI models sometimes generate code patterns that are textbook security vulnerabilities — not because they're malicious, but because those patterns are common in training data.
// AI-generated "convenience" function
app.post('/api/search', (req, res) => {
// Direct string interpolation in a database query
const results = await db.query(
`SELECT * FROM products WHERE name LIKE '%${req.body.query}%'`
);
res.json(results);
});
Or more subtly:
// AI loves eval() for dynamic config parsing
const config = eval(`(${rawConfigString})`);
// Or this pattern:
const callback = new Function('data', 'return ' + userFormula)(inputData);
These pass code review because they look like legitimate (if slightly unusual) patterns. They pass tests because tests use sanitized inputs.
Why your linter misses it: ESLint's security plugins catch some of these, but AI-generated security issues often use novel patterns that bypass rule-based detection.
Bug #5: The Over-Engineered Abstraction
This isn't a crash bug. It's a maintainability bomb.
AI models are trained on a lot of enterprise codebases. They love to generate:
// What you asked for: "validate email"
function isValidEmail(email: string): boolean {
return /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(email);
}
// What the AI generated:
interface ValidationRule<T> {
validate(value: T): ValidationResult;
transform?(value: T): T;
}
interface ValidationResult {
isValid: boolean;
errors: ValidationError[];
metadata?: Record<string, unknown>;
}
interface ValidationError {
code: string;
message: string;
severity: 'error' | 'warning' | 'info';
context?: ValidationContext;
}
class EmailValidationStrategy implements ValidationRule<string> {
// ... 87 lines of code
}
For a simple email validation. The code compiles. Tests pass. But now you have 87 lines of indirection that nobody understands and nobody wants to modify.
We found a 340-line "Plugin Architecture" that was auto-generated to handle what should have been a 15-line switch statement.
Why your linter misses it: There's no ESLint rule for "this is unnecessarily complex." Code review might catch it, but when you're reviewing 20 AI-assisted PRs a day, the pattern fatigue is real.
How We're Catching These Now
After months of painful production incidents, we built Open Code Review — an open-source CI/CD tool specifically designed to catch these AI-native bug categories.
# Install and scan in seconds
npx @opencodereview/cli scan src/ --sla L1
L1 (instant, no AI needed) catches:
- ✅ Phantom imports (registry verification)
- ✅ Deprecated API calls (AST-aware)
- ✅ Security anti-patterns
- ✅ Over-engineering heuristics
L2 (optional, local Ollama) adds:
- ✅ Cross-file context coherence
- ✅ Semantic similarity analysis
- ✅ AI-powered quality scoring
L3 (coming soon):
- ✅ Full LLM code review pass
It runs 100% locally — no API keys, no data leaves your machine. CI integration takes one line:
# GitHub Actions
- uses: raye-deng/open-code-review@v1
with:
sla: L1
threshold: 60
github-token: ${{ secrets.GITHUB_TOKEN }}
The Real Lesson
AI coding assistants are incredibly powerful. We're not going back to hand-writing everything. But the bugs they introduce are qualitatively different from human bugs. They look right. They test right. They compile right. And they fail in production in ways that feel almost impossible to debug.
Your existing QA pipeline — linters, unit tests, even human code review — was built for human bugs. AI bugs need AI-aware tooling.
What AI-specific bugs have you found in production? I'm collecting patterns and would love to hear your war stories.
Open Code Review is open source at github.com/raye-deng/open-code-review. PRs, issues, and war stories welcome.
Top comments (0)