The LLM Code Bugs Nobody Talks About

#ai #llm #programming

Every developer I know has a story about AI-generated code that looked completely right and was completely wrong. Not "wrong" in an obvious way — wrong in the way that costs you a Tuesday.

After shipping production systems where AI wrote a meaningful portion of the codebase, here are the failure modes I've stopped being surprised by.

1. Hallucinated imports that pass linting

The model confidently reaches for pandas.DataFrame.to_parquet(engine='pyarrow', schema_evolution=True). That parameter does not exist. The code passes a static linter because the import resolves. It fails at runtime, in production, on the one path you didn't test.

Why it happens: the model has seen thousands of DataFrame snippets and infers plausible-sounding parameters from patterns across libraries. It doesn't distinguish "I've seen this exact call" from "this feels right given everything I've read."

Fix: for any library call involving optional parameters, run the generated code against the actual library docs or a real interpreter before committing. Don't trust the linter alone.

2. Confident wrong refactoring

You ask the model to "refactor this function to be more readable." It hands back something cleaner. You merge it. Three weeks later, a subtle change in variable scope or early-return logic breaks a downstream assertion nobody noticed.

The model optimised for the appearance of correctness — shorter, more idiomatic code — without tracking all the invariants the original author was quietly maintaining.

Fix: treat AI refactors like any external PR. Require a diff review and existing test suite pass, not just a visual scan.

3. Stale context poisoning

Long sessions are particularly dangerous. By message 40, the model's working context has accumulated contradictory instructions, out-of-date schema definitions, and revised requirements that weren't cleanly updated. It starts synthesising code from a model of your codebase that no longer matches reality.

This is different from hallucination. The model is using your instructions — just the version from 30 messages ago.

Fix: for complex tasks, periodically start a fresh context with a clean system prompt rather than accumulating state across a single long session. Treat context windows like short-term memory, not a persistent source of truth.

4. Non-determinism breaking CI

AI-generated tests that rely on dictionary ordering, floating-point equality, or datetime.now() will pass locally and fail in CI at a rate that's maddening to debug. The model writes tests that pass in its internal simulation of the code but doesn't account for environment-specific non-determinism.

Fix: ban == comparisons on floats and timestamps in any AI-generated test. Add a lint rule. Make the model explain why each assertion is deterministic before you accept it.

5. The "working" code that doesn't handle scale

The model has seen plenty of sequential, synchronous Python. It has seen far less production async code under real concurrency. Generated async handlers frequently contain subtle race conditions — shared state accessed across coroutines without locks, cancellation paths that leave resources open, retry logic that doesn't account for partial writes.

The code works perfectly in your local test. Under 50 concurrent requests, it quietly corrupts state.

Fix: any AI-generated concurrency code gets a mandatory review against the asyncio documentation and your race condition checklist before merge. If you don't have a checklist, make one.

The meta-bug

All of these share a root cause: the model is optimising for plausibility, not correctness. Code that looks like correct code scores well in training. Code that subtly fails under edge conditions is hard to distinguish from working code in text form.

This doesn't mean AI coding is broken. It means the review practices most teams inherited from the "all code is human-written" era are the wrong shape for "30–50% of this was generated."

Your reviewers need to be looking for a different class of bug. The obvious ones, the model largely gets right. It's the plausible-but-wrong — the confident hallucination, the silent invariant violation, the context-poisoned refactor — that will cost you time until you build specific habits around catching them.

What patterns have you hit that aren't on this list? I'd genuinely like to know.