AI Code Generation: The Hallucination Tax

#ai #webdev #codequality #programming

Performance-Fresser — Episode 20

"AI will write your code! 55% faster! Ship in half the time!"

METR ran a randomised controlled trial. Sixteen experienced developers, 246 tasks, mature codebases averaging one million lines of code. Result: developers using AI were 19% slower. Not faster. Slower.

The developers themselves believed they were 20% faster. They were not. One does admire the confidence.

The Hallucination

19.6% of AI-recommended packages do not exist. Nearly one in five imports point to packages that were never published. 43% of those hallucinated packages reappear consistently across re-queries. The AI does not guess randomly. It hallucinates with conviction, and it hallucinates the same things repeatedly.

This is not an edge case. Across 756,000 code samples and 16 models, the pattern is remarkably consistent. Attackers have noticed, naturally. "Slopsquatting" registers packages matching AI-hallucinated names on npm and PyPI, turning the model's confidence into a supply chain attack vector. Rather entrepreneurial of them.

40% of GitHub Copilot's generated code contains security vulnerabilities (NYU, 89 scenarios, 1,692 programs). Developers with AI access write significantly less secure code than those without, whilst being considerably more confident that their code is secure. Stanford measured this across 47 developers in Python, JavaScript, and C. The less they questioned the AI, the more vulnerabilities they introduced. One does wonder whether "confidence" and "competence" have always been this loosely coupled.

The Complexity Tax

Here is what the benchmarks reveal but the marketing rather conveniently omits: AI performs measurably worse on complex, abstracted code. Framework-specific conventions, proprietary APIs, deep dependency chains: these are the contexts where hallucination rates climb. The more you ask the model to navigate, the more creatively it invents.

20.41% of code hallucinations stem from incorrect API usage. The more framework-specific the API, the more the model confuses conventions, invents methods that do not exist, and mixes patterns from different versions. Higher Halstead complexity, larger vocabulary, deeper abstraction: all correlate with higher failure rates in LLM-generated code. One might call it poetic justice: the abstractions designed to simplify development are now the abstractions that confuse the tool designed to simplify development.

Vanilla code in the language's standard library produces cleaner AI output. Not because the AI is smarter. Because there is less to hallucinate about. Fewer abstractions, fewer proprietary patterns, fewer opportunities for the model to confidently fabricate something that compiles but does not work.

JavaScript illustrates this rather neatly: 21.3% hallucinated imports versus 15.8% in Python. More packages in the ecosystem means more hallucination surface. The complexity you built for humans to struggle with is now the complexity AI struggles with too. The tax, as one might say, compounds.

The Model

Not all models are equal, and the tooling matters as much as the model behind it.

Copilot autocompletes lines. It predicts the next token based on your current file. An agentic model in a proper development environment reasons about architecture, reads your project structure, navigates across files, and understands context at a system level. The difference is not incremental. It is categorical.

Less than 44% of AI-generated code was accepted in the METR study. The developers spent more time evaluating, adjusting, and discarding suggestions than they would have spent writing the code themselves. The tool that was meant to remove friction became the friction. Quite the achievement.

Choosing the right model for the right task is engineering. Using whatever ships with your IDE is hope. Hope is a marvellous thing. It is not, however, a deployment strategy.

The Code Quality Decline

GitClear analysed 211 million lines of code and measured the impact of AI adoption on code quality:

Refactoring collapsed: from 25% of changed lines (2021) to under 10% (2024)
Code cloning surged: copy-pasted lines rose from 8.3% to 12.3% (a 48% increase)
Code churn doubled: lines reverted or updated within two weeks doubled versus the 2021 baseline

The AI generates code faster. The code gets replaced faster. The net effect on the codebase is not acceleration. It is the accumulation of disposable code that nobody refactors because the AI will cheerfully generate more. One does wonder whether "velocity" was always a euphemism for "volume."

The Pattern

AI code generation is a brilliant instrument. In the hands of someone who understands both the tool and the codebase, it accelerates work that would otherwise be tedious. In the hands of someone working inside a framework they do not fully understand, it amplifies the confusion at machine speed.

The answer is not more AI. It is less complexity. Reduce what the model must carry in context. Write code a human can read in one pass: lean, minimal, close to the language's own idioms. The AI will follow, because there is less to get wrong.

45% of developers say debugging AI code takes longer than writing it themselves. One does suspect this has less to do with the AI and rather more to do with not understanding the framework the AI is writing for. A developer who masters the fundamentals of the language itself reviews AI output in seconds. A developer buried in abstractions cannot review anyone's code, including their own.

61% say AI produces code that "looks correct but is not reliable." The hallucination is not in the model. It is in the expectation that a tool can navigate complexity you have not mastered yourself.

The Trust Erosion

The developer community is noticing. Stack Overflow's 2025 survey (65,000 respondents) tells the story:

84% use or plan to use AI tools (adoption is not the problem)
Only 29% trust AI accuracy (down from 40% the previous year)
Favourability dropped from 72% to 60%
75% still prefer asking a human over trusting AI output (rather telling)
77% say "vibe coding" is not part of their professional work (one does hope)

The industry adopted the tool before it understood the tool. Now the understanding is catching up, and confidence is dropping. Not because AI got worse. Because expectations met reality. One does find that reality has rather poor timing.

The Lever

The solution is not abandoning AI. It is reducing what the AI must navigate.

Write lean code: close to the language's own idioms, minimal abstractions, no framework magic. A function that does one thing, named clearly, understood where it is read. The AI generates better output for this code because there is less to hallucinate about. The human reviews it faster because there is less to misunderstand.

This is not nostalgia. It is architecture for the AI era. The same principles that made code maintainable for humans now make it reviewable when generated by machines.

Write lean. The AI will follow.

Read the full article on vivianvoss.net →

By Vivian Voss — System Architect & Software Developer. Follow me on LinkedIn for daily technical writing.