Is AI-Generated Code Buggier? The 2025-26 Data

#security #ai #benchmarking

Across the major 2025 and 2026 studies, AI-generated code carries measurably more defects: Veracode found 45% of it shipped an OWASP Top 10 vulnerability, CodeRabbit measured 1.7x more issues per pull request, Apiiro tracked a 10x rise in security findings, and a USENIX study found roughly one in five AI-suggested packages don't exist. BrassCoders, the bug scanner for AI coders, exists because those numbers land in real codebases.

The figures below come from named sources with linked methodology. Several are from vendors who sell scanning tools, so they're labeled as vendor reports and paired with academic work where it exists.

Security Vulnerabilities In AI Code

BrassCoders scans for the weakness classes Veracode measured: in the 2025 GenAI Code Security Report, 45% of AI-generated code samples introduced an OWASP Top 10 vulnerability, tested across more than 100 models on 80 real-world coding tasks.

The failure rate varied by language. Java fared worst at 72%, with Python, JavaScript, and C# lower but still substantial. The test was objective, not a survey: each sample was checked for a known vulnerability class. This is the single most-cited anchor for "is AI code secure," and the answer it gives is that nearly half the time it isn't.

More Issues Per Pull Request

BrassCoders runs in the pre-merge slot where the extra issues show up: CodeRabbit's State of AI vs Human Code Generation report measured 1.7x more issues on AI-assisted pull requests, 10.83 per PR against 6.45 for human-only work, across 470 open-source GitHub PRs.

This is a vendor report from a company selling AI review, so weigh it accordingly. The useful detail is the per-PR delta: the extra four-or-so issues per pull request are the triage load a reviewer absorbs on every AI-assisted change. That load is what a deterministic first pass is meant to cut down before a human ever looks.

Vulnerabilities Scale With Adoption

BrassCoders matters more as AI-assisted velocity climbs, which is exactly the trend Apiiro reported: a 10x rise in monthly security findings between December 2024 and June 2025, with privilege-escalation paths up 322%.

Apiiro's analysis drew on several thousand developers across tens of thousands of enterprise repositories. More code shipped faster means more findings in absolute terms, and the mix shifted toward higher-severity architectural and privilege-escalation flaws. Speed without a gate compounds.

Packages That Don't Exist

BrassCoders flags imports that don't resolve before installation, the defense against a defect class unique to AI: a USENIX Security 2025 study found 19.7% of packages recommended in LLM-generated code did not exist, and 43% of those fabricated names recurred across repeated prompts.

The study generated 2.23 million package references from 576,000 code samples across 16 models. Open-source models hallucinated more (21.7%) than commercial ones (5.2%). Because the fake names repeat, an attacker can register one and wait, an attack called slopsquatting. A real proof-of-concept package planted this way drew tens of thousands of downloads. An import either resolves or it doesn't, which makes this a clean deterministic check.

The Perception Gap

BrassCoders closes a gap the data keeps surfacing: developers trust AI code more than the measurements warrant. Snyk's 2023 report found nearly 80% of developers believe AI-generated code is more secure than human-written code, while METR's 2025 trial found experienced developers were 19% slower with AI tools yet believed they were 20% faster.

The Snyk survey and the METR randomized trial measure different things, but they point the same way: confidence in AI output runs ahead of its measured quality. A deterministic gate doesn't argue with the confidence. It just checks the code.

What The Numbers Mean For Your Workflow

BrassCoders turns these aggregate figures into a per-commit check: 12 scanners run against your AI-generated Python and emit the findings as YAML, deterministically and offline. The studies describe the problem at population scale; the scan addresses it on your branch.

The plain read on the data is that AI coding assistants ship more defects and more packages-that-don't-exist, and developers underestimate both. Pair the numbers with a gate that runs every time.

pip install brasscoders
brasscoders --offline scan /path/to/your/project

Top comments (1)

Luis Cruz • Jul 2

This is a really interesting example of how cohort effects can quietly influence behavior outcomes. The idea that the month you start learning piano could correlate with long-term persistence is a reminder that timing and context often matter as much as motivation. I also like the use of real learner data instead of purely theoretical behavioral assumptions. That said, it would be useful to control for confounding factors like seasonality in motivation, school schedules, or even initial difficulty exposure. Still, it raises a compelling question: are habits truly individual decisions, or are they partially shaped by external timing patterns we rarely account for?