Jakkie Koekemoer

Posted on Jan 21

The "Uncanny Valley" of AI Code: Why Hybrid PRs Are Harder to Review Than Pure AI or Pure Human Code

#ai #codereview #devex

I've been watching the AI coding explosion with fascination and a bit of concern. Here's what's happening: 84% of developers now use AI coding tools, but only 3% of developers "highly trust" AI output. That trust number? It's dropping, not rising.

After digging through the research, I found something unexpected. The real problem isn't pure AI code or pure human code. It's the stuff in between, PRs where 25-50% of the code is AI-generated and the rest is human-written. These hybrid PRs create review challenges that neither pure human nor pure AI code produces. There's a name for this effect, borrowed from robotics, and understanding it explains a lot about why code review is getting harder.

Adoption Is Through the Roof, But Trust Is Cratering

The numbers tell a strange story. The Stack Overflow 2025 Developer Survey surveyed over 49,000 developers and found that 84% are using or planning to use AI tools, up from 76% in 2024 and 70% in 2023. Daily usage among professional developers hit 51% in 2025. GitHub Copilot and ChatGPT dominate the market with 68% and 82% share respectively among developers using AI agents.

Velocity is way up. GitHub's Octoverse 2025 data shows 43.2 million pull requests merged monthly, a 23% year-over-year increase, with nearly 1 billion commits (up 25%). Duolingo reported a 70% increase in PR volume after adopting Copilot.

But here's where it gets weird. Trust in AI accuracy has fallen dramatically. 46% of developers don't trust AI tool accuracy in 2025, up from 31% in 2024. Only 3% report "high trust" in AI output.

The DORA 2025 State of AI-Assisted Software Development report surveyed nearly 5,000 technology professionals and found that 90% now use AI tools (up 14% from 2024), but AI adoption continues to correlate with increased instability. Teams get higher throughput but also higher change failure rates and more rework. IT Revolution's analysis of the report put it bluntly: "AI adoption correlated with worsened software delivery performance" with teams facing challenges in testing, code review, and quality assurance that "are not equipped to handle this new, accelerated pace."

We're shipping more code faster than ever, but we trust it less. That's a problem.

AI Code Has More Issues and Takes Longer to Review

Independent analysis shows AI-generated code creates more work for reviewers. CodeRabbit's 2025 analysis of 470 open-source GitHub PRs found AI-generated submissions contained about 1.7x more issues overall, 10.83 issues per PR versus 6.45 for human-written code. Critical issues appeared 40% more frequently in AI code (341 per 100 PRs vs. 240 for human), while major issues were 70% higher.

The types of issues are different too. Logic and correctness issues were 75% more common in AI PRs. Error and exception handling gaps appeared at nearly 2x the rate. Security vulnerabilities, particularly XSS at 2.74x higher and insecure object references at 1.91x higher, are a real concern. The one that matters most for reviewers: readability issues were 3x more common in AI-generated code.

This translates directly into more time spent reviewing. A Sonar survey found 38% of developers say reviewing AI-generated code requires more effort than reviewing human-written code. A research study by Alami and Ernst (2025) found that "the cognitive load sometimes is higher in dealing with LLM-generated feedback due to its excessive details."

The fundamental issue: 61% of developers agree AI often produces code that "looks correct but isn't reliable."

What Robotics Teaches Us About Hybrid Code

Back in 1970, Japanese roboticist Masahiro Mori wrote an essay called "Bukimi no Tani" (不気味の谷), which we know as "The Uncanny Valley." He noticed something strange: as robots become more human-like, people respond more positively to them, until they hit a critical threshold. At that point, when robots look almost human but not quite, people find them deeply unsettling. Something 95% human seems more disturbing than something 50% human.

Neuroscience research figured out why. A 2012 fMRI study by Saygin et al. found that viewing androids with human appearance but mechanical movement produced stronger brain activity in action perception areas than viewing either humans or robots. The brain expects consistency. When appearance and movement don't match, it creates prediction errors that cascade through your neural system.

Here's the key insight for code review: near-human entities aren't judged on robot standards, they're judged on human standards. As Saygin et al. put it: "A robot which has an appearance in the uncanny valley range is not judged as a robot doing a passable job at pretending to be human, but instead as an abnormal human doing a bad job at seeming like a normal person."

This evaluation framework shift explains why hybrid code, code that looks human but has AI-like error patterns, triggers cognitive friction that pure AI code doesn't.

Hybrid Code Forces You to Switch Evaluation Modes Constantly

Code review already takes up a lot of time. Microsoft Research found developers spend about 6 hours per week on reviews, while Google data showed median reviewers handle 4 changes weekly. This baseline cognitive investment gets worse when reviews require different evaluation approaches within a single PR.

Research by Gloria Mark at UC Irvine found that knowledge workers on average take 23 minutes and 15 seconds to fully return to a task after an interruption. For programming specifically, research by Chris Parnin found programmers need 10-15 minutes to start editing code after resuming from interruption, with only 10% of programming sessions resuming in under one minute. The cost is real: research by Rubinstein, Meyer, and Evans found that task-switching can cost up to 40% of productive time compared to single-task focus.

Not all switches are equal. Research distinguishes between environmental switches (changing tools or tabs) and conceptual switches (changing mental models or work types). Hybrid PR review is a conceptual switch. Verifying AI-generated patterns requires different cognitive operations than understanding human-written logic.

Sophie Leroy's research on "attention residue" explains what's happening. When you switch tasks mid-stream, part of your attention stays stuck on the prior task. Incomplete or unresolved evaluations create stronger residue, and the more engaging the interrupted task, the greater the residue. In hybrid PR review, each transition between evaluating human code logic and verifying AI code patterns leaves attention residue that makes your next evaluation worse.

Why AI Code Looks Right But Isn't

AI code generation optimizes for plausibility, not correctness. This is a fundamental characteristic that shapes what reviewers need to look for. Research has shown that LLMs "suffer from confabulations (or hallucinations), which can result in them making plausible but incorrect statements." In practice, these models operate "like a statistical pattern matcher" without semantic understanding, generating code that looks correct without verifying it actually is.

The hallucination problem is consistent and substantial. A University of Texas San Antonio study analyzing 2.23 million code samples found 19.7% contained hallucinated package references, API calls and imports referencing libraries that don't exist. GPT-series models performed better (5.2% hallucination rate) but open-source models reached 21.7%. Alarmingly, 43% of hallucinated packages repeated in all 10 queries, making these errors consistent and therefore harder to catch.

Stanford research under Professor Dan Boneh found participants with AI coding assistants wrote significantly less secure code than those without, with vulnerability increases in 4 of 5 programming tasks. Worse: users with AI assistance were more likely to believe their code was secure even when it contained more vulnerabilities. NYU's study analyzed 1,689 programs and found 40% contained potentially exploitable vulnerabilities.

AI code also creates maintainability problems you won't see in initial review. GitClear's analysis of 211 million lines of code from 2020-2024 found code churn (code reverted or updated within two weeks) increased from 3.1% to 5.7%, with an 8-fold increase in code duplication. They characterized AI-generated code as resembling "an itinerant contributor, prone to violate the DRY-ness of the repos visited."

Pure AI and Pure Human Code Each Allow Consistent Review Strategies

The uncanny valley effect shows up specifically because hybrid code prevents reviewers from using consistent mental models. When you're reviewing pure human code, you engage top-down comprehension. You're familiar with the author's patterns, team conventions, and architectural context. The task is understanding intent and evaluating logic choices.

For pure AI code, you can apply pattern verification. You're checking for hallucinated dependencies, validating API correctness, testing edge case handling, and verifying security properties. You know the code's origins, so you calibrate your skepticism appropriately. Span's AI code detection gives teams this visibility, identifying AI-generated code with over 95% accuracy.

Microsoft Research found that reviewer expertise with specific artifacts significantly affects review usefulness, with useful comments increasing dramatically in a reviewer's first year of familiarity. This matches what cognitive science tells us: experts use fundamentally different comprehension strategies than novices, processing familiar patterns far more efficiently.

Hybrid code defeats both strategies. Human-written sections invite top-down comprehension that may miss AI-inserted hallucinations. AI-generated sections requiring verification interrupt the logic evaluation of surrounding human code. Each transition forces you to switch mental models, a conceptual context switch that creates significant cognitive overhead.

The cognitive load stacks up quickly. Sweller's cognitive load theory identifies working memory limits of about 7 items you can hold simultaneously. Hybrid PR review demands tracking: the AI code patterns, the human code patterns, how they interact, correctness of each type, potential hallucinations, and integration with the broader codebase. This easily exceeds working memory capacity, which leads to more oversight errors.

Velocity Gains Are Creating Review Bottlenecks

The gap between how fast we can generate code and how fast we can review it is growing. Research found AI-assisted development creates larger PRs, with Faros AI finding PR sizes increased 154%, and DORA's 2025 research noting that "bigger changesets are riskier, something that DORA's research has long supported."

The 2025 Cortex report found PRs per author increased 20% year-over-year while incidents per PR increased 23.5%, with change failure rates rising about 30%.

Review capacity isn't scaling to match. Only 10.2% of developers currently use AI for committing and reviewing code, while 58.7% have no plans to adopt AI for review tasks. The workflow is lopsided: AI speeds up code generation but human reviewers remain the bottleneck.

As one developer put it in Stack Overflow's survey: "I expect that as tools mature I will be able to switch from primarily writing code to primarily reviewing generated code."

Elite teams, analyzed across 6.1 million PRs, achieve cycle times under 26 hours with rework rates under 3%. These benchmarks get harder to hit when review cognitive load increases. Code Climate's research found the median organization averages 7% rework rate, with bottom performers exceeding 10%, often driven by "lack of familiarity with the codebase," which is exactly what hybrid AI/human code creates.

What This Means for Your Team

The research paints a clear picture. Code that falls between clearly-AI and clearly-human triggers the uncanny valley effect. Your evaluation standards shift to human-code expectations, your sensitivity to imperfections increases, and you can't apply consistent mental models. Research suggests it takes approximately 23 minutes to refocus after interruptions, combined with up to 40% performance degradation from task-switching, explaining why hybrid PRs feel uniquely demanding.

Three practical principles emerge from this. First, clear attribution helps. Knowing which code sections are AI-generated lets reviewers apply appropriate verification strategies rather than constantly switching modes. Span's platform provides this visibility, detecting AI-generated code at the chunk level so teams understand exactly where AI contributions exist.

Second, batch structure matters. Separating AI-generated code from human modifications reduces within-PR context switching. Third, review training needs to evolve. The skills for evaluating AI code (hallucination detection, security pattern verification, edge case validation) differ from traditional logic review and benefit from explicit development.

Mori's original recommendation for robotics applies here: "Because of the risk inherent in trying to increase their degree of human likeness to scale the second peak, I recommend that designers instead take the first peak as their goal." For AI code, this suggests making AI contributions clearly identifiable rather than trying to make them indistinguishable from human code. Navigate around the valley rather than through it.

DEV Community