Gerus Lab

Posted on Apr 13

The Productivity Hallucination: Why AI-Generated Code Is Breaking Your DevOps Pipeline

#ai #productivity #programming #devops

Your Team Writes 3x More Code Than Last Year. So Why Is Nothing Shipping Faster?

Here's a number that should terrify every engineering manager: 84% of developers now use AI coding tools daily. Here's the number that should terrify them more: only 29% trust what they ship to production.

We at Gerus-lab have built over 14 products — from Web3 platforms on TON and Solana to AI-powered SaaS tools. We've watched the AI coding revolution from the inside. And after six months of integrating AI agents into our own pipeline, we have one conclusion that nobody wants to hear:

AI isn't making your team faster. It's creating a hallucination of productivity.

The Data Says You're Slower

Let's start with the most uncomfortable study of 2026.

METR ran a randomized controlled trial — not a survey, not vibes — where experienced developers performed real tasks in their own open-source repositories, with and without AI assistance. The result? With AI, tasks completed 19% slower. But here's the kicker: developers estimated they were 20% faster.

That's a 39-percentage-point gap between perception and reality.

We call this the Productivity Hallucination: you generate more PRs, close more tickets, write more lines of code — and genuinely feel more productive. But the code reaches production at the same speed. Or slower.

Faros AI confirmed this at scale. Their telemetry across 10,000 developers and 1,255 teams with high AI adoption showed:

Tasks closed: +21%
PRs created: +98%
PR size: +154%
Review time: +91%
Bugs per developer: +9%

Twice as many PRs. Each one twice as large. Reviews taking twice as long. More bugs getting through despite more review effort.

The bottleneck didn't disappear — it moved. From writing code to reviewing it.

The Pipeline Is Choking

CircleCI analyzed 28 million CI/CD workflows in 2026 and reached a damning conclusion: while teams create far more code, less of it makes it into production.

Google's DORA Report adds context: every 25% increase in AI adoption correlates with a 1.5% drop in delivery speed and a 7.2% decline in stability. Not because the code is bad — because the processes downstream can't digest the volume.

At Gerus-lab, we saw this firsthand. One of our engineers generated 11 PRs in a single Tuesday — up from his usual 2-3 per week. Those 11 PRs sat in review for an average of four days. Three lingered for over a week. By the time the last one merged, it had conflicts with main that took another hour to resolve.

The two senior engineers doing reviews looked, and I quote, "like they'd been through a war" by Friday.

Code was written faster than ever. It reached production at the same speed as before.

Why AI Code Is Harder to Review Than Human Code

This is the part most people miss.

Reviewing your colleague's code is one thing — you discussed the approach beforehand, you know their style, you understand their reasoning. AI-generated code has zero context. Every naming choice, every structural decision, every error handling pattern must be evaluated from scratch.

One AI-generated PR often consumes as much cognitive energy as two or three familiar PRs from teammates. And after 200-400 lines, review quality drops sharply regardless of who's reviewing.

VirtusLab's research on cognitive debt (published April 10, 2026) found something even more troubling: developers who regularly use AI rate their own skills higher than those who don't. But on tests without AI assistance, they score worse. AI masks skill gaps by producing working code in areas where the developer would have struggled. From the developer's perspective, it looks like "I solved it." In reality, the tool solved it — and the developer didn't notice they don't understand the result.

This is the second form of the Productivity Hallucination: not just an illusion of team speed, but an illusion of individual competence.

The Million-Line Graveyard

A fintech firm rolled out Cursor across their engineering team. Monthly code volume jumped from 25,000 lines to 250,000. The result wasn't a productivity boost — it was a backlog of one million lines of unreviewed code.

GitHub recorded a 400% increase in AI-generated contributions in 2025. The Python Software Foundation reported a 60% spike in maintainer burnout, with 78% of respondents blaming AI submissions.

And here's the spiral that every team using AI coding for 6+ months recognizes: features that shipped in two days now take two weeks. Each AI session optimizes for the current task without understanding the overall architecture. Business logic bleeds across layers. Single-purpose files accumulate concerns. Verification debt grows 30-40% per quarter.

Daniel Stenberg, author of curl, noticed something fascinating: AI-generated vulnerability reports used to be obvious garbage — easy to reject. Now they're genuinely good. But there are so many of them that maintainers physically can't keep up. Internet Bug Bounty suspended intake entirely.

When AI code was bad, it was easy to filter. Now that it looks good, verification costs more, not less.

What Actually Works: The Kapwing Model

After reviewing dozens of case studies, we found one company that cracked it: Kapwing.

25 people. In Q1 2026, every employee — designers, sales leads, content writers, support — committed code to production. 108 PRs through AI agents in one quarter. Production incidents went down. They canceled their quarterly bug bash because integration with their bug tracker closed issues automatically. 36 engineering days saved per quarter.

What made Kapwing different?

Not the tools. The process:

Curated first tasks — Every non-technical person got a specific, pre-selected task for their first PR. Not "figure it out."
Five months of phased rollout — Infrastructure → Training → Automation. Not "here's Codex, good luck."
Engineers review everything — Every PR, including those from non-technical staff.
Hard PR size limits — 400 lines maximum, enforced.

The CEO personally led the training. Not delegated to an "AI evangelist," not dumped in Notion.

The Framework We Use at Gerus-lab

After our own painful learning curve, we built a framework at Gerus-lab that balances AI speed with delivery quality:

1. Measure Output, Not Input

Stop celebrating PR count and lines of code. Track what matters:

Lead time to production (from first commit to deploy)
Review latency (how long PRs wait)
Rollback rate (how often deploys fail)
Incident rate (what breaks after deploy)

If input metrics rise while output metrics stagnate — you have a Productivity Hallucination.

2. Cap PR Size at 400 Lines

AI generated an entire feature? Split it into three PRs. This is non-negotiable. After 400 lines, human review quality collapses. We enforce this with a CI check that blocks oversized PRs.

3. Separate AI-Generated from Human-Written Code

Every AI-generated PR in our pipeline gets a label. Not for blame — for calibration. AI code gets extra review attention because it has no author context. We track metrics separately so we can see if AI-heavy sprints correlate with more incidents.

4. Rotate Review Load

AI code creates asymmetric load: the person writing does less work while the reviewer does more. We cap review assignments per person per day and rotate aggressively to prevent reviewer burnout.

5. Architecture Reviews Before AI Generation

Before anyone prompts an AI agent to build a feature, we do a 15-minute architecture review: Where does this code live? What modules does it touch? What are the boundaries? AI doesn't understand your system's architecture. A human should define the map before AI fills in the territory.

The 25-40% Sweet Spot

Anthropic's latest assessment suggests AI-generated code is now 41-42% of all code globally. But their research points to a sustainable quality threshold somewhere between 25-40%. Above that range, degradation starts eating the gains.

This matches our experience at Gerus-lab. On projects where AI generates roughly a third of the code, we see genuine speed improvements. Beyond that, review queues balloon, architectural coherence degrades, and the team starts spending more time fixing AI-introduced issues than they saved by generating them.

The goal isn't maximum AI usage. It's optimal AI usage.

The Uncomfortable Truth

78% of employees use AI without IT approval. A third rarely check the output. 15% hide their usage from managers.

Your PR pipeline is probably already half AI-generated code — and nobody in leadership knows. Input metrics show a throughput increase, and management is happy. What actually made it to production, and who verified it? Nobody tracks that, because officially, nobody deployed AI.

This isn't adoption. It's shadow engineering.

Stop Measuring the Wrong Things

AI coding tools are powerful. We use them daily at Gerus-lab. But power without process creates chaos.

The companies that will win aren't the ones generating the most code — they're the ones delivering the most value. And right now, those are very different things.

If you're seeing input metrics soar while delivery stays flat, you don't have a productivity gain. You have a productivity hallucination.

The cure isn't less AI. It's better engineering.

At Gerus-lab, we help teams build products that actually ship — from Web3 platforms and AI tools to SaaS products. If you're struggling with delivery velocity, we've been there. Let's talk.

What's your experience? Has AI actually sped up your team's delivery, or just the coding part? Drop a comment — we're genuinely curious.

DEV Community