Vibehackers

Posted on May 13 • Originally published at vibehackers.io

AI Coding Tools and Productivity: What the Controlled Evidence Shows

#ai #softwareengineering #productivity #research

Everyone keeps quoting METR's July 2025 study: developers thought AI sped them up by 20%, but they were actually 19% slower. It became the canonical "AI productivity is a mirage" data point. The line has been cited in every skeptical AI think-piece since.

Then a quieter thing happened. METR ran a follow-up. Anthropic ran a controlled trial. A team at Helsinki and elsewhere ran a two-stage RCT. SAP ran a wearables study. A diff-in-diff study on Cursor adoption dropped in November 2025. MIT economists ran three field experiments. None of these went viral. None of them tell the same story.

We went looking. This is what the controlled, quasi-experimental, and instrumented field-study literature on AI coding productivity actually says in early 2026 — not the surveys, not the vendor case studies, not the "developers report" claims. Randomized controlled trials where they exist, rigorous quasi-experiments and longitudinal natural experiments where they're the cleanest evidence available, and one diff-in-diff on Cursor that's too good to leave out.

Spoiler: the picture is not "AI gives 10x." It's also not "AI makes you slower." It's "modest gains, real costs, depends a lot on what you measure, and the headline number changes every six months."

The evidence has gotten richer over the last six months. The conclusion has gotten less crisp.

The Discourse Problem

Before the studies, a note on what gets counted as evidence.

Most claims about AI coding productivity come from one of three sources: vendor case studies (GitHub, Cursor, Copilot teams reporting on their own customers), developer surveys (Stack Overflow, JetBrains, DORA), and individual blog posts. None of these are useless. All of them are weak.

Vendor case studies have an obvious incentive problem. Surveys measure perception, not behavior — and the perception/behavior gap is the entire point of the METR finding. Individual blog posts are anecdotes.

Randomized controlled trials are scarce because they're expensive and slow. You need real developers, real tasks, control conditions, and enough sample size to detect modest effects against high task-to-task variance. The papers below are what we have.

1. METR's "We Still Don't Know" Follow-Up

The original METR study ran in early 2025: 16 experienced open-source developers, 246 tasks on their own mature repositories, mostly Cursor Pro with Claude 3.5/3.7 Sonnet. Result: developers were 19% slower with AI, but predicted they were 20% faster afterwards. The perception gap drove the citation count.

METR's follow-up started in August 2025. Same authors, larger pool (57 developers), more compact tasks, lower stipend ($50/hr instead of $150/hr to broaden recruitment). The plan was to validate or update the original finding using more recent tooling — Claude 4, GPT-5-era models.

The Feb 2026 write-up is unusually candid for a research blog post. The short version: the experiment broke.

Among the 10 developers who returned from the original study, AI use produced an 18% speedup — a sign-flip from the original. Among 47 new participants, the effect dropped to 4% speedup, statistically indistinguishable from zero. But METR flagged the results as unreliable: a substantial fraction of developers declined non-AI tasks ("I don't work this way anymore, I'm not going to pretend"), which biases the comparison downward for AI. The authors stopped short of publishing a headline number.

The most-cited paper in this entire conversation does not, as of mid-2026, have a clean follow-up using current tooling. METR is redesigning the experiment.

Practical reading: the original 19%-slower finding came from experienced devs working on their own legacy repos with early-2025 tools. It was always a narrow result. It does not generalize to junior developers, greenfield projects, or 2026 tooling, and the follow-up didn't give us a new number to hang anything on.

2. Anthropic on Skill Formation: 17% Comprehension Drop

Shen & Tamkin (Feb 2026) ran a different kind of trial. Fifty-two mostly junior engineers, all with at least a year of weekly Python, were asked to learn Trio — an asynchronous programming library none of them had used. Half got an AI coding assistant on top of search and docs. Half got search and docs only.

The productivity finding was the unremarkable one: no statistically significant difference in time to complete the learning tasks. The interesting finding was the test that came after.

When researchers gave participants a comprehension test on Trio — code reading, debugging, conceptual questions — the AI-assisted group scored 17% lower. Cohen's d = 0.738, p = 0.010. That's not noise. It's roughly the equivalent of dropping two letter grades.

The mechanism was visible in how people used the tool. Participants who used the AI for conceptual inquiry — asking what something meant, requesting explanations, posing follow-up questions — scored 65% or higher on the comprehension test. Participants who delegated code generation — "write this function for me" — scored below 40%.

Anthropic's own writeup is straightforward about the implication: AI helps you finish; it can hurt your understanding of what you finished. InfoQ covered it under "Reduces Developer Skill Mastery by 17%."

This is the study that matters most for juniors and for any code anyone has to maintain later. The author of the AI-generated code is, by the time the bug report arrives, often somebody who couldn't pass a comprehension test on it.

3. Echoes of AI: The Speedup That Doesn't Carry

The Echoes of AI paper is the cleanest two-stage design in the recent literature. 151 participants, 95% professional developers. Java with Spring Boot, working on RecipeFinder — a deliberately ~2 KLoC app salted with code smells, an injected bug, and incomplete tests.

Phase 1: original developers add a new feature. Some get an AI assistant, some don't. With AI, completion time dropped by a 30.7% median. Among habitual AI users — developers who'd already integrated AI into their daily workflow — the speedup was 55.9%. This is one of the strongest controlled-trial results favoring AI in the literature.

Phase 2: a different developer, without AI, extends the same code. This is the cost-of-AI-code question reframed as a controlled experiment. Does code written with AI cost more to maintain than code written without?

The answer was: not measurably. Phase 2 showed no statistically significant difference in completion time or code quality between the AI-authored and human-authored features. A Bayesian analysis put it bluntly: any maintainability advantage or disadvantage from AI use was "at most small and highly uncertain."

This cuts two ways. For AI optimists, it's the result they've been waiting for — a real, large Phase 1 speedup with no detectable downstream tax. For AI skeptics, it's a single study on a small codebase with a specific stack, and the question of long-term maintainability lives on a timescale Phase 2 didn't measure.

Both are right. The paper is the cleanest piece of recent evidence and doesn't, on its own, settle the question.

4. The SAP Wearables Study: Cognitive Load Is the Hidden Cost

A team studied SAP developers at work using a measurement stack heavier than anything else in this list: multi-day diary surveys, full screen/keyboard/mouse capture, and physiological wristband biometrics — heart rate variability, electrodermal activity — to estimate cognitive load directly.

This is not a randomized controlled trial. It's an observational, deeply instrumented field study at a single company, with small participant counts in each coding session. Treat the findings as descriptive, not causal. Worth including because it picks up something the time-to-completion studies miss.

Two patterns from the controlled coding sessions (Java: coding, debugging, docs, unit tests, brainstorming):

Moderate AI use sped developers up. Heavy AI use slowed them down. Not a contradiction — the relationship between AI usage intensity and productivity was non-monotonic. There was a sweet spot, and developers past it spent more time verifying answers, rephrasing prompts, and switching between code and chat than they'd have spent just doing the task.

Context-switching tanked productivity independently. Developers who flipped frequently between editor, chat window, and the AI's output produced less and were measurably more cognitively loaded than developers who used AI in longer focused stretches. A related IEEE paper from the same group made cognitive-load measurement the central question.

The qualitative finding was the one most worth quoting: AI was simultaneously perceived as raising productivity and raising cognitive load. Developers reported feeling faster and feeling more tired. The wristband data tracked the second half of that.

The implication for tool design is clear: chat-style interaction outperformed inline completions when the task required actual reasoning, and the productivity-per-cognitive-watt of AI usage drops sharply once usage gets compulsive. The implication for individual practice is more uncomfortable: feeling faster is not evidence of being faster, and feeling fine is not evidence of cheap.

5. Cursor: A Diff-in-Diff with Hard Numbers

Speed at the Cost of Quality, published November 2025 and refined into the v2 titled "Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor's Impact on Software Projects," is the closest thing the literature has to a longitudinal natural experiment. The authors compared open-source projects that adopted Cursor against matched controls that didn't, tracking commits, lines added, and code-quality metrics over months.

The headline finding is temporal: velocity rises sharply in the first month, then returns toward baseline.

Month 1 after Cursor adoption:

Lines added: +281.3% versus matched controls
Commits: +55.4%

Month 2:

Lines added: +48.4%
Commits: +14.5%

Month 3:

Both metrics return to baseline

That pattern looks more like a temporary adoption shock than a durable step change.

But the quality metrics don't settle. Across the full window:

Static analysis warnings: +29.7%
Code complexity (cognitive complexity, the standard tooling metric): +40.7%

These rises are persistent, not transient, and the paper's authors check whether the complexity increase is just a side effect of writing more code. It isn't. After controlling for velocity dynamics, Cursor adoption still adds a 9.0% baseline increase in code complexity that doesn't go away.

The paper's regression of future velocity against accumulated complexity points to a tech-debt feedback loop: a 100% increase in code complexity is associated with a 64.5% decrease in development velocity over time, and a 100% increase in static analysis warnings with a 50.3% decrease.

The mechanism the paper proposes: AI lets you ship a feature in month 1. The shipped code is denser and more warning-prone. Future work in that codebase is slower because the codebase is harder to reason about. The framework is plausible and the within-sample numbers are large, but this is an observational diff-in-diff on open-source projects — the authors themselves flag limits on external validity, and the long-term causal claim is the most contestable part of the paper.

A related study, AI IDEs or Autonomous Agents?, looking at the transition from IDE-based AI assistants to autonomous coding agents, found analogous patterns: a "significant, large, but transient" velocity increase paired with "significant and persistent" rises in static-analysis warnings and code complexity. Different tools, similar shape.

6. The Counter-Evidence: MIT/Microsoft/Accenture

Honest research roundups have to include the inconvenient evidence. The strongest counter to the "AI doesn't really help" reading is Cui, Demirer, Jaffe, Musolff, Peng, and Salz (2025), now peer-reviewed in Management Science — three field experiments at Microsoft, Accenture, and an anonymous Fortune 100 company, randomizing GitHub Copilot access across 4,867 developers.

The headline: a 26.08% increase in completed tasks for Copilot-equipped developers, pooled across all three sites. Code commits up 13.55%. Compilations up 38.38%. None of these are small.

The catch — and it's important — is in the subgroup analysis. The gains skewed heavily to less-experienced developers. Senior engineers at the same companies showed smaller effects. The size of the developer pool also matters: 4,867 across three large enterprises is structurally different from METR's 16 senior open-source maintainers working on their own repos.

So the field-experiment evidence and the open-source RCT evidence are not in direct contradiction. They are measuring different populations. The best-supported reading is narrower: Copilot access improved throughput in large enterprise settings, especially for less-experienced developers; METR-style results look much weaker for senior maintainers working in familiar mature repos. Both findings are robust enough to cite, and they should be cited together.

A separate workplace RCT, "Dear Diary", ran a similar design in a different setting and reached a comparable conclusion: real but modest gains, with high variance.

7. What We Couldn't Find

A few topics where we went looking and came up empty, in case you have better sources:

A clean RCT on agentic coding (Claude Code, Codex, Devin) at scale. There are descriptive empirical studies of agentic pull requests on GitHub — 33,596 PRs across five agents, revert rates by agent ranging from 0.7% for Codex to 7.6% for Copilot — but no randomized trial. This is the next frontier and the literature hasn't caught up.
A productivity study controlling for prompt skill. Every paper above pools developers regardless of how good they are at using the tool. The Anthropic skill-formation study hints that how you use AI dominates whether you use it — but there's no RCT that randomizes prompt training as a separate intervention.
A long-horizon longitudinal study. Cursor's diff-in-diff and the agentic-velocity paper both look at months. The hardest version of the question — what does AI-authored code cost five years in, when nobody remembers why it was written that way — is unanswered. Comprehension debt, if Anthropic's skill-formation result holds, makes this worse.
Quality-controlled study of vibe coding specifically. The whole "build it from a prompt, iterate against the running app" loop hasn't been measured in a controlled setting. We covered the risks and the team rollout question, but the experimental evidence for the workflow is still mostly anecdotal.

If you know of a clean RCT or large-scale natural experiment we missed, send it.

What the Evidence Actually Supports

Pulling the eight studies together, here is what survives:

AI coding tools deliver real but modest gains, especially for short cycles, smaller teams, less-experienced developers, and tractable tasks. The Microsoft/Accenture field experiments are the strongest evidence for this. Echoes of AI's Phase 1 supports it too.

The gains are not 10x. They are not 5x. They are, in well-designed studies, somewhere between "indistinguishable from zero" and "26-30% on completion time." That is a meaningful gain. It is also nowhere near the productivity miracle that mass-engineer-layoff narratives require.

The gains often come with non-trivial costs. Cognitive load goes up (SAP). Comprehension goes down (Anthropic). Code complexity grows persistently (Cursor diff-in-diff). The maintenance question is undersettled (Echoes of AI Phase 2).

Experienced developers on familiar mature codebases appear to get less benefit, and may pay more in fragmentation. This is the population where AI optimism most often fails experimentally — METR, the SAP heavy-usage curve, the senior-engineer subgroup in the Microsoft RCTs all point in a similar direction.

Tool fluency dominates tool access. Anthropic's skill-formation study is the cleanest single result here: the same tool, used differently, produced 65%+ vs sub-40% test scores. The first-order question for any team rolling out AI is not "do they have access" but "are they using it for inquiry or for delegation."

The 2026 evidence narrows rather than widens the slices where AI actually pays off: junior tasks on tractable code, focused stretches not compulsive checks, inquiry-driven prompting, short-horizon features.

The unresolved question is the second-year shape of the curve. The Cursor diff-in-diff hints at a maintenance tax that arrives after the velocity surge fades, but its design is observational and limited to open-source projects in a handful of languages. Whether the same pattern persists, weakens, or disappears in proprietary enterprise codebases — with different developer populations, review cultures, and quality gates — is exactly the study we don't have yet.

Notes on Sourcing

Numbers above come from the abstracts and results sections of the cited papers, not from press coverage or summaries. Where the press coverage disagrees with the paper (e.g., the Echoes of AI Phase 2 result is often summarized as "downstream cost" — the actual paper says "no significant difference"), we went with what the paper says. If we got something wrong, point it out and we'll fix it.

The full citation set:

METR original: arXiv 2507.09089 · blog
METR follow-up: blog
Shen & Tamkin — Skill Formation: arXiv 2601.20245 · Anthropic · InfoQ
Echoes of AI: arXiv 2507.00788
SAP wearables: IEEE 11121737 · IEEE 11024407
Cursor diff-in-diff (Speed at the Cost of Quality): arXiv 2511.04427
AI IDEs or Autonomous Agents: arXiv 2601.13597
Cui et al. — three field experiments: Management Science · SSRN · MIT PDF
Dear Diary RCT: arXiv 2410.18334

Originally published on Vibehackers.io.

DEV Community