Vibehackers

Posted on May 14

Anthropic Quietly Showed Their Own Tool Drops Dev Skill 17%

#ai #softwareengineering #career #productivity

A randomized controlled trial published February 2026, authored by two Anthropic researchers, tested what happens when developers learn a new library with vs without an AI coding assistant.

The productivity result was boring: no significant difference in completion time.

The mastery result was not.

The 17%

52 mostly-junior engineers. New library (Trio, async Python). Half got AI on top of search + docs. Half got search + docs only.

After they finished the work, both groups took a comprehension test on what they'd just built. Code reading, debugging, conceptual questions.

The AI-assisted group scored 17% lower.

Cohen's d = 0.738
p = 0.010
Roughly the equivalent of dropping two letter grades

This isn't "AI users felt less confident." This is "they couldn't explain or debug the code they shipped."

The paper. Anthropic's own writeup.

The Reason

The split inside the AI group was the most interesting part.

Conceptual-inquiry users — devs who asked things like "what does this do?", "why this pattern?", "explain X" — scored 65% or higher on the comprehension test.

Code-delegation users — devs who prompted "write this function" — scored below 40%.

Same tool. Same task. Same time. Just used differently.

The line from the paper: AI helps you finish. It can hurt your understanding of what you finished.

Why This Matters

If you're using AI to ship faster, you're trading something. The Anthropic data suggests what.

You're trading the ability to maintain it later.

The author of AI-generated code is, by the time the bug report arrives, often somebody who couldn't pass a comprehension test on it.

That's not a 5-year horizon problem. That's a "next sprint, this same dev" problem.

It's Not Just Anthropic

The Anthropic skill-formation study is the cleanest number. It's not the only one pointing the same direction.

METR (July 2025): 16 senior open-source devs, real repos. Devs thought AI sped them up 20%. They were actually 19% slower. The follow-up study, started in Aug 2025, broke — too many participants refused non-AI tasks. METR has no clean follow-up number to publish.
Cursor diff-in-diff (Nov 2025): Open-source projects that adopted Cursor had a +281% lines-added month 1, +48% month 2, baseline by month 3. Static analysis warnings up 29.7%. Code complexity up 40.7%. The future-velocity penalty: a 100% increase in code complexity is associated with a 64.5% decrease in development velocity over time.
Echoes of AI (2025): 151 developers, Java + Spring Boot. With AI: 30.7% faster Phase 1. Habitual AI users: 55.9% faster. In Phase 2, a different developer extended the code without AI — no significant difference in completion time or quality. The downstream cost was statistically zero. The downstream benefit was also statistically zero.

The pattern across studies: modest gains up front, real costs you can measure later, and the question of who pays the cost depends on whether the next developer to touch the code is you.

The Counter-Evidence

To be fair: Cui, Demirer, Jaffe et al. (2025), peer-reviewed in Management Science, ran three field experiments with 4,867 developers at Microsoft, Accenture, and a Fortune 100 company. Result: +26.08% completed tasks with Copilot access.

This is the strongest "AI helps" finding in the literature. It's a real, large, peer-reviewed number.

The catch: the gains skewed heavily to less-experienced developers. Senior devs at the same companies showed smaller effects. METR senior open-source devs on legacy code showed negative effects.

The least-bad reading: AI helps tractable enterprise tasks done by less-experienced devs. AI doesn't help (and may hurt) senior devs working on mature legacy code. Both findings are robust. Both should be cited together.

What You Should Do Differently

If the Anthropic skill-formation result is even directionally right, the practical change is small but real.

Ask AI questions. Don't ask AI to write code.

"What's the difference between X and Y here?"
"Why does this pattern break in case Z?"
"Explain why the docs recommend X instead of Y."

Then write the code yourself. Or write a first version, then ask AI to critique it.

This is exactly the workflow the 65%+ comprehension-test scorers were using. Same tool, dramatically different outcome.

Or: if you delegate code generation, do it on code you're never going to maintain. Throwaways. Spikes. One-shot scripts.

The minute you'll be the one fixing the bug six weeks from now, you want to be in the 65% group, not the sub-40% group.

TL;DR

AI coding tools deliver real but modest completion-time gains. Not 10x. Not 5x. Probably 0–30%.
They have non-trivial costs the discourse ignores: comprehension drops 17%, code complexity rises 40%, downstream velocity falls.
How you use AI dominates whether you use it. Inquiry > delegation, by a 25-point margin on actual comprehension tests.
The gains and costs hit different populations. Juniors on tractable code: real wins. Seniors on legacy: real losses.

The "AI 10x developer" framing is the wrong question. The real question is whether your future self can debug the code your present self shipped with AI.

If you want the full breakdown — eight studies, sourcing notes, methodological caveats — we wrote the longer evidence review. This was the short version.

Top comments (1)

go2null • Jun 1

What if all I write and care about is the Specs that I co-author with an AI in say, Gherkin, with the underlying code completely written by AI. That is, I treat the underlying code the same way I treat Assembly.