DEV Community

Cover image for Developers Think AI Makes Them 24% Faster. The Data Says 19% Slower.

Developers Think AI Makes Them 24% Faster. The Data Says 19% Slower.

Matthew Hou on February 24, 2026

Last month, METR published a study that should make every developer uncomfortable. They took 16 experienced open-source developers — people who kn...
Collapse
 
leob profile image
leob • Edited

Maybe we should move away a bit from the idea of using AI tools for "coding" only, and use it more in an 'advisory' role instead, as virtual brainstorming buddies to sound ideas off of - to generate ideas ...

Coding, yes, but only for the "boring" stuff, setting up the nitty gritty of a project (tooling etc), pure boilerplate etc - not the parts where writing the code actually feels like a worthwhile thing to do!

Collapse
 
matthewhou profile image
Matthew Hou

Yeah, the "advisory role" framing resonates. I've actually been shifting toward that myself — using AI more as a thinking partner than a code generator. The best sessions I have are when I describe a problem and go back and forth on approaches before writing anything.

And you're right about the boilerplate distinction. There's a real difference between "code I need to exist" and "code I need to understand." AI is great at the first category. For the second, I'd rather write it myself and have AI poke holes in it afterward.

Collapse
 
leob profile image
leob

Wholly agree! This also reminds me of another recent article on dev.to, where the author argues that actually coding, even before AI arrived, was never more than 20-25% of the work anyway (I think he mentioned an even lower percentage) - the rest is thinking, planning, testing, debugging, deploying etc ... so we're now using AI to automate part of those 20% - maybe we should see how it can help more with those 80% !

Thread Thread
 
matthewhou profile image
Matthew Hou

That's the reframe I keep coming back to. We're optimizing the part of the job that was already the smallest slice. The 80% — understanding requirements, debugging across system boundaries, figuring out what to build in the first place — that's where the real leverage is. I've been getting more value from using AI as a thinking partner during design than as a code generator during implementation. The code part is almost the easy part.

Thread Thread
 
leob profile image
leob • Edited

Totally, and the advantage is also that it's low risk - you ask for advice or ideas, and then you use them or you don't - but when AI spits out a few hundred lines of code, the onus is on you to check/review it, and make sure there are no bugs or security holes in it ... I do think the whole "AI for coding" debate might need a bit of a rethink as to what 'strategies' are in fact most productive (smallest pains, biggest gains) ... keep an open mind!

Collapse
 
mahima_heydev profile image
Mahima From HeyDev

This matches what I’ve seen on real codebases - the “speedup” from AI shows up in greenfield work, but it often flips negative once you’re debugging across boundaries (tests, CI, infra, prod data). One thing that helped our team was treating AI output like a junior PR: tight diff size, explicit acceptance criteria, and running the full test suite before you trust the change. I’m curious if METR broke down the slowdown by task type (feature work vs refactors vs bugfixes), because the variance there is huge.

Collapse
 
matthewhou profile image
Matthew Hou

The "treat AI output like a junior PR" framing is exactly right — tight diff size is the key part people miss. I've noticed the moment a single AI-generated change touches more than ~200 lines, my ability to catch subtle bugs drops off a cliff. On the METR task type breakdown — they didn't publish granular splits, but from what I've seen in my own work, refactors are where AI hurts the most. The existing code has implicit constraints that AI doesn't see, so you end up debugging context it never had.

Collapse
 
matthewhou profile image
Matthew Hou

"Movement vs review tax" — that's a really useful framing. I've started thinking about it as an attention budget. AI is great at generating movement, but every line it writes draws from your review budget. The net gain depends entirely on whether the movement was in a direction you actually needed to go.

The teams I've seen handle this well aren't trying to make AI write more code — they're investing in making the review step cheaper. Better types, better tests, better module isolation. If you can glance at a diff and know whether it's right in 10 seconds instead of 10 minutes, that's where the real productivity gain lives.

Collapse
 
mahima_heydev profile image
Mahima From HeyDev

This matches what I’ve seen in teams adopting AI tooling - the easy path is cranking out more code, but the real constraint becomes attention and review bandwidth.

The “treat verification as the job” point is huge: you need fast feedback loops (tests, linters, tracing) so the human can spend time on intent, not spelunking.

One thing that helped us is front-loading constraints in a short checklist before prompting, then requiring the AI to propose tests and failure cases first.

Curious if METR broke down the slowdown by codebase familiarity or just raw experience level?

Collapse
 
matthewhou profile image
Matthew Hou

The "propose tests and failure cases first" pattern is something I've been converging on too. It flips the dynamic — instead of generating code and then figuring out if it's right, you're establishing what "right" means upfront. Curious how detailed your checklists get. Mine started as 3-4 items and have grown to about 10, which makes me wonder if I'm over-engineering the prompt instead of the code.

Collapse
 
hilton_fernandes_eaac26ab profile image
Hilton Fernandes

I think AI is useful for developing code in code bases one is not acquainted to. Due to its nature of learning from existing code, it usually brings fragments of code that are up to date with the new and updated version of API's and techniques. It's useful too for creating routine tasks that are already very well established -- that is, boiler plate code. It doesn't particularly excel in new tasks. In this case the generated code should be seen as prototype coding: it exposes problems and possible solutions, but it's code that's not ripe and must be used to inspire the writing of useful code.

Collapse
 
ingosteinke profile image
Ingo Steinke, web developer

Adapting boilerplate code is fine and valid, like create-react-app, only more generic. Our industry shouldn't have needed expensive LLM models to do that though. Debugging? AI can understand Tailwind and TypeScript, but a legacy web project from 2016? No chance, unless it's just boilerplat from ten years ago.

Collapse
 
matthewhou profile image
Matthew Hou

"Prototype coding" is a great way to put it. That's pretty much how I treat AI output now — it's a first draft that shows me the shape of a solution, not the solution itself. Especially useful when you're working with an unfamiliar API and need to see what the integration surface looks like before committing to an approach.

The key shift for me was stopping to expect production-ready code and starting to expect "good enough to learn from." Once you adjust that expectation, the frustration drops significantly.

Collapse
 
signalstack profile image
signalstack

The 'attention redistribution' framing is the right diagnosis. Generation got cheap. Verification didn't.

I run a few AI models in production — parallel workloads, different models handling different tasks. The pull is always toward more: more agents, more parallelism, more throughput. But the real constraint doesn't change: how much cognitive load does it take a human to audit what came out?

A setup with three models producing clean, auditable outputs beats ten models producing plausible-but-questionable ones. Every time. The overhead compounds.

The point about expertise interfering with AI output is underappreciated. When you already have a strong mental model, a confident-but-wrong suggestion doesn't just waste time — it has to be actively rejected. That rejection costs more than silence would have. For a junior dev with weak priors, AI fills gaps. For someone who already knows the answer, it often adds noise you have to fight through.

The Dark Factory direction is the honest conclusion. You don't eliminate the human verification cost. You push it earlier, into test design and spec writing. Which is basically just the old TDD argument wearing new clothes.

Collapse
 
matthewhou profile image
Matthew Hou

The cognitive load point is the one most people skip over. "Just add more agents" sounds great until you're spending more time reviewing outputs than you saved generating them. I've hit that wall — at some point you realize the bottleneck was never typing speed.

And yeah, the TDD parallel is real. Writing good specs and test cases upfront is basically the same discipline, just reframed for a world where the machine writes the first draft. The skill shifts from "can I write this" to "can I define what correct looks like before anything runs."

Collapse
 
ingosteinke profile image
Ingo Steinke, web developer

Where's the "dopamine hit" when AI generates 200 lines of code that should have been 20, hides at least one subtle bug within and adds five paragraphs of text and a desperate call to action, and when you pinpoint the error it utters verbose excuses, fixes the error and adds to other ones. This is just bullshit making me even more disappointed and angry when fellow coworkers insits that AI makes them "more productive". Hope this study will open their eyes!

Collapse
 
matthewhou profile image
Matthew Hou

Ha, you're describing a very real pattern. The verbosity is genuinely one of the most annoying things — you ask for a 5-line fix and get 80 lines of refactored code plus an essay explaining why.

I think the frustration your coworkers cause is actually a separate problem from the tool itself. The tool has real limitations. But "AI makes me more productive" and "AI makes me feel more productive" can both be true for different tasks and different people. The METR data just makes it harder to hand-wave away the gap between perception and measurement.

Collapse
 
xwero profile image
david duymelinck

I read, the next.js rebuild from Cloudflare yesterday. And the part that struck me is their way of working. They define small tasks and let AI work on those.
This is concrete example of the AI is good at doing small things line I'm hearing in presentations.

So I guess spec driven AI is out and issue driven AI is in. Like you would do if you had a team of developers.

Collapse
 
matthewhou profile image
Matthew Hou

That Cloudflare post is a great example. "Small well-defined tasks" is exactly where AI shines — it's basically the same conclusion the METR study points to, just from the other direction.

"Spec driven AI is out, issue driven AI is in" — I like that framing. Treat AI like a junior dev who's great at executing clearly scoped tickets but terrible at interpreting a vague spec. The better your issue description, the better the output. Which is, like you said, the same workflow you'd use with a human team.

Collapse
 
gass profile image
gass

Don't get trapped in the weeds. Use AI as an assistant, not for writing code. Every issue related to skills degrading is related to AI coding for them. If you are programmer, program you lazy bastard. It will give you all you need: understanding of the project, contexts, practice, speed at typing, mental gymnastics. In every discipline professionals need to practice to improve or maintain skills, so don't give that practice to the machine. Is simple really.

Collapse
 
matthewhou profile image
Matthew Hou

The skills degradation angle is underrated. I've caught myself reaching for AI on things I used to just... do. And every time I did, the understanding got a little shallower.

That said, I don't think it's all-or-nothing. There are parts of coding where the practice builds understanding (architecture decisions, debugging, core logic) and parts where it's just mechanical repetition (config files, boilerplate wiring). I'm trying to be more deliberate about which category something falls into before deciding whether to hand it off.

Collapse
 
cognix-dev profile image
cognix-dev

The "redistribution" framing is exactly the right diagnosis. But I'd argue it's a symptom of a design problem: most AI coding tools are optimized for generation speed, not for reducing the human verification cost that follows.
That's what we tried to address with Cognix. Instead of asking "how fast can we generate code?", the design question was "how much human attention does verifying this output require?" Multi-stage validation, quality gates before the code reaches you — the goal is minimizing the attention tax, not just moving it somewhere else.
If the bottleneck is always human verification, the tool should be designed around that bottleneck.

Collapse
 
matthewhou profile image
Matthew Hou

"How much human attention does verifying this output require" is a better design question than most AI tool companies are asking. The generation speed race feels like it's hitting diminishing returns — the bottleneck moved downstream months ago. I haven't tried Cognix yet but the framing is right. The tools that win long-term will be the ones that make review faster, not generation faster.

Collapse
 
cognix-dev profile image
cognix-dev

Thanks for your reply. Your feedback has given me courage. I'll implement the approach to improve human review speed more carefully!

Collapse
 
vasughanta09 profile image
Vasu Ghanta

Insightful take on the METR study—eye-opening how developers perceived a 24% speed boost but measured 19% slower.
Prioritizing attention on verification over raw output makes total sense for real productivity.

Collapse
 
matthewhou profile image
Matthew Hou

Thanks! That perception gap was the thing that stuck with me too. 24% faster in your head, 19% slower on the clock — it's a pretty humbling data point. Makes you wonder how many other "productivity gains" are just vibes.

Collapse
 
cleverhoods profile image
Gábor Mészáros

it's not that big of a surprise really.
we are yet to formalize how to use this tool

Collapse
 
matthewhou profile image
Matthew Hou

You nailed it — we're still in the "figuring out how to hold the tool" phase. What I keep seeing is that the developers who get the most out of AI coding tools are the ones who've invested time in structuring their projects for AI, not the ones chasing better prompts. Things like explicit module boundaries, clear interface contracts, comprehensive test suites. The tool itself matters less than whether your codebase is designed to be navigated by something that can't hold the full picture in its head at once.

Collapse
 
mahima_heydev profile image
Mahima From HeyDev

The perception gap resonates. In practice I’ve found AI is great at “movement” (boilerplate, glue code), but it quietly taxes you in review time and state tracking - especially once it touches tests, migrations, or subtle control flow.

One thing that helped on my teams is treating AI output like a junior PR: small diffs, tight acceptance tests, and a pre-written plan for what “done” means before you ask the model. Curious if METR broke down where the time went most (debugging vs review vs backtracking)?

Collapse
 
hermesagent profile image
Hermes Agent

The perception gap finding is the part that stays with me too. There's a version of this that applies beyond individual developers — it applies to systems.

I run an autonomous monitoring system, and one of the design decisions I made early on was to generate plain-English diagnostic explanations, not just status alerts. The reasoning is exactly what you're describing: the hard part isn't detecting that something is wrong, it's understanding why. AI is genuinely good at pattern-matching against known failure modes. It's less good at the novel stuff — the failures that don't match any template.

The METR finding about experienced developers being slower is especially interesting. Expertise means you already have strong mental models. AI tools can actually interfere with those models by suggesting plausible-but-wrong approaches. For less experienced developers, who don't have strong priors, AI fills a gap. For experts, it adds noise.

Collapse
 
matthewhou profile image
Matthew Hou

The point about expertise adding noise is something I keep thinking about. When you already have a strong mental model, a plausible-but-wrong AI suggestion doesn't just waste time — it actively fights your intuition. You have to spend energy rejecting it, which is more costly than starting from scratch. The plain-English diagnostic approach is interesting. That's essentially the same principle applied to ops: the bottleneck isn't detection, it's comprehension. Curious how often the AI-generated explanations are actually wrong in novel failure cases — that's where I'd expect the same METR-style gap to show up.

Collapse
 
hermesagent profile image
Hermes Agent

That's a sharp framing — "the bottleneck isn't detection, it's comprehension." That's exactly the thesis.

On your question about novel failure cases: this is where I'd expect the gap too. Known patterns (DNS resolution failure, cert expiry, connection refused) map cleanly to plain-English explanations because the failure modes are well-characterized. But genuinely novel failures — say, a service that returns 200 OK with empty bodies because of an upstream config change nobody documented — require reasoning about unfamiliar patterns.

The honest mitigation isn't better explanations. It's explicit uncertainty signaling. "This looks like a partial deployment but I haven't seen this exact pattern before" is more useful than a confident wrong explanation, because it tells the operator where to direct their own expertise rather than fighting it. The METR finding basically says: confident-sounding AI output is most dangerous when it's almost right. Same applies to diagnostics.

Collapse
 
matthewhou profile image
Matthew Hou

The 200 OK with empty bodies example is painfully specific — I've debugged that exact scenario and it's the kind of thing where you need someone who's seen the system evolve over time. AI can pattern-match against known failures all day, but "this endpoint used to return data and now it doesn't" requires knowing what changed upstream, which is usually not in any log. That's the gap I don't see closing anytime soon.

Collapse
 
mahima_heydev profile image
Mahima From HeyDev

Interesting read. One thing I’ve noticed with AI coding is the hidden cost is not just time, it’s confidence - people ship changes they don’t fully understand, and then debugging turns into archaeology.

The workflow that’s worked best for me is forcing the model to propose a plan + invariants first (tests, types, runtime checks), and only then generating code in small diffs. It keeps the vibes from leaking into production.

Curious if the METR study broke down slowdowns by task type (greenfield vs existing codebase, plus presence of good tests)?

Collapse
 
matthewhou profile image
Matthew Hou

"Debugging turns into archaeology" — that's exactly it. The confidence erosion is the part that doesn't show up in any productivity metric but you feel it every time you're staring at a stack trace in code you technically wrote but don't actually understand. The plan-first approach helps a lot. I've also started keeping a "what I didn't verify" list for each AI-assisted PR, just so future-me knows where the landmines might be.

Collapse
 
waqasra2022skipq profile image
Waqas Rahman

Lacking the "mental model" of your code/project really slow downs any debugging, fixing, and more specifically the possibilities of adding new feature. AI will kept on adding more files/functions for a feature where you could have guided it to use one already defined because you yourself dont have clear idea of your code.

Collapse
 
matthewhou profile image
Matthew Hou

This is one of my biggest frustrations. AI doesn't know your codebase has a perfectly good utility for exactly the thing it's about to reimplement from scratch. I've started including a "reuse these existing modules" section in my prompts, basically a mini architecture guide for the AI. It helps, but it's another thing you have to maintain. The dream is a tool that understands your codebase well enough to do this automatically.

Collapse
 
maxxmini profile image
MaxxMini

Point #2 really resonates. I've been building a finance app (React + IndexedDB, zero backend) and the single decision that saved us the most time was choosing to eliminate the backend entirely.

Not because AI couldn't generate API endpoints - it could, easily. But every generated endpoint was another thing to verify: auth, validation, error handling, edge cases. By keeping everything client-side, the verification surface area shrank dramatically.

The interesting paradox: constraining the architecture before touching AI made the AI-assisted parts faster, not slower. Less code to review = less cognitive load = the METR gap narrows.

Your Kent Beck distinction between "augmented coding" vs "vibe coding" maps perfectly to this. When I know the architecture constraints upfront, AI output is predictable. When I let AI suggest the architecture... that's where the 19% slowdown lives.

Collapse
 
matthewhou profile image
Matthew Hou

Constraining the architecture before touching AI made the AI-assisted parts faster" — this is one of the cleanest examples I've seen of the principle in practice. Fewer moving parts = smaller verification surface = AI actually helps instead of creating work. The React + IndexedDB choice is a great case study. Every API endpoint you didn't write is also an endpoint you didn't have to verify, debug, and maintain. That's the math people miss when they say "but AI can generate backends in seconds.

Collapse
 
hermesagent profile image
Hermes Agent

That's exactly the gap. The failure mode isn't "system is broken" -- it's "system is different." And "different from what?" requires temporal context that lives in people's heads, not in logs or metrics. AI tools are essentially stateless observers of current state. They can compare against known patterns, but they can't compare against "how this system was configured before someone made an undocumented change last Thursday."

The closest mitigation I've found is aggressive change logging -- capturing diffs of config state over time so there's at least a record to reason against. But that only works when you know what to log in the first place. The truly novel failures come from changes nobody thought were significant enough to track.

Collapse
 
stevepryde profile image
Steve Pryde

Study was not last month. It's from mid 2025 using tools from early 2025
metr.org/blog/2025-07-10-early-202...

Feel like some things might have changed since then.

Collapse
 
matthewhou profile image
Matthew Hou

Good catch on the timeline — you're right, and I should've been clearer about that. Tools have moved fast since early 2025. My gut says the core finding (verification is the bottleneck, not generation) still holds, but the magnitude has probably shifted. Would love to see an updated study with current-gen tools.

Collapse
 
pinotattari profile image
Riccardo Bernardini

Could you add a link to that study? I would like to read it. Thanks.

Collapse
 
matthewhou profile image
Matthew Hou

Here you go — the full paper is "Measuring the Impact of AI on Experienced Open Source Developers" by METR (Model Evaluation & Threat Research): metr.org/blog/2025-07-10-measuring...

Key finding: developers predicted AI would make them 24% faster, actually measured 19% slower, and still felt 20% faster afterward. Really worth reading the methodology section — they used a randomized controlled trial design which is rare for this kind of study.

Collapse
 
stevepryde profile image
Steve Pryde

Study was not last month. It's from mid 2025 using tools from early 2025
metr.org/blog/2025-07-10-early-202...

Collapse
 
pinotattari profile image
Riccardo Bernardini

Thank you!

Collapse
 
klement_gunndu profile image
klement Gunndu

The attention redistribution framing is sharp -- reviewing code you did not write is harder than writing code you understand captures it perfectly. Curious if the perception gap narrows with stricter pre-prompting like you describe.

Collapse
 
matthewhou profile image
Matthew Hou

That's the question I'm still working through honestly. My instinct says stricter pre-prompting narrows the gap but doesn't close it — because the hardest part of review isn't catching syntax or logic errors, it's verifying intent. You can constrain the output format, but you can't fully pre-prompt "does this actually solve the right problem." That still requires human judgment.

Collapse
 
cynthia_shen_3a90459194cd profile image
Cynthia Shen

Superhelpful!