Last month, METR published a study that should make every developer uncomfortable.
They took 16 experienced open-source developers — people who kn...
For further actions, you may consider blocking this person and/or reporting abuse
Maybe we should move away a bit from the idea of using AI tools for "coding" only, and use it more in an 'advisory' role instead, as virtual brainstorming buddies to sound ideas off of - to generate ideas ...
Coding, yes, but only for the "boring" stuff, setting up the nitty gritty of a project (tooling etc), pure boilerplate etc - not the parts where writing the code actually feels like a worthwhile thing to do!
Yeah, the "advisory role" framing resonates. I've actually been shifting toward that myself — using AI more as a thinking partner than a code generator. The best sessions I have are when I describe a problem and go back and forth on approaches before writing anything.
And you're right about the boilerplate distinction. There's a real difference between "code I need to exist" and "code I need to understand." AI is great at the first category. For the second, I'd rather write it myself and have AI poke holes in it afterward.
Wholly agree! This also reminds me of another recent article on dev.to, where the author argues that actually coding, even before AI arrived, was never more than 20-25% of the work anyway (I think he mentioned an even lower percentage) - the rest is thinking, planning, testing, debugging, deploying etc ... so we're now using AI to automate part of those 20% - maybe we should see how it can help more with those 80% !
That's the reframe I keep coming back to. We're optimizing the part of the job that was already the smallest slice. The 80% — understanding requirements, debugging across system boundaries, figuring out what to build in the first place — that's where the real leverage is. I've been getting more value from using AI as a thinking partner during design than as a code generator during implementation. The code part is almost the easy part.
Totally, and the advantage is also that it's low risk - you ask for advice or ideas, and then you use them or you don't - but when AI spits out a few hundred lines of code, the onus is on you to check/review it, and make sure there are no bugs or security holes in it ... I do think the whole "AI for coding" debate might need a bit of a rethink as to what 'strategies' are in fact most productive (smallest pains, biggest gains) ... keep an open mind!
Exactly — and the low-risk part is what makes it a no-brainer starting point for people who are still hesitant. You ask for advice, you evaluate it, you use it or you don't. No one's committing AI-generated code to main in that workflow. It's actually the safest possible way to get value from AI while building intuition for where it's reliable and where it falls apart. I've started calling it "advisory mode first, generation mode later" — and honestly, some tasks never graduate from advisory mode, and that's fine.
Totally agree, that's also the way I see it - safety first, unless you're "vibe coding" some sort of funny hobby project and it doesn't really matter ...
I think AI is useful for developing code in code bases one is not acquainted to. Due to its nature of learning from existing code, it usually brings fragments of code that are up to date with the new and updated version of API's and techniques. It's useful too for creating routine tasks that are already very well established -- that is, boiler plate code. It doesn't particularly excel in new tasks. In this case the generated code should be seen as prototype coding: it exposes problems and possible solutions, but it's code that's not ripe and must be used to inspire the writing of useful code.
Adapting boilerplate code is fine and valid, like create-react-app, only more generic. Our industry shouldn't have needed expensive LLM models to do that though. Debugging? AI can understand Tailwind and TypeScript, but a legacy web project from 2016? No chance, unless it's just boilerplat from ten years ago.
"Prototype coding" is a great way to put it. That's pretty much how I treat AI output now — it's a first draft that shows me the shape of a solution, not the solution itself. Especially useful when you're working with an unfamiliar API and need to see what the integration surface looks like before committing to an approach.
The key shift for me was stopping to expect production-ready code and starting to expect "good enough to learn from." Once you adjust that expectation, the frustration drops significantly.
The 'attention redistribution' framing is the right diagnosis. Generation got cheap. Verification didn't.
I run a few AI models in production — parallel workloads, different models handling different tasks. The pull is always toward more: more agents, more parallelism, more throughput. But the real constraint doesn't change: how much cognitive load does it take a human to audit what came out?
A setup with three models producing clean, auditable outputs beats ten models producing plausible-but-questionable ones. Every time. The overhead compounds.
The point about expertise interfering with AI output is underappreciated. When you already have a strong mental model, a confident-but-wrong suggestion doesn't just waste time — it has to be actively rejected. That rejection costs more than silence would have. For a junior dev with weak priors, AI fills gaps. For someone who already knows the answer, it often adds noise you have to fight through.
The Dark Factory direction is the honest conclusion. You don't eliminate the human verification cost. You push it earlier, into test design and spec writing. Which is basically just the old TDD argument wearing new clothes.
The cognitive load point is the one most people skip over. "Just add more agents" sounds great until you're spending more time reviewing outputs than you saved generating them. I've hit that wall — at some point you realize the bottleneck was never typing speed.
And yeah, the TDD parallel is real. Writing good specs and test cases upfront is basically the same discipline, just reframed for a world where the machine writes the first draft. The skill shifts from "can I write this" to "can I define what correct looks like before anything runs."
Where's the "dopamine hit" when AI generates 200 lines of code that should have been 20, hides at least one subtle bug within and adds five paragraphs of text and a desperate call to action, and when you pinpoint the error it utters verbose excuses, fixes the error and adds to other ones. This is just bullshit making me even more disappointed and angry when fellow coworkers insits that AI makes them "more productive". Hope this study will open their eyes!
Ha, you're describing a very real pattern. The verbosity is genuinely one of the most annoying things — you ask for a 5-line fix and get 80 lines of refactored code plus an essay explaining why.
I think the frustration your coworkers cause is actually a separate problem from the tool itself. The tool has real limitations. But "AI makes me more productive" and "AI makes me feel more productive" can both be true for different tasks and different people. The METR data just makes it harder to hand-wave away the gap between perception and measurement.
The context switching tax is underrated. I tracked my own workflow for a week and realized I was spending ~15 minutes per session just re-establishing context after switching between tools — which model knows about this codebase, which one I already gave the architecture doc to, where did I leave off. That's not AI being slow, that's me managing AI being slow. Consolidating the interface helps, but honestly the bigger win was just picking one tool per task type and sticking with it instead of shopping around mid-flow.
Don't get trapped in the weeds. Use AI as an assistant, not for writing code. Every issue related to skills degrading is related to AI coding for them. If you are programmer, program you lazy bastard. It will give you all you need: understanding of the project, contexts, practice, speed at typing, mental gymnastics. In every discipline professionals need to practice to improve or maintain skills, so don't give that practice to the machine. Is simple really.
The skills degradation angle is underrated. I've caught myself reaching for AI on things I used to just... do. And every time I did, the understanding got a little shallower.
That said, I don't think it's all-or-nothing. There are parts of coding where the practice builds understanding (architecture decisions, debugging, core logic) and parts where it's just mechanical repetition (config files, boilerplate wiring). I'm trying to be more deliberate about which category something falls into before deciding whether to hand it off.
I read, the next.js rebuild from Cloudflare yesterday. And the part that struck me is their way of working. They define small tasks and let AI work on those.
This is concrete example of the AI is good at doing small things line I'm hearing in presentations.
So I guess spec driven AI is out and issue driven AI is in. Like you would do if you had a team of developers.
That Cloudflare post is a great example. "Small well-defined tasks" is exactly where AI shines — it's basically the same conclusion the METR study points to, just from the other direction.
"Spec driven AI is out, issue driven AI is in" — I like that framing. Treat AI like a junior dev who's great at executing clearly scoped tickets but terrible at interpreting a vague spec. The better your issue description, the better the output. Which is, like you said, the same workflow you'd use with a human team.
The perception gap finding is genuinely unsettling, and your reframing of the workflow shift is the most honest take I've seen on it.
The Before/With AI comparison hits right. Building AI-powered financial data tools, I've seen the same dynamic — the bottleneck was never code generation speed, it's always been "did we correctly specify what we wanted." AI just makes the cost of a wrong spec hit faster.
Developers who are genuinely more productive with AI are the ones who write rigorous specs and tests before touching a prompt, not the ones who iterate fastest on output. The METR data suggests the industry is confusing velocity with throughput.
One wrinkle worth adding: the 19% slowdown might be underestimating the effect for domain-specific work. When the codebase has non-obvious invariants (financial regulations, edge cases in settlement calculations, etc.), AI-generated code fails in subtle ways that take longer to debug than the time saved writing. That's the real trap.
The "redistribution" framing is exactly the right diagnosis. But I'd argue it's a symptom of a design problem: most AI coding tools are optimized for generation speed, not for reducing the human verification cost that follows.
That's what we tried to address with Cognix. Instead of asking "how fast can we generate code?", the design question was "how much human attention does verifying this output require?" Multi-stage validation, quality gates before the code reaches you — the goal is minimizing the attention tax, not just moving it somewhere else.
If the bottleneck is always human verification, the tool should be designed around that bottleneck.
"How much human attention does verifying this output require" is a better design question than most AI tool companies are asking. The generation speed race feels like it's hitting diminishing returns — the bottleneck moved downstream months ago. I haven't tried Cognix yet but the framing is right. The tools that win long-term will be the ones that make review faster, not generation faster.
Thanks for your reply. Your feedback has given me courage. I'll implement the approach to improve human review speed more carefully!
Lacking the "mental model" of your code/project really slow downs any debugging, fixing, and more specifically the possibilities of adding new feature. AI will kept on adding more files/functions for a feature where you could have guided it to use one already defined because you yourself dont have clear idea of your code.
This is one of my biggest frustrations. AI doesn't know your codebase has a perfectly good utility for exactly the thing it's about to reimplement from scratch. I've started including a "reuse these existing modules" section in my prompts, basically a mini architecture guide for the AI. It helps, but it's another thing you have to maintain. The dream is a tool that understands your codebase well enough to do this automatically.
it's not that big of a surprise really.
we are yet to formalize how to use this tool
You nailed it — we're still in the "figuring out how to hold the tool" phase. What I keep seeing is that the developers who get the most out of AI coding tools are the ones who've invested time in structuring their projects for AI, not the ones chasing better prompts. Things like explicit module boundaries, clear interface contracts, comprehensive test suites. The tool itself matters less than whether your codebase is designed to be navigated by something that can't hold the full picture in its head at once.
Insightful take on the METR study—eye-opening how developers perceived a 24% speed boost but measured 19% slower.
Prioritizing attention on verification over raw output makes total sense for real productivity.
Thanks! That perception gap was the thing that stuck with me too. 24% faster in your head, 19% slower on the clock — it's a pretty humbling data point. Makes you wonder how many other "productivity gains" are just vibes.
The attention redistribution framing is sharp -- reviewing code you did not write is harder than writing code you understand captures it perfectly. Curious if the perception gap narrows with stricter pre-prompting like you describe.
That's the question I'm still working through honestly. My instinct says stricter pre-prompting narrows the gap but doesn't close it — because the hardest part of review isn't catching syntax or logic errors, it's verifying intent. You can constrain the output format, but you can't fully pre-prompt "does this actually solve the right problem." That still requires human judgment.
Could you add a link to that study? I would like to read it. Thanks.
Study was not last month. It's from mid 2025 using tools from early 2025
metr.org/blog/2025-07-10-early-202...
Thank you!
Study was not last month. It's from mid 2025 using tools from early 2025
metr.org/blog/2025-07-10-early-202...
Feel like some things might have changed since then.
Good catch on the timeline — you're right, and I should've been clearer about that. Tools have moved fast since early 2025. My gut says the core finding (verification is the bottleneck, not generation) still holds, but the magnitude has probably shifted. Would love to see an updated study with current-gen tools.
Yes, you have a point. But the key is to make sure that the AI doesn't dominate the human. While the AI can help with the overall design and architecture and the rest of the steps, I think the real speed is given to the AI by giving it the prompts that drive the idea and its operation, defining its role, continuous situational awareness, periodic logs, and literally giving it the actions that a human would do when doing something. Otherwise, the AI will go one way on a large project today and the other way tomorrow, and as one developer said, "The AI wrote a project in a month, but I revised it in a year."
"The AI wrote a project in a month, but I revised it in a year" — that's going to age really well as a quote. It captures something a lot of teams are learning the hard way right now.
Your point about giving AI specific prompts that define its role and actions is key. The more you treat it like a tool with clear boundaries, the better it performs. The moment you hand it vague direction and hope for the best, you're setting yourself up for exactly that revision cycle you're describing.
Superhelpful!