There is something especially ironic about social media posts making blanket statements that are critical of people who make blanket statements about the usefulness of AI in coding. But we know that blanket statements are simplifications, and reality is never as cut-and-dried as a single eye-catching headline.
In the wake of the recently published METR study [Becker, J., Rush, N., Barnes, B., & Rein, D. (2025). Measuring the impact of early-2025 AI on experienced open-source developer productivity. METR (Model Evaluation & Threat Research). https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf], the oversimplified statement that AI will always increase coding productivity is widely being countered by the oversimplified statement that AI cannot increase productivity. My position is that neither claim is true. While that does not make for a very click-baity headline, I'd just like to ask that people consider some of the critically relevant details included in the METR study itself before they make up their minds.
Applies to experienced and highly-skilled developers only
The study authors intentionally decided to "focus on highly skilled open-source developers" by "recruiting experienced engineers with years of expertise in the target repositories." They chose this focus, they said, because "there has been relatively less research in this setting." [METR, page 3]
But, as the study authors themselves caution, "our results are consistent with small greenfield projects or development in unfamiliar codebases seeing substantial speedup from AI assistance." [METR, page 12]
If you are a novice in the coding language or the project you are working on, then these study results do not apply to you.
Does not apply to tasks the authors considered "unrepresentative of most software development work"
The authors cite studys that found that for tasks that are, what I would call, fairly boilerplate, where the core effort is in actually remebering and physically typing the code, and where there is likely to be "a large amount of LLM training data," the use of AI showed a "65% increase in the rate of task requirements satisfied." [METR, page 3]
If the task you are working on is mostly about grinding out a lot of boilerplate code, then these study results do not apply to you.
What's the goal?
A significant problem I have in this whole discussion is the definition of the goal, what the study authors call "developer productivity." Time to passing unit tests, measured in billable minutes, is obviously quite important in a setting where the goal is to ship features as fast (and, in the case of commercial projects, as economically) as possible. In that case, if you had a few novice developers tasked with churning out a lot of boilerplate code, then "Yay! AI!"
On the other hand if your thinking were a little longer term, and you actually considered the skill growth of your developers to be as valuable as next week's profit margins, maybe you want your team to spend longer manually researching well-known coding patterns and typing them out by hand; this could lead to deeper understanding, better future recall, and improved longer term performance of the developer as a contributor to your enterprise.
This is just one example, but my point is that we could draw opposite conclusions based on whether we are taking a shorter or longer-term view of how we define success. Using or not using AI could be considered equally good, depending on your goals.
Was there bias?
A potential problem with the study was the author's stated goal of using "realistic" coding tasks to measure the effect of AI usage. To this end, they asked the developers themselves to select the tasks.
"Each developer provides a list of real issues in their repository to work on as part of this study. Issues are typically bug reports, feature requests, or work items used to coordinate development. They range from brief problem descriptions to detailed analyses and represent work ranging from minutes to hours."
-- METR, page 5
This, according to at least one of the (only 16) developers who participated in the study, influenced the types of tasks they selected, with the developer giving consideration to how the tasks might be affected by AI usage.
"There was a selection effect in which tasks I submitted to the study. (a) I didn't want to risk getting randomized to "no AI" on tasks that felt sufficiently important or daunting to do without AI assistence. [sic] (b) Neatly packaged and well-scoped tasks felt suitable for the study, large open-ended greenfield stuff felt harder to legibilize, [sic] so I didn't submit those tasks to study even though AI speed up might have been larger"
-- Bloom, R. (@ruben_bloom). (2025, July 11). X.com post. https://x.com/ruben_bloom/status/1943536052037390531
This may or may not have skewed the results to some degree—I genuinely don't know—but it seems odd that in a study that took pains to carefully randomise some things, the pool of tasks for developers to work from was not better protected from possible bias.
What's next?
I'm glad to see work like the METR study being published. Having data to refer to when evaluating claims about the impact of AI on productivity in real-world scenarios is a nice change. But this is a start. It would be useful to go further:
- Does the outcome track based on the experience of the coder with the language and project?
- Does the complexity/creativity of the task have an effect; is AI more or less useful with tasks that are straightforward versus those that are more novel or require creative thinking?
- Is there a tradeoff where the developer benefits less from using AI, even if the work is completed more quickly? For example, when does the developer gain a lower level of understanding of the coding techniques provided by the AI?
Top comments (1)
This is interesting! I'm glad to see studies that are more focused like this come out, too. While I'm curious to see how generative AI models can handle the "everyday" coding task, I'm also curious to know how they would do in solving technical problems. 🪄
Giving an AI a well written story, with clear acceptance criteria can produce different results based on the model (or day of the week, in some cases). I'd like to see something similar for their problem solving ability (or lack thereof?).