originally published at collinwilkins.com
Anthropic surveyed 132 of their own engineers about Claude Code. The numbers looked incredible. 67% more merged PRs per day. Usage jumped from 28% to 59% of daily work. Self-reported productivity gains between 20% and 50%.
Then someone checked the organizational dashboard. The delivery metrics hadn't moved.
That's the productivity paradox, and the gap between those two sets of numbers is where it gets interesting.
I've been running Claude Code as my main dev tool for months now (it's basically replaced my terminal workflow at this point). I've written about context engineering, specialized agents, and agentic orchestration patterns. All of that assumed AI coding tools deliver net-positive outcomes. Turns out the picture is messier than I thought.
The individual numbers are impressive
Those Anthropic survey numbers are all individual metrics — how much each engineer used the tool, how fast they felt, how many PRs they shipped. By any of those measures, the tool was clearly working.
The solo developer story is even more dramatic. A case study published in February 2026 documented one developer delivering what was scoped as a "4 people x 6 months" project in 2 months, working alone. That's a raw 12x multiplier on person-months (the kind of number that gets screenshot'd and passed around without context), and by my rough math, about 3x when you weight for task mix. The breakdown by task type tells the real story:
| Task Type | Speedup |
|---|---|
| Boilerplate and scaffolding | ~10x |
| Complex logic and debugging | ~2x |
| Architecture and planning | Minimal |
That distribution matters. The mechanical work got dramatically faster, but the judgment work barely moved.
This matches my own usage almost exactly. When I'm scaffolding a new module or wiring up boilerplate integrations, Claude Code flies. I can stand up a full project structure in minutes. But the architecture decisions, the "which service owns this data" conversations, the debugging where the root cause is three layers removed from the symptom? Those take the same time they always did.
Faros AI's analysis confirmed the same shape: 21% more tasks completed, 98% more PRs merged.
If you stopped reading here, the conclusion is obvious... Ship Claude Code to your whole team and watch the numbers climb!
Don't stop reading here.
The organizational numbers tell a different story
Faros AI measured DORA metrics on the same teams: deployment frequency, lead time, change failure rate, time to restore service. Unchanged. Meanwhile code review times increased 91%. The METR study found experienced developers on familiar codebases took 19% longer on real-world tasks while estimating they were 20% faster. Developers felt faster. The customer deliverables didn't move.
Where the speedup goes
Code generation got faster but everything downstream of it didn't. Planning, design, prioritization, code review, QA — still run at the same speed. When one stage of the pipeline accelerates and the rest stays flat, you get a pile-up at the next bottleneck, not faster delivery.
What I'm seeing in a lot of teams right now: AI writes the code, opens a PR, and then another AI tool (or human) on the review side suggests meaningful changes. That leads to additional back-and-forth after the PR is already open — churn that doesn't show up in "PRs merged" but absolutely shows up in cycle time.
Anthropic's own survey found that more than 50% of their engineers could "fully delegate" only 0-20% of their daily work to Claude Code. These are Anthropic engineers, on Anthropic's own tool, in an environment optimized for exactly this usage. If the people who built the tool can only fully hand off a fifth of their work, the ceiling for a typical team is lower.
I'd put myself in that 0-20% bucket too. Most of my Claude Code usage is collaborative, not delegated. I'm reviewing output, re-prompting when it drifts, catching architectural decisions the agent doesn't have context for.
The "build it because you can" trap
There's a subtler problem that doesn't show up in any of the studies. Because coding delivery sped up, more features feel feasible. A feature that would have taken two sprints now looks like a long afternoon, and that changes the calculus on whether it's worth building.
It shouldn't. The cost of building a feature was never just the implementation time. It's the maintenance, the cognitive load on the team, the opportunity cost of not building something else, the QA cycles, the documentation, the support burden. AI made the implementation cheaper. It didn't make any of those other costs cheaper.
What I'm seeing is that the bar for "let's just build it" has dropped. It's easy to prompt a new feature into existence, so naturally the threshold for opening a PR lowers. Teams should keep a high bar and think hard about whether a feature is worth shipping at all, regardless of how fast it can be coded.
A lot of teams are also freezing hiring or laying people off based on early perceptions of AI development speed. There's a common assumption that AI simply raises the bar for everybody. In my experience, that's not the case. The gains are uneven, task-dependent, and often illusory when you measure end-to-end.
A Hacker News thread from March 2026 captured the human side of this well. Comments offered the most useful frame: "Do you enjoy the 'micro' of getting bits of code to work, or the 'macro' of building systems that work? If it's the former, you hate AI agents. If it's the latter, you love AI agents." That split is real and doesn't resolve with better tooling. It requires honest conversations about what each person's role becomes.
What this means for team adoption
Your current metrics are probably measuring the wrong things. If you're an engineering manager trying to figure out whether AI coding tools are working, here's what the data actually says.
The first thing I'd change is what you're counting. More PRs per developer is a real number. It doesn't mean the team ships better software faster. If review times are climbing and defect rates are flat, the bottleneck moved from writing to reviewing. Measure the bottleneck, not the part that got faster.
Invest in review infrastructure before scaling AI-generated output. The review time increase isn't a tooling problem you fix with a faster CI pipeline. That's a structural problem. If you're rolling out AI coding tools to a team without simultaneously expanding review capacity, you're building pressure on the part of the pipeline least equipped to absorb it.
Set expectations by task type, not tool type. The speedup distribution from that solo dev case study is the most useful number to take away from this data. Boilerplate flies. Architecture doesn't move.
Track the boring metrics. If you're measuring AI tool ROI through surveys and individual PR counts, you're measuring perception. Track cycle time end-to-end. Look at defects per deploy. Pull time-from-commit-to-production.
Don't let the tool lower your refactoring standards. When you're iterating on a feature, the original design sometimes calls for a refactor. When the LLM can work around the existing structure, the willingness to do that refactor drops. Fight that. Leave the codebase better than you found it, same as always.
Review AI output in a fresh session. AI is biased in the code it writes. Common patterns, familiar abstractions, the path of least resistance. The best way to catch those inefficiencies is to review with fresh eyes, outside the context window that produced the code. A thorough human review in a separate session will catch things that in-context review misses.
Don't use AI to bulldoze friction. The friction you feel during development, the code review pushback, the design debate, the test that keeps failing, that friction exists for a reason. Using AI code generation to power through it faster doesn't remove the underlying problem. It just ships the problem to production. These are the same engineering practices we've always applied.
What you're actually measuring
Confusing "more output" with "better outcomes" is how teams make expensive adoption decisions. The teams that get real value from Claude Code won't be the ones that hand it to every developer and watch PR counts climb. They'll redesign their workflow around the new shape of the work — where writing is cheap, reviewing is expensive, judgment calls haven't gotten any easier, and half the features that feel feasible probably aren't worth building.
I've built my personal workflow around these tools. This piece is about whether they're actually working.
Both questions matter. Most teams are only asking the first one.
Top comments (2)
Strong framing. The interesting part is exactly where the bottleneck moves after code generation gets cheap. As an AI founder, I've seen the same pattern: scaffolding speed improves dramatically, but architecture judgment, review quality, and decision latency still dominate the cycle time. The teams that win probably won't be the ones with the highest PR count, but the ones that redesign the whole workflow around faster iteration and better review.
The truth is, though, Anthropic is pushing updates at a fast pace. They have new features coming out almost daily.