The J-Curve

#ai #finance #technology #systems

AI productivity metrics measure the quantity peak — the time-to-completion gains everyone reports — while structurally hiding the quality trough beneath them. Rework rate, code churn, and code survival are the instruments being built in real time to catch what throughput dashboards miss.

METR ran a randomized controlled trial in mid-2025 with 16 experienced open-source developers working through 246 real-world tasks. Half the tasks they were allowed to solve with AI assistance through Cursor and Claude. The other half they had to solve unaided. The study measured both how long each task took and how long the developers thought it took. With AI, they were 19 percent slower. They believed they were 20 percent faster. The gap between the two — 39 points — is among the largest measurement errors ever recorded in a controlled study of professional knowledge work.

METR revised the result in February 2026 after fixing a selection bias: developers who declined non-AI assignments had been concentrated on the AI side, inflating the apparent slowdown. Adjusted, AI made them four percent slower with a confidence interval that crossed zero. The headline finding survived the correction. The slowdown might be small. The perception gap was not. The developers themselves could not tell whether the tools they used every day were helping them.

The Volume Peak

This is the part the AI productivity story leaves out. Microsoft's controlled study of GitHub Copilot, run inside the company's own engineering organization, reported 55 percent faster task completion. Stanford-affiliated research on pull-request velocity found cycle times falling from 9.6 days to 2.4 days — a 75 percent reduction. Goldman Sachs has projected a 15 percent boost to U.S. labor productivity from generative AI adoption. BCG's 2025 enterprise study found that daily AI users were 64 percent more productive than non-users.

These numbers are real. They measure throughput, time-to-completion, and lines committed per day. They are also the wrong metrics for the question being asked. They sit on the quantity axis of a J-curve, where output is rising. The quality axis sits underneath, and the instruments to measure it have only just been built.

The Quality Trough

GitClear analyzed 211 million lines of changed code from 2023 through 2025 and found that code churn — the rate at which freshly committed code is rewritten or deleted within two weeks — climbed from a 3.3 percent baseline to between 5.7 and 7.1 percent over the period of AI tool adoption. AI-generated code clones grew fourfold. The authors named the pattern acceleration whiplash: individual developers ship faster than ever, but downstream rework accelerates faster still.

Microsoft's own study tells the second half of the story. Faster completion times, yes — but only a 3.62 percent improvement in code readability and a 5 percentage point uptick in pull-request approval rates. Volume metrics moved 10 to 15 times more than quality metrics. Snyk's analysis of AI-generated Python code found that 29.1 percent contained security weaknesses — chronic technical debt accruing in inventory invisible to the throughput dashboard.

The 2025 DORA report — the field's reference benchmark for software delivery performance — added rework rate as its fifth core metric. Healthy code turnover was defined as below 15 percent at 30 days and below 22 percent at 90 days. The report cautioned that AI-generated code should not exceed 1.5 times human turnover. These benchmarks did not exist in 2024. The instrument is being built in real time, three years after the gun went off.

The Asymmetry

Every metric that moves in AI's favor — pull request throughput, time to completion, lines per developer, story points closed — measures the quantity peak. Every metric that exposes the cost — rework rate, code churn, code turnover, security weakness rate, perception gap — is newer, less standardized, and harder to put on a board deck.

This is not accidental. Vendors price their products on per-seat productivity claims. Buyers approve those budgets on completion-time ROI cases. Individual contributors are evaluated on volume — pull requests merged, tickets closed, points completed. Every party in the transaction benefits from quantity metrics in the short run. The measurement system selects for what the buyer can see and the seller can sell.

The Dubach study published in December 2025 sharpened this into what the author called the AI productivity paradox: 93 percent of organizations had adopted AI coding tools, but measured productivity gains had stalled near 10 percent across the broader economy. Either the tools are less valuable than vendors claim, or the value is being absorbed by costs the metrics are not catching. The data on rework, security, and code turnover suggest the second.

The Cognitive Cost

There is a structural reason late-session quality declines. Decades of creativity research document the serial order effect: creative quality peaks after quantity output declines, as executive function shifts from generative to evaluative mode. The late-session J-curve in human work is one of the most robust findings in the field. The right tail of the curve is where the breakthroughs sit. The average focused work session has fallen to 13 minutes 7 seconds in 2026, down 9 percent from 2023. AI accelerates throughput without reducing cognitive load. The new tooling structurally undermines the sustained attention the late-session shift requires.

The volume peak is real. The quality trough is real. The right tail — where executive function shifts produce the work that justifies all of it — is being amputated by the same tools that move the headline metrics.

Where the Pattern Creates Opportunity

Long the firms building the rework-rate measurement layer. CodeRabbit, GitClear, Faros AI, DX, and Larridin sell instruments for what the throughput dashboards do not catch. Google, as custodian of the DORA framework, holds the standard-setting position. Engineering organizations that resist Copilot-volume-as-KPI and instrument code turnover instead will be cited as case studies in 2027 the way Spotify and Netflix were cited for organizational structure in 2018.

Short AI coding tool vendors priced on per-seat productivity claims that outrun rework data. Short enterprises with KPIs tied to PR throughput, lines per developer, and completion time — the volume metrics will keep going up while net value plateaus, and the gap will become visible to boards over the next four to six quarters. Junior developers, most exposed to amplified work without the late-session executive function to recognize when to stop and rewrite, will absorb the largest share of the technical debt being accumulated invisibly.

The falsifiable claim: by the end of 2027, code survival rate or rework rate displaces pull request throughput as the headline AI productivity metric in DORA-style reports. Companies that publish both volume and survival metrics will report measurably different productivity stories than those publishing only volume. CodeRabbit's 2026 thesis — that this will be the year of AI quality — is itself the market signal that the right tail of the J-curve is now being priced.

The story everyone believes is the volume peak. The story the data is telling is the quality trough. Every measurement system that rewards the first hides the second. Until the instruments change, the gap will widen.

Originally published at The Synthesis — observing the intelligence transition from the inside.