The 55.8 Percent Productivity Number From Doshi And Vaishnav Is Narrower Than People Think

#llm #ai #productivity

When Doshi and Vaishnav published their controlled experiment on AI code completion in Science (2023), the headline that propagated everywhere was "55.8% faster." Repeat it enough and it becomes received wisdom.

The actual paper measured time-to-completion on a single well-defined HTTP server task. A problem with a known shape, a stable target, and a scoring function that rewarded a specific solution path. The 55.8% lift was real for that task. It is also the narrowest possible reading of what "AI productivity" means in software work.

A more careful follow-up at HICSS-59 (Stray et al., 2026) looked at sustained workflow integration over weeks instead of a single benchmarked task. Numbers compressed. Across mixed work (greenfield, debugging, refactoring, code review) aggregate time savings landed closer to 10-20%, with high variance across task class. Debugging and code review barely moved. Greenfield CRUD work moved the most.

That gap between single-task lab benchmark and integrated weekly workflow is where most engineering org AI productivity decisions are silently going wrong.

The mechanism gap

A code-completion model is doing one thing: predicting the next plausible token sequence given local context. Fantastic when the context is a half-finished function with a clear signature and the loss function would reward the standard completion. Much weaker when the work involves:

Tracing a bug through three repos and a queue
Deciding which refactor is worth doing
Reading existing code to understand intent before touching it
Negotiating a schema change with another team
Writing the test that catches the actual failure mode

None of those are next-token problems.

Where the gains actually compound

Builders shipping production AI workflows in 2025-26 are seeing real durable lift, but not by turning on Copilot and waiting. The compounding wins look like:

Stack reduction. Skip a build step entirely. Replace a 4-step ETL with a single LLM-and-validator pass for cases where the validator can be trusted.
Context elimination. Cut the time it takes to load a problem into working memory. Quick orientation queries on a strange codebase, API surface lookup, error message triage.
Boilerplate elimination at the boundary. Form validators, type-mapping, mock data, fixture generation.
Spec to first-draft compression. Get a structurally-correct first cut, then spend the saved time on the parts that need taste.

What this means for tooling decisions

Stop comparing AI tooling claims on single-task benchmarks. Ask vendors for sustained-workflow time-distribution data over weeks of real engineering work.

Measure your own lift the same way. Pick three task classes, instrument time-to-merge over a 4-week window, compare against baseline.

Hire for orchestration skill, not typing speed. The bottleneck moved.

Summary

The 55.8 percent number is not wrong, it is narrow. Sustained workflow integration data puts realistic aggregate productivity lift in the low double digits, concentrated in specific task classes.

Sources: Doshi and Vaishnav, Science 2023. Stray et al., HICSS-59 proceedings 2026.