DEV Community

Aamer Mihaysi
Aamer Mihaysi

Posted on

The Real Bottleneck in AI Coding Isn't Generation—It's Everything Else

The Real Bottleneck in AI Coding Isn't Generation—It's Everything Else

Shopify's CTO Mikhail Parakhin recently shared internal data that should unsettle anyone building with AI agents. Their engineers now consume unlimited Opus-4.6 tokens. Unlimited. The company's AI tool adoption curve shot from gradual to nearly vertical in December 2025. And yet Parakhin keeps obsessing over something that sounds almost mundane: pull request review.

This is the part nobody wants to talk about.

We've spent two years optimizing prompt engineering, agent architectures, and context windows. We've built systems that can generate thousands of lines of code per hour. What we haven't solved is what happens after generation—when that code needs to be reviewed, tested, merged, and deployed without breaking production.

Shopify's data reveals the uncomfortable truth. AI-written code has fewer bugs per line than human-written code. But because models write so much more of it, the absolute number of bugs reaching production has increased. The narrow waist in the development pipeline isn't generation anymore. It's the review and validation layer.

Parakhin describes their solution: expensive models taking turns in a critique loop. One agent writes. Another reviews with a different model. They debate. The code gets rewritten. This takes longer than parallel agent swarms, but produces higher quality. Shopify built their own PR review system because off-the-shelf tools optimize for speed, not depth.

The implications are deeper than tooling choices.

Git was designed for human-scale code velocity. Pull requests assume humans need time to review. CI/CD pipelines assume test suites can keep pace with commit frequency. These assumptions break when code generation becomes instantaneous and abundant.

Shopify now sees deployment cycles lengthening not because of slow generation, but because of test failures, rollback cascades, and the statistical inevitability that more code means more edge cases. They're experimenting with stack-based PRs through Graphite. They're questioning whether Git's merge-conflict model—essentially a global mutex on the codebase—can survive machine-scale contribution.

This isn't a Shopify problem. This is an industry problem we've been ignoring while chasing benchmark scores.

The metrics we've been optimizing—tokens per dollar, latency to first token, pass@k on coding benchmarks—assume that generation is the scarce resource. In production environments, the scarce resource has shifted to validation capacity. How many lines of code can your review infrastructure actually certify? How many test failures can your deployment pipeline absorb before rollback becomes the default?

Parakhin's token budget philosophy reveals the shift. He agrees with Jensen Huang's directional insight that engineers should consume more tokens. But he measures success differently. The important ratio isn't tokens generated—it's the budget split between generation and review. Expensive models checking work. Time allocated to critique rather than parallel exploration.

This is why specialized TPU architectures like Google's new agentic-era chips matter less for raw generation throughput than for enabling the multi-model inference patterns that quality validation requires. The future compute bottleneck isn't training or even inference—it's the orchestrated dance of critique, revision, and verification that production code demands.

What Shopify discovered mirrors what every large AI-adopting organization is learning: the hard part of agentic coding was never the coding. It was maintaining software quality at machine speed. It was preserving system reliability when human review became the bottleneck. It was rebuilding development workflows designed around human cognitive limits for a world where those limits no longer apply.

The companies that solve this—really solve it, not just slap AI review tools onto legacy workflows—will capture disproportionate value. Everyone else will drown in generated code they cannot safely ship.

Parakhin is hiring distributed database engineers and ML infrastructure specialists. He needs people who can reimagine how code repositories work when commits arrive at machine frequency. This should tell you everything about where the real engineering challenges have moved.

Generation solved. Integration remains.

Top comments (0)