AI Code review is broken. Here’s what nobody wants to admit

#devops #ai #codereview #programming

I’ve been doing DevOps and infrastructure work for 8 years. Last week I saw a dev on X say he burned $250 in 2 days running Claude for automated PR reviews. That number didn’t surprise me at all.

We tried the same thing. Wire up an LLM to your CI pipeline, send every PR through it, get automated review comments. Sounds great in theory. In practice, you get two problems that nobody in the AI tooling space wants to talk about.

The cost model is backwards

You’re paying per token. That means a one-line README typo costs the same to review as a 500-line auth refactor. Every review round resends the full diff plus repo context, so you’re paying for the same tokens over and over. You’re not paying for intelligence, you’re paying for repetition.

Someone on Twitter put it perfectly: “the bill is mostly cache misses.”

And no, cheaper models don’t fix this. Token costs drop but usage scales faster. Teams just send more code through the pipe. The bill stays the same or grows. It’s Jevons paradox applied to AI.

The output is a glorified linter

Here’s what the models actually catch: style inconsistencies, naming conventions, missing error handling, “consider adding a try-catch here.” Stuff your linter already catches or should catch.

Here’s what they miss: cross-file dependencies, architectural issues, subtle logic bugs, race conditions, security implications that require understanding your codebase as a whole. The stuff that actually breaks prod.

So you end up paying cloud API rates for reviews that still need human oversight. You didn’t replace code review. You added a cost layer on top of it.

The real bottleneck has shifted

AI made writing code almost free. A junior dev with Claude or Copilot can produce PRs at 5x the speed they could a year ago. But reviewing that code is still slow, expensive, and manual.

The bottleneck is no longer production. It’s review.

And the current tools make this worse, not better. Your team opens more PRs than ever, and every one of them generates 800 lines of AI suggestions that someone has to read, triage, and mostly ignore.

So what actually works?

Honestly, I don’t have a perfect answer yet. But after months of banging my head against this, here’s what I believe:

Code review shouldn’t be priced like a chatbot conversation. The value is per review, not per token. The tool should understand your codebase, not just the diff in front of it. And it should prioritize by severity instead of dumping 50 comments of equal weight on every PR.

The current generation of tools (raw API, Copilot PR review, most GitHub Actions wrappers) all share the same fundamental problem: they treat code review as a text completion task instead of an engineering judgment task.

Someone will fix this. Maybe soon. Until then, we’re all paying senior engineer rates for intern-level output.

Curious what your experience has been. Are you using AI code review? Has anyone actually made it work without the cost spiraling out of control?