hefty

Posted on Jun 4

AI code review is a routing problem now

#ai #programming #codequality #devops

The weakest version of AI code review is also the easiest one to demo.

Take a diff. Paste it into a model. Ask for a review. Get back a wall of comments that sound reasonable, half of which are too vague to act on and a few of which are just wrong enough to waste everyone's time.

That is not a review system. That is a comment generator.

The useful version looks much less magical. It looks like routing. It decides which parts of a change deserve attention, which reviewer should inspect them, what severity means, when a human needs to approve the result, and when the bot should stay quiet.

That last part matters more than people admit. Developers do not ignore review bots because they hate automation. They ignore review bots because the bots train them to ignore noise.

One prompt cannot be the review process

The naive prompt fails because it has no operating contract.

"Review this PR" sounds clear until you ask what kind of review you mean. Security? Data migration risk? Frontend accessibility? Dependency policy? Performance footguns? API compatibility? Test coverage? Abuse cases? Dead code? Naming?

A senior engineer does not review every diff the same way. A small CSS cleanup and an auth change do not deserve the same review path. A generated test update and a billing migration should not get the same amount of model attention.

So why do we keep asking one general-purpose model to behave like every reviewer at once?

The better pattern is boring and obvious once you see it:

split review concerns into scoped reviewers
give each reviewer explicit responsibilities and non-goals
normalize findings into structured output
classify severity before bothering a human
route risky changes differently from cheap ones

That is the difference between "the model had thoughts" and "the system made a review decision."

Specialist reviewers need boundaries, not vibes

Specialist agents are only useful when the specialization is real.

An accessibility reviewer should know what it is allowed to flag and what it should leave alone. A security reviewer should not waste time bike-shedding component names. A migration reviewer should care about rollback paths, data shape, and blast radius. A dependency reviewer should understand lockfile churn, license policy, and transitive risk.

The point is not to create a cute little panel of AI coworkers. The point is to reduce ambiguity.

Every reviewer needs a job description tight enough that the coordinator can judge its output. If two agents can produce the same comment in different words, the system is probably not specialized enough. If a reviewer cannot say "no issue found" without apologizing for it, the system will drown the PR in filler.

This is where AI review starts to feel like normal engineering again. You are defining interfaces. Inputs, outputs, ownership, failure behavior. The agent is just the worker behind one interface.

The coordinator is where trust lives or dies

Running five reviewers in parallel sounds powerful until all five return mediocre findings.

Now the problem is worse. The developer does not have one noisy bot. They have a noisy bot committee.

The coordinator layer is what makes the system survivable. It should dedupe findings, downgrade weak claims, merge overlapping concerns, and decide what deserves to block a PR. It should also preserve uncertainty instead of laundering every guess into confident review language.

This is the part teams underestimate. The value of AI review is not the number of comments generated. It is the number of comments that a developer can act on without doing a forensic audit of the bot.

Good review systems need structured findings. File path. Line range when possible. Category. Severity. Confidence. Suggested fix. Reasoning short enough to read. A clear distinction between "must fix" and "worth considering."

Without that structure, the review becomes theater. The bot speaks. The human squints. Nobody knows whether the finding is policy, preference, or panic.

Risk tiers beat blanket automation

Not every diff deserves the same machinery.

The practical move is to define risk tiers before the agent gets involved.

Low-risk changes might get cheap lint-style checks and a quick scan. Medium-risk changes might trigger targeted reviewers. High-risk changes should bring in stricter gates, human approval, audit logs, and maybe a rule that the bot can suggest but not approve.

This is especially true once agents touch production-adjacent tools. The failure mode is no longer just "bad code landed." It can be "the agent called the wrong internal API," "the agent had write access it did not need," or "the agent confidently touched a path with a bigger blast radius than the prompt implied."

Human-in-the-loop is not a step backward. It is the control plane.

The right question is not "can AI approve this by itself?" The right question is "what type of change is this, and what kind of approval makes sense for that risk?"

That framing keeps small changes fast without pretending every change is small.

Tool filtering is part of code review

There is another unglamorous piece here: what the agent can see and call.

Tool sprawl quietly wrecks agent systems. Give a reviewer every MCP server, every repo tool, every internal command, and every document surface, and you have not made it smarter. You have made its decision space messier. You also made the permission story harder to explain.

Review agents should get the smallest useful tool surface.

If the task is frontend accessibility review, maybe it needs the diff, relevant component files, rendered output, and an accessibility checklist. It probably does not need deploy credentials. If the task is dependency review, it needs package metadata and policy context. It does not need broad write access to the repo.

This also helps with cost. Context is not free. Parallel agents make that painfully obvious. A system that filters diffs, shares common context, and routes only the useful slice to each reviewer will be cheaper and easier to debug than one that dumps everything into every call.

Filtering is not an optimization pass at the end. It is part of the architecture.

What small teams should steal

Most teams do not need a Cloudflare-scale review platform. Copying the whole shape would be silly.

But small teams can steal the important moves.

Start with two or three review lanes instead of one generic bot. For example:

security and secrets
risky migrations or data changes
frontend behavior and accessibility

Write down what each lane is allowed to comment on. Write down what should block a PR. Write down what should be informational. Then make the output structured enough that a human can scan it quickly.

Add a simple risk router. File paths alone can get you surprisingly far at first. Changes under auth, billing, migrations, infrastructure, or permission-sensitive code can trigger stricter review. Docs and test-only changes can take a cheaper path.

Keep an audit trail. Which reviewers ran? Which tools did they use? Which findings were dismissed? Which ones blocked? If the system cannot explain its own behavior, people will not trust it when the stakes go up.

And please add an escape hatch. A broken AI review gate should be visible, logged, and bypassable by the right human. Otherwise you have not built quality infrastructure. You have built a new way for CI to hold a team hostage.

Dynamic workflows make this urgent

Parallel subagents are becoming normal product behavior. That is the direction the tools are moving: one agent plans, several agents investigate, another verifies, and the user sees the final result.

That can be genuinely useful for review. Bug hunts, security checks, migration audits, and regression verification all benefit from bounded parallel work.

But parallelism does not remove the need for judgment. It multiplies the need for it.

More agents means more outputs to reconcile, more token spend to justify, more permission boundaries to define, and more failure modes to observe. If the orchestration layer is weak, parallel review just produces wrong answers faster and in more places.

The winning setup is not "run as many agents as possible."

The winning setup is "route the work so the right agents run, the wrong agents stay idle, and the human only sees findings that survived a real filter."

Final thought

AI code review is not becoming a better comment box. It is becoming infrastructure.

The teams that get value from it will treat review like a routed system with narrow workers, severity rules, permission boundaries, telemetry, and human approval where the blast radius is real.

The teams that do not will keep pasting diffs into a model and wondering why developers stopped reading the output.

The future of AI review is not more comments.

It is knowing which comments are worth making.

Source notes

DEV Community