Prakhar Singh

Posted on May 12 • Edited on Jun 30 • Originally published at prakharsingh.github.io

Agentic code review in production: orchestration, evaluation, and the cost of being wrong

#ai #codereview #llm #devtools

What "agentic" actually buys you over a linter, why single-model approaches stall, and why false positives not raw model capability determines whether the system stays in the loop.

Agentic has become a marketing flag, but in code review it carries a precise technical meaning: the system, not the user, decides which tools to invoke against a change, in what order, and how to weight their findings. A linter runs a fixed pipeline. A single-pass language-model reviewer reads the diff and emits comments end-to-end. An agentic reviewer chooses between a compiler, a type checker, a test runner, a secret scanner, a static analyzer, and one or more language-model calls then arbitrates their disagreements before surfacing a review comment.

The model is one tool among several. The system's value is in the arbitration policy that decides which findings reach the developer.

The orchestration problem

Single-model approaches stall on three axes that pull in different directions: accuracy, latency, and cost. A frontier model gives the strongest multi-step reasoning on a non-trivial change but typically adds several seconds of latency and an order-of-magnitude cost premium per call; a small open-weights model returns in under a second but misses subtle invariants. Three routing strategies cover most of the production space:

Task-classification routing. A lightweight classifier (a smaller model or a rules layer) decides which downstream model handles a request. Style nits, dead-code removal, and import-order checks go to a fast cheap model; logic changes and concurrency reasoning go to a stronger one. This works as long as the classifier is calibrated; misclassification lands hard-reasoning prompts on under-powered models and produces confident-sounding nonsense.
Fallback chains. Try the cheap model first; if self-consistency across N samples is low or a cheap verifier disagrees escalate. This is robust against classifier drift but doubles cost on the long tail.
Evaluation-driven A/B routing. Maintain an offline evaluation set of past pull requests with human accept/reject outcomes; score model variants on precision and recall against that ground truth and route traffic to whichever variant scores highest on the relevant slice. This is the only strategy that adapts when a model is updated.

In practice production systems combine all three: classify, fall back on low confidence, and let offline evaluations reshape weights every release cycle.

Grounding with static analysis and retrieval

A pure language-model review hallucinates fixes proposing API calls that do not exist, citing version-specific behavior incorrectly, suggesting refactors that break other call sites the model never saw. Two anchors push the hallucination rate down.

First, deterministic static analyzers run in parallel with the language model. Type errors, null dereferences, missing await, unused imports; these are cheap, deterministic, and not worth a model call. The agent uses their output as ground truth and frames its review around facts the static analyzer surfaced, not facts the model invented.

Second, retrieval-augmented generation over the repository itself: prior review threads, commit messages, and the project's design documents. Most code review observations are not novel. The same patterns get flagged across file null-safety regressions, missing index migrations, inconsistent error wrapping. Retrieving prior review comments scoped to the touched files, modules, or owners shifts the model from generic best-practice advice to comments that match the codebase's established conventions.

False positives as the dominant cost

Developer trust in an automated reviewer collapses non-linearly: a handful of bad comments is usually enough for the team to start dismissing the bot reflexively. The arithmetic is unforgiving: a 5% false-positive rate at twenty review comments per pull request is one bogus flag per PR. Within a sprint, the team stops reading the bot's output.

Three controls keep the rate manageable:

Confidence thresholding - never surface a comment below a calibrated threshold, even if the model is willing to speak.
Dedup against historical dismissals - if a reviewer dismissed an analogous comment six months ago, the same shape of comment on the same file is suspect this time.
A closed feedback loop - every dismissed or accepted comment becomes training signal for the next routing decision and threshold update.

The third is where most teams underinvest. Without the loop the false-positive rate is whatever the underlying model happens to produce. With it, the rate trends down per release.

Compliance as a routing constraint

Compliance is not a bolt-on check. It belongs at the same layer as task classification; a first-class routing input, not a separate stage tacked on at the end.

Code touching regulated data, protected health information, payment card numbers, EU resident identifiers has to route differently. GDPR shapes both transfer (no diffs leaving the controller's processors without a Data Processing Agreement) and retention (logged prompts and completions are themselves processing activity). HIPAA obligations, Business Associate Agreements, and minimum-necessary access determine which model endpoints are eligible to process diffs containing PHI. PCI-DSS controls dictate cardholder-data redaction before model invocation. SOC 2 controls dictate operational guarantees on the reviewer service itself. Bolting any of this on after the fact produces gaps that surface during the first audit, not during development.

Closing

Agentic code review is a coordination system with a language model embedded in it, not a language model with tools attached. The hard problems are not in the model they are in the routing, the grounding, the evaluation, and the feedback loops that decide what the system does next time. Teams that treat the model as the product underinvest in everything that actually determines whether the product stays in use.

Originally published at prakharsingh.github.io/notes/agentic-code-review on 12 May 2026. I'm Prakhar Singh, Founding Engineer at Devzy AI building an agentic AI system for automated code review across CLI, PR, and IDE surfaces.

Top comments (5)

Theo Valmis • May 13

The false positive problem is underappreciated in most writeups on this. A system that flags 30% of PRs incorrectly doesn't get ignored -- it gets turned off. The reliability bar for staying in the loop is higher than the accuracy bar for being useful in isolation.

The routing strategies you describe are solid. The fallback chain is more robust than task-classification but the cost doubling on escalation is real -- especially in orgs where the cheap model handles 70% of volume but the hard 30% is exactly where confident-sounding nonsense is most dangerous. Have you found that self-consistency sampling is a better escalation signal than classifier confidence on its own?

Rasmus Ros • May 13

Precision usually matters more than recall here because review is an interruption channel. One bogus finding per PR is enough to train people to auto-dismiss, while one missed issue usually just falls back to human review or CI.

Prakhar Singh • May 13 • Edited

"Review is an interruption channel" is the right framing, and it explains why the precision/recall tradeoff lands differently here than in search or classification. A missed issue still gets caught by human review or CI: the cost is latency, not absence. But a bogus finding burns attention directly. Every false positive costs the developer 30-60 seconds to read, contextualize, and dismiss, on top of the trust erosion you mentioned. The trust cost compounds because it's per-person and per-sprint. Once this developer dismisses the bot, it doesn't matter that other developers still read it; the signal is already lost for that slice of the team.

The practical takeaway: track per-developer dismiss rate, not just aggregate precision. A reviewer with 95% precision but uneven dismissal patterns (everyone on team A reads it, team B ignores it) has a narrower blast radius than the aggregate suggests.

Andrii Krugliak • May 18

The calibration point is where I keep getting burned. Fallback chains feel safer until you measure cost - on the long tail the cheap-first model fails plausibly rather than loudly, so the verifier never fires and you ship a wrong review without escalating. The fix for me was running the cheap model and the cheap verifier on different decompositions of the same diff, so disagreement actually surfaces. Eval sets rot too - false-positive rate on our 6-month set is 3x what it was on day 1.

Harjot Singh • May 30

"The cost of being wrong" is the right axis to design an agentic reviewer around, and it's the part most demos skip. A review agent that's confidently wrong is worse than no agent, because it manufactures false confidence - a green check nobody double-checks. So the eval harness matters more than the model: you need to measure false-approve and false-reject rates, not just "does it leave plausible comments."

The orchestration insight that follows: route by stakes. Low-risk diffs (formatting, docs, mechanical refactors) can get a cheap fast reviewer with light human spot-checks; high-blast-radius changes (auth, money, migrations) demand the strong model AND a human gate. Spending equal review rigor on every PR is how you either burn budget or miss the dangerous one. Excellent production-grade framing - the evaluation section is the part people most need to copy.