The dirty secret of AI code review is that there is no single "best" model. There are only models that happen to be good at the specific thing you're asking them to do right now.
I learned this the hard way while building 2ndOpinion, an AI code review tool where Claude, Codex, and Gemini cross-check each other's work over MCP. The first version hard-coded one model for every review. The reviews were fine for JavaScript. They were embarrassing for Rust. They were weirdly confident and wrong for SQL.
So we stopped picking a single model and started routing. This post is about how that routing layer actually works — what signals we collect, how the scoring math plays out, and what surprised us when we shipped it.
The problem: model strength is language-specific
When we started tracking per-language accuracy, a clear pattern emerged. If you take the same corpus of reviewed pull requests and score each model on catching real bugs versus hallucinating issues that don't exist, you don't get a uniform leaderboard. You get something that looks more like rock-paper-scissors.
One model might be excellent at spotting async/await footguns in TypeScript but completely miss lifetime issues in Rust. Another might be phenomenal at Python decorator patterns and hopeless at Go's error handling conventions. A third might crush Terraform drift detection but flag perfectly valid Kubernetes manifests as "probably wrong."
This isn't a flaw in any particular model. It's a consequence of training data distribution, RLHF feedback, and the fact that "code" is actually hundreds of very different specialties wearing the same trench coat. Treating "AI code review" as one problem and picking one winner leaves performance on the table for every language that isn't the winner's strong suit.
What --llm auto actually does
When you run a review with no model flag, the CLI calls the auto router:
npm i -g 2ndopinion-cli
2ndopinion review src/auth.ts
Under the hood, --llm auto is the default. It takes three inputs — language, change type, and file size — and picks a model. Here's the Python SDK equivalent, which exposes the same router via a keyword argument:
from secondopinion import Client
client = Client()
result = client.opinion(
code=open("src/auth.ts").read(),
language="typescript",
llm="auto", # route based on accuracy data
change_type="bugfix" # optional hint
)
print(result.findings)
print(result.review_metadata.model_used) # which model actually ran
The review_metadata object is the important part for debugging. Every response tells you which model was picked and why, along with token counts and duration. If you want reproducibility, pin the model explicitly; if you want the best review for this specific request, let the router decide.
The signals that feed the router
There are four signals we weight, in roughly this order:
Language. Detected from file extension, shebang, or an explicit language= argument in the SDK. This is the dominant signal because accuracy variance between models on a given language is much larger than variance on other dimensions.
Change type. A new-feature diff has different review priorities than a bugfix or a refactor. Security-sensitive file paths (auth/, crypto/, anything matching a configurable allowlist) bump a security-audit weight into the decision.
File size and diff size. Very large files hit models with bigger effective context windows. Small targeted diffs can go to faster models without losing accuracy — no point paying for a heavyweight review of a three-line typo fix.
Pattern memory. If we've seen a similar bug pattern in this repo before, we bias toward the model that caught the original. This is a small effect per review, but over a project's lifetime it adds up, because teams tend to re-introduce the same class of bug in different forms.
The scoring itself is embarrassingly simple. For each candidate model we compute a weighted sum from the accuracy table and pick the highest. It's not a neural net. It's not an LLM picking another LLM. It's a lookup table and a weighted argmax. We tried fancier approaches and they kept losing to the lookup table, which turns out to be the honest answer in most ML-system stories.
Where the accuracy data comes from
A router is only as good as the data behind it. Ours comes from three places.
First, offline evals. We maintain a set of benchmark repos per language with known bugs — either ones we inject, or historical CVEs replayed on the vulnerable commit. Every model gets scored on "did you catch this specific bug" and "did you flag something that wasn't actually a problem."
Second, production telemetry. When a user accepts or rejects a finding via 2ndopinion fix or the GitHub PR agent, that's a signal. Rejected findings that were later confirmed as real bugs (via a follow-up commit or a revert) are gold. We only aggregate feedback, never store code — that's a hard constraint baked into the pipeline.
Third, consensus disagreements. When you run a consensus review, three models vote. Disagreements are interesting because they surface cases where one model sees a bug the others miss. Over time, the model that's consistently right on disagreements gets weighted higher for that language.
# Three-model consensus review — the source of a lot of our training signal
2ndopinion review src/auth.ts --consensus
Three credits, one command. The confidence-weighted aggregator takes the three reviews, collapses duplicate findings, and ranks by agreement. High-agreement findings surface first; disagreements get flagged explicitly so a human can adjudicate.
A concrete example: routing a TypeScript auth change
Say you run this:
2ndopinion review src/auth/session.ts
The router sees:
- Language: TypeScript (file extension and tsconfig detected)
- Change type: bugfix (detected from git diff — a returned value was modified, no new exports)
- File size: 240 lines
- Path signal:
auth/→ security-sensitive bump - Pattern memory: this repo had a session-fixation issue three months ago
The router weights the security-sensitive bump and biases toward whichever model has the strongest track record on auth/session TypeScript bugs in our accuracy table. It runs that single model at three-credits-equivalent depth, returns a review, and the review_metadata field on the response tells you exactly which model was chosen so you can audit the decision.
If any of those signals flip — different language, a new-feature diff, no security-sensitive path — you'd get a different model. That's the whole point.
What surprised us
Two things.
First, the router made the marginal model matter. We used to think of models as tiered — a "best" one, a "good enough" one, a "cheap one for trivial stuff." Once we started routing on language-specific accuracy, the hierarchy collapsed. Models we'd written off as second-tier turned out to dominate specific slices. There is no tier list. There are just specialties.
Second, the router made consensus more valuable, not less. You'd think smart routing would make consensus redundant — why run three models if one is already the best? In practice, consensus is where the router learns. Every disagreement is a labeled data point about where the router's current guess is wrong. We run consensus on a sampled slice of reviews partly to keep the accuracy table fresh.
The takeaway
If you're building anything on top of LLMs, the lesson generalizes past code review: "which model is best" is the wrong question. The right question is "which model is best for this specific request, given what I know about it." Build a router, not a leaderboard.
If you want to see smart model routing in action, the fastest way is the free playground — no signup, just paste code and see which model the router picks:
# Install the CLI and run a review
npm i -g 2ndopinion-cli
2ndopinion review src/your-file.ts
Or try the playground at get2ndopinion.dev and watch the model_used field on the response. You can force a specific model with --llm claude, --llm codex, or --llm gemini to see how the same code gets reviewed differently — which is the fastest way to internalize why routing matters in the first place.
If you've built a routing layer for a different ML-backed product, I'd love to hear what signals ended up mattering most. Drop a comment — I'm especially curious about people who tried fancier approaches before collapsing back to a lookup table.
Top comments (0)