The math of multi-model consensus: when 3 cheap reviews beat 1 expensive one

#ai #codereview #llm #python

there's a reflex in AI tooling that says: when in doubt, reach for the biggest model. bigger model, better review, fewer escaped bugs. it feels obviously true. but if you actually write down the probabilities, the reflex falls apart for a large class of problems. three smaller, cheaper reviews — read together correctly — can beat one expensive one, and not by a little.

this isn't a vibes argument. it's the same math that makes RAID arrays more reliable than a single expensive disk, and ensemble classifiers beat single models in practice. let me show the numbers, then the catch, then how to actually wire it up.

the single-reviewer ceiling

say your best, most expensive model catches a real bug 80% of the time on a given diff. that's genuinely good. it also means it misses one in five. run it again on the same diff and you don't get to 96% — you get back to 80%, because the second pass has the same blind spots as the first. a model's errors aren't random noise you can average away by re-rolling. they're systematic. the bug it can't see, it can't see twice.

so the ceiling on a single reviewer isn't set by how many times you ask. it's set by the model's correlation with itself, which is 1. you are stuck at 80%.

why independent errors change everything

now suppose instead of one model at 80% you take three models that each catch only 70% — individually worse — but whose mistakes are uncorrelated. a bug one misses, another tends to catch, because they were trained on different data with different objectives and have learned different "smells."

the probability that all three miss the same real bug, if their misses were fully independent, is:

0.30 × 0.30 × 0.30 = 0.027

that's a 97.3% catch rate from three reviewers that are each individually worse than your expensive one. the expensive single model sat at 80%. the trio of cheaper models lands near 97% — purely because their failures don't line up.

real models aren't fully independent, so you never get the textbook number. but even at partial independence the direction holds, and it holds hard. here's the same calculation generalized, with a correlation knob so you can see how the advantage decays as the models start failing alike:

def union_catch_rate(per_model_recall, n_models, corr=0.0):
    """Approximate catch rate for n independent-ish reviewers.
    corr=0 -> fully independent, corr=1 -> fully correlated (no gain)."""
    miss = 1 - per_model_recall
    independent_all_miss = miss ** n_models
    correlated_all_miss = miss                 # behaves like a single model
    blended = (1 - corr) * independent_all_miss + corr * correlated_all_miss
    return 1 - blended

for c in (0.0, 0.25, 0.5, 0.75):
    rate = union_catch_rate(0.70, 3, corr=c)
    print(f"corr={c:>4}: 3x70% reviewers -> {rate:.1%}")

corr= 0.0: 3x70% reviewers -> 97.3%
corr=0.25: 3x70% reviewers -> 90.5%
corr= 0.5: 3x70% reviewers -> 83.7%
corr=0.75: 3x70% reviewers -> 76.8%

the lesson lives in that table. when your reviewers fail independently (top row) three cheap ones crush one expensive one. when they fail alike (bottom row) you've just paid three times for one opinion. the entire game is decorrelating your reviewers, which in practice means using different model families, not the same model three times or three models from the same lab fine-tuned off the same base.

the catch: cost, latency, and false positives

three reviews aren't free, and the naive pitch ignores three real costs.

the first is money, but it cuts the surprising way. three small-model calls are usually cheaper than one frontier call, not more expensive — a mid-tier model runs a fraction of the per-token price of a flagship. so "3 cheap beats 1 expensive" is often literally cheaper, not a quality-for-cost trade.

the second is latency, and here you win for free: the three reviews are embarrassingly parallel. fire them concurrently and wall-clock time is the slowest of the three, not the sum — roughly the latency of one call.

the third is the real cost, and it's false positives. three reviewers flag more total stuff, and some of it is noise. union everything blindly and you bury the developer in low-value nits. this is where flat counting fails: "two of three flagged it" treats every model as equally credible on every question, which is plainly false. a model strong on Python concurrency may be weak on SQL injection. the fix is to weight each model's vote by its measured accuracy on this specific language and issue type, and to separate the high-agreement findings from the worth-a-glance solos.

findings = run_parallel(diff, models=["a", "b", "c"])   # concurrent

# weight each vote by that model's track record on (language, category)
def weighted_consensus(finding):
    return sum(model_accuracy[m][finding.lang][finding.category]
              for m in finding.flagged_by)

high_conf = [f for f in findings if weighted_consensus(f) >= THRESHOLD]
worth_a_glance = [f for f in findings if f not in high_conf]

that's the difference between an ensemble that helps and one that just yells louder.

what it looks like in practice

once you weight the votes and parallelize the calls, a three-model review of a staged diff runs in about the time of a single call and reads like this:

$ npx 2ndopinion-cli review --staged

  scanning 4 files · 3 models · weighted consensus · parallel

  src/auth/session.py
    ⚠ high   security      JWT verified without exp check        (consensus 0.94)
    ⚠ med    concurrency   token refresh races on shared dict    (consensus 0.71)
  src/billing/charge.py
    ⚠ high   numeric       float math on currency amounts        (consensus 0.89)
  src/api/routes.py
    · low    style         unused import (1 model, low weight)   (consensus 0.22)

  4 findings · 2 high · 1 med · 1 muted · 1.9s

note the bottom finding: one model flagged it, the weighting knew that model is weak on style calls, so it got muted rather than shoved in your face. that's the math doing triage.

where this goes

you can build the core of this yourself in an afternoon, and you probably should — running three model families against one rubric and unioning the findings is most of the value. the part that's tedious to maintain is the bookkeeping underneath: tracking per-model, per-language, per-issue-type accuracy over time so the weights stay calibrated, and remembering bug shapes you've already seen so they get flagged instantly instead of re-derived every run.

that compounding bookkeeping is what 2ndOpinion handles. it runs Claude, Codex, and Gemini in parallel and returns a calibrated, weighted consensus — each model's vote scaled by its measured accuracy per language and issue type rather than a flat majority — plus a pattern memory that recognizes known bug shapes on sight. it ships as an MCP server, a REST API, a CLI, and a GitHub PR agent.

the takeaway works with or without any tool: stop equating "best review" with "biggest model." reach for reviewers that fail differently, run them in parallel, and weight what they say. if you want the calibrated version without maintaining the accuracy tables yourself, get2ndopinion.dev has a $5 starter pack (100 credits) and a 7-day Pro trial — or just run npx 2ndopinion-cli on a staged diff and watch three cheap opinions outvote one expensive guess.