I let Claude and Codex argue about my code for a week. Here's what they caught.

#ai #codereview #llm #devtools

single-model code review has a structural blind spot, and it took me an embarrassingly long time to name it: the model that reviews your diff is the same kind of model that would have written the diff. it shares the failure modes. ask one LLM to find the bug it didn't notice the first time and you often get a confident "looks good to me" — the same confidence that produced the bug.

so for a week i ran every diff in a side project through two models instead of one — Claude and Codex (GPT) — and made them review independently, then compared where they disagreed. the disagreements turned out to be the entire point. here's what actually got caught, why a second opinion works, and how to wire it up yourself.

why two models beat one

the intuition people reach for is "two reviewers catch more than one." true but boring. the sharper version is about correlation of errors.

if a single reviewer misses a class of bug — say, async race conditions — it misses that class every time. running it twice doesn't help; the second pass has the same blind spot. you need a reviewer whose errors are uncorrelated with the first.

different model families fail differently. Claude and GPT were trained on different data mixes, with different RLHF, and they've internalized different "smells." where their judgments diverge is exactly where the uncertainty lives. a bug that one flags and the other misses is a bug worth a human's two minutes. a line both flag is almost certainly real. a line neither flags... well, that's the residual risk you can't escape, but it's smaller than what one model leaves behind.

concretely, over the week the overlap looked roughly like this: about 60% of real issues were caught by both, 30% by exactly one, and the remaining 10% needed me. that 30% is the whole argument. it's the bugs that would have shipped under a single-model setup.

the bug neither half-caught alone

here's a real one, simplified. a token-bucket rate limiter in TypeScript:

class RateLimiter {
  private tokens: number;
  private lastRefill: number;

  constructor(private capacity: number, private refillPerSec: number) {
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }

  tryConsume(cost = 1): boolean {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.capacity,
      this.tokens + elapsed * this.refillPerSec
    );
    this.lastRefill = now;

    if (this.tokens >= cost) {
      this.tokens -= cost;
      return true;
    }
    return false;
  }
}

Claude flagged the obvious-in-hindsight problem: under concurrent calls in a single event loop this is mostly fine, but the moment you share this instance across an await boundary, two callers can both read this.tokens before either writes it back, and you over-admit. it suggested making the refill-and-consume step atomic.

Codex flagged something else entirely: Date.now() is wall-clock, so an NTP adjustment or a clock skew can make elapsed negative, which silently removes tokens and stalls the limiter. it suggested performance.now() or a monotonic source and a Math.max(0, elapsed) clamp.

neither model named both issues on its own. read independently, you'd have patched one and shipped the other. read together, the diff that came out was actually correct:

tryConsume(cost = 1): boolean {
  const now = performance.now();
  const elapsed = Math.max(0, (now - this.lastRefill) / 1000);
  this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillPerSec);
  this.lastRefill = now;

  if (this.tokens < cost) return false;
  this.tokens -= cost;
  return true;
}

that's the pattern that repeated all week. each model is a strong reviewer with a specific squint. the value isn't in either review — it's in the diff between them.

the failure mode of consensus

here's the honest caveat, because two models also fail in a way one doesn't: they agree confidently and are both wrong. this happened with a date-parsing helper both models pronounced clean. both missed that it assumed MM/DD/YYYY and would silently misread DD/MM input. agreement is not proof. it's a strong prior, not a verdict.

which is why raw "both said yes" isn't enough. you want to know how reliable each model is on this specific kind of thing. Claude might be stronger on concurrency in TypeScript; Codex might be stronger on numeric edge cases in Python. weighting each model's vote by its track record per language and issue type beats a flat majority. a flat vote treats every reviewer as equally credible on every question, which is obviously false the second you watch them work.

wiring it up yourself

the minimal version is a script: send the diff to two providers with an identical rubric, parse structured findings, and surface the union with agreement annotated. the prompt matters — ask for a severity and a category on each finding so you can diff them programmatically:

RUBRIC = """Review this diff. For each issue return JSON:
{"line": int, "severity": "high|med|low",
 "category": "concurrency|numeric|security|logic|style",
 "issue": str}. Only real issues. Empty list if none."""

findings_a = review(diff, model="claude")
findings_b = review(diff, model="gpt")

both = intersect(findings_a, findings_b)   # high-confidence
solo = symmetric_diff(findings_a, findings_b)  # worth a human glance

that's genuinely most of the value, and you should build it before you buy anything — understanding the mechanism is worth the afternoon.

where this goes

once i had the two-model loop running locally i didn't want to maintain prompt rubrics, JSON parsing, and per-model reliability tracking by hand. that bookkeeping — which model to trust on which language and issue type — is the part that actually compounds over time, and it's tedious to do well.

that's the itch 2ndOpinion scratches. it runs Claude, Codex, and Gemini in parallel and returns a calibrated, weighted consensus, where each model's vote is weighted by its measured accuracy per language and per issue type rather than a flat majority. it also keeps a pattern memory, so a bug shape it's seen before gets flagged instantly instead of re-litigated. you can run it as an MCP server, a REST API, a CLI, or a GitHub PR agent.

the fastest way to feel the difference is the CLI on a real diff:

$ npx 2ndopinion-cli review --staged

  scanning 3 files · 2 models · weighted consensus

  src/rate-limiter.ts
    ⚠ high   concurrency   over-admits across await boundary   (consensus 0.91)
    ⚠ high   numeric       Date.now() can go backwards         (consensus 0.88)
  src/parse-date.ts
    ⚠ med    logic         ambiguous MM/DD vs DD/MM parsing    (consensus 0.64)

  3 issues · 2 high · 1 med · 1.8s

the week's takeaway is simple enough to act on without any tool: stop asking one model to check its own kind of work. add a reviewer that fails differently. if you want the calibrated, weighted version of that without maintaining the plumbing, get2ndopinion.dev has a $5 starter pack (100 credits) and a 7-day Pro trial — or just run npx 2ndopinion-cli on a staged diff and watch where the two models disagree.