When Claude, Codex, and Gemini Disagree on the Same Code

#ai #codereview #devtools #buildinpublic

When we tell people 2ndOpinion runs every pull request past Claude, Codex, and Gemini and then cross-examines the findings, the most common follow-up is: "Do they actually disagree? Or is this just three models rubber-stamping each other?"

The answer, from about six months of production review logs, is that they disagree often enough to matter. Not on everything — maybe 15% of diffs — but the disagreements cluster on exactly the kinds of bugs that hurt in production: concurrency, null handling, subtle security issues, and "this works but it's going to page you at 3am" architectural smells.

Here are four real cases, lightly anonymized, where the three models read the same code and came back with meaningfully different verdicts. If you're trying to decide whether multi-model review is worth the extra tokens, these are the kinds of arguments it's buying you.

Case 1: The async/await race that only one model saw

The diff was a webhook handler in a Node.js payments service. Roughly:

app.post("/webhook/stripe", async (req, res) => {
  const event = verifySignature(req);
  const existing = await db.events.findOne({ id: event.id });
  if (existing) return res.status(200).send("duplicate");
  await processEvent(event);
  await db.events.insert({ id: event.id, status: "processed" });
  res.status(200).send("ok");
});

Codex flagged it as a textbook race condition: two copies of the same webhook arriving within milliseconds both pass the findOne check before either has written to events, both run processEvent, and you charge the customer twice. Recommended fix: a unique index on id plus wrap the processing in an idempotency key pattern.

Claude said the code was fine and suggested minor cleanup — extract processEvent into a service, add structured logging.

Gemini agreed with Codex about the race but suggested a different fix — optimistic insert first, catch the unique constraint violation, return early if duplicate. Cleaner on the happy path.

The consensus step flagged the race because two of three models saw it. Without cross-checking, whichever single model you happened to be using would have told you the diff was either shippable or a bug — a coin flip on a payments handler.

The lesson isn't that Claude is worse at concurrency. Rerun this prompt on a different day and the models trade places. The lesson is that any single model has blind spots that are invisible until a different model looks at the same code.

Case 2: The "working" SQL that was quietly injectable

A new internal admin endpoint, Python, roughly:

def search_users(query: str, sort: str = "created_at"):
    sql = f"SELECT * FROM users WHERE email ILIKE %s ORDER BY {sort} DESC"
    return db.execute(sql, [f"%{query}%"])

Gemini immediately flagged the SQL injection in the sort parameter — the %s parameterization protects query, but sort is interpolated directly into the string. An attacker who controls sort can turn this into ORDER BY (SELECT ...) DESC and exfiltrate data.

Codex flagged it too, with a suggested allowlist: if sort not in {"created_at", "email", "last_login"}: raise ValueError(...).

Claude said the query was safe because the user parameter was parameterized — and it was technically right about query, but it missed that sort is a user-controllable input from the same request.

This is the most dangerous kind of AI review error: confidently correct about one thing, silent on a worse thing right next to it. A single-model review that happened to land on Claude that day would have said "LGTM." The second opinion is exactly what you want for security-adjacent diffs — one model being wrong is common, two models being wrong in the same direction is rare.

Case 3: The memory leak that wasn't

Sometimes consensus is wrong and the outlier is right. React component, roughly:

useEffect(() => {
  const ws = new WebSocket(url);
  ws.onmessage = (e) => setMessages((m) => [...m, e.data]);
  return () => ws.close();
}, [url]);

Claude and Gemini both flagged a missing cleanup of the onmessage handler and warned about a memory leak if the component re-mounted rapidly.

Codex pushed back — because ws is created inside the effect and ws.close() is called in the cleanup, the socket is garbage-collected along with its handlers. The handler doesn't need explicit removal. The two-of-three majority was wrong; the outlier was right.

This is where our cross-examination step earns its keep. Instead of defaulting to "majority wins," the consensus layer asks the dissenting model to defend its position, then asks the other two to respond. In this case Codex explained the GC behavior, Claude acknowledged the correction, and the final verdict downgraded the finding from "bug" to "stylistic nit."

If you only run majority voting, you get the wrong answer on cases like this. If you run proper cross-examination, you get the right answer and the reasoning, which is how engineers actually build trust in AI review.

Case 4: The Rust borrow checker dispute

A small but contentious one. The diff refactored a hot path:

fn process(items: Vec<Item>) -> Vec<Processed> {
    items.iter()
        .map(|i| transform(i.clone()))
        .collect()
}

Codex flagged the .clone() as wasteful and suggested taking items by value and using into_iter() to move instead of clone.

Gemini agreed with the performance critique but added a nuance — if Item contains anything expensive to clone (like an Arc<Mutex<T>>), the clone is specifically what you don't want in a hot path.

Claude defended the clone. Its argument: if transform is defined for &Item in the existing codebase and changing it breaks fifteen other callers, the clone is the minimal-risk change. "Optimal" and "mergeable" are different targets.

None of the three models was wrong. They were optimizing for different objectives, which is a pattern we see constantly — performance versus maintainability, correctness versus velocity, local improvement versus blast-radius. Multi-model review surfaces that there is a tradeoff rather than presenting one model's preferred answer as The Answer. That's usually more useful than a confident single verdict.

What we do with the disagreements

The short version of the product: every review goes to all three models in parallel. Findings that all three agree on are high-confidence and reported first. Findings where models disagree trigger a cross-examination round where each model sees the others' output and gets a chance to revise. Anything still contested is surfaced to the human reviewer with the full argument attached, rather than hidden behind a single "LGTM."

That last part is the one most people underestimate. You don't want the AI to resolve every disagreement — some disagreements are the signal. A human reviewer who sees "Claude says ship, Gemini says block, here's why" makes a better decision than one who sees a single-model verdict in either direction.

If your team is running code review with one model and wondering what the second opinion would say, that's the whole pitch. Install the CLI with npm i -g 2ndopinion-cli, run 2ndopinion review, and see where your models actually disagree. Or wire it into Claude Code / Cursor as an MCP server — docs at get2ndopinion.dev.

We publish a weekly build-in-public update, and this post is part of it. If you have a case where two AI reviewers disagreed on your code and you're curious what a third would say, send it over — the weird diffs are the fun ones.