<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Brian Mello</title>
    <description>The latest articles on DEV Community by Brian Mello (@brianmello).</description>
    <link>https://dev.to/brianmello</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3858439%2Ffbe563b7-f4da-44b2-83c2-72f137eae4ab.png</url>
      <title>DEV Community: Brian Mello</title>
      <link>https://dev.to/brianmello</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/brianmello"/>
    <language>en</language>
    <item>
      <title>When Claude, Codex, and Gemini Disagree on the Same Code</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 24 Apr 2026 17:06:36 +0000</pubDate>
      <link>https://dev.to/brianmello/when-claude-codex-and-gemini-disagree-on-the-same-code-4cnd</link>
      <guid>https://dev.to/brianmello/when-claude-codex-and-gemini-disagree-on-the-same-code-4cnd</guid>
      <description>&lt;p&gt;When we tell people &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;2ndOpinion&lt;/a&gt; runs every pull request past Claude, Codex, and Gemini and then cross-examines the findings, the most common follow-up is: "Do they actually disagree? Or is this just three models rubber-stamping each other?"&lt;/p&gt;

&lt;p&gt;The answer, from about six months of production review logs, is that they disagree often enough to matter. Not on everything — maybe 15% of diffs — but the disagreements cluster on exactly the kinds of bugs that hurt in production: concurrency, null handling, subtle security issues, and "this works but it's going to page you at 3am" architectural smells.&lt;/p&gt;

&lt;p&gt;Here are four real cases, lightly anonymized, where the three models read the same code and came back with meaningfully different verdicts. If you're trying to decide whether multi-model review is worth the extra tokens, these are the kinds of arguments it's buying you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 1: The async/await race that only one model saw
&lt;/h2&gt;

&lt;p&gt;The diff was a webhook handler in a Node.js payments service. Roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/webhook/stripe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;verifySignature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findOne&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duplicate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;processEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;processed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Codex&lt;/strong&gt; flagged it as a textbook race condition: two copies of the same webhook arriving within milliseconds both pass the &lt;code&gt;findOne&lt;/code&gt; check before either has written to &lt;code&gt;events&lt;/code&gt;, both run &lt;code&gt;processEvent&lt;/code&gt;, and you charge the customer twice. Recommended fix: a unique index on &lt;code&gt;id&lt;/code&gt; plus wrap the processing in an idempotency key pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; said the code was fine and suggested minor cleanup — extract &lt;code&gt;processEvent&lt;/code&gt; into a service, add structured logging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini&lt;/strong&gt; agreed with Codex about the race but suggested a different fix — optimistic insert first, catch the unique constraint violation, return early if duplicate. Cleaner on the happy path.&lt;/p&gt;

&lt;p&gt;The consensus step flagged the race because two of three models saw it. Without cross-checking, whichever single model you happened to be using would have told you the diff was either shippable or a bug — a coin flip on a payments handler.&lt;/p&gt;

&lt;p&gt;The lesson isn't that Claude is worse at concurrency. Rerun this prompt on a different day and the models trade places. The lesson is that &lt;em&gt;any&lt;/em&gt; single model has blind spots that are invisible until a different model looks at the same code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 2: The "working" SQL that was quietly injectable
&lt;/h2&gt;

&lt;p&gt;A new internal admin endpoint, Python, roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM users WHERE email ILIKE %s ORDER BY &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; DESC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Gemini&lt;/strong&gt; immediately flagged the SQL injection in the &lt;code&gt;sort&lt;/code&gt; parameter — the &lt;code&gt;%s&lt;/code&gt; parameterization protects &lt;code&gt;query&lt;/code&gt;, but &lt;code&gt;sort&lt;/code&gt; is interpolated directly into the string. An attacker who controls &lt;code&gt;sort&lt;/code&gt; can turn this into &lt;code&gt;ORDER BY (SELECT ...) DESC&lt;/code&gt; and exfiltrate data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codex&lt;/strong&gt; flagged it too, with a suggested allowlist: &lt;code&gt;if sort not in {"created_at", "email", "last_login"}: raise ValueError(...)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; said the query was safe because the user parameter was parameterized — and it was technically right about &lt;code&gt;query&lt;/code&gt;, but it missed that &lt;code&gt;sort&lt;/code&gt; is a user-controllable input from the same request.&lt;/p&gt;

&lt;p&gt;This is the most dangerous kind of AI review error: confidently correct about one thing, silent on a worse thing right next to it. A single-model review that happened to land on Claude that day would have said "LGTM." The second opinion is exactly what you want for security-adjacent diffs — one model being wrong is common, two models being wrong in the same direction is rare.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 3: The memory leak that wasn't
&lt;/h2&gt;

&lt;p&gt;Sometimes consensus is wrong and the outlier is right. React component, roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="nf"&gt;useEffect&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onmessage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setMessages&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; and &lt;strong&gt;Gemini&lt;/strong&gt; both flagged a missing cleanup of the &lt;code&gt;onmessage&lt;/code&gt; handler and warned about a memory leak if the component re-mounted rapidly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codex&lt;/strong&gt; pushed back — because &lt;code&gt;ws&lt;/code&gt; is created inside the effect and &lt;code&gt;ws.close()&lt;/code&gt; is called in the cleanup, the socket is garbage-collected along with its handlers. The handler doesn't need explicit removal. The two-of-three majority was wrong; the outlier was right.&lt;/p&gt;

&lt;p&gt;This is where our cross-examination step earns its keep. Instead of defaulting to "majority wins," the consensus layer asks the dissenting model to defend its position, then asks the other two to respond. In this case Codex explained the GC behavior, Claude acknowledged the correction, and the final verdict downgraded the finding from "bug" to "stylistic nit."&lt;/p&gt;

&lt;p&gt;If you only run majority voting, you get the wrong answer on cases like this. If you run proper cross-examination, you get the right answer &lt;em&gt;and&lt;/em&gt; the reasoning, which is how engineers actually build trust in AI review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 4: The Rust borrow checker dispute
&lt;/h2&gt;

&lt;p&gt;A small but contentious one. The diff refactored a hot path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Processed&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
        &lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Codex&lt;/strong&gt; flagged the &lt;code&gt;.clone()&lt;/code&gt; as wasteful and suggested taking &lt;code&gt;items&lt;/code&gt; by value and using &lt;code&gt;into_iter()&lt;/code&gt; to move instead of clone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini&lt;/strong&gt; agreed with the performance critique but added a nuance — if &lt;code&gt;Item&lt;/code&gt; contains anything expensive to clone (like an &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;T&amp;gt;&amp;gt;&lt;/code&gt;), the clone is specifically what you don't want in a hot path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; defended the clone. Its argument: if &lt;code&gt;transform&lt;/code&gt; is defined for &lt;code&gt;&amp;amp;Item&lt;/code&gt; in the existing codebase and changing it breaks fifteen other callers, the clone is the minimal-risk change. "Optimal" and "mergeable" are different targets.&lt;/p&gt;

&lt;p&gt;None of the three models was wrong. They were optimizing for different objectives, which is a pattern we see constantly — performance versus maintainability, correctness versus velocity, local improvement versus blast-radius. Multi-model review surfaces that there &lt;em&gt;is&lt;/em&gt; a tradeoff rather than presenting one model's preferred answer as The Answer. That's usually more useful than a confident single verdict.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we do with the disagreements
&lt;/h2&gt;

&lt;p&gt;The short version of the product: every review goes to all three models in parallel. Findings that all three agree on are high-confidence and reported first. Findings where models disagree trigger a cross-examination round where each model sees the others' output and gets a chance to revise. Anything still contested is surfaced to the human reviewer with the full argument attached, rather than hidden behind a single "LGTM."&lt;/p&gt;

&lt;p&gt;That last part is the one most people underestimate. You don't want the AI to resolve every disagreement — some disagreements are the signal. A human reviewer who sees "Claude says ship, Gemini says block, here's why" makes a better decision than one who sees a single-model verdict in either direction.&lt;/p&gt;




&lt;p&gt;If your team is running code review with one model and wondering what the second opinion would say, that's the whole pitch. Install the CLI with &lt;code&gt;npm i -g 2ndopinion-cli&lt;/code&gt;, run &lt;code&gt;2ndopinion review&lt;/code&gt;, and see where your models actually disagree. Or wire it into Claude Code / Cursor as an MCP server — docs at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We publish a weekly build-in-public update, and this post is part of it. If you have a case where two AI reviewers disagreed on your code and you're curious what a third would say, send it over — the weird diffs are the fun ones.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>devtools</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>How Smart Model Routing Picks the Right AI for Your Programming Language</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 17 Apr 2026 17:06:47 +0000</pubDate>
      <link>https://dev.to/brianmello/how-smart-model-routing-picks-the-right-ai-for-your-programming-language-2jog</link>
      <guid>https://dev.to/brianmello/how-smart-model-routing-picks-the-right-ai-for-your-programming-language-2jog</guid>
      <description>&lt;p&gt;The dirty secret of AI code review is that there is no single "best" model. There are only models that happen to be good at the specific thing you're asking them to do right now.&lt;/p&gt;

&lt;p&gt;I learned this the hard way while building &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;2ndOpinion&lt;/a&gt;, an AI code review tool where Claude, Codex, and Gemini cross-check each other's work over MCP. The first version hard-coded one model for every review. The reviews were fine for JavaScript. They were embarrassing for Rust. They were weirdly confident and wrong for SQL.&lt;/p&gt;

&lt;p&gt;So we stopped picking a single model and started routing. This post is about how that routing layer actually works — what signals we collect, how the scoring math plays out, and what surprised us when we shipped it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: model strength is language-specific
&lt;/h2&gt;

&lt;p&gt;When we started tracking per-language accuracy, a clear pattern emerged. If you take the same corpus of reviewed pull requests and score each model on catching real bugs versus hallucinating issues that don't exist, you don't get a uniform leaderboard. You get something that looks more like rock-paper-scissors.&lt;/p&gt;

&lt;p&gt;One model might be excellent at spotting async/await footguns in TypeScript but completely miss lifetime issues in Rust. Another might be phenomenal at Python decorator patterns and hopeless at Go's error handling conventions. A third might crush Terraform drift detection but flag perfectly valid Kubernetes manifests as "probably wrong."&lt;/p&gt;

&lt;p&gt;This isn't a flaw in any particular model. It's a consequence of training data distribution, RLHF feedback, and the fact that "code" is actually hundreds of very different specialties wearing the same trench coat. Treating "AI code review" as one problem and picking one winner leaves performance on the table for every language that isn't the winner's strong suit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What &lt;code&gt;--llm auto&lt;/code&gt; actually does
&lt;/h2&gt;

&lt;p&gt;When you run a review with no model flag, the CLI calls the &lt;code&gt;auto&lt;/code&gt; router:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli
2ndopinion review src/auth.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, &lt;code&gt;--llm auto&lt;/code&gt; is the default. It takes three inputs — language, change type, and file size — and picks a model. Here's the Python SDK equivalent, which exposes the same router via a keyword argument:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;secondopinion&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;opinion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/auth.ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;typescript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# route based on accuracy data
&lt;/span&gt;    &lt;span class="n"&gt;change_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bugfix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# optional hint
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;review_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_used&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# which model actually ran
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;review_metadata&lt;/code&gt; object is the important part for debugging. Every response tells you which model was picked and why, along with token counts and duration. If you want reproducibility, pin the model explicitly; if you want the best review for this specific request, let the router decide.&lt;/p&gt;

&lt;h2&gt;
  
  
  The signals that feed the router
&lt;/h2&gt;

&lt;p&gt;There are four signals we weight, in roughly this order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Language.&lt;/strong&gt; Detected from file extension, shebang, or an explicit &lt;code&gt;language=&lt;/code&gt; argument in the SDK. This is the dominant signal because accuracy variance between models on a given language is much larger than variance on other dimensions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Change type.&lt;/strong&gt; A new-feature diff has different review priorities than a bugfix or a refactor. Security-sensitive file paths (&lt;code&gt;auth/&lt;/code&gt;, &lt;code&gt;crypto/&lt;/code&gt;, anything matching a configurable allowlist) bump a security-audit weight into the decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File size and diff size.&lt;/strong&gt; Very large files hit models with bigger effective context windows. Small targeted diffs can go to faster models without losing accuracy — no point paying for a heavyweight review of a three-line typo fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern memory.&lt;/strong&gt; If we've seen a similar bug pattern in this repo before, we bias toward the model that caught the original. This is a small effect per review, but over a project's lifetime it adds up, because teams tend to re-introduce the same class of bug in different forms.&lt;/p&gt;

&lt;p&gt;The scoring itself is embarrassingly simple. For each candidate model we compute a weighted sum from the accuracy table and pick the highest. It's not a neural net. It's not an LLM picking another LLM. It's a lookup table and a weighted argmax. We tried fancier approaches and they kept losing to the lookup table, which turns out to be the honest answer in most ML-system stories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the accuracy data comes from
&lt;/h2&gt;

&lt;p&gt;A router is only as good as the data behind it. Ours comes from three places.&lt;/p&gt;

&lt;p&gt;First, offline evals. We maintain a set of benchmark repos per language with known bugs — either ones we inject, or historical CVEs replayed on the vulnerable commit. Every model gets scored on "did you catch this specific bug" and "did you flag something that wasn't actually a problem."&lt;/p&gt;

&lt;p&gt;Second, production telemetry. When a user accepts or rejects a finding via &lt;code&gt;2ndopinion fix&lt;/code&gt; or the GitHub PR agent, that's a signal. Rejected findings that were later confirmed as real bugs (via a follow-up commit or a revert) are gold. We only aggregate feedback, never store code — that's a hard constraint baked into the pipeline.&lt;/p&gt;

&lt;p&gt;Third, consensus disagreements. When you run a consensus review, three models vote. Disagreements are interesting because they surface cases where one model sees a bug the others miss. Over time, the model that's consistently right on disagreements gets weighted higher for that language.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Three-model consensus review — the source of a lot of our training signal&lt;/span&gt;
2ndopinion review src/auth.ts &lt;span class="nt"&gt;--consensus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three credits, one command. The confidence-weighted aggregator takes the three reviews, collapses duplicate findings, and ranks by agreement. High-agreement findings surface first; disagreements get flagged explicitly so a human can adjudicate.&lt;/p&gt;

&lt;h2&gt;
  
  
  A concrete example: routing a TypeScript auth change
&lt;/h2&gt;

&lt;p&gt;Say you run this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion review src/auth/session.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The router sees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Language: TypeScript (file extension and tsconfig detected)&lt;/li&gt;
&lt;li&gt;Change type: bugfix (detected from git diff — a returned value was modified, no new exports)&lt;/li&gt;
&lt;li&gt;File size: 240 lines&lt;/li&gt;
&lt;li&gt;Path signal: &lt;code&gt;auth/&lt;/code&gt; → security-sensitive bump&lt;/li&gt;
&lt;li&gt;Pattern memory: this repo had a session-fixation issue three months ago&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The router weights the security-sensitive bump and biases toward whichever model has the strongest track record on auth/session TypeScript bugs in our accuracy table. It runs that single model at three-credits-equivalent depth, returns a review, and the &lt;code&gt;review_metadata&lt;/code&gt; field on the response tells you exactly which model was chosen so you can audit the decision.&lt;/p&gt;

&lt;p&gt;If any of those signals flip — different language, a new-feature diff, no security-sensitive path — you'd get a different model. That's the whole point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised us
&lt;/h2&gt;

&lt;p&gt;Two things.&lt;/p&gt;

&lt;p&gt;First, the router made the marginal model matter. We used to think of models as tiered — a "best" one, a "good enough" one, a "cheap one for trivial stuff." Once we started routing on language-specific accuracy, the hierarchy collapsed. Models we'd written off as second-tier turned out to dominate specific slices. There is no tier list. There are just specialties.&lt;/p&gt;

&lt;p&gt;Second, the router made consensus more valuable, not less. You'd think smart routing would make consensus redundant — why run three models if one is already the best? In practice, consensus is where the router learns. Every disagreement is a labeled data point about where the router's current guess is wrong. We run consensus on a sampled slice of reviews partly to keep the accuracy table fresh.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;If you're building anything on top of LLMs, the lesson generalizes past code review: "which model is best" is the wrong question. The right question is "which model is best for this specific request, given what I know about it." Build a router, not a leaderboard.&lt;/p&gt;

&lt;p&gt;If you want to see smart model routing in action, the fastest way is the free playground — no signup, just paste code and see which model the router picks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the CLI and run a review&lt;/span&gt;
npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli
2ndopinion review src/your-file.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or try the playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; and watch the &lt;code&gt;model_used&lt;/code&gt; field on the response. You can force a specific model with &lt;code&gt;--llm claude&lt;/code&gt;, &lt;code&gt;--llm codex&lt;/code&gt;, or &lt;code&gt;--llm gemini&lt;/code&gt; to see how the same code gets reviewed differently — which is the fastest way to internalize why routing matters in the first place.&lt;/p&gt;

&lt;p&gt;If you've built a routing layer for a different ML-backed product, I'd love to hear what signals ended up mattering most. Drop a comment — I'm especially curious about people who tried fancier approaches before collapsing back to a lookup table.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
      <category>programming</category>
    </item>
    <item>
      <title>What 10 Versions of an AI Code Review CLI Taught Me About Developer UX</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 10 Apr 2026 17:12:18 +0000</pubDate>
      <link>https://dev.to/brianmello/what-10-versions-of-an-ai-code-review-cli-taught-me-about-developer-ux-1301</link>
      <guid>https://dev.to/brianmello/what-10-versions-of-an-ai-code-review-cli-taught-me-about-developer-ux-1301</guid>
      <description>&lt;p&gt;You don't learn how developers think by reading docs. You learn by shipping something, watching it fail, and shipping it again.&lt;/p&gt;

&lt;p&gt;I've been building &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;2ndOpinion&lt;/a&gt;, an AI code review tool where multiple models — Claude, Codex, Gemini — cross-check each other's reviews. Over the past few months and ten CLI versions, I've rewritten the developer experience more times than I'd like to admit. Here's what actually stuck.&lt;/p&gt;

&lt;h2&gt;
  
  
  Version 1: The "Just Ship It" Phase
&lt;/h2&gt;

&lt;p&gt;The first version of the CLI did exactly one thing: send your code to three AI models and print their reviews. It worked. Technically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx 2ndopinion-cli &lt;span class="nt"&gt;--file&lt;/span&gt; src/auth.ts &lt;span class="nt"&gt;--models&lt;/span&gt; claude,codex,gemini &lt;span class="nt"&gt;--format&lt;/span&gt; json &lt;span class="nt"&gt;--output&lt;/span&gt; review.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five flags to get a single review. Every run required you to specify models, format, and output. Nobody wants to think that hard before getting feedback on their code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson: If your CLI needs a manual, you've already lost.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Smart Default" Breakthrough
&lt;/h2&gt;

&lt;p&gt;The single biggest improvement wasn't a feature — it was removing decisions. Version 0.5.0 introduced one command that just works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion review src/auth.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The tool auto-detects your language, picks the best models for that language based on real accuracy data, and prints a formatted review. No flags required.&lt;/p&gt;

&lt;p&gt;Downloads jumped immediately. Not because the tool got more powerful — it got simpler.&lt;/p&gt;

&lt;p&gt;Behind the scenes, &lt;code&gt;--llm auto&lt;/code&gt; routes your code to whichever models perform best for your specific language. TypeScript reviews go to different models than Python reviews, because we track which models actually catch bugs in each language. But the developer doesn't need to know any of that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Feedback That Changed Everything
&lt;/h2&gt;

&lt;p&gt;A developer tried 2ndOpinion and told me: "I got my review. Now what?"&lt;/p&gt;

&lt;p&gt;That question haunted me. Getting a list of issues is step one. But developers don't want a report — they want their code to be better. So I built &lt;code&gt;fix&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion fix src/auth.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command. It reviews your code, identifies the issues, generates fixes, and applies them. You can review the diff before accepting. The entire loop from "something's wrong" to "it's fixed" happens in your terminal.&lt;/p&gt;

&lt;p&gt;Then came &lt;code&gt;watch&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion watch src/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Continuous monitoring. Save a file, get a review. Like having a pair programmer who never takes a break and never gets passive-aggressive about your variable names.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson: The best developer tool is the one that closes the loop. Don't hand developers a problem — hand them a solution.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Multi-Model Insight Nobody Asked For
&lt;/h2&gt;

&lt;p&gt;Here's something I didn't expect: individual AI models are unreliable in predictable ways. Claude is excellent at architectural reasoning but sometimes misses edge cases in error handling. Codex catches implementation bugs that Claude misses. Gemini often spots performance issues the others overlook.&lt;/p&gt;

&lt;p&gt;No single model is "the best." But three models reviewing the same code? They catch what each other misses. That's the core thesis of 2ndOpinion — consensus-based review.&lt;/p&gt;

&lt;p&gt;When all three models agree something is a problem, the confidence is high. When they disagree, that's where the interesting conversations happen. We built a confidence-weighted system that surfaces high-agreement issues first and flags disagreements for human review.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;consensus&lt;/code&gt; command makes this explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion review &lt;span class="nt"&gt;--consensus&lt;/span&gt; src/auth.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three models review in parallel. You get a unified report with confidence scores. Three credits, one command, and a review that's more thorough than any single model could produce.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Got Wrong About Developer UX
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I over-indexed on power users.&lt;/strong&gt; Early versions had flags for everything: model selection, temperature, output format, verbosity levels, custom prompts. Power users loved it. Everyone else bounced.&lt;/p&gt;

&lt;p&gt;The fix was layered complexity. The default command (&lt;code&gt;2ndopinion&lt;/code&gt;) requires zero configuration. Power users can add flags to customize. But the first experience is frictionless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I underestimated CI/CD.&lt;/strong&gt; Developers don't just run tools locally — they run them in pipelines. Version 0.10.0 added &lt;code&gt;--ci&lt;/code&gt;, &lt;code&gt;--json&lt;/code&gt;, and &lt;code&gt;--plain&lt;/code&gt; flags specifically for non-interactive environments. It sounds obvious in retrospect, but I spent months building interactive terminal UI before realizing half my users needed the opposite.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In your GitHub Actions workflow&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--pr&lt;/span&gt; &lt;span class="nv"&gt;$PR_NUMBER&lt;/span&gt; &lt;span class="nt"&gt;--ci&lt;/span&gt; &lt;span class="nt"&gt;--json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;I ignored the "try before you buy" instinct.&lt;/strong&gt; Developers don't sign up for things. They install them, try them, and decide in under 60 seconds. The free playground on &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; — no signup required — exists because I watched too many developers hit a registration wall and leave.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next: The Skills Marketplace
&lt;/h2&gt;

&lt;p&gt;The most surprising thing I've learned is that every team has domain-specific review needs. A fintech team cares about different patterns than a game studio. A team migrating from Python 2 to 3 needs a completely different lens.&lt;/p&gt;

&lt;p&gt;So we're building a skills marketplace where developers can create custom audit skills — specialized review logic for specific domains — and sell them. Creators earn 70% of revenue. It turns tribal knowledge into something shareable and monetizable.&lt;/p&gt;

&lt;p&gt;Think of it as npm for code review intelligence. Someone who's spent five years dealing with Django security footguns can package that knowledge into a skill that catches those issues for every Django developer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Ten versions in, the biggest lesson is this: &lt;strong&gt;developer tools win on defaults, not features.&lt;/strong&gt; Every flag you add is a decision you're asking the developer to make. Every decision is friction. Every bit of friction is a reason to close the terminal and move on.&lt;/p&gt;

&lt;p&gt;If you're building developer tools, here's my checklist: Does the zero-config experience work? Does the tool close the loop (find problem → fix problem)? Can it run in CI without modification? Can someone try it in under 60 seconds?&lt;/p&gt;

&lt;p&gt;If you want to try multi-model AI code review, the CLI is one install away:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli
2ndopinion review your-file.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or try the playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; — no signup, no credit card, just paste code and see what three AI models think.&lt;/p&gt;

&lt;p&gt;I'd love to hear what you've learned building developer tools. What UX lessons took you the longest to figure out? Drop a comment below.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>devtools</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>Single-Model vs Multi-Model AI Code Review: What I Learned Running Both</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 03 Apr 2026 17:07:37 +0000</pubDate>
      <link>https://dev.to/brianmello/single-model-vs-multi-model-ai-code-review-what-i-learned-running-both-2i22</link>
      <guid>https://dev.to/brianmello/single-model-vs-multi-model-ai-code-review-what-i-learned-running-both-2i22</guid>
      <description>&lt;p&gt;I've been obsessing over AI code review for the last year. Not because I think AI will replace code review — I don't — but because I think most developers are leaving a lot of quality signal on the table by using AI review the wrong way.&lt;/p&gt;

&lt;p&gt;Here's the thing nobody talks about: &lt;strong&gt;a single AI model is confidently wrong surprisingly often.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not maliciously wrong. Not obviously wrong. Just... plausible-sounding wrong. It'll flag a false positive, miss a real bug, or give you a high-confidence "looks good" on code that has a subtle race condition. And because the model sounds so sure of itself, you accept it and move on.&lt;/p&gt;

&lt;p&gt;I learned this the hard way. Then I started running multi-model consensus review instead, and it changed my whole mental model of what AI code review should look like.&lt;/p&gt;

&lt;p&gt;Here's what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Single-Model Review
&lt;/h2&gt;

&lt;p&gt;When you pipe code through one model — say, Claude or GPT-4 — you get a single "opinion." That opinion is shaped by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model's training data distribution&lt;/li&gt;
&lt;li&gt;Whatever biases crept in during RLHF&lt;/li&gt;
&lt;li&gt;The specific prompt you used&lt;/li&gt;
&lt;li&gt;The model's current context window state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of those factors are visible to you as the reviewer. You just get a confident-sounding output and have to decide how much to trust it.&lt;/p&gt;

&lt;p&gt;I started noticing patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; tends to be excellent at spotting architectural smell and async/await patterns. It's more conservative — it'll point out potential issues even when they're not certain bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4 / Codex&lt;/strong&gt; is better at catching common idiom violations and tends to give more opinionated style feedback. It's more decisive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt; has surprisingly strong instincts around security patterns and type safety, particularly in typed languages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't a knock on any model. They're just different lenses. And here's the thing: &lt;strong&gt;a bug that one model misses, another often catches.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Running the Same Code Through Both Approaches
&lt;/h2&gt;

&lt;p&gt;I took a production Node.js service — about 2,000 lines across 12 files — and ran it two ways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach 1: Single-model review (just Claude)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the CLI&lt;/span&gt;
npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli

&lt;span class="c"&gt;# Review with a single model&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--llm&lt;/span&gt; claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Approach 2: Multi-model consensus (Claude + Codex + Gemini in parallel)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Use consensus mode — 3 models, confidence-weighted&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--consensus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The single-model pass found &lt;strong&gt;14 issues&lt;/strong&gt;: 9 flagged as medium severity, 3 high, 2 low. Took about 8 seconds.&lt;/p&gt;

&lt;p&gt;The consensus pass found &lt;strong&gt;19 issues&lt;/strong&gt;: same 14, plus 5 more. Three of those 5 were real bugs I later confirmed in prod logs.&lt;/p&gt;

&lt;p&gt;But here's the part that matters more than the raw numbers:&lt;/p&gt;

&lt;p&gt;The consensus pass also &lt;strong&gt;filtered out 4 false positives&lt;/strong&gt; that Claude had flagged with high confidence. Those were caught because Codex and Gemini both disagreed — and when 2 out of 3 models say "this is fine," the confidence weight pulls the verdict away from "issue."&lt;/p&gt;




&lt;h2&gt;
  
  
  How Confidence-Weighted Consensus Works
&lt;/h2&gt;

&lt;p&gt;The naive approach to multi-model review would be simple majority voting: if 2 of 3 models say something is a bug, call it a bug. That's better than nothing, but it treats all models as equally reliable on all tasks.&lt;/p&gt;

&lt;p&gt;Confidence-weighted consensus is smarter. Each model reports not just &lt;em&gt;what&lt;/em&gt; it found, but &lt;em&gt;how confident&lt;/em&gt; it is. The final verdict weights those signals proportionally.&lt;/p&gt;

&lt;p&gt;So if Claude says "potential null dereference, high confidence" and Codex says "looks fine, medium confidence," the system doesn't just flip a coin. It weights Claude's high-confidence flag more heavily than Codex's medium-confidence dismissal.&lt;/p&gt;

&lt;p&gt;In practice, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unanimous findings&lt;/strong&gt; → almost certainly real, shown at the top&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2/3 agreement, high confidence&lt;/strong&gt; → likely real, worth investigating&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1/3 agreement, low confidence on the finding model&lt;/strong&gt; → deprioritized, often noise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Divergent high-confidence opinions&lt;/strong&gt; → flagged as a "debate" item worth human judgment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what that looks like with the Python SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;secondopinion&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;

&lt;span class="c1"&gt;# Run consensus review
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;consensus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;server.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;finding&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Models agreeing: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[94%] HIGH: Unhandled promise rejection in processWebhook()
  Models agreeing: claude, codex, gemini

[71%] MEDIUM: Missing input validation on userId parameter
  Models agreeing: claude, gemini

[38%] LOW: Variable name 'data' is ambiguous
  Models agreeing: codex
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That 38% finding? Probably noise. The 94% finding? Drop everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Single-Model Review Is Still Fine
&lt;/h2&gt;

&lt;p&gt;I want to be fair here. Single-model review isn't bad — it's just different.&lt;/p&gt;

&lt;p&gt;For fast iteration during development, single-model is great. You're not trying to catch every bug; you're trying to get quick feedback while the code is fresh. Running &lt;code&gt;2ndopinion fix&lt;/code&gt; in watch mode gives you that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Continuous monitoring — single model, fast feedback loop&lt;/span&gt;
2ndopinion watch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For code that's about to merge to main — especially anything touching auth, payments, or data pipelines — the consensus pass is worth the extra 10-15 seconds and the 2 additional credits.&lt;/p&gt;

&lt;p&gt;The mental model I've landed on: &lt;strong&gt;single-model for development velocity, consensus for pre-merge quality gates.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Deeper Lesson: Models Have Blind Spots
&lt;/h2&gt;

&lt;p&gt;The thing I didn't fully appreciate before building multi-model review into my workflow: AI models have systematic blind spots, not random ones.&lt;/p&gt;

&lt;p&gt;If Claude misses a certain class of bug, it tends to &lt;em&gt;consistently&lt;/em&gt; miss that class. It's not a random error — it's a bias in how the model was trained. That means if you only ever use Claude, you'll ship the same categories of bugs repeatedly without ever knowing they're being systematically missed.&lt;/p&gt;

&lt;p&gt;Multi-model consensus surfaces those blind spots by triangulating from different vantage points. It's the same reason we have human code reviewers with different backgrounds look at the same PR.&lt;/p&gt;

&lt;p&gt;One model trained heavily on Python might under-weight JavaScript async patterns. Another trained on a lot of library code might be overly conservative about application-layer error handling. When you combine them, the idiosyncrasies average out.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;If you want to see this difference yourself, there's a free playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; — no signup required. Paste your code, run both modes, and compare the outputs side by side.&lt;/p&gt;

&lt;p&gt;Or install the CLI and try it on your own codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli

&lt;span class="c"&gt;# Single model&lt;/span&gt;
2ndopinion review

&lt;span class="c"&gt;# Consensus (3 models, confidence-weighted)&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--consensus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first time you see a consensus pass catch something a single-model review confidently missed, you'll get it. That's the moment the model clicked for me.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;2ndOpinion is a multi-model AI code review tool. Claude, Codex, and Gemini cross-check each other's findings via MCP, CLI, Python SDK, REST API, and GitHub PR Agent. Free playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codequality</category>
      <category>codereview</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to Add Multi-Model AI Code Review to Claude Code in 30 Seconds</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Thu, 02 Apr 2026 23:16:45 +0000</pubDate>
      <link>https://dev.to/brianmello/how-to-add-multi-model-ai-code-review-to-claude-code-in-30-seconds-4aoe</link>
      <guid>https://dev.to/brianmello/how-to-add-multi-model-ai-code-review-to-claude-code-in-30-seconds-4aoe</guid>
      <description>&lt;p&gt;You know that moment when Claude reviews your code, gives it the green light, and then two days later you're debugging a production issue that &lt;em&gt;three humans&lt;/em&gt; would have caught immediately?&lt;/p&gt;

&lt;p&gt;Single-model AI code review has a blind spot problem. Each model was trained on different data, has different failure modes, and holds different opinions about what "correct" looks like. When you only ask one AI, you're getting one perspective — and that perspective has systematic gaps.&lt;/p&gt;

&lt;p&gt;Multi-model consensus code review flips the script. Instead of trusting one AI, you get Claude, GPT-4o, and Gemini to cross-check each other. Where all three agree, you can be confident. Where they diverge, &lt;em&gt;that's&lt;/em&gt; where you need to look closer.&lt;/p&gt;

&lt;p&gt;Here's how to set it up in Claude Code in about 30 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Single-Model Review
&lt;/h2&gt;

&lt;p&gt;Let me be direct: single-model AI code review is better than nothing. But it has a fundamental flaw — the model doesn't know what it doesn't know.&lt;/p&gt;

&lt;p&gt;I ran an experiment last month. I fed the same set of 50 bugs across Claude, GPT-4o, and Gemini separately. Each model caught some bugs the others missed. GPT-4o was better at certain Python anti-patterns. Gemini caught more async/concurrency issues. Claude excelled at security-related edge cases.&lt;/p&gt;

&lt;p&gt;No model caught everything. But when I used all three in consensus mode? Coverage went up significantly.&lt;/p&gt;

&lt;p&gt;This is the case for multi-model AI code review — it's not about any single model being bad, it's about combining strengths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up 2ndOpinion via MCP in 60 Seconds
&lt;/h2&gt;

&lt;p&gt;2ndOpinion is an AI-to-AI communication platform that routes your code to multiple models simultaneously and returns a confidence-weighted consensus. It plugs into Claude Code via MCP.&lt;/p&gt;

&lt;p&gt;Here's the config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2ndopinion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2ndopinion-mcp"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"SECONDOPINION_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-api-key-here"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Drop that into your Claude Code MCP config file (usually &lt;code&gt;~/.claude/mcp_config.json&lt;/code&gt;), restart Claude Code, and you're done. No extra dependencies. No separate process to run.&lt;/p&gt;

&lt;p&gt;Once it's wired up, you have access to these tools directly inside Claude Code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;review&lt;/code&gt;&lt;/strong&gt; — standard multi-model code review (uses 2 credits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;consensus&lt;/code&gt;&lt;/strong&gt; — parallel review from 3 models with confidence weighting (3 credits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;debate&lt;/code&gt;&lt;/strong&gt; — multi-round AI debate for architecture decisions (5–7 credits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;bug_hunt&lt;/code&gt;&lt;/strong&gt; — targeted bug detection sweep&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;security_audit&lt;/code&gt;&lt;/strong&gt; — security-focused review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't have to remember which tool to use. The &lt;code&gt;--llm auto&lt;/code&gt; flag routes to the best model for your language based on real accuracy data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Your First Consensus Review
&lt;/h2&gt;

&lt;p&gt;Once the MCP is connected, you can trigger a review in plain English inside Claude Code:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Run a consensus code review on this file."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Or you can use the CLI directly if you prefer the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install globally&lt;/span&gt;
npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli

&lt;span class="c"&gt;# Review a specific file&lt;/span&gt;
2ndopinion review src/auth/token-validator.ts

&lt;span class="c"&gt;# Full consensus (3 models in parallel)&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--consensus&lt;/span&gt; src/auth/token-validator.ts

&lt;span class="c"&gt;# Watch mode — auto-review on every save&lt;/span&gt;
2ndopinion watch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The consensus output tells you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Where all three models agree&lt;/strong&gt; — high confidence issues, fix these immediately&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where two out of three agree&lt;/strong&gt; — worth a look, especially for complex logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where models disagree&lt;/strong&gt; — the most interesting category; often means an ambiguous design tradeoff&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That last category is my favorite. When GPT-4o says "this is fine" and Claude says "this will blow up under load" — that's a signal to dig in, not dismiss.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Output Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Here's a real example. I had this Python function I was shipping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM users WHERE id = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running &lt;code&gt;2ndopinion review --consensus&lt;/code&gt; on this file returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔴 CONSENSUS (3/3 models agree): SQL injection vulnerability
   Line 3: f-string interpolation in SQL query
   Fix: Use parameterized queries

🟡 MAJORITY (2/3 models): Connection not closed on exception
   Line 2: db.connect() has no context manager / finally block
   Claude, GPT-4o: Flag | Gemini: Acceptable (with connection pooling)

🟢 LOW CONFIDENCE (1/3 models): Return type may be None
   Line 4: fetchone() returns None if no row found
   Only Claude flagged this
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SQL injection is obvious in hindsight — all three models agree, high confidence. The connection handling disagreement is &lt;em&gt;interesting&lt;/em&gt; — it tells me something about the environment assumptions baked into each model. And the None return type is a low-confidence flag worth noting for future-proofing.&lt;/p&gt;

&lt;p&gt;This is what multi-model AI code review buys you: not just more issues, but a &lt;em&gt;quality signal&lt;/em&gt; on each issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern Memory and Regression Tracking
&lt;/h2&gt;

&lt;p&gt;One thing that makes 2ndOpinion useful beyond a one-off review is that it builds project context over time. It tracks which patterns it's flagged before, so it can alert you when the same class of bug reappears in a different file.&lt;/p&gt;

&lt;p&gt;If you fixed an authentication bypass three weeks ago and a new PR introduces a structurally similar issue, 2ndOpinion flags it as a regression. No additional config required — it builds this context automatically per project.&lt;/p&gt;

&lt;p&gt;Combined with the GitHub PR Agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Review PR #42 from the CLI&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--pr&lt;/span&gt; 42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;...and you get automated multi-model review on every pull request, with regression awareness. The PR gets an inline comment breakdown — agreements, disagreements, and confidence levels — before a human reviewer ever opens it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Marketplace: Build Audits, Earn Revenue
&lt;/h2&gt;

&lt;p&gt;This is the part that surprised me most. 2ndOpinion has a skills marketplace where you can publish custom audit types. If you've got deep expertise in, say, Rust memory safety or Django security patterns, you can package that into an audit skill, publish it, and earn 70% of every credit spent running it.&lt;/p&gt;

&lt;p&gt;It's an interesting model: the platform benefits from domain expertise that no general-purpose LLM has, and the experts get a revenue stream from codifying what they know.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Without Signing Up
&lt;/h2&gt;

&lt;p&gt;If you want to kick the tires before committing, there's a free playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; — no signup required. Paste a code snippet, pick your review type, and see what three models think.&lt;/p&gt;

&lt;p&gt;For the full MCP + Claude Code integration, you'll need an API key, but the setup overhead is genuinely minimal. One JSON config, one restart, and you're running confidence-weighted multi-model code review on every file you touch.&lt;/p&gt;




&lt;p&gt;Single-model AI code review is table stakes at this point. If you're serious about code quality, the next step is getting your AIs to argue with each other — and paying attention to where they agree.&lt;/p&gt;

&lt;p&gt;Check out &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; or the &lt;a href="https://github.com/bdubtronux/2ndopinion" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; to dig into the details.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
