<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Brian Mello</title>
    <description>The latest articles on DEV Community by Brian Mello (@brianmello).</description>
    <link>https://dev.to/brianmello</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3858439%2Ffbe563b7-f4da-44b2-83c2-72f137eae4ab.png</url>
      <title>DEV Community: Brian Mello</title>
      <link>https://dev.to/brianmello</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/brianmello"/>
    <language>en</language>
    <item>
      <title>I Let Three AIs Argue About My Vibe-Coded App — Here's What They Caught</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 15 May 2026 17:08:33 +0000</pubDate>
      <link>https://dev.to/brianmello/i-let-three-ais-argue-about-my-vibe-coded-app-heres-what-they-caught-3c2c</link>
      <guid>https://dev.to/brianmello/i-let-three-ais-argue-about-my-vibe-coded-app-heres-what-they-caught-3c2c</guid>
      <description>&lt;p&gt;I built a small side project in Cursor over a weekend. Login, dashboard, a couple of forms, a Stripe-style checkout flow. The kind of thing that &lt;em&gt;feels&lt;/em&gt; done. Clicking around, everything works. The vibes are immaculate.&lt;/p&gt;

&lt;p&gt;So I did the responsible adult thing: I shipped it.&lt;/p&gt;

&lt;p&gt;It broke in three places within 48 hours. None of the breaks were in code I had written by hand. They were in code an AI had generated that I had skimmed, nodded at, and moved on from.&lt;/p&gt;

&lt;p&gt;That's the trap of vibe coding. The AI is fluent. You're fluent at &lt;em&gt;reading&lt;/em&gt; what the AI made. Neither of you is the kind of pedantic loser who notices that the "Cancel" button on the checkout modal actually submits the form on mobile Safari because someone forgot to add &lt;code&gt;type="button"&lt;/code&gt; somewhere three components deep.&lt;/p&gt;

&lt;p&gt;This is the story of the second app I built, where I tried something different. I let three AI testing agents argue about my app before I shipped it. They caught seven things I would have missed. They also disagreed with each other in ways that, weirdly, made me trust the result more.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The app was a simple expense-splitting tool — Splitwise but uglier and free. Built in Cursor, deployed on Vercel, total dev time around eight hours spread across two evenings. By the end I had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Email/password signup&lt;/li&gt;
&lt;li&gt;A "create group" flow&lt;/li&gt;
&lt;li&gt;An "add expense" form with split logic&lt;/li&gt;
&lt;li&gt;A settle-up view&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Standard vibe-coded SaaS skeleton. Worked on my machine. Looked fine on my phone.&lt;/p&gt;

&lt;p&gt;Instead of clicking around for an hour and calling it good, I pointed 2ndOpinion Testing at the URL. The pitch on the box is "AI agents test your app like real users, then cross-examine each other's findings." I'd seen the demo. This was the first time I'd used it on something I actually cared about.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "three AIs arguing" actually looks like
&lt;/h2&gt;

&lt;p&gt;The product runs three different model-backed agents at your app concurrently. Each one explores independently — clicking, typing, navigating like a confused new user who has never seen the thing before. They each file a report on what's broken or weird.&lt;/p&gt;

&lt;p&gt;Then comes the part that earns the courtroom metaphor in the marketing: the agents cross-examine each other's findings. Agent A claims the signup form is broken. Agent B says they signed up just fine. The system makes them reproduce, defend, or retract.&lt;/p&gt;

&lt;p&gt;You don't end up with three separate reports to read. You end up with one verdict: here's what's actually wrong, here's what one agent thought was wrong but couldn't reproduce, here's what all three independently flagged.&lt;/p&gt;

&lt;p&gt;Reading the final verdict felt like reading the minutes of a deposition. In a good way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The seven things they caught
&lt;/h2&gt;

&lt;p&gt;I'll walk through them in increasing order of "ouch, I should have caught that."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The signup email field accepted "test" as a valid email.&lt;/strong&gt; All three agents flagged this. Front-end validation was just &lt;code&gt;required&lt;/code&gt;, no &lt;code&gt;type="email"&lt;/code&gt;. Cursor had generated a form with the bare minimum and I hadn't tightened it. Five-second fix. Would have looked terrible the first time a real user mistyped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The "Add Expense" form let you submit $0.&lt;/strong&gt; Two of three agents tried it, both succeeded, both filed it. The third agent said "this is probably intentional, some groups track zero-dollar IOUs." The system made them argue about it. They settled on "probably a bug, ask the developer." It was a bug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The settle-up calculation rounded wrong on three-way splits.&lt;/strong&gt; $10 split three ways became $3.33 + $3.33 + $3.33, which is $9.99. Someone was always going to be a penny off. One agent caught it by splitting a coffee three ways and noticing the totals didn't reconcile. The other two had only tested two-way splits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Pressing Enter in the "group name" field submitted the form before I'd added any members.&lt;/strong&gt; Only one agent caught this — the others were filling forms by clicking the submit button like polite humans. The one that pressed Enter found a half-broken state where the group existed but had no members and couldn't be edited.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. The mobile nav menu didn't close after navigating.&lt;/strong&gt; Two agents flagged it. Classic AI-generated React component thing. The menu had open/close state, but route changes didn't reset it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. The password reset email link 404'd.&lt;/strong&gt; I had not, in fact, set up the password reset route. The "Forgot password?" link went to &lt;code&gt;/reset-password&lt;/code&gt; which did not exist. I had written the link before writing the page and never come back to it. One agent found this by clicking every link on the login screen. Embarrassing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. The Stripe-style checkout for the (currently mocked) "Pro" tier accepted submissions but didn't go anywhere.&lt;/strong&gt; I had stubbed out the Pro upgrade flow and forgotten about it. The button looked real. The page it led to was a 404.&lt;/p&gt;

&lt;p&gt;Seven real things. None of them catastrophic, all of them the kind of thing that, on a launch day with twenty people poking at your app, accumulate into "this product feels janky."&lt;/p&gt;

&lt;h2&gt;
  
  
  The part I didn't expect: the disagreements
&lt;/h2&gt;

&lt;p&gt;The disagreements are what convinced me this approach actually works. Here are two:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Was the signup flow too slow?&lt;/strong&gt; One agent flagged the signup as "slow, took 4 seconds." The other two said it felt normal. The system made the first agent show its work. Turned out it had been testing on a throttled connection it had picked up from somewhere in its state, and the other two hadn't. The finding got retracted. If I had just had one agent, I'd have gone hunting for a phantom performance problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Was the "delete group" confirmation modal confusing?&lt;/strong&gt; Two agents thought the wording was unclear. The third said it was fine. The argument ended with "this is subjective, flagging for human review." That's the right answer. The tool wasn't pretending to be sure when it wasn't.&lt;/p&gt;

&lt;p&gt;I have used single-AI testing tools before. They sound confident about everything, including the wrong things. Watching three agents disagree and then resolve felt much closer to the experience of having three different humans review a PR. Some things were unanimous. Some things were noise. The noise got filtered before it got to me.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd tell another vibe coder
&lt;/h2&gt;

&lt;p&gt;A few things, in order of how often I've now had to repeat them to friends:&lt;/p&gt;

&lt;p&gt;You don't need to learn Playwright. You don't need to write Cypress specs. You don't need to even know what "end-to-end testing" is in the traditional sense. If you built your app in Bolt, Lovable, v0, or Replit, the testing tool you want is the same kind of thing — point it at a URL, let it figure out what to do.&lt;/p&gt;

&lt;p&gt;You do need to test &lt;em&gt;before&lt;/em&gt; you ship, not after. The temptation when you've spent a weekend vibing with an AI is to deploy on Sunday night, post on X, and hope. Resist. A 20-minute pre-flight on a Sunday afternoon catches the seven things that would have been a soft launch disaster.&lt;/p&gt;

&lt;p&gt;You should care about the disagreements more than the agreements. If your testing tool always sounds 100% confident, it's lying to you. Real bugs aren't unanimous. The interesting findings are the ones where one agent saw something and the others didn't — and you get told whether the holdout was right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;If you've vibe-coded anything in the last month and it's sitting in a Vercel deployment waiting for you to feel brave enough to share the link, I'd run it through this before you do.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://testing.get2ndopinion.dev" rel="noopener noreferrer"&gt;Try 2ndOpinion Testing →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You paste a URL. Three AIs argue about it. You ship with fewer surprises. That's the whole product.&lt;/p&gt;

&lt;p&gt;The Splitwise-but-uglier app is still up, by the way. Seven fewer embarrassments than it would have had. I'll take it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>vibecoding</category>
      <category>webdev</category>
    </item>
    <item>
      <title>How to Add Multi-Model AI Code Review to Your CI/CD Pipeline</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Sat, 09 May 2026 17:31:55 +0000</pubDate>
      <link>https://dev.to/brianmello/how-to-add-multi-model-ai-code-review-to-your-cicd-pipeline-1p4i</link>
      <guid>https://dev.to/brianmello/how-to-add-multi-model-ai-code-review-to-your-cicd-pipeline-1p4i</guid>
      <description>&lt;p&gt;Running AI code review locally is fine for solo work. The moment you have a team, the question becomes: how do I make the AI an actual gate in the pipeline, not a thing one person remembers to run before they push?&lt;/p&gt;

&lt;p&gt;This is a walkthrough for wiring &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;2ndOpinion&lt;/a&gt; — the multi-model AI code review CLI — into a CI/CD pipeline. I'll show GitHub Actions in full, then sketch the same pattern for GitLab CI and CircleCI. The interesting decisions aren't where the YAML goes; they're around consensus thresholds, blocking vs informational mode, and what happens when Claude, Codex, and Gemini disagree on the same diff (which, from our review logs, is roughly 15% of the time).&lt;/p&gt;

&lt;h2&gt;
  
  
  What "AI code review in CI" actually means
&lt;/h2&gt;

&lt;p&gt;There are two shapes this takes, and the YAML is almost identical for either. The difference is the policy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Informational mode.&lt;/strong&gt; Every PR runs the review. Findings are posted as a comment or check annotation. Nothing blocks merge. Humans decide what to do.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blocking mode.&lt;/strong&gt; Review runs on every PR. If the consensus surface flags a HIGH severity finding, the check fails and merge is blocked until the author either fixes it or someone with override permission ships anyway.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I recommend starting in informational mode for the first week or two. AI reviewers — even three of them cross-examining each other — surface false positives. You want the team to learn the noise floor before the bot can block their merges, otherwise the first false-positive blocker generates a Slack thread that ends with "let's just turn this off."&lt;/p&gt;

&lt;h2&gt;
  
  
  The minimum GitHub Actions config
&lt;/h2&gt;

&lt;p&gt;Here's the workflow file I use as a starting point. Drop it in &lt;code&gt;.github/workflows/ai-review.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AI Code Review&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;opened&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;synchronize&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;reopened&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
      &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;fetch-depth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;# need full history for diffs&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Node&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-node@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;20'&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install 2ndOpinion CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm install -g 2ndopinion-cli&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run multi-model review&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ANTHROPIC_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.OPENAI_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;GOOGLE_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GOOGLE_API_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;2ndopinion review \&lt;/span&gt;
            &lt;span class="s"&gt;--base origin/${{ github.base_ref }} \&lt;/span&gt;
            &lt;span class="s"&gt;--head HEAD \&lt;/span&gt;
            &lt;span class="s"&gt;--format github-comment \&lt;/span&gt;
            &lt;span class="s"&gt;--severity-threshold medium \&lt;/span&gt;
            &lt;span class="s"&gt;--comment-pr ${{ github.event.pull_request.number }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth calling out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;fetch-depth: 0&lt;/code&gt;&lt;/strong&gt; is necessary because &lt;code&gt;actions/checkout&lt;/code&gt; defaults to a shallow clone, and the CLI needs full history to compute the actual PR diff against the base branch. Skip this and your review runs against an empty diff, which produces a confidently empty review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three API keys.&lt;/strong&gt; Multi-model review means three providers. If you only set one, the CLI degrades to single-model mode and prints a warning. That's fine for a smoke test, but the whole reason you're doing this is the multi-model surface — the disagreement signal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;--severity-threshold medium&lt;/code&gt;&lt;/strong&gt; suppresses LOW findings in the PR comment. LOW is mostly nits and style preferences, and posting them on every PR trains your team to ignore the bot. Keep MEDIUM and HIGH visible; suppress LOW.&lt;/p&gt;

&lt;h2&gt;
  
  
  Going from informational to blocking
&lt;/h2&gt;

&lt;p&gt;To turn this into a merge gate, change one flag and one branch protection setting.&lt;/p&gt;

&lt;p&gt;In the workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;2ndopinion review \&lt;/span&gt;
  &lt;span class="s"&gt;--base origin/${{ github.base_ref }} \&lt;/span&gt;
  &lt;span class="s"&gt;--head HEAD \&lt;/span&gt;
  &lt;span class="s"&gt;--format github-comment \&lt;/span&gt;
  &lt;span class="s"&gt;--severity-threshold medium \&lt;/span&gt;
  &lt;span class="s"&gt;--fail-on high \&lt;/span&gt;
  &lt;span class="s"&gt;--comment-pr ${{ github.event.pull_request.number }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--fail-on high&lt;/code&gt; flag tells the CLI to exit with a non-zero status if any HIGH severity finding has consensus from at least 2 of 3 models. The 2-of-3 threshold matters — it's why you don't want to block on single-model verdicts. Any single model can confidently invent a critical bug. Two models independently flagging the same critical bug is meaningfully harder to fake.&lt;/p&gt;

&lt;p&gt;Then in &lt;strong&gt;Settings → Branches → Branch protection&lt;/strong&gt; for your default branch, add the &lt;code&gt;AI Code Review / review&lt;/code&gt; check to the required checks list. Now the merge button is gated.&lt;/p&gt;

&lt;p&gt;I'd hold this back for at least a week of informational-mode runs. Look at the false positive rate. If you're getting more than one false HIGH per ten PRs, tune the consensus threshold up to 3-of-3 instead of 2-of-3 before flipping the gate on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;--fail-on high --consensus-required &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's stricter — only blocks when all three models agree the finding is HIGH. False positive rate drops, false negative rate goes up. Tradeoff worth making early; you can loosen later once the team trusts the bot.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitLab CI
&lt;/h2&gt;

&lt;p&gt;Same pattern, different YAML. &lt;code&gt;.gitlab-ci.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ai-code-review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node:20&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_PIPELINE_SOURCE == 'merge_request_event'&lt;/span&gt;
  &lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;GIT_DEPTH&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;npm install -g 2ndopinion-cli&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;2ndopinion review&lt;/span&gt;
        &lt;span class="s"&gt;--base origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME&lt;/span&gt;
        &lt;span class="s"&gt;--head HEAD&lt;/span&gt;
        &lt;span class="s"&gt;--format gitlab-note&lt;/span&gt;
        &lt;span class="s"&gt;--severity-threshold medium&lt;/span&gt;
        &lt;span class="s"&gt;--comment-mr $CI_MERGE_REQUEST_IID&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CLI knows about GitLab's note format and uses &lt;code&gt;CI_JOB_TOKEN&lt;/code&gt; automatically if it's available in the environment, so you don't need to set up a separate token unless you want bot-attributed comments.&lt;/p&gt;

&lt;h2&gt;
  
  
  CircleCI
&lt;/h2&gt;

&lt;p&gt;CircleCI's config doesn't have the same first-class PR concept, but the CLI handles it. &lt;code&gt;.circleci/config.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2.1&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ai-review&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;docker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cimg/node:20.11&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;checkout&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm install -g 2ndopinion-cli&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run review&lt;/span&gt;
          &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;2ndopinion review \&lt;/span&gt;
              &lt;span class="s"&gt;--base origin/main \&lt;/span&gt;
              &lt;span class="s"&gt;--head HEAD \&lt;/span&gt;
              &lt;span class="s"&gt;--format json \&lt;/span&gt;
              &lt;span class="s"&gt;--output review.json&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;store_artifacts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;review.json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CircleCI doesn't have a native PR-comment surface, so I store the review as a build artifact and add a separate small script to POST the JSON to the GitHub PR via a personal access token. Less elegant than the GitHub Actions path, but it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do when the models disagree
&lt;/h2&gt;

&lt;p&gt;The reason multi-model review is in CI in the first place is that disagreements are signal, not noise. The CLI's default behavior on a finding where models split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3-of-3 agree (HIGH):&lt;/strong&gt; posted as a HIGH finding, blocks merge if &lt;code&gt;--fail-on high&lt;/code&gt; is set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2-of-3 agree (HIGH):&lt;/strong&gt; posted as a HIGH finding with the dissenting model's argument attached, blocks if &lt;code&gt;--consensus-required 2&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1-of-3 (HIGH):&lt;/strong&gt; posted as a NOTE-level finding with the model's argument and the other two models' counter-arguments. Never blocks. Visible to humans.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last category is the most underrated output. About 8% of our diffs produce a 1-of-3 HIGH where exactly one model is convinced something is broken and the other two say it's fine. Most of those are false positives by the lone model. But about a quarter of them — by far the most interesting quarter — are real bugs that two models missed. You don't want those silently dropped, but you also don't want them blocking merges. NOTE-level surfacing is the right answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost and time, in case you're worried about either
&lt;/h2&gt;

&lt;p&gt;Median review on a typical 200-line diff: about 40 seconds wall-clock and roughly $0.06 in combined API spend across the three providers. That's wall-clock time the developer doesn't spend; it runs in parallel with the rest of the CI matrix. The cost works out to less than a tenth of what most teams pay for any single human reviewer's hour, which is the right comparison — multi-model review doesn't replace human review, it replaces the human reviewer asking "did you check for race conditions" by hand.&lt;/p&gt;

&lt;p&gt;We've seen teams skip the AI review step for files larger than 1000 lines or generated files (lockfiles, schema dumps) — &lt;code&gt;--exclude '**/*.lock'&lt;/code&gt; and &lt;code&gt;--max-diff-lines 1000&lt;/code&gt; handle both.&lt;/p&gt;




&lt;p&gt;If you want to try this on a real repo, the CLI is &lt;code&gt;npm install -g 2ndopinion-cli&lt;/code&gt; and the docs for every flag mentioned above are at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt;. The MCP server flavor (for plugging the same review engine into Claude Code or Cursor as an agent tool) is also there.&lt;/p&gt;

&lt;p&gt;We publish a weekly build-in-public update; this post is part of it. If you wire 2ndOpinion into your CI and one of your three models flags something the other two missed on a real diff, send the case over — those are the ones we use to tune the consensus thresholds.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>devops</category>
      <category>cicd</category>
    </item>
    <item>
      <title>How to Test Your AI-Built App Without Writing a Single Test</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 01 May 2026 17:07:41 +0000</pubDate>
      <link>https://dev.to/brianmello/how-to-test-your-ai-built-app-without-writing-a-single-test-b64</link>
      <guid>https://dev.to/brianmello/how-to-test-your-ai-built-app-without-writing-a-single-test-b64</guid>
      <description>&lt;p&gt;You opened Cursor. You typed "build me a booking app." Forty-five minutes later, you have something that runs. The login works. The calendar mostly works. You ship it.&lt;/p&gt;

&lt;p&gt;Then a friend tries it and the date picker goes blank on iOS. Another user finds that hitting back after a failed payment leaves the form locked. Someone else can't sign up because their email has a plus sign in it.&lt;/p&gt;

&lt;p&gt;Welcome to the gap nobody talks about in the vibe coding era: AI-built apps are easier to ship than ever, and exactly as buggy as you'd expect from code you didn't fully read. Traditional testing — the unit tests, the integration tests, the Selenium suites — assumes you have time, expertise, and patience to write them. Vibe coders have none of those. So the apps go out untested.&lt;/p&gt;

&lt;p&gt;This post is about the shift that's quietly happening: AI testing tools that act like real users, find real bugs in your AI-built app, and never ask you to write a line of test code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why traditional testing fails vibe coders
&lt;/h2&gt;

&lt;p&gt;Selenium is from 2004. Cypress and Playwright are better, but the workflow hasn't really changed: you write a script that says click this, type that, assert this. Then your AI rebuilds the navbar and your selectors break. You spend an afternoon fixing tests instead of shipping features.&lt;/p&gt;

&lt;p&gt;The friction is bad enough for full-time engineers. For someone who built their app in Bolt or Lovable over a weekend, it's a hard no. You're not going to become a QA engineer. You're going to ship and hope.&lt;/p&gt;

&lt;p&gt;There is a third option, and it turns out it's pretty obvious in hindsight: have the AI do the testing too.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shift: AI agents that test like users
&lt;/h2&gt;

&lt;p&gt;The new generation of testing tools doesn't ask you to describe what to test. It asks for a URL.&lt;/p&gt;

&lt;p&gt;You point the tool at your live app. An agent opens it, looks at the page, and behaves like a curious human. It clicks things. It fills in forms. It tries weird inputs. It tries to break the flow. It does the things you'd do if you sat down to test your own app, except it doesn't get bored after the third happy path.&lt;/p&gt;

&lt;p&gt;This is fundamentally different from script-based testing. Scripts only test what you told them to test. An agent explores. It can find the dead end you didn't know existed because you never thought to script the click that gets you there.&lt;/p&gt;

&lt;p&gt;The closest analogy is hiring a junior QA contractor who's actually thorough — except this one shows up in thirty seconds and costs less than a sandwich.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "no scripts" actually means
&lt;/h2&gt;

&lt;p&gt;When I say no scripts, I mean it literally. No selectors. No fixtures. No mocks. No test framework to install. The mental model is more like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Here's my app: testing.example.com&lt;/li&gt;
&lt;li&gt;Here's a sentence about what it does: a booking flow for a yoga studio&lt;/li&gt;
&lt;li&gt;Find the bugs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the entire input. You spend more time writing the sentence than configuring anything else.&lt;/p&gt;

&lt;p&gt;The output is a verdict. Not a failed test name and a stack trace — a plain-English description of what's broken, why it matters, and how to reproduce it. Screenshots included.&lt;/p&gt;

&lt;p&gt;For a vibe coder, this is the whole point. You don't want to learn the testing tool. You want to know if your app is shippable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three AIs walk into a courtroom
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting. A single AI agent tests your app and tells you it found three bugs. How do you know it's right? AI agents hallucinate. They confidently report "the login button doesn't work" when actually they just couldn't find it because a cookie banner was in the way.&lt;/p&gt;

&lt;p&gt;The fix is the same fix that's working everywhere else AI gets used in production: you ask more than one model, and you make them justify themselves.&lt;/p&gt;

&lt;p&gt;In 2ndOpinion's testing product, three different agents test your app independently. Then they cross-examine each other's findings, courtroom style. Did Agent A really see this bug, or did it misread the page? Can Agent B reproduce it? Does Agent C agree that the failure mode is what A says it is?&lt;/p&gt;

&lt;p&gt;What you get back is a verdict with confidence levels. The bugs that all three agents independently found are almost certainly real. The ones only one agent flagged usually aren't. This cuts the false-positive rate dramatically and saves you from chasing ghosts.&lt;/p&gt;

&lt;p&gt;If you've ever been burned by an AI tool that confidently lied to you, this is the cure. Make them argue. The truth tends to survive the argument.&lt;/p&gt;

&lt;h2&gt;
  
  
  A typical workflow
&lt;/h2&gt;

&lt;p&gt;Here's what testing an AI-built app looks like when you remove the scripting:&lt;/p&gt;

&lt;p&gt;You finish a feature in Cursor, v0, Lovable, Replit — wherever you build. You deploy it. You paste the URL into your testing tool. You add a one-line description of what the app does and which flow you care about. You hit go.&lt;/p&gt;

&lt;p&gt;A few minutes later, you have a list of issues. Not "test_login_button_failed at line 47." Something like: "If a user enters an email with a plus sign, the signup form silently fails. No error message appears, the button just stops responding. Reproduced in Chrome and Safari."&lt;/p&gt;

&lt;p&gt;You take that to your AI coding tool. You paste the bug. You ask for a fix. You redeploy. You re-run the test. You ship.&lt;/p&gt;

&lt;p&gt;The total cycle is maybe twenty minutes. Compare that to writing a single Cypress test from scratch, which is also twenty minutes — except you're still at zero coverage when you finish.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this catches that you'd miss
&lt;/h2&gt;

&lt;p&gt;The category of bugs that AI testing finds reliably is the one vibe coders most often ship by accident.&lt;/p&gt;

&lt;p&gt;Edge cases in input handling. The plus sign in the email. The apostrophe in the last name. The phone number with a country code.&lt;/p&gt;

&lt;p&gt;Broken back buttons and refresh behavior. The state that doesn't persist. The form that reposts on refresh. The "session expired" page that has no way out.&lt;/p&gt;

&lt;p&gt;Mobile-specific weirdness. The viewport that doesn't scroll. The keyboard that covers the submit button. The autofocus that fights with the iOS keyboard.&lt;/p&gt;

&lt;p&gt;Auth flows that work for the happy path and explode otherwise. Wrong password. Expired link. Already-registered email. OAuth cancellation halfway through.&lt;/p&gt;

&lt;p&gt;These are the bugs your friends find on Twitter the day after you launch. They're also the ones a single AI rarely catches reliably, which is why the multi-agent cross-examination matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is heading
&lt;/h2&gt;

&lt;p&gt;Two things are going to happen over the next year. First, this kind of testing becomes default. Pasting a URL and getting a verdict will feel as obvious as pasting an error into ChatGPT. Second, the tools that survive will be the ones that handle disagreement honestly — that show you when their agents argued, who won, and why.&lt;/p&gt;

&lt;p&gt;The vibe coder workflow has been bottlenecked on testing for two years. The unblocking is happening right now, and it doesn't involve learning Selenium.&lt;/p&gt;

&lt;p&gt;If you've built something in Cursor, Bolt, or Lovable and you're nervous about shipping it: that's a reasonable feeling, and the tools to act on it finally exist.&lt;/p&gt;




&lt;p&gt;If you want to try this on something you've built, &lt;a href="https://testing.get2ndopinion.dev" rel="noopener noreferrer"&gt;2ndOpinion Testing&lt;/a&gt; is the macOS desktop app I'm building for exactly this. Paste a URL, get a verdict. No scripts, no selectors, no test framework to learn.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>vibecoding</category>
      <category>productivity</category>
    </item>
    <item>
      <title>When Claude, Codex, and Gemini Disagree on the Same Code</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 24 Apr 2026 17:06:36 +0000</pubDate>
      <link>https://dev.to/brianmello/when-claude-codex-and-gemini-disagree-on-the-same-code-4cnd</link>
      <guid>https://dev.to/brianmello/when-claude-codex-and-gemini-disagree-on-the-same-code-4cnd</guid>
      <description>&lt;p&gt;When we tell people &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;2ndOpinion&lt;/a&gt; runs every pull request past Claude, Codex, and Gemini and then cross-examines the findings, the most common follow-up is: "Do they actually disagree? Or is this just three models rubber-stamping each other?"&lt;/p&gt;

&lt;p&gt;The answer, from about six months of production review logs, is that they disagree often enough to matter. Not on everything — maybe 15% of diffs — but the disagreements cluster on exactly the kinds of bugs that hurt in production: concurrency, null handling, subtle security issues, and "this works but it's going to page you at 3am" architectural smells.&lt;/p&gt;

&lt;p&gt;Here are four real cases, lightly anonymized, where the three models read the same code and came back with meaningfully different verdicts. If you're trying to decide whether multi-model review is worth the extra tokens, these are the kinds of arguments it's buying you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 1: The async/await race that only one model saw
&lt;/h2&gt;

&lt;p&gt;The diff was a webhook handler in a Node.js payments service. Roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/webhook/stripe&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;verifySignature&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;existing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findOne&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;existing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duplicate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;processEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;processed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ok&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Codex&lt;/strong&gt; flagged it as a textbook race condition: two copies of the same webhook arriving within milliseconds both pass the &lt;code&gt;findOne&lt;/code&gt; check before either has written to &lt;code&gt;events&lt;/code&gt;, both run &lt;code&gt;processEvent&lt;/code&gt;, and you charge the customer twice. Recommended fix: a unique index on &lt;code&gt;id&lt;/code&gt; plus wrap the processing in an idempotency key pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; said the code was fine and suggested minor cleanup — extract &lt;code&gt;processEvent&lt;/code&gt; into a service, add structured logging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini&lt;/strong&gt; agreed with Codex about the race but suggested a different fix — optimistic insert first, catch the unique constraint violation, return early if duplicate. Cleaner on the happy path.&lt;/p&gt;

&lt;p&gt;The consensus step flagged the race because two of three models saw it. Without cross-checking, whichever single model you happened to be using would have told you the diff was either shippable or a bug — a coin flip on a payments handler.&lt;/p&gt;

&lt;p&gt;The lesson isn't that Claude is worse at concurrency. Rerun this prompt on a different day and the models trade places. The lesson is that &lt;em&gt;any&lt;/em&gt; single model has blind spots that are invisible until a different model looks at the same code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 2: The "working" SQL that was quietly injectable
&lt;/h2&gt;

&lt;p&gt;A new internal admin endpoint, Python, roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM users WHERE email ILIKE %s ORDER BY &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; DESC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Gemini&lt;/strong&gt; immediately flagged the SQL injection in the &lt;code&gt;sort&lt;/code&gt; parameter — the &lt;code&gt;%s&lt;/code&gt; parameterization protects &lt;code&gt;query&lt;/code&gt;, but &lt;code&gt;sort&lt;/code&gt; is interpolated directly into the string. An attacker who controls &lt;code&gt;sort&lt;/code&gt; can turn this into &lt;code&gt;ORDER BY (SELECT ...) DESC&lt;/code&gt; and exfiltrate data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codex&lt;/strong&gt; flagged it too, with a suggested allowlist: &lt;code&gt;if sort not in {"created_at", "email", "last_login"}: raise ValueError(...)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; said the query was safe because the user parameter was parameterized — and it was technically right about &lt;code&gt;query&lt;/code&gt;, but it missed that &lt;code&gt;sort&lt;/code&gt; is a user-controllable input from the same request.&lt;/p&gt;

&lt;p&gt;This is the most dangerous kind of AI review error: confidently correct about one thing, silent on a worse thing right next to it. A single-model review that happened to land on Claude that day would have said "LGTM." The second opinion is exactly what you want for security-adjacent diffs — one model being wrong is common, two models being wrong in the same direction is rare.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 3: The memory leak that wasn't
&lt;/h2&gt;

&lt;p&gt;Sometimes consensus is wrong and the outlier is right. React component, roughly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="nf"&gt;useEffect&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onmessage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setMessages&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;m&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
  &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; and &lt;strong&gt;Gemini&lt;/strong&gt; both flagged a missing cleanup of the &lt;code&gt;onmessage&lt;/code&gt; handler and warned about a memory leak if the component re-mounted rapidly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Codex&lt;/strong&gt; pushed back — because &lt;code&gt;ws&lt;/code&gt; is created inside the effect and &lt;code&gt;ws.close()&lt;/code&gt; is called in the cleanup, the socket is garbage-collected along with its handlers. The handler doesn't need explicit removal. The two-of-three majority was wrong; the outlier was right.&lt;/p&gt;

&lt;p&gt;This is where our cross-examination step earns its keep. Instead of defaulting to "majority wins," the consensus layer asks the dissenting model to defend its position, then asks the other two to respond. In this case Codex explained the GC behavior, Claude acknowledged the correction, and the final verdict downgraded the finding from "bug" to "stylistic nit."&lt;/p&gt;

&lt;p&gt;If you only run majority voting, you get the wrong answer on cases like this. If you run proper cross-examination, you get the right answer &lt;em&gt;and&lt;/em&gt; the reasoning, which is how engineers actually build trust in AI review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Case 4: The Rust borrow checker dispute
&lt;/h2&gt;

&lt;p&gt;A small but contentious one. The diff refactored a hot path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Item&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Processed&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="nf"&gt;.iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
        &lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Codex&lt;/strong&gt; flagged the &lt;code&gt;.clone()&lt;/code&gt; as wasteful and suggested taking &lt;code&gt;items&lt;/code&gt; by value and using &lt;code&gt;into_iter()&lt;/code&gt; to move instead of clone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini&lt;/strong&gt; agreed with the performance critique but added a nuance — if &lt;code&gt;Item&lt;/code&gt; contains anything expensive to clone (like an &lt;code&gt;Arc&amp;lt;Mutex&amp;lt;T&amp;gt;&amp;gt;&lt;/code&gt;), the clone is specifically what you don't want in a hot path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude&lt;/strong&gt; defended the clone. Its argument: if &lt;code&gt;transform&lt;/code&gt; is defined for &lt;code&gt;&amp;amp;Item&lt;/code&gt; in the existing codebase and changing it breaks fifteen other callers, the clone is the minimal-risk change. "Optimal" and "mergeable" are different targets.&lt;/p&gt;

&lt;p&gt;None of the three models was wrong. They were optimizing for different objectives, which is a pattern we see constantly — performance versus maintainability, correctness versus velocity, local improvement versus blast-radius. Multi-model review surfaces that there &lt;em&gt;is&lt;/em&gt; a tradeoff rather than presenting one model's preferred answer as The Answer. That's usually more useful than a confident single verdict.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we do with the disagreements
&lt;/h2&gt;

&lt;p&gt;The short version of the product: every review goes to all three models in parallel. Findings that all three agree on are high-confidence and reported first. Findings where models disagree trigger a cross-examination round where each model sees the others' output and gets a chance to revise. Anything still contested is surfaced to the human reviewer with the full argument attached, rather than hidden behind a single "LGTM."&lt;/p&gt;

&lt;p&gt;That last part is the one most people underestimate. You don't want the AI to resolve every disagreement — some disagreements are the signal. A human reviewer who sees "Claude says ship, Gemini says block, here's why" makes a better decision than one who sees a single-model verdict in either direction.&lt;/p&gt;




&lt;p&gt;If your team is running code review with one model and wondering what the second opinion would say, that's the whole pitch. Install the CLI with &lt;code&gt;npm i -g 2ndopinion-cli&lt;/code&gt;, run &lt;code&gt;2ndopinion review&lt;/code&gt;, and see where your models actually disagree. Or wire it into Claude Code / Cursor as an MCP server — docs at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We publish a weekly build-in-public update, and this post is part of it. If you have a case where two AI reviewers disagreed on your code and you're curious what a third would say, send it over — the weird diffs are the fun ones.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>devtools</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>How Smart Model Routing Picks the Right AI for Your Programming Language</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 17 Apr 2026 17:06:47 +0000</pubDate>
      <link>https://dev.to/brianmello/how-smart-model-routing-picks-the-right-ai-for-your-programming-language-2jog</link>
      <guid>https://dev.to/brianmello/how-smart-model-routing-picks-the-right-ai-for-your-programming-language-2jog</guid>
      <description>&lt;p&gt;The dirty secret of AI code review is that there is no single "best" model. There are only models that happen to be good at the specific thing you're asking them to do right now.&lt;/p&gt;

&lt;p&gt;I learned this the hard way while building &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;2ndOpinion&lt;/a&gt;, an AI code review tool where Claude, Codex, and Gemini cross-check each other's work over MCP. The first version hard-coded one model for every review. The reviews were fine for JavaScript. They were embarrassing for Rust. They were weirdly confident and wrong for SQL.&lt;/p&gt;

&lt;p&gt;So we stopped picking a single model and started routing. This post is about how that routing layer actually works — what signals we collect, how the scoring math plays out, and what surprised us when we shipped it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem: model strength is language-specific
&lt;/h2&gt;

&lt;p&gt;When we started tracking per-language accuracy, a clear pattern emerged. If you take the same corpus of reviewed pull requests and score each model on catching real bugs versus hallucinating issues that don't exist, you don't get a uniform leaderboard. You get something that looks more like rock-paper-scissors.&lt;/p&gt;

&lt;p&gt;One model might be excellent at spotting async/await footguns in TypeScript but completely miss lifetime issues in Rust. Another might be phenomenal at Python decorator patterns and hopeless at Go's error handling conventions. A third might crush Terraform drift detection but flag perfectly valid Kubernetes manifests as "probably wrong."&lt;/p&gt;

&lt;p&gt;This isn't a flaw in any particular model. It's a consequence of training data distribution, RLHF feedback, and the fact that "code" is actually hundreds of very different specialties wearing the same trench coat. Treating "AI code review" as one problem and picking one winner leaves performance on the table for every language that isn't the winner's strong suit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What &lt;code&gt;--llm auto&lt;/code&gt; actually does
&lt;/h2&gt;

&lt;p&gt;When you run a review with no model flag, the CLI calls the &lt;code&gt;auto&lt;/code&gt; router:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli
2ndopinion review src/auth.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood, &lt;code&gt;--llm auto&lt;/code&gt; is the default. It takes three inputs — language, change type, and file size — and picks a model. Here's the Python SDK equivalent, which exposes the same router via a keyword argument:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;secondopinion&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;opinion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;src/auth.ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;typescript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# route based on accuracy data
&lt;/span&gt;    &lt;span class="n"&gt;change_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bugfix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="c1"&gt;# optional hint
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;review_metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_used&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# which model actually ran
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;review_metadata&lt;/code&gt; object is the important part for debugging. Every response tells you which model was picked and why, along with token counts and duration. If you want reproducibility, pin the model explicitly; if you want the best review for this specific request, let the router decide.&lt;/p&gt;

&lt;h2&gt;
  
  
  The signals that feed the router
&lt;/h2&gt;

&lt;p&gt;There are four signals we weight, in roughly this order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Language.&lt;/strong&gt; Detected from file extension, shebang, or an explicit &lt;code&gt;language=&lt;/code&gt; argument in the SDK. This is the dominant signal because accuracy variance between models on a given language is much larger than variance on other dimensions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Change type.&lt;/strong&gt; A new-feature diff has different review priorities than a bugfix or a refactor. Security-sensitive file paths (&lt;code&gt;auth/&lt;/code&gt;, &lt;code&gt;crypto/&lt;/code&gt;, anything matching a configurable allowlist) bump a security-audit weight into the decision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File size and diff size.&lt;/strong&gt; Very large files hit models with bigger effective context windows. Small targeted diffs can go to faster models without losing accuracy — no point paying for a heavyweight review of a three-line typo fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern memory.&lt;/strong&gt; If we've seen a similar bug pattern in this repo before, we bias toward the model that caught the original. This is a small effect per review, but over a project's lifetime it adds up, because teams tend to re-introduce the same class of bug in different forms.&lt;/p&gt;

&lt;p&gt;The scoring itself is embarrassingly simple. For each candidate model we compute a weighted sum from the accuracy table and pick the highest. It's not a neural net. It's not an LLM picking another LLM. It's a lookup table and a weighted argmax. We tried fancier approaches and they kept losing to the lookup table, which turns out to be the honest answer in most ML-system stories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the accuracy data comes from
&lt;/h2&gt;

&lt;p&gt;A router is only as good as the data behind it. Ours comes from three places.&lt;/p&gt;

&lt;p&gt;First, offline evals. We maintain a set of benchmark repos per language with known bugs — either ones we inject, or historical CVEs replayed on the vulnerable commit. Every model gets scored on "did you catch this specific bug" and "did you flag something that wasn't actually a problem."&lt;/p&gt;

&lt;p&gt;Second, production telemetry. When a user accepts or rejects a finding via &lt;code&gt;2ndopinion fix&lt;/code&gt; or the GitHub PR agent, that's a signal. Rejected findings that were later confirmed as real bugs (via a follow-up commit or a revert) are gold. We only aggregate feedback, never store code — that's a hard constraint baked into the pipeline.&lt;/p&gt;

&lt;p&gt;Third, consensus disagreements. When you run a consensus review, three models vote. Disagreements are interesting because they surface cases where one model sees a bug the others miss. Over time, the model that's consistently right on disagreements gets weighted higher for that language.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Three-model consensus review — the source of a lot of our training signal&lt;/span&gt;
2ndopinion review src/auth.ts &lt;span class="nt"&gt;--consensus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three credits, one command. The confidence-weighted aggregator takes the three reviews, collapses duplicate findings, and ranks by agreement. High-agreement findings surface first; disagreements get flagged explicitly so a human can adjudicate.&lt;/p&gt;

&lt;h2&gt;
  
  
  A concrete example: routing a TypeScript auth change
&lt;/h2&gt;

&lt;p&gt;Say you run this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion review src/auth/session.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The router sees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Language: TypeScript (file extension and tsconfig detected)&lt;/li&gt;
&lt;li&gt;Change type: bugfix (detected from git diff — a returned value was modified, no new exports)&lt;/li&gt;
&lt;li&gt;File size: 240 lines&lt;/li&gt;
&lt;li&gt;Path signal: &lt;code&gt;auth/&lt;/code&gt; → security-sensitive bump&lt;/li&gt;
&lt;li&gt;Pattern memory: this repo had a session-fixation issue three months ago&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The router weights the security-sensitive bump and biases toward whichever model has the strongest track record on auth/session TypeScript bugs in our accuracy table. It runs that single model at three-credits-equivalent depth, returns a review, and the &lt;code&gt;review_metadata&lt;/code&gt; field on the response tells you exactly which model was chosen so you can audit the decision.&lt;/p&gt;

&lt;p&gt;If any of those signals flip — different language, a new-feature diff, no security-sensitive path — you'd get a different model. That's the whole point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised us
&lt;/h2&gt;

&lt;p&gt;Two things.&lt;/p&gt;

&lt;p&gt;First, the router made the marginal model matter. We used to think of models as tiered — a "best" one, a "good enough" one, a "cheap one for trivial stuff." Once we started routing on language-specific accuracy, the hierarchy collapsed. Models we'd written off as second-tier turned out to dominate specific slices. There is no tier list. There are just specialties.&lt;/p&gt;

&lt;p&gt;Second, the router made consensus more valuable, not less. You'd think smart routing would make consensus redundant — why run three models if one is already the best? In practice, consensus is where the router learns. Every disagreement is a labeled data point about where the router's current guess is wrong. We run consensus on a sampled slice of reviews partly to keep the accuracy table fresh.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;If you're building anything on top of LLMs, the lesson generalizes past code review: "which model is best" is the wrong question. The right question is "which model is best for this specific request, given what I know about it." Build a router, not a leaderboard.&lt;/p&gt;

&lt;p&gt;If you want to see smart model routing in action, the fastest way is the free playground — no signup, just paste code and see which model the router picks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the CLI and run a review&lt;/span&gt;
npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli
2ndopinion review src/your-file.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or try the playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; and watch the &lt;code&gt;model_used&lt;/code&gt; field on the response. You can force a specific model with &lt;code&gt;--llm claude&lt;/code&gt;, &lt;code&gt;--llm codex&lt;/code&gt;, or &lt;code&gt;--llm gemini&lt;/code&gt; to see how the same code gets reviewed differently — which is the fastest way to internalize why routing matters in the first place.&lt;/p&gt;

&lt;p&gt;If you've built a routing layer for a different ML-backed product, I'd love to hear what signals ended up mattering most. Drop a comment — I'm especially curious about people who tried fancier approaches before collapsing back to a lookup table.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
      <category>programming</category>
    </item>
    <item>
      <title>What 10 Versions of an AI Code Review CLI Taught Me About Developer UX</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 10 Apr 2026 17:12:18 +0000</pubDate>
      <link>https://dev.to/brianmello/what-10-versions-of-an-ai-code-review-cli-taught-me-about-developer-ux-1301</link>
      <guid>https://dev.to/brianmello/what-10-versions-of-an-ai-code-review-cli-taught-me-about-developer-ux-1301</guid>
      <description>&lt;p&gt;You don't learn how developers think by reading docs. You learn by shipping something, watching it fail, and shipping it again.&lt;/p&gt;

&lt;p&gt;I've been building &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;2ndOpinion&lt;/a&gt;, an AI code review tool where multiple models — Claude, Codex, Gemini — cross-check each other's reviews. Over the past few months and ten CLI versions, I've rewritten the developer experience more times than I'd like to admit. Here's what actually stuck.&lt;/p&gt;

&lt;h2&gt;
  
  
  Version 1: The "Just Ship It" Phase
&lt;/h2&gt;

&lt;p&gt;The first version of the CLI did exactly one thing: send your code to three AI models and print their reviews. It worked. Technically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx 2ndopinion-cli &lt;span class="nt"&gt;--file&lt;/span&gt; src/auth.ts &lt;span class="nt"&gt;--models&lt;/span&gt; claude,codex,gemini &lt;span class="nt"&gt;--format&lt;/span&gt; json &lt;span class="nt"&gt;--output&lt;/span&gt; review.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five flags to get a single review. Every run required you to specify models, format, and output. Nobody wants to think that hard before getting feedback on their code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson: If your CLI needs a manual, you've already lost.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Smart Default" Breakthrough
&lt;/h2&gt;

&lt;p&gt;The single biggest improvement wasn't a feature — it was removing decisions. Version 0.5.0 introduced one command that just works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion review src/auth.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The tool auto-detects your language, picks the best models for that language based on real accuracy data, and prints a formatted review. No flags required.&lt;/p&gt;

&lt;p&gt;Downloads jumped immediately. Not because the tool got more powerful — it got simpler.&lt;/p&gt;

&lt;p&gt;Behind the scenes, &lt;code&gt;--llm auto&lt;/code&gt; routes your code to whichever models perform best for your specific language. TypeScript reviews go to different models than Python reviews, because we track which models actually catch bugs in each language. But the developer doesn't need to know any of that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Feedback That Changed Everything
&lt;/h2&gt;

&lt;p&gt;A developer tried 2ndOpinion and told me: "I got my review. Now what?"&lt;/p&gt;

&lt;p&gt;That question haunted me. Getting a list of issues is step one. But developers don't want a report — they want their code to be better. So I built &lt;code&gt;fix&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion fix src/auth.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One command. It reviews your code, identifies the issues, generates fixes, and applies them. You can review the diff before accepting. The entire loop from "something's wrong" to "it's fixed" happens in your terminal.&lt;/p&gt;

&lt;p&gt;Then came &lt;code&gt;watch&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion watch src/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Continuous monitoring. Save a file, get a review. Like having a pair programmer who never takes a break and never gets passive-aggressive about your variable names.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson: The best developer tool is the one that closes the loop. Don't hand developers a problem — hand them a solution.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Multi-Model Insight Nobody Asked For
&lt;/h2&gt;

&lt;p&gt;Here's something I didn't expect: individual AI models are unreliable in predictable ways. Claude is excellent at architectural reasoning but sometimes misses edge cases in error handling. Codex catches implementation bugs that Claude misses. Gemini often spots performance issues the others overlook.&lt;/p&gt;

&lt;p&gt;No single model is "the best." But three models reviewing the same code? They catch what each other misses. That's the core thesis of 2ndOpinion — consensus-based review.&lt;/p&gt;

&lt;p&gt;When all three models agree something is a problem, the confidence is high. When they disagree, that's where the interesting conversations happen. We built a confidence-weighted system that surfaces high-agreement issues first and flags disagreements for human review.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;consensus&lt;/code&gt; command makes this explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;2ndopinion review &lt;span class="nt"&gt;--consensus&lt;/span&gt; src/auth.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three models review in parallel. You get a unified report with confidence scores. Three credits, one command, and a review that's more thorough than any single model could produce.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Got Wrong About Developer UX
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;I over-indexed on power users.&lt;/strong&gt; Early versions had flags for everything: model selection, temperature, output format, verbosity levels, custom prompts. Power users loved it. Everyone else bounced.&lt;/p&gt;

&lt;p&gt;The fix was layered complexity. The default command (&lt;code&gt;2ndopinion&lt;/code&gt;) requires zero configuration. Power users can add flags to customize. But the first experience is frictionless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I underestimated CI/CD.&lt;/strong&gt; Developers don't just run tools locally — they run them in pipelines. Version 0.10.0 added &lt;code&gt;--ci&lt;/code&gt;, &lt;code&gt;--json&lt;/code&gt;, and &lt;code&gt;--plain&lt;/code&gt; flags specifically for non-interactive environments. It sounds obvious in retrospect, but I spent months building interactive terminal UI before realizing half my users needed the opposite.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In your GitHub Actions workflow&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--pr&lt;/span&gt; &lt;span class="nv"&gt;$PR_NUMBER&lt;/span&gt; &lt;span class="nt"&gt;--ci&lt;/span&gt; &lt;span class="nt"&gt;--json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;I ignored the "try before you buy" instinct.&lt;/strong&gt; Developers don't sign up for things. They install them, try them, and decide in under 60 seconds. The free playground on &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; — no signup required — exists because I watched too many developers hit a registration wall and leave.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next: The Skills Marketplace
&lt;/h2&gt;

&lt;p&gt;The most surprising thing I've learned is that every team has domain-specific review needs. A fintech team cares about different patterns than a game studio. A team migrating from Python 2 to 3 needs a completely different lens.&lt;/p&gt;

&lt;p&gt;So we're building a skills marketplace where developers can create custom audit skills — specialized review logic for specific domains — and sell them. Creators earn 70% of revenue. It turns tribal knowledge into something shareable and monetizable.&lt;/p&gt;

&lt;p&gt;Think of it as npm for code review intelligence. Someone who's spent five years dealing with Django security footguns can package that knowledge into a skill that catches those issues for every Django developer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;Ten versions in, the biggest lesson is this: &lt;strong&gt;developer tools win on defaults, not features.&lt;/strong&gt; Every flag you add is a decision you're asking the developer to make. Every decision is friction. Every bit of friction is a reason to close the terminal and move on.&lt;/p&gt;

&lt;p&gt;If you're building developer tools, here's my checklist: Does the zero-config experience work? Does the tool close the loop (find problem → fix problem)? Can it run in CI without modification? Can someone try it in under 60 seconds?&lt;/p&gt;

&lt;p&gt;If you want to try multi-model AI code review, the CLI is one install away:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli
2ndopinion review your-file.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or try the playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; — no signup, no credit card, just paste code and see what three AI models think.&lt;/p&gt;

&lt;p&gt;I'd love to hear what you've learned building developer tools. What UX lessons took you the longest to figure out? Drop a comment below.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>devtools</category>
      <category>buildinpublic</category>
    </item>
    <item>
      <title>Single-Model vs Multi-Model AI Code Review: What I Learned Running Both</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Fri, 03 Apr 2026 17:07:37 +0000</pubDate>
      <link>https://dev.to/brianmello/single-model-vs-multi-model-ai-code-review-what-i-learned-running-both-2i22</link>
      <guid>https://dev.to/brianmello/single-model-vs-multi-model-ai-code-review-what-i-learned-running-both-2i22</guid>
      <description>&lt;p&gt;I've been obsessing over AI code review for the last year. Not because I think AI will replace code review — I don't — but because I think most developers are leaving a lot of quality signal on the table by using AI review the wrong way.&lt;/p&gt;

&lt;p&gt;Here's the thing nobody talks about: &lt;strong&gt;a single AI model is confidently wrong surprisingly often.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not maliciously wrong. Not obviously wrong. Just... plausible-sounding wrong. It'll flag a false positive, miss a real bug, or give you a high-confidence "looks good" on code that has a subtle race condition. And because the model sounds so sure of itself, you accept it and move on.&lt;/p&gt;

&lt;p&gt;I learned this the hard way. Then I started running multi-model consensus review instead, and it changed my whole mental model of what AI code review should look like.&lt;/p&gt;

&lt;p&gt;Here's what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem With Single-Model Review
&lt;/h2&gt;

&lt;p&gt;When you pipe code through one model — say, Claude or GPT-4 — you get a single "opinion." That opinion is shaped by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model's training data distribution&lt;/li&gt;
&lt;li&gt;Whatever biases crept in during RLHF&lt;/li&gt;
&lt;li&gt;The specific prompt you used&lt;/li&gt;
&lt;li&gt;The model's current context window state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of those factors are visible to you as the reviewer. You just get a confident-sounding output and have to decide how much to trust it.&lt;/p&gt;

&lt;p&gt;I started noticing patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; tends to be excellent at spotting architectural smell and async/await patterns. It's more conservative — it'll point out potential issues even when they're not certain bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4 / Codex&lt;/strong&gt; is better at catching common idiom violations and tends to give more opinionated style feedback. It's more decisive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt; has surprisingly strong instincts around security patterns and type safety, particularly in typed languages.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't a knock on any model. They're just different lenses. And here's the thing: &lt;strong&gt;a bug that one model misses, another often catches.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Running the Same Code Through Both Approaches
&lt;/h2&gt;

&lt;p&gt;I took a production Node.js service — about 2,000 lines across 12 files — and ran it two ways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Approach 1: Single-model review (just Claude)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install the CLI&lt;/span&gt;
npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli

&lt;span class="c"&gt;# Review with a single model&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--llm&lt;/span&gt; claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Approach 2: Multi-model consensus (Claude + Codex + Gemini in parallel)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Use consensus mode — 3 models, confidence-weighted&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--consensus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The single-model pass found &lt;strong&gt;14 issues&lt;/strong&gt;: 9 flagged as medium severity, 3 high, 2 low. Took about 8 seconds.&lt;/p&gt;

&lt;p&gt;The consensus pass found &lt;strong&gt;19 issues&lt;/strong&gt;: same 14, plus 5 more. Three of those 5 were real bugs I later confirmed in prod logs.&lt;/p&gt;

&lt;p&gt;But here's the part that matters more than the raw numbers:&lt;/p&gt;

&lt;p&gt;The consensus pass also &lt;strong&gt;filtered out 4 false positives&lt;/strong&gt; that Claude had flagged with high confidence. Those were caught because Codex and Gemini both disagreed — and when 2 out of 3 models say "this is fine," the confidence weight pulls the verdict away from "issue."&lt;/p&gt;




&lt;h2&gt;
  
  
  How Confidence-Weighted Consensus Works
&lt;/h2&gt;

&lt;p&gt;The naive approach to multi-model review would be simple majority voting: if 2 of 3 models say something is a bug, call it a bug. That's better than nothing, but it treats all models as equally reliable on all tasks.&lt;/p&gt;

&lt;p&gt;Confidence-weighted consensus is smarter. Each model reports not just &lt;em&gt;what&lt;/em&gt; it found, but &lt;em&gt;how confident&lt;/em&gt; it is. The final verdict weights those signals proportionally.&lt;/p&gt;

&lt;p&gt;So if Claude says "potential null dereference, high confidence" and Codex says "looks fine, medium confidence," the system doesn't just flip a coin. It weights Claude's high-confidence flag more heavily than Codex's medium-confidence dismissal.&lt;/p&gt;

&lt;p&gt;In practice, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unanimous findings&lt;/strong&gt; → almost certainly real, shown at the top&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2/3 agreement, high confidence&lt;/strong&gt; → likely real, worth investigating&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1/3 agreement, low confidence on the finding model&lt;/strong&gt; → deprioritized, often noise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Divergent high-confidence opinions&lt;/strong&gt; → flagged as a "debate" item worth human judgment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's what that looks like with the Python SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;secondopinion&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;

&lt;span class="c1"&gt;# Run consensus review
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;consensus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;server.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;finding&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;summary&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  Models agreeing: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[94%] HIGH: Unhandled promise rejection in processWebhook()
  Models agreeing: claude, codex, gemini

[71%] MEDIUM: Missing input validation on userId parameter
  Models agreeing: claude, gemini

[38%] LOW: Variable name 'data' is ambiguous
  Models agreeing: codex
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That 38% finding? Probably noise. The 94% finding? Drop everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Single-Model Review Is Still Fine
&lt;/h2&gt;

&lt;p&gt;I want to be fair here. Single-model review isn't bad — it's just different.&lt;/p&gt;

&lt;p&gt;For fast iteration during development, single-model is great. You're not trying to catch every bug; you're trying to get quick feedback while the code is fresh. Running &lt;code&gt;2ndopinion fix&lt;/code&gt; in watch mode gives you that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Continuous monitoring — single model, fast feedback loop&lt;/span&gt;
2ndopinion watch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For code that's about to merge to main — especially anything touching auth, payments, or data pipelines — the consensus pass is worth the extra 10-15 seconds and the 2 additional credits.&lt;/p&gt;

&lt;p&gt;The mental model I've landed on: &lt;strong&gt;single-model for development velocity, consensus for pre-merge quality gates.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Deeper Lesson: Models Have Blind Spots
&lt;/h2&gt;

&lt;p&gt;The thing I didn't fully appreciate before building multi-model review into my workflow: AI models have systematic blind spots, not random ones.&lt;/p&gt;

&lt;p&gt;If Claude misses a certain class of bug, it tends to &lt;em&gt;consistently&lt;/em&gt; miss that class. It's not a random error — it's a bias in how the model was trained. That means if you only ever use Claude, you'll ship the same categories of bugs repeatedly without ever knowing they're being systematically missed.&lt;/p&gt;

&lt;p&gt;Multi-model consensus surfaces those blind spots by triangulating from different vantage points. It's the same reason we have human code reviewers with different backgrounds look at the same PR.&lt;/p&gt;

&lt;p&gt;One model trained heavily on Python might under-weight JavaScript async patterns. Another trained on a lot of library code might be overly conservative about application-layer error handling. When you combine them, the idiosyncrasies average out.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;If you want to see this difference yourself, there's a free playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; — no signup required. Paste your code, run both modes, and compare the outputs side by side.&lt;/p&gt;

&lt;p&gt;Or install the CLI and try it on your own codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli

&lt;span class="c"&gt;# Single model&lt;/span&gt;
2ndopinion review

&lt;span class="c"&gt;# Consensus (3 models, confidence-weighted)&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--consensus&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first time you see a consensus pass catch something a single-model review confidently missed, you'll get it. That's the moment the model clicked for me.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;2ndOpinion is a multi-model AI code review tool. Claude, Codex, and Gemini cross-check each other's findings via MCP, CLI, Python SDK, REST API, and GitHub PR Agent. Free playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codequality</category>
      <category>codereview</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to Add Multi-Model AI Code Review to Claude Code in 30 Seconds</title>
      <dc:creator>Brian Mello</dc:creator>
      <pubDate>Thu, 02 Apr 2026 23:16:45 +0000</pubDate>
      <link>https://dev.to/brianmello/how-to-add-multi-model-ai-code-review-to-claude-code-in-30-seconds-4aoe</link>
      <guid>https://dev.to/brianmello/how-to-add-multi-model-ai-code-review-to-claude-code-in-30-seconds-4aoe</guid>
      <description>&lt;p&gt;You know that moment when Claude reviews your code, gives it the green light, and then two days later you're debugging a production issue that &lt;em&gt;three humans&lt;/em&gt; would have caught immediately?&lt;/p&gt;

&lt;p&gt;Single-model AI code review has a blind spot problem. Each model was trained on different data, has different failure modes, and holds different opinions about what "correct" looks like. When you only ask one AI, you're getting one perspective — and that perspective has systematic gaps.&lt;/p&gt;

&lt;p&gt;Multi-model consensus code review flips the script. Instead of trusting one AI, you get Claude, GPT-4o, and Gemini to cross-check each other. Where all three agree, you can be confident. Where they diverge, &lt;em&gt;that's&lt;/em&gt; where you need to look closer.&lt;/p&gt;

&lt;p&gt;Here's how to set it up in Claude Code in about 30 seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Single-Model Review
&lt;/h2&gt;

&lt;p&gt;Let me be direct: single-model AI code review is better than nothing. But it has a fundamental flaw — the model doesn't know what it doesn't know.&lt;/p&gt;

&lt;p&gt;I ran an experiment last month. I fed the same set of 50 bugs across Claude, GPT-4o, and Gemini separately. Each model caught some bugs the others missed. GPT-4o was better at certain Python anti-patterns. Gemini caught more async/concurrency issues. Claude excelled at security-related edge cases.&lt;/p&gt;

&lt;p&gt;No model caught everything. But when I used all three in consensus mode? Coverage went up significantly.&lt;/p&gt;

&lt;p&gt;This is the case for multi-model AI code review — it's not about any single model being bad, it's about combining strengths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up 2ndOpinion via MCP in 60 Seconds
&lt;/h2&gt;

&lt;p&gt;2ndOpinion is an AI-to-AI communication platform that routes your code to multiple models simultaneously and returns a confidence-weighted consensus. It plugs into Claude Code via MCP.&lt;/p&gt;

&lt;p&gt;Here's the config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2ndopinion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2ndopinion-mcp"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"SECONDOPINION_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your-api-key-here"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Drop that into your Claude Code MCP config file (usually &lt;code&gt;~/.claude/mcp_config.json&lt;/code&gt;), restart Claude Code, and you're done. No extra dependencies. No separate process to run.&lt;/p&gt;

&lt;p&gt;Once it's wired up, you have access to these tools directly inside Claude Code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;review&lt;/code&gt;&lt;/strong&gt; — standard multi-model code review (uses 2 credits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;consensus&lt;/code&gt;&lt;/strong&gt; — parallel review from 3 models with confidence weighting (3 credits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;debate&lt;/code&gt;&lt;/strong&gt; — multi-round AI debate for architecture decisions (5–7 credits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;bug_hunt&lt;/code&gt;&lt;/strong&gt; — targeted bug detection sweep&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;security_audit&lt;/code&gt;&lt;/strong&gt; — security-focused review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't have to remember which tool to use. The &lt;code&gt;--llm auto&lt;/code&gt; flag routes to the best model for your language based on real accuracy data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Your First Consensus Review
&lt;/h2&gt;

&lt;p&gt;Once the MCP is connected, you can trigger a review in plain English inside Claude Code:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Run a consensus code review on this file."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Or you can use the CLI directly if you prefer the terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install globally&lt;/span&gt;
npm i &lt;span class="nt"&gt;-g&lt;/span&gt; 2ndopinion-cli

&lt;span class="c"&gt;# Review a specific file&lt;/span&gt;
2ndopinion review src/auth/token-validator.ts

&lt;span class="c"&gt;# Full consensus (3 models in parallel)&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--consensus&lt;/span&gt; src/auth/token-validator.ts

&lt;span class="c"&gt;# Watch mode — auto-review on every save&lt;/span&gt;
2ndopinion watch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The consensus output tells you:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Where all three models agree&lt;/strong&gt; — high confidence issues, fix these immediately&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where two out of three agree&lt;/strong&gt; — worth a look, especially for complex logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where models disagree&lt;/strong&gt; — the most interesting category; often means an ambiguous design tradeoff&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That last category is my favorite. When GPT-4o says "this is fine" and Claude says "this will blow up under load" — that's a signal to dig in, not dismiss.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Output Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;Here's a real example. I had this Python function I was shipping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_user_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM users WHERE id = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running &lt;code&gt;2ndopinion review --consensus&lt;/code&gt; on this file returned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🔴 CONSENSUS (3/3 models agree): SQL injection vulnerability
   Line 3: f-string interpolation in SQL query
   Fix: Use parameterized queries

🟡 MAJORITY (2/3 models): Connection not closed on exception
   Line 2: db.connect() has no context manager / finally block
   Claude, GPT-4o: Flag | Gemini: Acceptable (with connection pooling)

🟢 LOW CONFIDENCE (1/3 models): Return type may be None
   Line 4: fetchone() returns None if no row found
   Only Claude flagged this
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SQL injection is obvious in hindsight — all three models agree, high confidence. The connection handling disagreement is &lt;em&gt;interesting&lt;/em&gt; — it tells me something about the environment assumptions baked into each model. And the None return type is a low-confidence flag worth noting for future-proofing.&lt;/p&gt;

&lt;p&gt;This is what multi-model AI code review buys you: not just more issues, but a &lt;em&gt;quality signal&lt;/em&gt; on each issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pattern Memory and Regression Tracking
&lt;/h2&gt;

&lt;p&gt;One thing that makes 2ndOpinion useful beyond a one-off review is that it builds project context over time. It tracks which patterns it's flagged before, so it can alert you when the same class of bug reappears in a different file.&lt;/p&gt;

&lt;p&gt;If you fixed an authentication bypass three weeks ago and a new PR introduces a structurally similar issue, 2ndOpinion flags it as a regression. No additional config required — it builds this context automatically per project.&lt;/p&gt;

&lt;p&gt;Combined with the GitHub PR Agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Review PR #42 from the CLI&lt;/span&gt;
2ndopinion review &lt;span class="nt"&gt;--pr&lt;/span&gt; 42
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;...and you get automated multi-model review on every pull request, with regression awareness. The PR gets an inline comment breakdown — agreements, disagreements, and confidence levels — before a human reviewer ever opens it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Marketplace: Build Audits, Earn Revenue
&lt;/h2&gt;

&lt;p&gt;This is the part that surprised me most. 2ndOpinion has a skills marketplace where you can publish custom audit types. If you've got deep expertise in, say, Rust memory safety or Django security patterns, you can package that into an audit skill, publish it, and earn 70% of every credit spent running it.&lt;/p&gt;

&lt;p&gt;It's an interesting model: the platform benefits from domain expertise that no general-purpose LLM has, and the experts get a revenue stream from codifying what they know.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Without Signing Up
&lt;/h2&gt;

&lt;p&gt;If you want to kick the tires before committing, there's a free playground at &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; — no signup required. Paste a code snippet, pick your review type, and see what three models think.&lt;/p&gt;

&lt;p&gt;For the full MCP + Claude Code integration, you'll need an API key, but the setup overhead is genuinely minimal. One JSON config, one restart, and you're running confidence-weighted multi-model code review on every file you touch.&lt;/p&gt;




&lt;p&gt;Single-model AI code review is table stakes at this point. If you're serious about code quality, the next step is getting your AIs to argue with each other — and paying attention to where they agree.&lt;/p&gt;

&lt;p&gt;Check out &lt;a href="https://get2ndopinion.dev" rel="noopener noreferrer"&gt;get2ndopinion.dev&lt;/a&gt; or the &lt;a href="https://github.com/bdubtronux/2ndopinion" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; to dig into the details.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>codereview</category>
      <category>devtools</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
