How to Add Multi-Model AI Code Review to Your CI/CD Pipeline

#ai #codereview #devops #cicd

Running AI code review locally is fine for solo work. The moment you have a team, the question becomes: how do I make the AI an actual gate in the pipeline, not a thing one person remembers to run before they push?

This is a walkthrough for wiring 2ndOpinion — the multi-model AI code review CLI — into a CI/CD pipeline. I'll show GitHub Actions in full, then sketch the same pattern for GitLab CI and CircleCI. The interesting decisions aren't where the YAML goes; they're around consensus thresholds, blocking vs informational mode, and what happens when Claude, Codex, and Gemini disagree on the same diff (which, from our review logs, is roughly 15% of the time).

What "AI code review in CI" actually means

There are two shapes this takes, and the YAML is almost identical for either. The difference is the policy:

Informational mode. Every PR runs the review. Findings are posted as a comment or check annotation. Nothing blocks merge. Humans decide what to do.
Blocking mode. Review runs on every PR. If the consensus surface flags a HIGH severity finding, the check fails and merge is blocked until the author either fixes it or someone with override permission ships anyway.

I recommend starting in informational mode for the first week or two. AI reviewers — even three of them cross-examining each other — surface false positives. You want the team to learn the noise floor before the bot can block their merges, otherwise the first false-positive blocker generates a Slack thread that ends with "let's just turn this off."

The minimum GitHub Actions config

Here's the workflow file I use as a starting point. Drop it in .github/workflows/ai-review.yml:

name: AI Code Review

on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # need full history for diffs

      - name: Set up Node
        uses: actions/setup-node@v4
        with:
          node-version: '20'

      - name: Install 2ndOpinion CLI
        run: npm install -g 2ndopinion-cli

      - name: Run multi-model review
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          2ndopinion review \
            --base origin/${{ github.base_ref }} \
            --head HEAD \
            --format github-comment \
            --severity-threshold medium \
            --comment-pr ${{ github.event.pull_request.number }}

A few things worth calling out:

fetch-depth: 0 is necessary because actions/checkout defaults to a shallow clone, and the CLI needs full history to compute the actual PR diff against the base branch. Skip this and your review runs against an empty diff, which produces a confidently empty review.

Three API keys. Multi-model review means three providers. If you only set one, the CLI degrades to single-model mode and prints a warning. That's fine for a smoke test, but the whole reason you're doing this is the multi-model surface — the disagreement signal.

--severity-threshold medium suppresses LOW findings in the PR comment. LOW is mostly nits and style preferences, and posting them on every PR trains your team to ignore the bot. Keep MEDIUM and HIGH visible; suppress LOW.

Going from informational to blocking

To turn this into a merge gate, change one flag and one branch protection setting.

In the workflow:

2ndopinion review \
  --base origin/${{ github.base_ref }} \
  --head HEAD \
  --format github-comment \
  --severity-threshold medium \
  --fail-on high \
  --comment-pr ${{ github.event.pull_request.number }}

The --fail-on high flag tells the CLI to exit with a non-zero status if any HIGH severity finding has consensus from at least 2 of 3 models. The 2-of-3 threshold matters — it's why you don't want to block on single-model verdicts. Any single model can confidently invent a critical bug. Two models independently flagging the same critical bug is meaningfully harder to fake.

Then in Settings → Branches → Branch protection for your default branch, add the AI Code Review / review check to the required checks list. Now the merge button is gated.

I'd hold this back for at least a week of informational-mode runs. Look at the false positive rate. If you're getting more than one false HIGH per ten PRs, tune the consensus threshold up to 3-of-3 instead of 2-of-3 before flipping the gate on:

--fail-on high --consensus-required 3

That's stricter — only blocks when all three models agree the finding is HIGH. False positive rate drops, false negative rate goes up. Tradeoff worth making early; you can loosen later once the team trusts the bot.

GitLab CI

Same pattern, different YAML. .gitlab-ci.yml:

ai-code-review:
  stage: test
  image: node:20
  rules:
    - if: $CI_PIPELINE_SOURCE == 'merge_request_event'
  variables:
    GIT_DEPTH: 0
  script:
    - npm install -g 2ndopinion-cli
    - 2ndopinion review
        --base origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME
        --head HEAD
        --format gitlab-note
        --severity-threshold medium
        --comment-mr $CI_MERGE_REQUEST_IID

The CLI knows about GitLab's note format and uses CI_JOB_TOKEN automatically if it's available in the environment, so you don't need to set up a separate token unless you want bot-attributed comments.

CircleCI

CircleCI's config doesn't have the same first-class PR concept, but the CLI handles it. .circleci/config.yml:

version: 2.1
jobs:
  ai-review:
    docker:
      - image: cimg/node:20.11
    steps:
      - checkout
      - run: npm install -g 2ndopinion-cli
      - run:
          name: Run review
          command: |
            2ndopinion review \
              --base origin/main \
              --head HEAD \
              --format json \
              --output review.json
      - store_artifacts:
          path: review.json

CircleCI doesn't have a native PR-comment surface, so I store the review as a build artifact and add a separate small script to POST the JSON to the GitHub PR via a personal access token. Less elegant than the GitHub Actions path, but it works.

What to do when the models disagree

The reason multi-model review is in CI in the first place is that disagreements are signal, not noise. The CLI's default behavior on a finding where models split:

3-of-3 agree (HIGH): posted as a HIGH finding, blocks merge if --fail-on high is set.
2-of-3 agree (HIGH): posted as a HIGH finding with the dissenting model's argument attached, blocks if --consensus-required 2.
1-of-3 (HIGH): posted as a NOTE-level finding with the model's argument and the other two models' counter-arguments. Never blocks. Visible to humans.

That last category is the most underrated output. About 8% of our diffs produce a 1-of-3 HIGH where exactly one model is convinced something is broken and the other two say it's fine. Most of those are false positives by the lone model. But about a quarter of them — by far the most interesting quarter — are real bugs that two models missed. You don't want those silently dropped, but you also don't want them blocking merges. NOTE-level surfacing is the right answer.

Cost and time, in case you're worried about either

Median review on a typical 200-line diff: about 40 seconds wall-clock and roughly $0.06 in combined API spend across the three providers. That's wall-clock time the developer doesn't spend; it runs in parallel with the rest of the CI matrix. The cost works out to less than a tenth of what most teams pay for any single human reviewer's hour, which is the right comparison — multi-model review doesn't replace human review, it replaces the human reviewer asking "did you check for race conditions" by hand.

We've seen teams skip the AI review step for files larger than 1000 lines or generated files (lockfiles, schema dumps) — --exclude '**/*.lock' and --max-diff-lines 1000 handle both.

If you want to try this on a real repo, the CLI is npm install -g 2ndopinion-cli and the docs for every flag mentioned above are at get2ndopinion.dev. The MCP server flavor (for plugging the same review engine into Claude Code or Cursor as an agent tool) is also there.

We publish a weekly build-in-public update; this post is part of it. If you wire 2ndOpinion into your CI and one of your three models flags something the other two missed on a real diff, send the case over — those are the ones we use to tune the consensus thresholds.