DEV Community

Cover image for A 4-model adversarial review pipeline that costs $0.30 per audit
Mike
Mike

Posted on • Originally published at mad-it.agency

A 4-model adversarial review pipeline that costs $0.30 per audit

TL;DR: Single-model code review has correlated blind spots. A panel of frontier models from different vendor families catches
more, but only if a final verification step throws out the hallucinations the panel converged on.

Crucible is a Claude Code skill that does both for about $0.30 a run.


If GPT misses a bug, the runner-up GPT-class model usually misses it too. Same training data, same blind spots. The fix isn't a
bigger model. It's a panel of structurally different ones.

I built Crucible to test this. It's a Claude Code skill that walks a codebase file by
file, runs each file through four frontier models from different vendor families, then has Claude verify the findings against the
actual source before showing them to me.

This post is about the orchestration tricks that made it actually work.

## The panel

Four models, four families. As of writing, the default panel is:

  • DeepSeek V4-Pro
  • Google Gemini 3.1 Pro
  • Moonshot Kimi K2.6
  • MiniMax M2.7

Family diversity matters more than raw benchmark scores. Two top-tier OpenAI-family models will share weaknesses. A DeepSeek +

Gemini + Kimi + MiniMax panel does not.

## Pass 1: chained review

Each file goes through the panel in order. Model A finds. Model B sees A's findings, validates and adds. Model C consolidates.
Model D ranks by severity and emits the structured output.

This works because adversarial pressure compounds. Model B is more rigorous when it's grading Model A's homework than when it's

writing its own.

## Pass 2: blind parallel mode

--blind runs the same panel concurrently. No model sees another's output. Findings that overlap on file plus line plus topic
become "consensus" findings and get ranked higher. Findings only one model raised are tagged but not promoted.

Blind mode is slower per file but catches different things than chained mode. Chained converges. Blind diverges.

## Pass 3: Claude verification

This is the step everyone skips, and it's the one that matters. After the panel finishes, Claude reads every CRITICAL and HIGH

finding back against the actual source code and marks each one:

  • Confirmed: matches the code
  • Refined: real issue, severity adjusted
  • Disputed: panel hallucinated, source disagrees
  • Needs human judgment: real tradeoff, not a bug

Claude can also add up to three findings the panel missed. In practice the panel rarely misses things at HIGH severity. It's the
LOW and MEDIUM that get pruned hardest in this pass.

## What broke and how I fixed it

A few orchestration bugs I hit. Sharing them in case you're building something similar.

### Reasoning models eat your output

OpenRouter's chat.completions returns .content and .reasoning_content separately on DeepSeek-R1 and Nemotron. The actual

review lives in reasoning_content after a <think>...</think> block. If your wrapper only reads .content, you'll get empty

strings half the time and not know why.

Crucible's extractor reads both fields and strips <think> blocks before parsing.

### Free-tier model caches drift

I had an existing skill that called OpenRouter's free tier. Free models swap behavior weekly. Crucible has its own paid-only

model cache at ~/.crucible/models.json with a 72-hour TTL, refreshed by discover-premium.sh.

### Mid-run network blips lose work

Findings persist as they land, so a rate-limit hiccup at file 40 of 60 doesn't cost you the prior 39. Resume a partial run with

--resume <run-id>.

### Panel drift mid-run

Freeze your model panel at the start of the run. No --auto cache mutation. The same four model IDs that started the run finish
it. Otherwise file 1 might be reviewed by Kimi K2.6 and file 30 by Kimi K2.7, with subtly different output formats that break the
aggregator.

## Cost

A 5-file by 3-model SOTA panel on a small project: $0.05 to $0.15. A 50-file run: about $0.50.

The first real audit on a 1,162-line project (5 files, 4 models) cost $0.30 and surfaced 30 findings. Nine were HIGH severity.
All nine were real after Claude verification. Three of them were things I'd reviewed myself and missed.

## Try it


bash                                                                                                                          
  git clone https://github.com/Bambushu/crucible.git ~/.claude/skills/crucible                                                   
  export OPENROUTER_API_KEY=sk-or-...                                                                                              
  # restart Claude Code                                                                                                          

  Then from any project:                                                                                                         

  /crucible                              # current branch's diff
  /crucible --all                        # whole repo
  /crucible --paths "src/**/*.ts"        # glob                                                                                    
  /crucible --diff main...HEAD           # range                                                                                   

  MIT licensed. Issues and PRs welcome.                                                                                            

  

GitHub logo Bambushu / crucible

Codebase-level adversarial review by a panel of frontier models. A Claude Code skill that runs every file through DeepSeek + Gemini + Kimi + MiniMax in sequence, then has Claude verify the findings against the actual source.

Crucible

Crucible

Codebase-level adversarial review by a panel of frontier models.

A Claude Code skill that walks your code piece-by-piece and puts every file under simultaneous pressure from a panel of structurally different models, then aggregates the findings into a single severity-ranked report that Claude itself verifies before you see it.

Install · How it works · Cost · Modes · Sample report



What it is

/crucible is a Claude Code slash-command skill. You drop the folder into ~/.claude/skills/, set one env var, and from inside any project you can run:

/crucible                              # review the current branch's diff
/crucible --all                        # review the whole repo
/crucible --paths "src/api/**/*.ts"    # review a glob
/crucible --diff main...HEAD           # review a specific range

Behind the scenes, Claude:

  1. Resolves the file list and prints a pre-flight (files, models, est. cost).
  2. Loads a panel of four current SOTA paid models from OpenRouter, each from a…
Enter fullscreen mode Exit fullscreen mode

Top comments (0)