Most AI coding posts focus on the code: which model writes cleaner functions, which one needs less prompting, which one hallucinates less.
But the code isn't usually where my projects break.
The plan is.
https://github.com/permoon/multi-model-redteam
A pipeline that almost shipped
A few weeks ago I asked Claude Code to plan a BigQuery dedup pipeline. Routine stuff. Pull events from Postgres into GCS, load into BigQuery, dedup by event ID, impute some missing checkout rows.
The plan came back in maybe 90 seconds. Six steps, clean SQL, sensible-looking error handling. I almost just told it to start coding.
Then I tried something. I sent the same plan to Codex and Gemini, and asked each one separately to break it.
Three models. Same plan. No shared context. None of them knew what the others wrote.
Here's what came back.
What three models found
All three caught the same dedup bug:
The INSERT INTO order_events_dedup step wasn't idempotent. Any retry doubled yesterday's rows. The existing alert ("less than 50% of expected") is one-sided and would never fire on over-counts.
That's the easy one. The interesting findings were the ones only one model caught.
Only Claude caught this:
Step D's correlated subquery had unqualified column references. WHERE m2.user_id = user_id doesn't bind the way the writer intended in BigQuery's scoping rules. The imputation step would silently do nothing after day one. The pipeline's whole purpose (filling in missing checkout events) would fail invisibly for 2–8 weeks before anyone noticed.
Codex and Gemini both quoted this exact SQL block in their reviews. Neither tested whether the join actually binds.
Only Gemini caught this:
Midnight-boundary race. The same event retried at 23:59:59 on Day 1 and 00:00:02 on Day 2 lands in two different daily partitions. Step C's GROUP BY only sees within a partition. The cross-partition pair never gets deduped.
Only Codex caught this:
Truncated CSV from GCS, BigQuery load succeeds anyway. Up to 50% of the data can be silently lost while still passing the row-count alert, because the truncated file is still syntactically valid CSV.
Three different blind spots. Three different models. If I'd just gone with any one model's review, I'd have shipped two of these bugs.
The workflow
PROMPT=$(cat prompts/system-prompt.md plan.md)
echo "$PROMPT" | claude --print > out/claude.md &
echo "$PROMPT" | codex exec --skip-git-repo-check > out/codex.md &
echo "$PROMPT" | gemini --skip-trust > out/gemini.md &
wait
# 4th call merges and ranks the three reviews
{ cat prompts/consolidation-prompt.md \
out/claude.md out/codex.md out/gemini.md; } \
| claude --print > out/ranked.md
Three CLIs in parallel. Same prompt. No shared context. A fourth call to merge.
Wall time: 5–15 minutes (the merge step dominates). Cost: about $0.10–0.20 for a sample plan, $0.50–2.00 for production-size.
The prompt that does the work
The prompt sent to all three models has one job: force concrete failure scenarios, reject abstract advice. Five dimensions:
1. HIDDEN ASSUMPTIONS — ordering, uniqueness, atomicity, data
freshness, caller behavior. What does this design implicitly
depend on?
2. DEPENDENCY FAILURES — upstream/downstream services, external
APIs, databases, messaging. What breaks if a dependency
degrades?
3. BOUNDARY INPUTS — empty, single, huge batch, malicious,
malformed.
4. MISUSE PATHS — caller misbehavior, user skipping steps,
out-of-order operations.
5. ROLLBACK & BLAST RADIUS — how to recover, scope of damage.
5-minute detection vs 5-day detection?
For each scenario:
- TRIGGER: what causes it
- IMPACT: who is affected, how badly
- DETECTABILITY: how long until noticed
Reject abstract advice like "add monitoring". Specify what
metric, what threshold, what alert.
That last paragraph is doing most of the work. Without it you get "consider rate limiting" and "ensure proper error handling." With it you get the midnight-boundary race.
What I actually learned
Three models in parallel isn't impressive. Anyone can run three CLIs. The thing that surprised me is how rarely the unique findings overlap.
Claude tends to over-warn. It flags five defensive checks that aren't really bugs. But it actually reads the SQL.
Codex is concise. It skips integration details, but it notices file-format and infra failure modes the others gloss over.
Gemini stays surface-level a lot of the time. But when it does dig in, it's often a concurrency or partition issue the others missed.
You don't get this from ensemble averaging. The consensus findings are the obvious ones. The unique findings are the ones a single-model review would have quietly missed.
That's the whole point.
Not a multi-agent framework
This is a workflow, not a system. No orchestrator, no shared scratchpad, no consensus protocol, no agent class hierarchy. Three CLIs in parallel. A fourth call to merge.
If you want an installed framework with marketplace plugins, there are several. This is the opposite shape: ~30 lines you paste into your CLAUDE.md, and the next time you ask Claude Code to review a plan, it fans out to Codex and Gemini in parallel and brings back a merged report.
I wrote it up
The full method, both case studies (the BigQuery pipeline above plus a Cloud Run + Workflows deploy), and the 100-line redteam.sh are in a small repo:
https://github.com/permoon/multi-model-redteam
Three install tiers depending on what you have set up:
-
Tier 0: paste 30 lines into
CLAUDE.md. No install. -
Tier 1:
git cloneand run the bash script. - Tier 2: copy the prompt into Claude / ChatGPT / Gemini's chat UI. One model only, but better than no frame.
It's also a teaching repo. Seven chapters, from "why one LLM isn't enough" to the parallel script.
Open question
The reason I'm posting this: I want to know if other people are doing something similar.
Are you red-teaming AI-generated plans before letting the model implement them? With one model? Multiple? Or are you mostly trusting the plan and reviewing the code afterward?
If you've tried this and it didn't work for you, I'd especially like to hear that.

Top comments (1)
this is amazing !