6 months solo on a multi-agent PR reviewer. 10.93 vs 3.80 blockers/PR (claude alone) on my benchmark — please test on real PRs and tell me where it's wrong

Baessi — Sun, 10 May 2026 17:23:08 +0000

TL;DR: I built a 3-LLM code reviewer (Claude + GPT-5 + Gemini that deliberate). My synthetic-bug benchmark shows 3×
the depth at the same catch rate as Claude alone. But 15 synthetic PRs is not enough. I need YOUR PRs to validate or
kill the hypothesis.

Background:
6 months ago Claude solo review kept missing things I considered blockers but it called "minor". Tried adding more
models in parallel + deliberation. Result on my private corpus:

Claude alone: 3.80 blockers/PR
3-agent council: 10.93 blockers/PR
Both 100% catch on synthetic bugs

Pattern, after debugging the gap: one model skips a missing test that another catches. A "minor" by Claude becomes a
blocker by Gemini. Single-agent has no second perspective.

The bigger feature is PRD-aware review. .conclave/prd.md → agents flag spec deviations as first-class blockers. Scope
creep, route mismatches, forgotten acceptance criteria.

What I need:

Run on a real PR, tell me where wrong
Compare vs your usual reviewer (Claude / Cursor / human)
Send false positives, I incorporate into federated failure-catalog

How:

Demo (3 free/day): https://conclave-ai.dev/#try
GitHub App + BYO key = unlimited free
CLI: npm i -g @conclave-ai/cli

Source-available (FSL-1.1-Apache-2.0): https://github.com/seunghunbae-3svs/conclave-ai
Stack: TS / Node 20 / Cloudflare Workers + Containers + D1 / Mastra. 26 packages, 2691 tests.

Limitations I know:

Beta, things break
Cost scaling on large diffs untested
Spec-mismatch only useful if you maintain a PRD
I'm one person + Claude pair-programming — bus factor 1

If the numbers don't survive contact with real codebases, I want to know. Poke holes.

https://github.com/seunghunbae-3svs/conclave-ai

DEV Community: Baessi

6 months solo on a multi-agent PR reviewer. 10.93 vs 3.80 blockers/PR (claude alone) on my benchmark — please test on real PRs and tell me where it's wrong