TL;DR: I built a 3-LLM code reviewer (Claude + GPT-5 + Gemini that deliberate). My synthetic-bug benchmark shows 3×
the depth at the same catch rate as Claude alone. But 15 synthetic PRs is not enough. I need YOUR PRs to validate or
kill the hypothesis.
Background:
6 months ago Claude solo review kept missing things I considered blockers but it called "minor". Tried adding more
models in parallel + deliberation. Result on my private corpus:
- Claude alone: 3.80 blockers/PR
- 3-agent council: 10.93 blockers/PR
- Both 100% catch on synthetic bugs
Pattern, after debugging the gap: one model skips a missing test that another catches. A "minor" by Claude becomes a
blocker by Gemini. Single-agent has no second perspective.
The bigger feature is PRD-aware review. .conclave/prd.md → agents flag spec deviations as first-class blockers. Scope
creep, route mismatches, forgotten acceptance criteria.
What I need:
- Run on a real PR, tell me where wrong
- Compare vs your usual reviewer (Claude / Cursor / human)
- Send false positives, I incorporate into federated failure-catalog
How:
- Demo (3 free/day): https://conclave-ai.dev/#try
- GitHub App + BYO key = unlimited free
- CLI: npm i -g @conclave-ai/cli
Source-available (FSL-1.1-Apache-2.0): https://github.com/seunghunbae-3svs/conclave-ai
Stack: TS / Node 20 / Cloudflare Workers + Containers + D1 / Mastra. 26 packages, 2691 tests.
Limitations I know:
- Beta, things break
- Cost scaling on large diffs untested
- Spec-mismatch only useful if you maintain a PRD
- I'm one person + Claude pair-programming — bus factor 1
If the numbers don't survive contact with real codebases, I want to know. Poke holes.
Top comments (0)