6 months solo on a multi-agent PR reviewer. 10.93 vs 3.80 blockers/PR (claude alone) on my benchmark — please test on real PRs and tell me where it's wrong

#agents #llm #showdev #ai

TL;DR: I built a 3-LLM code reviewer (Claude + GPT-5 + Gemini that deliberate). My synthetic-bug benchmark shows 3×
the depth at the same catch rate as Claude alone. But 15 synthetic PRs is not enough. I need YOUR PRs to validate or
kill the hypothesis.

Background:
6 months ago Claude solo review kept missing things I considered blockers but it called "minor". Tried adding more
models in parallel + deliberation. Result on my private corpus:

Claude alone: 3.80 blockers/PR
3-agent council: 10.93 blockers/PR
Both 100% catch on synthetic bugs

Pattern, after debugging the gap: one model skips a missing test that another catches. A "minor" by Claude becomes a
blocker by Gemini. Single-agent has no second perspective.

The bigger feature is PRD-aware review. .conclave/prd.md → agents flag spec deviations as first-class blockers. Scope
creep, route mismatches, forgotten acceptance criteria.

What I need:

Run on a real PR, tell me where wrong
Compare vs your usual reviewer (Claude / Cursor / human)
Send false positives, I incorporate into federated failure-catalog

How:

Demo (3 free/day): https://conclave-ai.dev/#try
GitHub App + BYO key = unlimited free
CLI: npm i -g @conclave-ai/cli

Source-available (FSL-1.1-Apache-2.0): https://github.com/seunghunbae-3svs/conclave-ai
Stack: TS / Node 20 / Cloudflare Workers + Containers + D1 / Mastra. 26 packages, 2691 tests.

Limitations I know: