DEV Community

Baessi
Baessi

Posted on

6 months solo on a multi-agent PR reviewer. 10.93 vs 3.80 blockers/PR (claude alone) on my benchmark — please test on real PRs and tell me where it's wrong

TL;DR: I built a 3-LLM code reviewer (Claude + GPT-5 + Gemini that deliberate). My synthetic-bug benchmark shows 3×
the depth at the same catch rate as Claude alone. But 15 synthetic PRs is not enough. I need YOUR PRs to validate or
kill the hypothesis.

Background:
6 months ago Claude solo review kept missing things I considered blockers but it called "minor". Tried adding more
models in parallel + deliberation. Result on my private corpus:

  • Claude alone: 3.80 blockers/PR
  • 3-agent council: 10.93 blockers/PR
  • Both 100% catch on synthetic bugs

Pattern, after debugging the gap: one model skips a missing test that another catches. A "minor" by Claude becomes a
blocker by Gemini. Single-agent has no second perspective.

The bigger feature is PRD-aware review. .conclave/prd.md → agents flag spec deviations as first-class blockers. Scope
creep, route mismatches, forgotten acceptance criteria.

What I need:

  • Run on a real PR, tell me where wrong
  • Compare vs your usual reviewer (Claude / Cursor / human)
  • Send false positives, I incorporate into federated failure-catalog

How:

Source-available (FSL-1.1-Apache-2.0): https://github.com/seunghunbae-3svs/conclave-ai
Stack: TS / Node 20 / Cloudflare Workers + Containers + D1 / Mastra. 26 packages, 2691 tests.

Limitations I know:

  • Beta, things break
  • Cost scaling on large diffs untested
  • Spec-mismatch only useful if you maintain a PRD
  • I'm one person + Claude pair-programming — bus factor 1

If the numbers don't survive contact with real codebases, I want to know. Poke holes.

https://github.com/seunghunbae-3svs/conclave-ai

Top comments (0)