I kept asking a single LLM for decisions and getting one confident answer with zero accountability. So I built Boardroom: instead of one reply, a decision goes through a board of AI directors that argue — and the verdict gets scored against reality over time.
npx boardroom-ai
It's open source, zero-dependency Node, ~2k lines: https://github.com/leokwon68/boardroom-ai
The problem with one-shot answers
Ask one model "should I raise prices 20%?" and you get a fluent, confident paragraph. But there's no dissent, no one trying to prove it wrong, and crucially no record of whether it was right. You can't tell a lucky guess from a good call.
How a meeting runs
- Positions (R1) — 3+ director personas answer independently, each with a fixed lens (numbers, execution reality, risk).
- Cross-examination (R2) — each director attacks the weakest claim on the table and may change their mind.
- Verdict — a chair rules with a confidence score and a falsifier: "here's what would prove this wrong, and by when."
- Red team — a separate pass tries to kill the verdict. If it survives, confidence is adjusted down.
The part I actually care about: a batting average
Every verdict lands in a ledger with a review date. Later you mark it hit or miss, and the board keeps a running average. It's the first time I've had an AI that has to face its own track record instead of being confidently wrong forever.
It can also execute
Approve a verdict and an executor runs the plan with a real browser (Playwright MCP), shell, and files — then files evidence. Risky steps (payments, posting, trades) are always held for a human. Nothing irreversible happens without one click.
Tech notes
-
Zero dependencies. Engine + HTTP server + Telegram bridge are pure Node +
fetch. - Runs on Claude Code with no API key, or paste one key (Anthropic / OpenAI / Gemini).
- Mix models per seat — chair on one model, a GPT seat next to a Claude seat. Routing picks the provider per model id.
-
npx boardroom-aiopens a local web UI; there's also a hosted version: https://boardroom-cloud.vercel.app
Honest limitations
- Outcome scoring is manual — you mark hit/miss. (Automating it fairly is hard.)
- The debate can groupthink if personas are too similar. Diverse lenses matter — you can even name seats after real people (a Buffett seat on margin-of-safety, a Dalio seat on macro).
- The executor is powerful, so I keep it gated on purpose.
The question I'm chewing on
Is a batting average for an AI's judgment genuinely useful — a forcing function that makes the thing earn trust — or is it just theater? Would love this community's take.
Top comments (0)