I built an AI board of directors that debates a decision, then scores itself

leokwon68 — Thu, 11 Jun 2026 16:55:39 +0000

I kept asking a single LLM for decisions and getting one confident answer with zero accountability. So I built Boardroom: instead of one reply, a decision goes through a board of AI directors that argue — and the verdict gets scored against reality over time.

npx boardroom-ai

It's open source, zero-dependency Node, ~2k lines: https://github.com/leokwon68/boardroom-ai

The problem with one-shot answers

Ask one model "should I raise prices 20%?" and you get a fluent, confident paragraph. But there's no dissent, no one trying to prove it wrong, and crucially no record of whether it was right. You can't tell a lucky guess from a good call.

How a meeting runs

Positions (R1) — 3+ director personas answer independently, each with a fixed lens (numbers, execution reality, risk).
Cross-examination (R2) — each director attacks the weakest claim on the table and may change their mind.
Verdict — a chair rules with a confidence score and a falsifier: "here's what would prove this wrong, and by when."
Red team — a separate pass tries to kill the verdict. If it survives, confidence is adjusted down.

The part I actually care about: a batting average

Every verdict lands in a ledger with a review date. Later you mark it hit or miss, and the board keeps a running average. It's the first time I've had an AI that has to face its own track record instead of being confidently wrong forever.

It can also execute

Approve a verdict and an executor runs the plan with a real browser (Playwright MCP), shell, and files — then files evidence. Risky steps (payments, posting, trades) are always held for a human. Nothing irreversible happens without one click.

Tech notes

Zero dependencies. Engine + HTTP server + Telegram bridge are pure Node + fetch.
Runs on Claude Code with no API key, or paste one key (Anthropic / OpenAI / Gemini).
Mix models per seat — chair on one model, a GPT seat next to a Claude seat. Routing picks the provider per model id.
npx boardroom-ai opens a local web UI; there's also a hosted version: https://boardroom-cloud.vercel.app

Honest limitations

Outcome scoring is manual — you mark hit/miss. (Automating it fairly is hard.)
The debate can groupthink if personas are too similar. Diverse lenses matter — you can even name seats after real people (a Buffett seat on margin-of-safety, a Dalio seat on macro).
The executor is powerful, so I keep it gated on purpose.

The question I'm chewing on

Is a batting average for an AI's judgment genuinely useful — a forcing function that makes the thing earn trust — or is it just theater? Would love this community's take.

Repo: https://github.com/leokwon68/boardroom-ai

DEV Community: leokwon68