I Built a Multi-LLM Debate Engine That Fact-Checks Itself in Real Time

#ai #llm #opensource #python

When you ask one LLM a question, you get one answer. When you ask five LLMs the same question, you get five answers and no way to tell which is right.

The naive fix — make them vote, or make them argue, or summarize them all — turns out to make things worse, not better. LLMs are prone to sycophancy; when one confidently states a wrong fact, the others tend to concede rather than push back. Add a summarizer on top and you get a polished, cited-looking answer that is confidently wrong.

I wanted a different shape: a structured debate between agents with different roles, plus a sixth agent whose only job is to fact-check the others mid-debate — before any of them gets a chance to agree with a hallucination.

This post is a walkthrough of what I built, why it works, and where it doesn't. The code is on GitHub under MIT: capitansuat/swarm-debate.

The shape of the problem

Imagine you ask five LLMs: "Is Acme Corp's recent acquisition of Beta Inc going to close by year end?"

You'll get responses that sound like this, rewritten for brevity:

Model A: "Morgan Stanley's November 28 M&A tracker shows the deal at 85% approval probability..."
Model B: "According to the DOJ Second Request docket DOJ-HSR-2025-4471..."
Model C: "The Wall Street Journal reported on October 17 that both parties received antitrust clearance..."

One of those is real. The other two are fabrications — a made-up Morgan Stanley tracker (dated in the future, which makes it impossible), and a DOJ docket number that doesn't exist.

A human reading the three responses is probably fine. A human running a downstream pipeline that summarizes them into a single answer is not fine, because the fabricated citations carry the same rhetorical weight as the real one. If a second LLM is then asked to synthesize these three, the odds it surfaces the fabrication as a problem are low. It will more likely produce a smoothly paraphrased answer that treats all three sources as equivalent.

The pattern I borrowed

While reading about Mixture-of-Experts language models, I came across the shared expert pattern. In an MoE model with routing, each input token selects K experts to process it. But some architectures also include one shared expert that runs on every token, regardless of what the router picks. The shared expert handles general competence; the routed experts handle specialization.

This is a strong structural answer to the debate problem: what if the "shared expert" in a multi-agent system is just... a fact-checker?

The shape would look like this:

Round 1:
  Analyst -> opinion
  Strategist -> opinion
  Devil's Advocate -> opinion
  Researcher -> opinion
  Validator -> reads all four, fact-checks every concrete claim

Round 2:
  Each persona sees the previous round's output
  AND the validator's findings (OK / WARN / FAIL markers)
  AND is told: "Do not use claims marked FAIL"
  ... generates a new, hopefully more grounded opinion
  Validator runs again on the new outputs

Round 3: same pattern, then synthesize

The key design choices:

The Validator does not debate. It doesn't take sides, doesn't argue, only verifies.
Validator output is filtered before injection. Other agents see only the structured markers, not the full validator reasoning. Otherwise they start quoting the validator as a peer, which defeats the point.
FAIL findings carry forward explicitly. The next round's prompt literally says "claims marked FAIL were verified wrong; do not reuse them." This is not subtle; it's what makes the pattern work.

What the Validator actually sees

The Validator's system prompt is strict and narrow. Paraphrased:

You are a validator. You do NOT participate in the debate.
Read what was said this round. Identify verifiable claims:
numbers, dates, company names, reports, URLs, events.

For each concrete claim, you MUST use web_search to verify.
Future-dated source claims (e.g. "May 25 report" cited on April 24)
are automatically [FAIL].

Output format:
  [OK]   <claim> — verified, source URL: ...
  [WARN] <claim> — suspect, reason: ...
  [FAIL] <claim> — fabricated or wrong, correction: ..., source URL: ...

Run this on the acquisition example and you get something like:

[OK]   WSJ reported antitrust clearance on Oct 17 (wsj.com/articles/..., 2025-10-17)
[FAIL] "Morgan Stanley M&A tracker, November 28" — today is October 20, future-dated
[FAIL] "DOJ Second Request docket DOJ-HSR-2025-4471" — no such filing in PACER or DOJ records
[WARN] "85% approval probability" — probability figure unsourced, no widely published tracker confirms

Those FAIL lines get inlined into the next round's prompt. Model A, which fabricated the Morgan Stanley citation, reads its own claim marked [FAIL] and is told not to reuse it. In my test runs, the same model, given the same topic, in the very next round, correctly drops the fabrication and reframes its argument around real data. No fine-tuning, no retraining — just structured feedback during generation.

Before/after numbers from a real run

I ran the same 4-persona × 3-round debate twice on the same topic. The only difference: the first run had a broken Validator (timeouts mid-round so most fact-checks didn't land). The second had the Validator running cleanly every round.

	Run 1 (broken validator)	Run 2 (clean)
Persona calls completed	9/12	12/12
Validator rounds that ran	1/3	3/3
Fabricated citations in log	2	0
Validator FAIL markers	1	3
Verified source URLs in log	~5	~20
Total runtime	26 min	30 min

Four extra minutes of runtime. Two fewer fabrications surviving to the synthesis step. For any downstream use that treats the synthesis as input — a decision support pipeline, a summary for a human in a hurry, a training dataset — this is a disproportionately good trade.

The implementation is boring (intentionally)

The engine is one Python file, under 600 lines, pure stdlib plus PyYAML. Personas are YAML. Providers are OpenAI-compatible HTTP endpoints with a dispatcher that also knows how to shell out to CLI tools (useful if you already pay for a chat subscription and would rather reuse that access than buy API credits).

swarm-debate/
├── src/
│   ├── swarm_debate.py    # the engine
│   ├── config.yaml         # providers, timeouts
│   └── personas.yaml       # the six roles
├── examples/
│   ├── topics.md           # topics that produce good debates
│   └── product-brief-...   # example context document

I deliberately kept the model names and auth patterns out of the hot path. Which model you pick for each persona is in personas.yaml; the engine itself doesn't care. You can run the whole thing entirely on local Ollama if you want, or mix local personas for cheap speech with a single cloud-backed persona for the Validator.

Things that surprised me

The validator is the bottleneck by a wide margin. On my setup, the debating personas each took 30-180 seconds per round. The Validator took 300+ seconds because it has to read all four persona outputs and run a web search per claim. If you want this faster, lowering reasoning effort on the validator is the single highest-leverage knob.

Quality is non-linear in reasoning effort for the validator specifically. Cheap validator = performative. It nods at claims without actually looking anything up. It might say [OK] "according to Reuters" without verifying that Reuters actually said the thing. You can tell from the log: a good validator produces URLs; a cheap one produces vague attributions. This matches the intuition that fact-checking is harder than answering.

Personas with single-responsibility prompts outperform multi-responsibility prompts. An early version had the Researcher persona double as the validator — "when you research, also fact-check the others." Argument quality dropped, fact-check quality dropped, and both responsibilities became half-hearted. Splitting them fixed both.

What's not solved

A few things I left on the roadmap because I didn't want to ship speculative solutions:

Round adaptivity. All debates run a fixed number of rounds. Most topics converge by round 3 anyway, but "no new information" detection would save time on easy questions.
Async validator. The validator currently blocks the next round. Running it in parallel is straightforward but changes the injection semantics.
Meta-validator. Two validators from different model families, disagreements flagged. Cheap insurance against validator-specific failure modes.
Persona reliability metrics. Track which personas accumulate the most FAIL markers in your domain. In my runs one persona was noticeably more prone to fabrication than the others; I'd rather surface that data than guess.

Try it

git clone https://github.com/capitansuat/swarm-debate.git
cd swarm-debate
pip install -r requirements.txt
mkdir -p swarm/debates
cp src/*.yaml swarm/

python3 src/swarm_debate.py \
  --topic "Should we migrate our 65kLOC TypeScript backend to Rust?" \
  --agents analyst,strategist,devils_advocate,researcher \
  --rounds 3

Edit swarm/personas.yaml to point at whatever providers you have (API keys, CLI tools, local Ollama — any combination works). The dispatcher figures out which path to use based on what's configured.

The output is a Markdown log with all rounds, validator findings, and a synthesis section ready to pipe into a strong model for the final answer.

Credit

The shared-expert idea came from reading the OpenMythos community repository — a speculative reconstruction of a hypothetical MoE language model. OpenMythos is architecture-level speculation rather than a runnable model, and its specific claims about actual production systems are unverified, but the structural idea of one expert always running alongside the routed experts is a real pattern in MoE research and it translates cleanly into multi-agent systems.

Related papers worth a skim if you find this interesting:

Switch Transformers — foundational MoE work
DeepSeekMoE — formal shared-expert definition
Universal Transformers — for the "same weights, different round behavior" idea I want to try next

If you run it on a topic I wouldn't think to try, I'd like to see the log. Open an issue with the result attached — it's the kind of feedback that tells me whether the pattern generalizes or works only in my specific workload.

Repo: capitansuat/swarm-debate