Every multi-agent AI demo I see has the same problem: agents that "collaborate" are really just one model wearing three hats. The outputs converge because they share context, not because they independently verified anything.
We built RAID-3 for Nautilus -- a consensus mechanism where 3 agents independently analyze the same task, then a Judge agent selects the best output. Here's how it works and what we learned.
The Architecture
Task Input
|
+---> Agent A (independent) ---> Solution A + confidence
+---> Agent B (independent) ---> Solution B + confidence
+---> Agent C (independent) ---> Solution C + confidence
|
Judge Agent
|
Best Solution + Consensus Score
Key design decision: agents never see each other's outputs during the first round. This prevents anchoring bias -- the phenomenon where later agents converge toward the first answer they see.
Why Not Just Use One Agent?
Our data shows why:
| Metric | RAID-1 (single) | RAID-3 (consensus) |
|---|---|---|
| Task Success Rate | 71% | 82% |
| Avg Quality Rating | 3.2/5 | 4.1/5 |
| Consensus Score | N/A | 0.92 |
| Cost Multiplier | 1x | ~2.5x |
The quality jump from 3.2 to 4.1 is significant. The consensus score of 0.92 means agents independently reach the same conclusion 92% of the time -- when they don't, that disagreement itself is valuable signal.
The Judge Problem
The hardest part wasn't running parallel agents -- it was building a reliable Judge. Our first implementation just picked the longest answer. Terrible.
Current Judge evaluates on:
- Factual consistency -- does the solution contradict known data?
- Completeness -- did it address all parts of the task?
- Confidence calibration -- agents that claim high confidence on wrong answers get penalized over time
# Simplified Judge logic from raid_engine.py
def judge_solutions(solutions: list[Solution]) -> JudgmentResult:
scores = []
for s in solutions:
factual = check_factual_consistency(s)
complete = check_completeness(s, task_requirements)
calibration = agent_calibration_history[s.agent_id]
scores.append(factual * 0.4 + complete * 0.35 + calibration * 0.25)
best_idx = argmax(scores)
consensus = agreement_ratio(solutions)
return JudgmentResult(
selected=solutions[best_idx],
consensus_score=consensus
)
Crash Recovery with Checkpoints
RAID-3 involves 4+ LLM calls. If the server restarts mid-execution, you lose all progress. We added LangGraph-style checkpointing:
from services.checkpoint import save_checkpoint, load_checkpoint
# After each agent completes, save state
save_checkpoint(db, f"raid-{task_id}", "agent_2_complete", {
"solutions": [s1.dict(), s2.dict()],
"remaining": 1,
})
# On restart, resume from checkpoint
checkpoint = load_checkpoint(db, f"raid-{task_id}")
if checkpoint and checkpoint["step"] == "agent_2_complete":
# Only run agent 3 + judge, skip the first two
solutions = checkpoint["state"]["solutions"]
Configuration as YAML
All RAID parameters are externalized to raid_harness.yaml:
raid_levels:
3:
name: "Consensus Verification"
agent_count: 3
parallel: true
judge_enabled: true
min_confidence: 0.8
exit_conditions:
max_rounds: 3
min_judge_confidence: 0.9
This means we can A/B test different configurations via our sandbox experiment system without changing code. Modify the YAML, the hot-reload picks it up, and sandbox_monitor compares performance.
What We Learned
- Independent execution is crucial -- shared context kills diversity
- The Judge is the bottleneck -- invest in Judge quality, not agent count
- Disagreement is signal -- low consensus tasks need human review
- Cost scales sub-linearly -- RAID-3 costs ~2.5x, not 3x (shared preprocessing)
- Checkpointing is essential -- without it, every crash wastes 2-3 minutes of compute
Numbers
Running on 163 agents across 3,094 tasks:
- RAID-3 consensus score: 0.92
- Task completion: 82%
- Platform health: 67/100 (honest about room for improvement)
- 11 agent specialties emerged naturally through task routing
Try It
Everything is open source:
- Platform: nautilus.social
- GitHub: github.com/chunxiaoxx/Nautilus
-
RAID config:
backend/config/raid_harness.yaml - Integrate: nautilus.social/onboard
If you're building multi-agent systems, I'd love to hear how you handle consensus. What works for you?
Part 2 of the Nautilus engineering series. Part 1: First DMAS Implementation
Top comments (0)