DEV Community

chunxiaoxx
chunxiaoxx

Posted on

How We Built RAID-3: Multi-Agent Consensus That Actually Works

Every multi-agent AI demo I see has the same problem: agents that "collaborate" are really just one model wearing three hats. The outputs converge because they share context, not because they independently verified anything.

We built RAID-3 for Nautilus -- a consensus mechanism where 3 agents independently analyze the same task, then a Judge agent selects the best output. Here's how it works and what we learned.

The Architecture

Task Input
    |
    +---> Agent A (independent) ---> Solution A + confidence
    +---> Agent B (independent) ---> Solution B + confidence
    +---> Agent C (independent) ---> Solution C + confidence
                                          |
                                    Judge Agent
                                          |
                                Best Solution + Consensus Score
Enter fullscreen mode Exit fullscreen mode

Key design decision: agents never see each other's outputs during the first round. This prevents anchoring bias -- the phenomenon where later agents converge toward the first answer they see.

Why Not Just Use One Agent?

Our data shows why:

Metric RAID-1 (single) RAID-3 (consensus)
Task Success Rate 71% 82%
Avg Quality Rating 3.2/5 4.1/5
Consensus Score N/A 0.92
Cost Multiplier 1x ~2.5x

The quality jump from 3.2 to 4.1 is significant. The consensus score of 0.92 means agents independently reach the same conclusion 92% of the time -- when they don't, that disagreement itself is valuable signal.

The Judge Problem

The hardest part wasn't running parallel agents -- it was building a reliable Judge. Our first implementation just picked the longest answer. Terrible.

Current Judge evaluates on:

  1. Factual consistency -- does the solution contradict known data?
  2. Completeness -- did it address all parts of the task?
  3. Confidence calibration -- agents that claim high confidence on wrong answers get penalized over time
# Simplified Judge logic from raid_engine.py
def judge_solutions(solutions: list[Solution]) -> JudgmentResult:
    scores = []
    for s in solutions:
        factual = check_factual_consistency(s)
        complete = check_completeness(s, task_requirements)
        calibration = agent_calibration_history[s.agent_id]
        scores.append(factual * 0.4 + complete * 0.35 + calibration * 0.25)

    best_idx = argmax(scores)
    consensus = agreement_ratio(solutions)
    return JudgmentResult(
        selected=solutions[best_idx],
        consensus_score=consensus
    )
Enter fullscreen mode Exit fullscreen mode

Crash Recovery with Checkpoints

RAID-3 involves 4+ LLM calls. If the server restarts mid-execution, you lose all progress. We added LangGraph-style checkpointing:

from services.checkpoint import save_checkpoint, load_checkpoint

# After each agent completes, save state
save_checkpoint(db, f"raid-{task_id}", "agent_2_complete", {
    "solutions": [s1.dict(), s2.dict()],
    "remaining": 1,
})

# On restart, resume from checkpoint
checkpoint = load_checkpoint(db, f"raid-{task_id}")
if checkpoint and checkpoint["step"] == "agent_2_complete":
    # Only run agent 3 + judge, skip the first two
    solutions = checkpoint["state"]["solutions"]
Enter fullscreen mode Exit fullscreen mode

Configuration as YAML

All RAID parameters are externalized to raid_harness.yaml:

raid_levels:
  3:
    name: "Consensus Verification"
    agent_count: 3
    parallel: true
    judge_enabled: true
    min_confidence: 0.8
    exit_conditions:
      max_rounds: 3
      min_judge_confidence: 0.9
Enter fullscreen mode Exit fullscreen mode

This means we can A/B test different configurations via our sandbox experiment system without changing code. Modify the YAML, the hot-reload picks it up, and sandbox_monitor compares performance.

What We Learned

  1. Independent execution is crucial -- shared context kills diversity
  2. The Judge is the bottleneck -- invest in Judge quality, not agent count
  3. Disagreement is signal -- low consensus tasks need human review
  4. Cost scales sub-linearly -- RAID-3 costs ~2.5x, not 3x (shared preprocessing)
  5. Checkpointing is essential -- without it, every crash wastes 2-3 minutes of compute

Numbers

Running on 163 agents across 3,094 tasks:

  • RAID-3 consensus score: 0.92
  • Task completion: 82%
  • Platform health: 67/100 (honest about room for improvement)
  • 11 agent specialties emerged naturally through task routing

Try It

Everything is open source:

If you're building multi-agent systems, I'd love to hear how you handle consensus. What works for you?


Part 2 of the Nautilus engineering series. Part 1: First DMAS Implementation

Top comments (0)