This is a submission for the OpenClaw Challenge — OpenClaw in Action.
I Built a Multi-Agent Research Pipeline That Catches AI Confabulation Before It Reaches My Users
LLMs are great at sounding confident. That's the problem.
An LLM will tell you that commit a3f9b2c added user authentication last Tuesday, that the /api/v2/users endpoint returns 200 OK, and that your Pro subscription is $19/month — all with complete certainty, all potentially wrong. This is confabulation: the model generating plausible-sounding text that fills gaps in its knowledge, delivered with full confidence.
In production AI systems, this erodes user trust, breaks integrations, and sends people down blind alleys. I built a system to catch it before it reaches anyone. Here's what I built and how OpenClaw powers it.
What I Built
A multi-agent research pipeline where findings go through three rounds before reaching the user:
- Gap dig — parallel agents investigate specific knowledge gaps
- Consensus vote — three agents (Scout, Auditor, Dev) vote on each finding
- Validation — challenged findings get tested against the real environment
The system is orchestrated by a Research Orchestrator that manages phase transitions, coordinates agent spawning, and synthesizes final output. It's built entirely on OpenClaw with FastMCP servers and OpenClaw's native multi-agent spawning.
How I Used OpenClaw
Multi-Agent Spawning
OpenClaw can spawn sub-agents with custom prompts and session management. The Research Orchestrator uses this to launch parallel gap-dig agents:
from agents.personas import get_persona, get_spawn_prompt
# Build a gap-dig agent prompt with persona + memory
prompt = get_spawn_prompt(
agent_type="research",
task=f"Investigate this specific gap: {gap}",
context=loaded_memory
)
# Spawn it as a sub-agent, get results back
result = sessions_spawn(
task=prompt,
mode="run",
timeoutSeconds=300
)
Each sub-agent is scoped to its gap, outputs structured findings, and terminates. No shared state between agents — they're genuinely independent, which is what makes the consensus vote meaningful.
FastMCP Servers
Three FastMCP servers extend OpenClaw's capabilities for the pipeline:
Consensus Server — voting and scoring:
# Three agents vote. Finding is confirmed only if consensus ≥ 0.6
submit_vote(finding_id, Vote(
agent="auditor",
vote_type=VoteType.CHALLENGE,
confidence=0.75,
reason="GitHub was 3 days stale; local git disagreed"
))
Validation Server — reality testing:
# Test git claims against actual repo state
# Test API claims against live endpoints
# Test URL claims with actual HTTP requests
run_validation(finding_id, environment="local_api")
Calendar + Git Tools — support infrastructure for agent coordination.
These are registered as MCP tool servers in OpenClaw's gateway config. The agent calls them via the standard MCP interface — no custom wiring needed.
Agent Personas with Memory Compounding
Each agent role (Scout, Auditor, Dev, Writer) has:
- A persona file — thinking style, default questions, voice
- A memory file — accumulates experience across sessions
# Persona defines how the agent approaches a task
class ResearchAgent:
thinking_style = "investigative" # Asks "what's actually here?"
default_questions = [
"What's the specific gap no one talks about?",
"What's the evidence for this claim?",
]
voice = "Found something real: ..."
# Memory compounds across sessions
# Every confirmed finding gets written to memory/agents/research-agent.md
Over time, each persona deepens in its domain. Scout gets better at finding gaps. Auditor gets sharper at spotting weak evidence. The memory system is our own implementation — SQLite-backed with read/write/search/compact tools via FastMCP.
Cron-Driven Automation
The pipeline runs on a schedule. Nightly research cycles run autonomously, with findings staged for morning review:
# Cron: every weekday at 8 AM ET
0 8 * * 1-5 research-orchestrator --topic=$(cat ~/.research/today_topic)
Failed cycles self-repair via a cron health monitor. If a job times out or drifts from its session, the health system detects and fixes it automatically.
Demo
Here's what the system actually outputs. For a research task on "x402 ecosystem readiness":
Phase 1 — Orientation produced 5 specific gaps:
- What x402 endpoints are actually deployed and in use?
- What does the auth model look like in practice?
- What's the real revenue potential for a new endpoint?
- What are the failure modes in token refresh?
- Is the developer ecosystem mature enough to build on?
Phase 2 — Gap Dig ran 5 parallel agents, one per gap.
Phase 3 — Consensus voted on 8 findings:
Finding: "x402 wallet address xyz has received 0 transactions"
- Scout: CONFIRM (confidence 0.7) — "Confirmed on-chain"
- Auditor: CONFIRM (confidence 0.85) — "Direct observation"
- Dev: CHALLENGE (confidence 0.6) — "Wallet address may be wrong"
→ Consensus: 0.32 (challenged) → Sent to Validation
Phase 4 — Validation tested the wallet address:
$ curl https://api.x402.org/wallet/xyz
→ 404 Not Found (wallet not found)
→ Validation: FAIL — finding is wrong
The finding that looked most confirmed got rejected by validation. This is the system working correctly.
What I Learned
Distributed skepticism beats validation
Adding a validator (one more LLM call) just doubles the confabulation risk. Distributed skepticism — three agents with genuinely different roles, looking at the same claim from different angles — surfaces the uncertainty that single-model confidence hides.
The architecture matters more than the model
The quality of the output comes from the phase structure (survey → dig → vote → validate → synthesize), not from which LLM powers each agent. We run on MiniMax-M2.7 for speed and cost. The architecture is the product.
OpenClaw makes multi-agent practical
The hard parts of multi-agent — session management, memory across agents, tool sharing via MCP, cron-driven automation — are all handled by OpenClaw's infrastructure. The Research Orchestrator just coordinates. This makes it practical to run multi-agent systems that would otherwise require significant custom infrastructure.
Named entity preservation is still hard
TurboQuant handles context window compression well, but named entities (commit hashes, wallet addresses, API endpoints) get lost in extractive summarization. For research that relies on specific facts, this matters. We're evaluating LLM-backed compaction via Mnemo Cortex to handle this better.
Source Code
-
agents/servers/research_orchestrator.py— pipeline conductor -
agents/servers/consensus_server.py— voting system -
agents/servers/validation_server.py— reality testing -
servers/agent_memory_mcp.py— SQLite-backed agent memory -
agents/personas/— Scout, Auditor, Dev, Writer persona definitions
All registered as FastMCP servers in OpenClaw. Runs on a cron schedule. Self-healing via cron health monitor.
No video demo — but the system runs every day on actual research tasks. Check the commit history for the full implementation.
Top comments (0)