DEV Community

MrClaw207
MrClaw207

Posted on

Adaptive Research: Turn One Question Into a Multi-Agent Investigation

OpenClaw Challenge Submission 🦞

This is a submission for the OpenClaw Challenge — Wealth of Knowledge.


Adaptive Research: Turn One Question Into a Multi-Agent Investigation

When you ask an AI agent to research something, it usually does one of two things: it finds what you could find yourself in five minutes, or it generates a polished-sounding answer that's completely wrong. Both are useless.

What you actually want is a system that surveys the landscape, identifies specific knowledge gaps, digs into each one with targeted research, catches disagreements before they become confident lies, and validates claims against reality before presenting the final answer.

That's what the adaptive research pipeline does. Here's how it works.


The Problem With One-Shot Research

The default research pattern — one agent, one query, one answer — has a fundamental flaw: the agent has no way to know what it doesn't know. It will confidently tell you that the commit a3f9b2c added user authentication last Tuesday, that the /api/v2/users endpoint returns 200 OK, and that your Pro subscription is $19/month — all potentially wrong, all delivered with equal confidence.

This is confabulation: the model filling gaps with plausible-sounding text. It isn't lying. It genuinely believes what it's saying. And it has no mechanism to self-correct without an external check.

The answer isn't "add a validator." A single additional LLM call just trades one model for another — same confabulation risk, doubled latency.

The answer is distributed skepticism: multiple agents with different roles, looking at the same claim from different angles, voting before anything goes to the user.


The Three-Phase Pipeline

Phase 1: Orientation

You start with a question — something like "should I use x402 or a traditional API for an AI agent product?"

Phase 1 takes that question and decomposes it into specific knowledge gaps instead of trying to answer it directly. The Scout agent outputs things like:

  • "What is x402 and how does the authentication model differ from API keys?"
  • "What are real-world adoption rates for x402 in production?"
  • "What does the pricing comparison look like for comparable workloads?"
  • "What edge cases exist in x402 token refresh that implementations get wrong?"

This is the most underrated step in research. Most research fails because it starts too broad ("tell me about x402") or too narrow ("is x402 better than Stripe?"). The orientation phase produces specific, addressable questions.

Phase 2: Gap Dig

Each gap from Phase 1 gets assigned to a dedicated research agent — one agent per gap, working in parallel.

This is where most single-agent pipelines fall apart. They try to answer all the gaps in one pass, and the result is surface-level answers to each question that no one would use for a real decision.

The gap dig agents each output:

  • Specific findings with cited sources
  • Confidence level (high/medium/low)
  • Specific uncertainties the agent couldn't resolve
  • New gaps that emerged during research

Phase 3: Consensus + Validation

Here's where we prevent confabulation.

Findings from all gap dig agents go to the Consensus Server, where three agents vote on each finding:

  • Scout — the original researcher. Probably biased toward confirming what it found.
  • Auditor — the skeptic. Challenges assumptions and looks for counterexamples.
  • Dev — the implementation checker. Verifies whether the finding holds up in code or reality.

Each agent votes: confirm (+1), challenge (-1), or uncertain (0), weighted by their confidence level:

consensus_score = (confirms × confidence - challenges × confidence) / num_voting_agents
Enter fullscreen mode Exit fullscreen mode
Score Status Action
≥ 0.6 Confirmed Goes to synthesis
0.3–0.6 Challenged Sent to Validation Server
< 0.3 Rejected Discarded

The critical insight: an agent can be highly confident AND wrong. A single model saying "I'm 95% sure" sounds reassuring. But confidence is about internal consistency, not ground truth. When three agents with different prompts and roles look at the same claim, their disagreements surface the uncertainty that raw confidence hides.

Phase 4: Validation

Consensus catches disagreements. But it doesn't catch confabulation — an agent can be highly confident and completely wrong.

This is where the Validation Server comes in. Challenged findings get tested against reality:

  • Git claims → check the actual commit history
  • API uptime claims → curl the endpoint
  • Price claims → search for corroboration
  • URL claims → attempt the request

Only findings that survive both consensus AND validation make it into the final report.

Phase 5: Synthesis

The synthesis agent takes only confirmed and validated findings and produces the final output. No hedging. No "on the other hand." Only things that were confirmed by multiple agents and tested against reality.


The Architecture

The system runs on OpenClaw with three FastMCP servers working together:

Research Orchestrator (research_orchestrator.py) — the conductor:

run_research_cycle(topic: str, phase: str)
# Phase transitions: orientation → gap_id → targeted_dig → consensus → validate → synthesize
Enter fullscreen mode Exit fullscreen mode

Consensus Server (consensus_server.py) — the voting layer:

submit_vote(finding_id: str, vote: Vote)      # Submit agent vote
get_consensus_results(finding_id: str)         # Get voting results
get_challenged_findings()                      # Get challenged items
Enter fullscreen mode Exit fullscreen mode

Validation Server (validation_server.py) — the reality check:

run_validation(finding_id: str, environment: str)  # Test against real environment
Enter fullscreen mode Exit fullscreen mode

Plus agent personas — Scout, Auditor, Dev, and Hemingway — each with distinct memory files, voices, and thinking styles that compound across sessions.


Real Results

We ran this pipeline three times:

1. Nvidia AITune research — orientation produced 6 specific gaps. Each was researched in parallel. The consensus round caught two claims that sounded plausible but had weak evidence. The validation round caught one claim about CUDA version compatibility that was simply wrong. Final output: accurate, actionable, with confidence levels.

2. MiniMax-M2.7 model capabilities — consensus voting identified that our understanding of context window limits was uncertain. Validation confirmed the actual specs before we built on incorrect assumptions.

3. x402 ecosystem research — gap dig found 9 deployed endpoints with $0 revenue. Challenge phase correctly identified that the monetization model was unrealistic for a new account. Validation confirmed: the wallet had zero incoming transactions.


What Makes This Different From "Just Adding a Validator"

Most people hear "consensus" and think "second opinion." That's not what this is.

A second opinion is still one model with one reasoning path, giving you a thumbs up or down. It has the same confabulation risks as the original.

The key difference is independence: three agents with different system prompts, different roles, and different knowledge bases. Scout has the research context. Auditor has the skeptic's lens. Dev has the implementation reality check. They're not agreeing with each other — they're surfacing disagreements that would otherwise stay hidden.

"Consensus catches disagreements. Validation catches confabulation."

The second question — "do multiple independent agents believe this?" — is answerable without a ground-truth oracle. The first question — "is this true?" — often isn't. We use the answerable question as a proxy for the unanswerable one.


When to Use It

Not every question needs this. A simple factual lookup — "what's the capital of France" — is faster with a single agent call.

Use adaptive research when:

  • The answer will affect a significant decision
  • There are multiple competing claims to evaluate
  • You need to cite sources for something important
  • The domain is outside your direct expertise

The upfront cost is higher. The output quality is significantly better.


Source code: agents/servers/research_orchestrator.py, agents/servers/consensus_server.py, agents/servers/validation_server.py — registered as FastMCP servers in OpenClaw.

Top comments (0)