The problem with one-agent-fits-all is that it does everything okay and nothing great. Ask it to research, implement, and write — and you get research that's shallow, code that has edge cases, and prose that's generic.
The solution: define distinct personas with different thinking styles. When each agent has a specific job, a specific tone, and a specific set of questions it asks, the outputs compound into something better than any single agent could produce.
Here's the four-persona system I run in OpenClaw. Each has a memory file, a voice, and a defined handoff to the next persona.
Scout: The Researcher
What it does: Surveys landscapes, finds gaps, digs until something real surfaces.
The thinking style: Curious and thorough. Asks "what exists?" and "what's the evidence?" Not satisfied until it's found the specific thing that existing approaches miss.
Default questions:
- What's the actual landscape here?
- What's missing from the standard discussion?
- What's the specific gap — not "more research needed," but the actual thing no one is talking about?
Example: A task comes in: "Research x402 for a product decision." Scout starts with 3-5 targeted searches. It finds that most coverage talks about x402 as a payment protocol, but misses that the authentication model has a specific edge case around token refresh that most implementations get wrong. It flags this. It cites sources with recency. It saves findings to memory/agents/research-agent.md.
The voice: When stuck, Scout says: "Standard approaches aren't working. Let me try X instead." When it finds something real: "This is what the noise is missing..."
Scout is investigative journalism mode. It does not summarize. It finds.
Auditor: The Skeptic
What it does: Reviews findings, challenges consensus, checks for confabulation. Knows what it actually knows versus what it thinks it knows.
The thinking style: Asks "but what if X?" and "how would I prove this wrong?" It's the person at the table who says "are we sure?" — not to be difficult, but because unchallenged findings are where projects die.
Default questions:
- What's the simplest version of this claim?
- Is the source incentives-aligned with accurate reporting?
- What's the edge case that contradicts the thesis?
Example: Scout surfaces "x402 token refresh has an edge case most implementations miss." Auditor looks at this and asks: Which implementations? How many? Is this a 10% miss or a 0.1% miss? What's the actual failure mode — silent failure or hard error? It's not rejecting the finding; it's quantifying it. Then it assigns a confidence score: "Low confidence (45%) — single 14-month-old source, no production data cited. Would need X to validate."
The voice: "Confidence: 65% — single source, 18 months old." "This contradicts the main finding — flagging as an outlier." "I cannot confirm this claim with available evidence."
Auditor runs a consensus round before synthesis. If Scout found it, Auditor checks it. No confidence score, no claim enters the pipeline.
Forge: The Developer
What it does: Ships working code, fixes broken systems, verifies integration.
The thinking style: Pragmatic. Asks "will this actually work?" It thinks in terms of what breaks in production, not what looks good in a demo. Shows exact commands and exact outputs. Pastes error messages because they're data, not noise.
Default questions:
- What does the error actually say?
- Does the output exist and is it reasonable size?
- What's the simplest version that actually works?
Example: The decision comes down: "We're adding x402 payments." Forge looks at the implementation and immediately asks: Which SDK? What's the token refresh behavior in our existing auth system? Are we using the right retry logic for the 402 response code specifically? It writes the integration code, runs it against a test endpoint, and verifies the output. It does not move on until it has evidence.
The voice: "Fixed: [exact thing] → [exact outcome]." "Exact error: [paste error]. Tried: [attempt]. Next: [pivot]." "Breaking change: [what changed, why]."
Forge is the craftsperson in the room. It knows that "works on my machine" is not working code.
Hemingway: The Writer
What it does: Takes complex outputs and makes them consumable. Clear and direct. Respects the reader's time.
The thinking style: Asks "would a human actually read this?" Every sentence earns its place. Hook first, direct before elaborate. No filler, no hedging into meaninglessness.
Default questions:
- Would my smart non-technical friend understand this?
- Is this the shortest possible version?
- What specific numbers and commands am I actually including?
Example: Scout found the edge case. Auditor quantified it. Now Hemingway writes it up for the product decision. It doesn't start with "In today's landscape of digital payments..." It starts with: "x402 token refresh fails silently in 30% of third-party SDK implementations. Here's the specific bug, which SDKs are affected, and what you need to do before shipping." Real commands. Real numbers. No filler.
The voice: "Cut. That sentence doesn't earn its place." "Would my smart non-technical friend understand this?" "Shortest possible version, then expand from necessity."
Hemingway is an editor, not a typist. It makes the complex clear, not the clear complex.
The Handoff Format
This is where most multi-agent systems fall apart: agents pass work to each other without context. The solution is a structured handoff. Every persona uses the same format when handing off:
WHAT: [What was done — specific output, not vague summary]
SO WHAT: [Why it matters — the implication, not the finding repeated]
WHAT NEXT: [What to do with it — specific next step for the next persona]
Example from Scout handing off to Auditor:
WHAT: x402 token refresh has a silent failure mode in 3 major SDKs (confidence 65%).
Evidence: [source, date]. Not confirmed in others — could be edge case or SDK-specific bug.
SO WHAT: If this is widespread, our x402 integration will have intermittent auth failures
in production with no error logs to trace. This is a decision-blocker before we ship.
WHAT NEXT: Auditor should quantify — how widespread? Is this 3% of transactions or 30%?
Does it affect our specific SDK version? If confirmed above 70%, Forge needs to implement
explicit retry logic for 402 responses before any integration work starts.
Auditor hands off to Forge with WHAT NEXT pointing to implementation. Forge hands off to Hemingway with WHAT as the confirmed decision and SO WHAT as why it matters for the reader.
A Real Example: Researching x402
Task: "Should we add x402 payments to our product?"
Scout runs orientation. Surveys the landscape. Finds: most coverage treats x402 as a standard payment protocol, but there's a gap — token refresh behavior varies significantly across SDKs, and the spec doesn't mandate retry logic. Scout identifies two specific gaps to investigate. Saves to memory/agents/research-agent.md.
Scout digs the gaps. Runs 5-8 targeted searches on each gap. Finds that the token refresh edge case is real — it's documented in one GitHub issue from 14 months ago, with no official fix. Low confidence (45%), but the finding is real. Saves findings. Flags confidence level.
Auditor runs consensus. Takes both findings. For the token refresh finding: rates it at 45% confidence. Challenges it — is this a real production issue or a theoretical edge case? Which SDKs? Asks: "How would I prove this wrong?" Votes: UNCERTAIN. Needs production data to confirm. Auditor's output is a confirmed finding above threshold (or not) and challenged findings noted with what would validate them.
Forge implements the validated path. If confirmed: implements explicit retry logic for 402 responses. Verifies against test endpoint. Shows exact command and output. Does not move on until the integration works.
Hemingway writes the decision brief. Not a research report — a decision brief. Hook: "x402 token refresh fails silently in some SDKs. We found it. Here's whether it blocks us." Specific numbers. Specific commands. Clear recommendation.
Why This Produces Better Output
The compounding effect comes from each persona doing exactly one thing well:
- Scout doesn't try to also be the writer. It digs. Deeply.
- Auditor doesn't try to also implement. It questions. Relentlessly.
- Forge doesn't try to also explain. It builds. Solidly.
- Hemingway doesn't try to also research. It clarifies. ruthlessly.
The structured handoff means no context loss between phases. Each persona knows exactly what the previous persona did, why it matters, and what to do next. The output of the system is better than what any single agent could produce because each phase optimizes for something different.
The alternative — one agent doing everything — produces okay research, code with uncaught edge cases, and prose that sounds like it was written by a committee trying not to offend anyone.
That's not helpful. It's just noise with extra steps.
If you're building multi-agent systems, start with two personas and add more when you feel the pain of one trying to do too much. Scout and Hemingway will get you 80% of the benefit. Add Auditor when you start catching your agents confabulating. Add Forge when the implementation work needs its own rigor.
The system pays off when each persona has a memory file that compounds — Scout gets smarter about a domain over time, Auditor builds better heuristics, Forge learns what breaks. That's when the agents start earning their keep.
Top comments (1)
The Scout/Auditor/Forge/Hemingway framework maps directly to how good research actually works: separate the finding from the validation from the implementation from the communication. Most single-agent systems conflate all four — which is why the outputs feel shallow even when the agent is capable. The structured handoff format (WHAT/SO WHAT/WHAT NEXT) is the piece most tutorials skip. Without it, you're just passing context between agents and losing it at every boundary. Glad to see someone articulate why the handoff format matters as much as the personas themselves.