I Built an AI System Where Agents Argue — Then Learn From the Argument

Wilbur AI — Wed, 22 Apr 2026 07:26:19 +0000

Most AI agent systems come in two flavors: a single autonomous agent
looping until it declares victory (AutoGPT), or multiple agents
dividing labor on a task (CrewAI). I wanted a third flavor — agents
that actually disagree with each other, and a system that gets smarter
from those disagreements.

This is the story of building Agora, what I borrowed from existing
projects, and the one thing that's genuinely new.

The problem with "one AI thinks for you"

When engineers plan a feature, they don't just have one brain compute
the answer. They research prior art. They debate. Someone pokes holes.
Then they act. Each role catches different blind spots.

Single-LLM systems skip all of that. You get one answer, often with
confidence that isn't earned.

Multi-agent systems like CrewAI get you partway there — agents divide
labor — but they rarely argue. The "Research agent" hands off to
"Write agent" who hands off to "Edit agent" in a pipeline. No one's
job is to push back.

What Agora does differently

Agora has a council: Scout (research), Architect (design), Critic
(challenges), Synthesizer (summary), plus an optional Sentinel
(security review). They run in parallel on the same input. Their
outputs go to the Synthesizer, who notes where they agreed, where
they disagreed, and turns it into action items you approve before
anything executes.

Then the part I think is genuinely new: after the session, Agora runs
two skill extractors independently:

Execution skill extractor — "what worked for this task type" (learned from the Executor's tool-calling trace)
Discussion skill extractor — "how did the roles disagree, how was it resolved" (learned from the council transcript)

The second one has a dedicated prompt (_DISCUSS_PROMPT in the code)
that explicitly asks "what did each perspective contribute" and "how
were disagreements resolved". It's structurally impossible for a
single-agent system to produce this signal — there's no one to
disagree with.

Honest positioning

I stand on two people's work:

DeerFlow (ByteDance) gave me the sandbox execution model and the memory design
Hermes Agent (Nous Research) gave me the learn-skills-from-execution pattern

Agora's original contribution is the council discussion layer AND
the discussion-skill extraction. Both originals are credited in
the README.

Architecture overview

User input
  ↓
Moderator (routing) → QUICK / DISCUSS / EXECUTE / CLARIFY
  ↓
DISCUSS branch (parallel):
  Scout (web search) ║ Architect (design) ║ Critic (challenge)
                      ↓
  Synthesizer (merges to action items)
  ↓
User approves
  ↓
Executor (tool-calling loop: read / write / patch / shell)
  ↓
SkillExtractor (discussion skill ‖ execution skill, independent)

Implementation details worth highlighting

Three-tier skill matching

# backend/agora/skills/store.py
match_embedding(query)  # Primary: semantic similarity
match_llm(query)        # Fallback: LLM relevance check
match_keyword(query)    # Last resort: keyword overlap

Works in environments without embedding providers. Each tier has
different cost/quality tradeoffs — the system degrades gracefully
instead of hard-failing.

Parallel council execution

tasks = [scout.run(), architect.run(), critic.run()]
results = await asyncio.gather(*tasks)
summary = await synthesizer.run(results)

Council wall-clock time stays close to single-agent response time
instead of scaling with role count. This is what makes the
"council debate" pattern actually usable in a product.

SSE streaming across agents

The web UI shows all four agents streaming tokens simultaneously.
Without this, users wait 30 seconds for a blob of text. With it,
the "council discussing" metaphor feels alive. Worth every line of
the frontend work.

Testing AI systems

Two-tier strategy:

Unit tests with mocks (188) — verify control flow, prompt structure, data shape
Integration tests with LLM-as-Judge (15) — verify actual output quality

Unit tests catch regressions fast; integration tests catch quality
drift when you change prompts. Both matter.

What's next

Web UI polish (discussion visualization still needs work)
MCP server support (external tool integrations beyond the built-in set)
Skill versioning (current skills are immutable; roll-forward should be possible)
Dynamic sub-agent generation (on-demand specialist roles)

Try it

git clone https://github.com/wilbur-labs/Agora.git
cd Agora
cp .env.example .env  # add CLAUDE_API_KEY
docker compose up -d
open http://localhost:8000

MIT license. FastAPI + Next.js. Works with Claude, Azure OpenAI,
OpenAI, or any OpenAI-compatible endpoint (including local Ollama
/ vLLM).

Ask

If you've seen prior art for extracting skills from multi-agent
disagreement specifically, please tell me. I've done due diligence
on the usual suspects but it's a small world and I could easily
have missed something. GitHub issues are open.

DEV Community: Wilbur AI