<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Wilbur AI</title>
    <description>The latest articles on DEV Community by Wilbur AI (@wilburlabs_731).</description>
    <link>https://dev.to/wilburlabs_731</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891943%2F5a06d5d1-2e05-43a4-8257-b4a2f0021ef2.jpeg</url>
      <title>DEV Community: Wilbur AI</title>
      <link>https://dev.to/wilburlabs_731</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wilburlabs_731"/>
    <language>en</language>
    <item>
      <title>I Built an AI System Where Agents Argue — Then Learn From the Argument</title>
      <dc:creator>Wilbur AI</dc:creator>
      <pubDate>Wed, 22 Apr 2026 07:26:19 +0000</pubDate>
      <link>https://dev.to/wilburlabs_731/i-built-an-ai-system-where-agents-argue-then-learn-from-the-argument-41l9</link>
      <guid>https://dev.to/wilburlabs_731/i-built-an-ai-system-where-agents-argue-then-learn-from-the-argument-41l9</guid>
      <description>&lt;p&gt;Most AI agent systems come in two flavors: a single autonomous agent&lt;br&gt;
looping until it declares victory (AutoGPT), or multiple agents&lt;br&gt;
dividing labor on a task (CrewAI). I wanted a third flavor — agents&lt;br&gt;
that actually disagree with each other, and a system that gets smarter&lt;br&gt;
from those disagreements.&lt;/p&gt;

&lt;p&gt;This is the story of building Agora, what I borrowed from existing&lt;br&gt;
projects, and the one thing that's genuinely new.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyro599lvuohs8bcm320.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnyro599lvuohs8bcm320.gif" alt=" " width="560" height="303"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with "one AI thinks for you"
&lt;/h2&gt;

&lt;p&gt;When engineers plan a feature, they don't just have one brain compute&lt;br&gt;
the answer. They research prior art. They debate. Someone pokes holes.&lt;br&gt;
Then they act. Each role catches different blind spots.&lt;/p&gt;

&lt;p&gt;Single-LLM systems skip all of that. You get one answer, often with&lt;br&gt;
confidence that isn't earned.&lt;/p&gt;

&lt;p&gt;Multi-agent systems like CrewAI get you partway there — agents divide&lt;br&gt;
labor — but they rarely argue. The "Research agent" hands off to&lt;br&gt;
"Write agent" who hands off to "Edit agent" in a pipeline. No one's&lt;br&gt;
job is to push back.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Agora does differently
&lt;/h2&gt;

&lt;p&gt;Agora has a council: Scout (research), Architect (design), Critic&lt;br&gt;
(challenges), Synthesizer (summary), plus an optional Sentinel&lt;br&gt;
(security review). They run in &lt;strong&gt;parallel&lt;/strong&gt; on the same input. Their&lt;br&gt;
outputs go to the Synthesizer, who notes where they agreed, where&lt;br&gt;
they disagreed, and turns it into action items you approve before&lt;br&gt;
anything executes.&lt;/p&gt;

&lt;p&gt;Then the part I think is genuinely new: after the session, Agora runs&lt;br&gt;
&lt;strong&gt;two&lt;/strong&gt; skill extractors independently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Execution skill extractor&lt;/strong&gt; — "what worked for this task type"
(learned from the Executor's tool-calling trace)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discussion skill extractor&lt;/strong&gt; — "how did the roles disagree, how
was it resolved" (learned from the council transcript)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The second one has a dedicated prompt (&lt;code&gt;_DISCUSS_PROMPT&lt;/code&gt; in the code)&lt;br&gt;
that explicitly asks "what did each perspective contribute" and "how&lt;br&gt;
were disagreements resolved". It's structurally impossible for a&lt;br&gt;
single-agent system to produce this signal — there's no one to&lt;br&gt;
disagree with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest positioning
&lt;/h2&gt;

&lt;p&gt;I stand on two people's work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DeerFlow&lt;/strong&gt; (ByteDance) gave me the sandbox execution model and
the memory design&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hermes Agent&lt;/strong&gt; (Nous Research) gave me the learn-skills-from-execution pattern&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agora's original contribution is the council discussion layer AND&lt;br&gt;
the discussion-skill extraction. Both originals are credited in&lt;br&gt;
the README.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User input
  ↓
Moderator (routing) → QUICK / DISCUSS / EXECUTE / CLARIFY
  ↓
DISCUSS branch (parallel):
  Scout (web search) ║ Architect (design) ║ Critic (challenge)
                      ↓
  Synthesizer (merges to action items)
  ↓
User approves
  ↓
Executor (tool-calling loop: read / write / patch / shell)
  ↓
SkillExtractor (discussion skill ‖ execution skill, independent)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementation details worth highlighting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Three-tier skill matching
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# backend/agora/skills/store.py
&lt;/span&gt;&lt;span class="nf"&gt;match_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Primary: semantic similarity
&lt;/span&gt;&lt;span class="nf"&gt;match_llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# Fallback: LLM relevance check
&lt;/span&gt;&lt;span class="nf"&gt;match_keyword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# Last resort: keyword overlap
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Works in environments without embedding providers. Each tier has&lt;br&gt;
different cost/quality tradeoffs — the system degrades gracefully&lt;br&gt;
instead of hard-failing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parallel council execution
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;scout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;architect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;critic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;synthesizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Council wall-clock time stays close to single-agent response time&lt;br&gt;
instead of scaling with role count. This is what makes the&lt;br&gt;
"council debate" pattern actually usable in a product.&lt;/p&gt;

&lt;h3&gt;
  
  
  SSE streaming across agents
&lt;/h3&gt;

&lt;p&gt;The web UI shows all four agents streaming tokens simultaneously.&lt;br&gt;
Without this, users wait 30 seconds for a blob of text. With it,&lt;br&gt;
the "council discussing" metaphor feels alive. Worth every line of&lt;br&gt;
the frontend work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing AI systems
&lt;/h2&gt;

&lt;p&gt;Two-tier strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unit tests with mocks&lt;/strong&gt; (188) — verify control flow, prompt
structure, data shape&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration tests with LLM-as-Judge&lt;/strong&gt; (15) — verify actual
output quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unit tests catch regressions fast; integration tests catch quality&lt;br&gt;
drift when you change prompts. Both matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Web UI polish (discussion visualization still needs work)&lt;/li&gt;
&lt;li&gt;MCP server support (external tool integrations beyond the built-in
set)&lt;/li&gt;
&lt;li&gt;Skill versioning (current skills are immutable; roll-forward
should be possible)&lt;/li&gt;
&lt;li&gt;Dynamic sub-agent generation (on-demand specialist roles)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/wilbur-labs/Agora.git
&lt;span class="nb"&gt;cd &lt;/span&gt;Agora
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env  &lt;span class="c"&gt;# add CLAUDE_API_KEY&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
open http://localhost:8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;MIT license. FastAPI + Next.js. Works with Claude, Azure OpenAI,&lt;br&gt;
OpenAI, or any OpenAI-compatible endpoint (including local Ollama&lt;br&gt;
/ vLLM).&lt;/p&gt;

&lt;h2&gt;
  
  
  Ask
&lt;/h2&gt;

&lt;p&gt;If you've seen prior art for extracting skills from multi-agent&lt;br&gt;
disagreement specifically, please tell me. I've done due diligence&lt;br&gt;
on the usual suspects but it's a small world and I could easily&lt;br&gt;
have missed something. GitHub issues are open.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
