<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stephen Trembley </title>
    <description>The latest articles on DEV Community by Stephen Trembley  (@sturnaai).</description>
    <link>https://dev.to/sturnaai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3891384%2Faba05fc7-caa9-4d21-86e1-f3e59420271f.png</url>
      <title>DEV Community: Stephen Trembley </title>
      <link>https://dev.to/sturnaai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sturnaai"/>
    <language>en</language>
    <item>
      <title>How We Built a Self-Healing Agent Marketplace with 201 Competing AI Agents</title>
      <dc:creator>Stephen Trembley </dc:creator>
      <pubDate>Tue, 21 Apr 2026 22:06:38 +0000</pubDate>
      <link>https://dev.to/sturnaai/how-we-built-a-self-healing-agent-marketplace-with-201-competing-ai-agents-43kf</link>
      <guid>https://dev.to/sturnaai/how-we-built-a-self-healing-agent-marketplace-with-201-competing-ai-agents-43kf</guid>
      <description>&lt;p&gt;Most agent frameworks assume you know the best agent for the job before the job starts. You pick a model, wire a DAG, and hope it holds.&lt;/p&gt;

&lt;p&gt;We didn't know. So we made 201 agents compete for every task — and let outcomes decide.&lt;/p&gt;

&lt;p&gt;This is the architecture behind &lt;a href="https://sturna.ai" rel="noopener noreferrer"&gt;Sturna.ai&lt;/a&gt;, and why we call it the octopus brain.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Static DAGs
&lt;/h2&gt;

&lt;p&gt;LangGraph, CrewAI, AutoGen — they're all variations of the same idea: you compose agents into a fixed graph. Agent A calls Agent B which calls Agent C. The flow is known at design time.&lt;/p&gt;

&lt;p&gt;That works until it doesn't.&lt;/p&gt;

&lt;p&gt;In production, task diversity is brutal. A single "analyze my competitors" intent might need a web scraper, a summarizer, a data formatter, and a report writer — or it might need completely different agents depending on which competitors, which market, which output format. Static graphs require you to anticipate all of this upfront. You can't.&lt;/p&gt;

&lt;p&gt;The deeper problem: when a node fails, the whole DAG fails. There's no self-healing. There's no "try something else." You get an error and you restart.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Octopus Brain Model
&lt;/h2&gt;

&lt;p&gt;An octopus has a central brain but its arms have their own neural clusters — each arm can act semi-independently, process information locally, and adapt without waiting for central coordination.&lt;/p&gt;

&lt;p&gt;We built Sturna with the same principle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Central coordinator&lt;/strong&gt; receives an intent and broadcasts it to all capable agents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;201 specialized agents&lt;/strong&gt; each evaluate the task independently and submit proposals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitive routing&lt;/strong&gt; selects the best proposal based on past performance, confidence scores, and task type&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution layer&lt;/strong&gt; runs the winning agent — and if it fails, automatically routes to the next best proposal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No fixed DAG. No predetermined path. The route emerges from competition.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Self-Healing" Actually Means
&lt;/h2&gt;

&lt;p&gt;When people say "self-healing," they usually mean retry logic. Retry the same thing 3 times, then give up.&lt;/p&gt;

&lt;p&gt;That's not healing. That's hoping.&lt;/p&gt;

&lt;p&gt;Sturna's self-healing is architectural:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Every task has N competing proposals, ranked by predicted success&lt;/li&gt;
&lt;li&gt;If agent #1 fails, the system doesn't restart — it promotes agent #2&lt;/li&gt;
&lt;li&gt;Agent #2 runs with full context of what agent #1 attempted&lt;/li&gt;
&lt;li&gt;Failure data feeds back into routing scores, making future routing smarter&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The agents aren't just competing for the first run. They're competing across every run, accumulating performance history that shapes every future routing decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Numbers After 6 Months in Production
&lt;/h2&gt;

&lt;p&gt;After running this system across thousands of real tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;201 active agents&lt;/strong&gt; across 14 capability categories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;86%+ first-attempt success rate&lt;/strong&gt; (vs ~60% with our original static routing)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;45-second median time-to-value&lt;/strong&gt; from intent to delivered result&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-healing triggered on ~14% of tasks&lt;/strong&gt; — those tasks still complete, they just take a second pass&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 86% number is the one I'm most proud of. That's not accuracy on benchmarks — that's real tasks from real users completing successfully on the first agent attempt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Competitive Routing vs Static DAGs: The Real Tradeoff
&lt;/h2&gt;

&lt;p&gt;I want to be honest about what you give up with competitive routing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Static DAG&lt;/th&gt;
&lt;th&gt;Competitive Routing&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Predictability&lt;/td&gt;
&lt;td&gt;High — same path every time&lt;/td&gt;
&lt;td&gt;Lower — path varies by agent performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debuggability&lt;/td&gt;
&lt;td&gt;Easy — trace the graph&lt;/td&gt;
&lt;td&gt;Harder — need proposal replay logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency (simple tasks)&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;td&gt;Higher — broadcast overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency (complex tasks)&lt;/td&gt;
&lt;td&gt;Higher — no fallback path&lt;/td&gt;
&lt;td&gt;Lower — parallel evaluation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure recovery&lt;/td&gt;
&lt;td&gt;Manual — fix the DAG&lt;/td&gt;
&lt;td&gt;Automatic — next proposal promoted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Improvement over time&lt;/td&gt;
&lt;td&gt;Manual — you retune&lt;/td&gt;
&lt;td&gt;Automatic — routing learns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For simple, well-scoped tasks you run thousands of times, static DAGs win on predictability. For diverse, open-ended tasks where failure matters, competitive routing wins on resilience.&lt;/p&gt;

&lt;p&gt;We built Sturna for the second category.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Agents Submit Proposals
&lt;/h2&gt;

&lt;p&gt;Each agent in Sturna exposes a &lt;code&gt;canHandle(intent)&lt;/code&gt; method that returns a confidence score (0-1) and an execution plan. When a task comes in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified — real implementation has more context&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;AgentProposal&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;agentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;estimatedDuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;executionPlan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;requiredCapabilities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Coordinator broadcasts and collects&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proposals&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Rank by: confidence × historical success rate × recency&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ranked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rankProposals&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;proposals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;agentHistory&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ranking function is the core IP. Confidence alone isn't enough — an agent can be overconfident on task types it's bad at. We weight heavily by actual historical success rate, with recency bias (recent performance matters more than old performance).&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Got Wrong First
&lt;/h2&gt;

&lt;p&gt;Two things killed our first two versions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version 1: Too much competition.&lt;/strong&gt; Broadcasting to all 201 agents created ~400ms of overhead even before execution started. We added capability tagging — agents declare what they can handle, and broadcast only goes to capable agents. Overhead dropped to ~30ms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version 2: No proposal replay.&lt;/strong&gt; When an agent failed, the next agent started completely fresh. Users saw inconsistent results. We built a context handoff layer — the winning backup agent receives what the failed agent attempted, and can continue rather than restart.&lt;/p&gt;

&lt;p&gt;The context handoff was 3 weeks of work and cut re-execution time in half.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where This Goes
&lt;/h2&gt;

&lt;p&gt;The 201-agent number isn't a ceiling. Every new capability we add is a new agent. The routing system gets better the more agents compete — more data, more diversity, more paths to success.&lt;/p&gt;

&lt;p&gt;We're currently working on agent coalitions: groups of agents that propose to handle a task collaboratively, with shared execution context. The octopus brain, but with arms that can coordinate.&lt;/p&gt;

&lt;p&gt;If you're building agent infrastructure and want to compare notes, we're at &lt;a href="https://sturna.ai" rel="noopener noreferrer"&gt;sturna.ai&lt;/a&gt;. The system is live and handling real production traffic — we'd rather learn from builders than pitch in abstractions.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post covers the architecture as it exists today. The numbers are from our internal dashboards as of April 2026.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>webdev</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
