<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Robin King</title>
    <description>The latest articles on DEV Community by Robin King (@surimple).</description>
    <link>https://dev.to/surimple</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3911897%2F6508757a-fcc7-47ea-bfe5-f5b62013559b.jpg</url>
      <title>DEV Community: Robin King</title>
      <link>https://dev.to/surimple</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/surimple"/>
    <language>en</language>
    <item>
      <title>CLMA Frame Test</title>
      <dc:creator>Robin King</dc:creator>
      <pubDate>Mon, 04 May 2026 14:20:16 +0000</pubDate>
      <link>https://dev.to/surimple/clma-frame-test-446</link>
      <guid>https://dev.to/surimple/clma-frame-test-446</guid>
      <description>&lt;h1&gt;
  
  
  CLMA vs Web Chat: Putting Iterative Verification to the Test
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Posted on May 6, 2026 · #CLMA #MultiAgent #CodeGeneration #EventSourcing #Comparison #Python&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;All code is open source on GitHub: &lt;a href="https://github.com/kriely/CLMA" rel="noopener noreferrer"&gt;github.com/kriely/CLMA&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is a companion piece to &lt;a href="https://dev.to/kriely/building-clma-2p6b"&gt;Building CLMA: A Self-Verifying Multi-Agent Framework from Scratch&lt;/a&gt;. In that article, I described the framework. Here, I put it to the test — head to head against a plain web chat, same model, same problem.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Same LLM&lt;/strong&gt; (DeepSeek) tasked with writing the same code. No human intervention on either side. Two questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Q1&lt;/strong&gt; — Thread-safe bounded blocking queue (put/get with timeout)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Q5&lt;/strong&gt; — Event sourcing framework for a bank account (events, replay, serialization, optimistic concurrency, business rules, freeze/unfreeze)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For Q5, the CLMA version went through &lt;strong&gt;3 automated iteration rounds&lt;/strong&gt; (Solver → Verifier → Refiner → Verifier → Refiner → Verifier → Evaluator). The web chat version was a single-shot output.&lt;/p&gt;




&lt;h2&gt;
  
  
  Q1: Bounded Blocking Queue
&lt;/h2&gt;

&lt;p&gt;Both implementations passed all &lt;strong&gt;12 test cases&lt;/strong&gt; — basic put/get, blocking/unblocking behavior, timeout, edge cases (maxsize=1, maxsize=0), queue state queries, and invalid capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;12/12 pass for both.&lt;/strong&gt; On the surface, a draw. But the engineering quality tells a different story.&lt;/p&gt;

&lt;h3&gt;
  
  
  CLMA Version (1.py)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Two separate Conditions — put and get don't contend
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;not_empty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;not_full&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_lock&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# time.monotonic() — immune to system clock adjustments
&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;full&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;Full&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;not_full&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;not_full&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Web Chat Version (2.py)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Single Condition — functional but suboptimal
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cond&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;threading&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Condition&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# time.time() — affected by system clock changes
&lt;/span&gt;&lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;full&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;deadline&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;QueueFull&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cond&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Differences
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;CLMA&lt;/th&gt;
&lt;th&gt;Web Chat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Conditions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2 (not_empty / not_full) — put/get don't contend&lt;/td&gt;
&lt;td&gt;1 — notify() may wake wrong waiter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Clock&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;time.monotonic()&lt;/code&gt; — immune to NTP adjustments&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;time.time()&lt;/code&gt; — affected by system clock changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Timeout precision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Exact decrement per loop iteration&lt;/td&gt;
&lt;td&gt;Once-calculated &lt;code&gt;deadline&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exception names&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Full&lt;/code&gt;, &lt;code&gt;Empty&lt;/code&gt; — concise&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;QueueFull&lt;/code&gt;, &lt;code&gt;QueueEmpty&lt;/code&gt; — verbose&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Edge case&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Handles &lt;code&gt;timeout &amp;lt; 0&lt;/code&gt; defensively&lt;/td&gt;
&lt;td&gt;No check for negative timeout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Comments&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;Chinese&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Both pass all tests, but CLMA's design is more robust for high-concurrency scenarios. Two Conditions prevent head-of-line blocking between producers and consumers. &lt;code&gt;time.monotonic()&lt;/code&gt; avoids a real-world bug class (NTP jumps causing premature or delayed timeouts). The difference matters under load, not in a single-threaded test.&lt;/p&gt;




&lt;h2&gt;
  
  
  Q5: Event Sourcing Framework
&lt;/h2&gt;

&lt;p&gt;This is where the gap opens wider. Both implement an event-sourced bank account with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Events&lt;/strong&gt;: account opened, deposited, withdrawn, frozen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event store&lt;/strong&gt; with optimistic concurrency control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event replay&lt;/strong&gt; (rebuild aggregate state from history)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Serialization / deserialization&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business rules&lt;/strong&gt;: no negative deposits, no over-withdrawal, no withdrawal on frozen account&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  CLMA Version (4.py) — After 3 Iterations
&lt;/h3&gt;

&lt;p&gt;The automated Verifier caught two things the initial output missed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Round 1 → Round 2:&lt;/strong&gt; "Where's the &lt;code&gt;Unfrozen&lt;/code&gt; event? A frozen account can never be unfrozen — this is incomplete."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Round 2 → Round 3:&lt;/strong&gt; "The freeze implementation blocks withdrawals, but should it also block deposits? This is a business policy decision — document it explicitly."&lt;/p&gt;

&lt;p&gt;Result — &lt;strong&gt;CLMA adds the &lt;code&gt;Unfrozen&lt;/code&gt; event&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Unfrozen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aggregate_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aggregate_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the &lt;code&gt;BankAccount&lt;/code&gt; handles it properly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Deposited&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;       &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Withdrawn&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;     &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Frozen&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_frozen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Unfrozen&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;      &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_frozen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# ← Added by Verifier
&lt;/span&gt;    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Web Chat Version (3.py) — Single Shot
&lt;/h3&gt;

&lt;p&gt;The web version has a clean architecture — proper &lt;code&gt;Event&lt;/code&gt; base class, &lt;code&gt;register_event&lt;/code&gt; decorator, &lt;code&gt;payload()&lt;/code&gt; abstraction, serialization round-trip. But &lt;strong&gt;it has no &lt;code&gt;Unfrozen&lt;/code&gt; event.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@register_event&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AccountFrozen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;aggregate_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
    &lt;span class="c1"&gt;# ... no Unfrozen counterpart exists
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;freeze()&lt;/code&gt; method works, but there's no &lt;code&gt;unfreeze()&lt;/code&gt;. Once frozen, the account stays frozen forever.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Test Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;CLMA&lt;/th&gt;
&lt;th&gt;Web Chat&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Event basics (IDs, timestamps, types)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Serialization / deserialization&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event replay (deposit 100+50, withdraw 30 = 120)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business rules (no negative, no overdraft)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Freeze → reject withdrawal&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unfreeze → allow operations again&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;❌ Missing&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optimistic concurrency&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both frameworks pass all standard event sourcing tests. But the &lt;strong&gt;missing &lt;code&gt;Unfrozen&lt;/code&gt; event&lt;/strong&gt; in the web chat version is not a cosmetic issue — it's a domain modeling gap. In any real banking system, frozen accounts need a thaw mechanism.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why CLMA Found It
&lt;/h3&gt;

&lt;p&gt;The third iteration round is where the value shows. The Verifier's feedback was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The freeze flow is incomplete. Freezing is an operation that must be reversible. Consider adding an Unfrozen event and updating the aggregate to apply it."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A human reviewer would spot this too. But the CLMA Verifier catches it automatically, in seconds, with no developer in the loop. This is the difference between code review as a &lt;em&gt;process&lt;/em&gt; and code review as a &lt;em&gt;downloaded prompt&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Q1 (Blocking Queue)&lt;/th&gt;
&lt;th&gt;Q5 (Event Sourcing)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CLMA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12/12 ✅ + better design&lt;/td&gt;
&lt;td&gt;Full feature set ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Web Chat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12/12 ✅ + usable but less robust&lt;/td&gt;
&lt;td&gt;Missing &lt;code&gt;Unfrozen&lt;/code&gt; event ❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For simple, well-defined problems (Q1), a single-shot chat prompt gets you 90% of the way. The CLMA advantage is marginal — better engineering choices, but the output is functionally equivalent.&lt;/p&gt;

&lt;p&gt;For complex, multi-faceted problems (Q5) where completeness matters — domain events, edge cases, business rules — &lt;strong&gt;the iterative verification loop earns its keep.&lt;/strong&gt; The 3 rounds of automated review caught a real domain modeling gap that a single prompt missed. Not because the LLM couldn't write an &lt;code&gt;Unfrozen&lt;/code&gt; event, but because no single prompt can anticipate all the completeness conditions of a non-trivial domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pattern is clear:&lt;/strong&gt; Generation quality is already good. Verification quality is where the gap is. And verification is exactly what CLMA automates.&lt;/p&gt;




&lt;h2&gt;
  
  
  Files
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;1.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CLMA — bounded blocking queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;2.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Web chat — bounded blocking queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;3.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Web chat — event sourcing framework&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;4.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CLMA — event sourcing framework (3 iterations)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_compare.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Q1 test suite — 12 cases for both&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_q5_compare.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Q5 test suite — auto-detects class names&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All comparison files are in the &lt;a href="https://github.com/kriely/CLMA" rel="noopener noreferrer"&gt;CLMA repository&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Tags: #CLMA #MultiAgent #CodeGeneration #EventSourcing #Comparison #Python #DeepSeek&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Building CLMA: A Self-Verifying Multi-Agent Framework from Scratch</title>
      <dc:creator>Robin King</dc:creator>
      <pubDate>Mon, 04 May 2026 12:34:05 +0000</pubDate>
      <link>https://dev.to/surimple/building-clma-a-self-verifying-multi-agent-framework-from-scratch-3068</link>
      <guid>https://dev.to/surimple/building-clma-a-self-verifying-multi-agent-framework-from-scratch-3068</guid>
      <description>&lt;h1&gt;
  
  
  Building CLMA: A Self-Verifying Multi-Agent Framework from Scratch
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Posted on May 4, 2026 · #LLM #MultiAgent #CodeGeneration #OpenSource #SystemDesign #WebUI #SSE&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;All code is open source on GitHub: &lt;a href="https://github.com/kriely/CLMA" rel="noopener noreferrer"&gt;github.com/kriely/CLMA&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1: The Problem — LLMs Can't Self-Verify
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The One-Off Generation Trap
&lt;/h3&gt;

&lt;p&gt;If you've spent any time using AI for coding, you've experienced this cycle: ask → get code → try to run → it fails → paste error → get fix → something else breaks → lather, rinse, repeat.&lt;/p&gt;

&lt;p&gt;Each iteration costs you time, context switching, and cognitive energy. The LLM itself never knows whether its output actually &lt;em&gt;works&lt;/em&gt; — it just predicts tokens. It produces code, but it cannot &lt;em&gt;verify&lt;/em&gt; code.&lt;/p&gt;

&lt;p&gt;This is the fundamental asymmetry of LLM-assisted coding today: &lt;strong&gt;generation is cheap, but verification is manual&lt;/strong&gt;. And as tasks grow from "write a sort function" to "build a microservice architecture with authentication, rate limiting, and a PostgreSQL backend", the gap between "code that looks right" and "code that actually works" becomes a chasm.&lt;/p&gt;

&lt;p&gt;Most existing solutions paper over this gap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Direct prompting&lt;/strong&gt; — ask once, hope for the best&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chat-based refinement&lt;/strong&gt; — human-in-the-loop for every error&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent frameworks&lt;/strong&gt; — chain multiple LLM calls, but still no automated quality gate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG + tools&lt;/strong&gt; — give the LLM more context, but still no feedback loop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of them ask the hard question: &lt;em&gt;How do you know the output is good?&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Seed of an Idea
&lt;/h3&gt;

&lt;p&gt;The idea for CLMA (Closed-Loop Multi-Agent) came from a simple observation: &lt;strong&gt;if one LLM call is unreliable, and a human checking its output is slow, what if we let a &lt;em&gt;second&lt;/em&gt; LLM call verify the first one's output — and then give that feedback back to the first one to improve?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the core loop: Solver produces code → Verifier checks it → Refiner improves it → repeat until scores pass a threshold. No human in the middle.&lt;/p&gt;

&lt;p&gt;But turning that simple idea into a working system took months of iteration, dozens of wrong turns, and a fundamental rethinking of what "multi-agent" actually means.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Core Architecture
&lt;/h3&gt;

&lt;p&gt;CLMA is built in three layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────┐
│  Web UI (Flask + SSE + SVG)         │
│  Real-time flow graphs &amp;amp; gauges     │
├─────────────────────────────────────┤
│  Python Interface (pybind11)        │
│  Agent orchestration &amp;amp; scoring      │
├─────────────────────────────────────┤
│  C++17 Core Engine                  │
│  Orchestrator · DAG · Rule Engine   │
│  Token Monitor · Plugin Manager     │
└─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The C++ core handles performance-critical paths — DAG processing, rule matching, and token tracking — while the Python layer manages agent orchestration, LLM API calls, and scoring logic. The Web UI communicates via Server-Sent Events for real-time streaming of every agent action.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Five Agent Roles
&lt;/h3&gt;

&lt;p&gt;Every query passes through some subset of these five agents:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Prompt Template&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Refiner&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reformulates the user's query into a structured task. Extracts implicit requirements.&lt;/td&gt;
&lt;td&gt;"Restate the task clearly. Identify edge cases."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Reasoner&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Produces a solution strategy without writing code. Plans the approach.&lt;/td&gt;
&lt;td&gt;"Outline the algorithm. Consider time/space complexity."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Solver&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generates the actual implementation code.&lt;/td&gt;
&lt;td&gt;"Write production-quality code following the plan."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Verifier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reviews the Solver's output. Checks correctness, completeness, and potential bugs.&lt;/td&gt;
&lt;td&gt;"Review this code. List issues by severity."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Evaluator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scores the final output on three dimensions. Decides if iteration is needed.&lt;/td&gt;
&lt;td&gt;"Rate this solution on reasonableness, executability, and satisfaction."&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Evaluator produces a three-dimensional score:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reasonableness&lt;/strong&gt; (0–1): Does the approach make sense for the problem?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executability&lt;/strong&gt; (0–1): Would the code actually run without errors?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Satisfaction&lt;/strong&gt; (0–1): Does the output fully address the user's query?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overall = Reasonableness × 0.4 + Executability × 0.4 + Satisfaction × 0.2&lt;/p&gt;

&lt;p&gt;If the overall score falls below a configurable threshold (default 0.7), the framework loops back: Refiner receives Verifier's feedback, Solver generates an improved version, Verifier checks again, and Evaluator re-scores. This continues up to &lt;code&gt;max_iterations&lt;/code&gt; (default 3).&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Three Scores?
&lt;/h3&gt;

&lt;p&gt;A single score is too coarse for meaningful iteration. Consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High reasonableness + low executability → the approach is sound but the implementation has bugs → Verifier should focus on code issues&lt;/li&gt;
&lt;li&gt;Low reasonableness + high executability → the code runs but solves the wrong problem → Reasoner needs to rethink the approach&lt;/li&gt;
&lt;li&gt;Low satisfaction → the output is technically correct but misses the user's intent → Refiner should re-examine the query&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By separating the three dimensions, each agent gets targeted feedback about &lt;em&gt;what&lt;/em&gt; specifically needs improvement, rather than a vague "score too low, try again."&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2: From Single Loop to Adaptive Network — The Evolution of Execution Modes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Naive First Attempt
&lt;/h3&gt;

&lt;p&gt;When I started CLMA, the architecture was embarrassingly simple: a linear pipeline. Take the user's query → pass it through five agents in sequence → output the result. No iteration, no scoring, no feedback.&lt;/p&gt;

&lt;p&gt;It didn't work well.&lt;/p&gt;

&lt;p&gt;The first real version was the &lt;strong&gt;Single Closed Loop&lt;/strong&gt; — and it looked like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkriely%2FCLMA%2Fmain%2Fblog%2Fimages%2Fsingle%25E6%2589%25A7%25E8%25A1%258C%25E6%25B5%2581%25E7%25A8%258B.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkriely%2FCLMA%2Fmain%2Fblog%2Fimages%2Fsingle%25E6%2589%25A7%25E8%25A1%258C%25E6%25B5%2581%25E7%25A8%258B.gif" alt="Single Loop execution flow — the framework iterates through Refiner → Reasoner → Solver → Verifier → Evaluator until scores pass the threshold." width="760" height="496"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│  Query                                                  │
│    ↓                                                    │
│  Refiner → Reasoner → Solver → Verifier → Evaluator     │
│    ↑                                         │          │
│    └────── score &amp;lt; threshold? ───────────────┘          │
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The loop:&lt;/strong&gt; Solver generates code → Verifier reviews it → if Evaluator scores below threshold, Refiner gets the feedback and the loop restarts. Each iteration builds on the previous one's Verifier feedback.&lt;/p&gt;

&lt;p&gt;This was the first time I saw the self-verification idea actually working. A query like "implement a thread-safe LRU cache" would start with a reasonable-but-flawed first attempt, then refine through 2–3 iterations into production-quality code — all without human intervention.&lt;/p&gt;

&lt;p&gt;But the Single Loop had a glaring problem: &lt;strong&gt;it treated every query the same way.&lt;/strong&gt; A "hello world" query and a "design a distributed rate limiter" query both went through the same 5-agent pipeline with the same iteration logic. The trivial query took 8 seconds when it should have taken 2. The complex query took 40 seconds when it needed more structured decomposition.&lt;/p&gt;

&lt;h3&gt;
  
  
  DAG Mode: Parallel Decomposition
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkriely%2FCLMA%2Fmain%2Fblog%2Fimages%2F%25E6%2589%2593%25E5%25BC%2580DAG%25E5%2590%258E%25E7%259A%2584single%25E6%2589%25A7%25E8%25A1%258C.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkriely%2FCLMA%2Fmain%2Fblog%2Fimages%2F%25E6%2589%2593%25E5%25BC%2580DAG%25E5%2590%258E%25E7%259A%2584single%25E6%2589%25A7%25E8%25A1%258C.gif" alt="DAG Mode — the C++ DAG processor decomposes tasks into parallel sub-tasks, executing them concurrently." width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first major iteration was &lt;strong&gt;DAG (Directed Acyclic Graph) mode.&lt;/strong&gt; Instead of running agents sequentially, the C++ DAG processor would:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parse the user's query to identify independent sub-tasks&lt;/li&gt;
&lt;li&gt;Build a dependency graph&lt;/li&gt;
&lt;li&gt;Execute parallel sub-tasks concurrently&lt;/li&gt;
&lt;li&gt;Aggregate and verify the combined output&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For multi-component tasks like "build a REST API with auth, CRUD endpoints, and a PostgreSQL schema," DAG mode decomposes the three components, solves them in parallel, and merges the results. This cut total time from 40s (serial) to ~20s (parallel) for the same quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trade-off:&lt;/strong&gt; DAG mode works well for clearly separable tasks — components with clean interfaces and independent logic. But for tasks that need deep reasoning about a single complex problem, the bottleneck shifts from parallelism to iteration quality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Nested Multi-Loop: Strategy + Execution
&lt;/h3&gt;

&lt;p&gt;Some problems are too complex for a single loop. Consider "design and implement a distributed task scheduler with leader election, worker pools, and fault tolerance." You need &lt;em&gt;strategic decisions&lt;/em&gt; first (consensus protocol? Raft or Paxos? task distribution model?), &lt;em&gt;then&lt;/em&gt; implementation.&lt;/p&gt;

&lt;p&gt;The Nested Multi-Loop architecture addresses this with two concentric loops:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkriely%2FCLMA%2Fmain%2Fblog%2Fimages%2Fmulti%25E6%2589%25A7%25E8%25A1%258C%25E6%25B5%2581%25E7%25A8%258B.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkriely%2FCLMA%2Fmain%2Fblog%2Fimages%2Fmulti%25E6%2589%25A7%25E8%25A1%258C%25E6%25B5%2581%25E7%25A8%258B.gif" alt="Nested Multi-Loop — outer strategy loop plans the architecture, inner execution loop implements each component." width="800" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Outer loop (strategy):&lt;/strong&gt; Planner → Commander → Producer → Verifier → Evaluator. This loop handles architectural decisions, component decomposition, and high-level design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inner loop (execution):&lt;/strong&gt; Refiner → Reasoner → Solver → Verifier → Evaluator. Runs &lt;em&gt;inside&lt;/em&gt; each component, iterating on implementation quality.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The outer loop's output becomes the inner loop's input for each component. The inner loop's results feed back into the outer loop's Verifier. This hierarchical iteration catches both design-level and implementation-level issues in a single pass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pain point:&lt;/strong&gt; Nested Multi-Loop is powerful but slow. A single query can take 40–60 seconds to complete, and the flow graph visualization becomes dense enough to require zoom and pan controls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adaptive Agent Network: Self-Organizing Topology
&lt;/h3&gt;

&lt;p&gt;The biggest insight came from watching how users actually interacted with the four modes. Most users defaulted to one mode — usually Single Loop — and never switched. The framework had all these execution modes, but no one was using them because &lt;em&gt;choosing the right mode&lt;/em&gt; required understanding the framework's internals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if the framework could choose for you?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's the Adaptive Agent Network (AAN):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkriely%2FCLMA%2Fmain%2Fblog%2Fimages%2Faan%25E6%2589%25A7%25E8%25A1%258C%25E6%25B5%2581%25E7%25A8%258B.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fkriely%2FCLMA%2Fmain%2Fblog%2Fimages%2Faan%25E6%2589%25A7%25E8%25A1%258C%25E6%25B5%2581%25E7%25A8%258B.gif" alt="AAN Mode — the Router Agent analyzes the query and selects the optimal execution topology automatically." width="760" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AAN introduces a &lt;strong&gt;Router Agent&lt;/strong&gt; that runs before execution begins. The Router analyzes the query and picks from four topologies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topology&lt;/th&gt;
&lt;th&gt;When It's Chosen&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Direct&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Trivial queries (effective length &amp;lt; 15 chars, no code intent)&lt;/td&gt;
&lt;td&gt;Single Solver call → score → done. ~2s latency.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Most medium-complexity queries&lt;/td&gt;
&lt;td&gt;Refiner → Reasoner → Solver → Verifier → Evaluator, with iterative score feedback (up to 3 rounds).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parallel&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Queries with explicit parallel keywords ("分别", "both", "multiple")&lt;/td&gt;
&lt;td&gt;Solves modules concurrently, then merges via Integrator.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tree&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Complex architectural queries ("architecture", "system", "subsystem")&lt;/td&gt;
&lt;td&gt;Recursive binary decomposition: splits the problem into sub-problems, solves each leaf independently, merges bottom-up.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Router uses a heuristic-based classifier rather than another LLM call (which would be expensive and defeat the purpose):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;effective_len&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cjk_count&lt;/span&gt;  &lt;span class="c1"&gt;# Chinese chars have higher info density
&lt;/span&gt;&lt;span class="n"&gt;has_code_intent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;写&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;implement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;build&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...])&lt;/span&gt;
&lt;span class="n"&gt;is_trivial&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;effective_len&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;has_code_intent&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The AAN Chain Breakthrough
&lt;/h3&gt;

&lt;p&gt;The most interesting evolution was the Chain topology itself. Initially, Chain was a single-pass pipeline — run all five agents once, score once, done. No iteration. The thinking was: "If the Router already selected Chain, the query should be straightforward enough for one pass."&lt;/p&gt;

&lt;p&gt;That was wrong.&lt;/p&gt;

&lt;p&gt;Even medium-complexity queries — "build a subset of Photoshop in HTML" — need iterative refinement. The first pass might cover basic features (pen tool, color picker, save) but miss important ones (layers, selection tools, undo/redo). Without iteration, the Verifier's feedback is wasted — the user sees the first attempt, and that's that.&lt;/p&gt;

&lt;p&gt;So Chain evolved too. The current implementation runs the same closed-loop iteration as the original Single Loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Round 1: Refiner → Reasoner → Solver → Verifier → Evaluator → score = 0.66 ❌
Round 2: Refiner (with Verifier feedback) → Reasoner → Solver → Verifier → Evaluator → score = 0.70 ✅ → Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each round passes the previous Verifier's feedback to the Refiner, which uses it to guide the Solver toward specific improvements. The iteration stops when scores meet the threshold or max rounds are reached.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Current Mode Matrix
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Typical Time (small task)&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fast Path&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"hello world", trivial math&lt;/td&gt;
&lt;td&gt;~1–3s&lt;/td&gt;
&lt;td&gt;Greetings, trivial queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Single Loop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Function implementations, algorithms&lt;/td&gt;
&lt;td&gt;~8–20s&lt;/td&gt;
&lt;td&gt;Well-defined coding tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DAG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-component features&lt;/td&gt;
&lt;td&gt;~20–45s&lt;/td&gt;
&lt;td&gt;APIs, services, parallel work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nested Multi-Loop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;System architecture&lt;/td&gt;
&lt;td&gt;~45–90s&lt;/td&gt;
&lt;td&gt;Full-stack design + implementation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AAN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Whatever you throw at it&lt;/td&gt;
&lt;td&gt;2–90s&lt;/td&gt;
&lt;td&gt;Mixed workloads, adaptive routing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Tip:&lt;/strong&gt; The times above reflect small-to-medium tasks. As project complexity grows — longer code output, deeper iteration loops, more parallel sub-tasks — actual processing time scales proportionally. A large system architecture query under Nested Multi-Loop can take 2–3 minutes, while a simple bug fix under Fast Path resolves in seconds. Choose your architecture mode based on the &lt;em&gt;scope of the task&lt;/em&gt;, not the &lt;em&gt;clock on the wall&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What I Learned About Mode Design
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Don't make users choose.&lt;/strong&gt; AAN was the most recent addition for a reason — it took seeing users stick to one mode before realizing that choice itself is a UX failure. The framework should infer what the user needs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Iteration &amp;gt; Parallelism for quality.&lt;/strong&gt; DAG's parallel execution cuts wall-clock time, but it's the Single Loop's iterative feedback that actually improves output quality. The best combination is both — AAN Chain's closure-based iteration with DAG's parallel sub-task execution.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Benchmark everything.&lt;/strong&gt; Without a scoring system, you can't tell whether a new mode is actually better. The three-dimensional score (reasonableness × executability × satisfaction) made it possible to compare modes quantitatively.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Part 3: The Web UI Deep Dive — Making Multi-Agent Execution Visible
&lt;/h2&gt;

&lt;p&gt;A multi-agent system is inherently invisible. The user types a query, agents talk to each other inside the framework, and minutes later an answer appears. But &lt;em&gt;what happened in between?&lt;/em&gt; Which agent ran? What did it produce? Did the framework iterate? Why did it take so long?&lt;/p&gt;

&lt;p&gt;Without this visibility, multi-agent systems are black boxes — and developers (rightly) distrust black boxes.&lt;/p&gt;

&lt;p&gt;The CLMA Web UI was designed from day one to answer one question: &lt;strong&gt;"What is the framework doing right now?"&lt;/strong&gt; Every agent action, every score change, every iteration is streamed to the browser in real-time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5f4c3mug6xse2ugls3v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5f4c3mug6xse2ugls3v.png" alt="CLMA Web UI — Dark theme with live execution flow graph, score gauge, and session management sidebar." width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcioii2pla560no9nh62a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcioii2pla560no9nh62a.png" alt="Day mode — one-click theme toggle inverts the interface for well-lit environments." width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  SSE-Driven Architecture
&lt;/h3&gt;

&lt;p&gt;The Web UI uses Server-Sent Events (SSE) rather than WebSockets or polling. Why SSE?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unidirectional is enough.&lt;/strong&gt; The browser only receives events; it never needs to send commands to the backend mid-stream. SSE is simpler than WebSockets for this use case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard HTTP.&lt;/strong&gt; No special server support needed — Flask's streaming responses work out of the box.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic reconnection.&lt;/strong&gt; Browsers natively reconnect dropped SSE connections, which matters when backend processing can take 40+ seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The event stream carries typed payloads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;event: agent_start
data: {"agent": "solver", "agent_label": "Solver", "iteration": 1, "timestamp": ...}

event: agent_complete
data: {"agent": "solver", "content_preview": "...", "duration_ms": 2340, ...}

event: iteration
data: {"iteration": 2, "scores": {"reasonableness": 0.75, ...}, ...}

event: done
data: {"result": {"content": "...", "score": {...}}, ...}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each event type triggers a different UI update — no polling, no manual refresh, no "waiting for response" spinner that tells you nothing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Score Gauge
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwkzs519gclvvus59zg0p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwkzs519gclvvus59zg0p.png" alt="The score gauge consolidates three evaluation dimensions into one visual readout. Green = passing (≥0.7), yellow = marginal, red = needs improvement." width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The score gauge is the most-watched element in the UI. It shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Overall score&lt;/strong&gt; as a large circular gauge (green/yellow/red)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three sub-scores&lt;/strong&gt; as animated bars (reasonableness, executability, satisfaction)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iteration count&lt;/strong&gt; showing which round we're on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gauge animates on each score update, giving a visceral sense of "getting better" as the framework iterates. This was a deliberate design choice — seeing the needle move from 0.66 to 0.70 across iterations builds trust in the process.&lt;/p&gt;

&lt;h3&gt;
  
  
  Execution Timeline &amp;amp; Flow Graph
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0dkgsdjq3st16ljdn42.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr0dkgsdjq3st16ljdn42.png" alt="After execution completes, the output panel shows the final code with syntax highlighting, execution results, and timing information." width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The flow graph is rendered as an inline SVG that updates in real-time as agents complete. Each agent appears as a node with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent name and icon&lt;/strong&gt; (Solver → 🛠, Verifier → 🔍, Evaluator → 📊)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duration&lt;/strong&gt; — how long this agent took&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status&lt;/strong&gt; — running, completed, or failed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token usage&lt;/strong&gt; — prompt/completion tokens for this call&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the mode is AAN (Adaptive Agent Network), the flow graph adapts its layout based on the Router's topology decision — showing a single node for Direct mode, a linear chain for Chain mode, parallel branches for Parallel mode, and a recursive tree for Tree mode.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session Management Sidebar
&lt;/h3&gt;

&lt;p&gt;The sidebar lists all past sessions, grouped by date:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click any session to reload its full output and scores&lt;/li&gt;
&lt;li&gt;Sessions show query preview, score summary, and mode used&lt;/li&gt;
&lt;li&gt;Today's sessions have a separate summary (total queries, completions, tokens)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This turns CLMA into a persistent workspace rather than a one-shot chatbot. You can compare how different modes handled the same query, revisit past iterations, and track scoring trends over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration Panels
&lt;/h3&gt;

&lt;p&gt;CLMA supports deep runtime configuration without restarting the server. Three settings panels are accessible from the UI:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffef2wfiklu516kyxt01s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffef2wfiklu516kyxt01s.png" alt="API Configuration panel — switch between 5+ LLM providers, configure API keys, base URLs, and model selection. Zero-downtime switching." width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Configuration&lt;/strong&gt; lets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Switch between OpenAI, Anthropic, DeepSeek, Gemini, and local models at runtime&lt;/li&gt;
&lt;li&gt;Configure API keys, base URLs, and model names&lt;/li&gt;
&lt;li&gt;Test the connection before submitting queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiz1m108nv01bqp96io47.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiz1m108nv01bqp96io47.png" alt="Rules Configuration — YAML-based rule engine that customizes how the framework interprets different query types." width="800" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rules Configuration&lt;/strong&gt; exposes the C++ rule engine's patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define custom validation methods per query type&lt;/li&gt;
&lt;li&gt;Configure automatic code execution triggers&lt;/li&gt;
&lt;li&gt;Set sandbox tiering rules by language&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uv66f25m1chbk9s3v6p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3uv66f25m1chbk9s3v6p.png" alt="Tools Configuration — enable/disable execution environments (Python, C++, Shell, Node.js) and set sandbox timeout limits." width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools Configuration&lt;/strong&gt; manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which execution environments are enabled (Python, C++, Shell, Node.js)&lt;/li&gt;
&lt;li&gt;Sandbox timeout limits&lt;/li&gt;
&lt;li&gt;Token budget and max iterations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Theme System
&lt;/h3&gt;

&lt;p&gt;The UI defaults to dark mode (designed for late-night coding sessions), with a one-click toggle to invert to light mode. The toggle uses a CSS filter approach — &lt;code&gt;invert(1) hue-rotate(180deg)&lt;/code&gt; — which works across all elements without needing separate light/dark CSS variables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5f4c3mug6xse2ugls3v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5f4c3mug6xse2ugls3v.png" alt="Dark mode vs light mode side-by-side comparison. No refresh needed, instant toggle." width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Design Philosophy: Visibility Builds Trust
&lt;/h3&gt;

&lt;p&gt;The most important design lesson from building the CLMA UI: &lt;strong&gt;users trust systems they can watch.&lt;/strong&gt; A system that produces output in a black box is always suspect — no matter how good the output is. A system that shows every step, every agent, every score change, and every iteration builds confidence through transparency.&lt;/p&gt;

&lt;p&gt;When a user sees:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Router analyzes the query → decides it's a Chain topology&lt;/li&gt;
&lt;li&gt;Refiner restructures the task&lt;/li&gt;
&lt;li&gt;Solver generates 1,200 lines of code in 4.2 seconds&lt;/li&gt;
&lt;li&gt;Verifier identifies 3 potential issues&lt;/li&gt;
&lt;li&gt;Score = 0.66 → below threshold → iterating&lt;/li&gt;
&lt;li&gt;Second pass addresses all 3 issues&lt;/li&gt;
&lt;li&gt;Score = 0.92 → passing → done&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;...they don't just trust the output more. They understand &lt;em&gt;why&lt;/em&gt; the framework made the decisions it did. And when something goes wrong, they know where to look.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: Lessons Learned, Mistakes Made, and What's Next
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Hardest Lessons
&lt;/h3&gt;

&lt;p&gt;After months of building CLMA, here are the things I wish I'd known from day one.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. "Agent" is an abstraction leak, not a solution
&lt;/h4&gt;

&lt;p&gt;The term "agent" sounds sophisticated, but it's dangerously vague. In CLMA, an agent is just a prompt template + a context builder + an LLM call. There's no persistent state, no tool use, no memory (in the agentic sense). The framework orchestrates these calls, not the agents themselves.&lt;/p&gt;

&lt;p&gt;I spent weeks early on designing elaborate agent communication protocols (who talks to whom? how do they share context? what if an agent goes rogue?) before realizing that &lt;strong&gt;the simplest architecture was the right one:&lt;/strong&gt; linear data flow with structured context injection. The complexity should live in the orchestration, not the agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Don't over-model the agents. Model the data flow between them.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Three scores &amp;gt; one score, but not by much
&lt;/h4&gt;

&lt;p&gt;The three-dimensional scoring (reasonableness, executability, satisfaction) was a late addition — and it was the right call. A single score doesn't tell the Verifier &lt;em&gt;what&lt;/em&gt; to fix. But in practice, two of the three dimensions are highly correlated for code generation tasks: if the code is executable, it's usually reasonable, and vice versa.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I'd do differently:&lt;/strong&gt; Make scoring adaptive. For code generation tasks, weigh executability higher. For design tasks, weigh reasonableness higher. The dimensions should adjust based on the Router's classification, not be fixed weights.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. AAN was the hardest feature to get right
&lt;/h4&gt;

&lt;p&gt;The Adaptive Agent Network sounds elegant in theory — "the framework chooses its own topology!" — but the Router heuristic is fragile. A query like "分别用python和javascript实现排序算法" triggers Parallel mode (correctly), but "分别实现python排序和javascript排序" triggers..."&lt;/p&gt;

&lt;p&gt;The AAN Router has gone through 6 major revisions. It started as a single &lt;code&gt;len(query)&lt;/code&gt; threshold, evolved into keyword matching, then effective length (accounting for Chinese character density), then code-intent detection, and recently added closed-loop iteration to the Chain topology itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The AAN Router will never be perfect,&lt;/strong&gt; and that's okay. The design goal isn't perfection — it's "better than always defaulting to Single Loop." Any heuristic that beats the baseline is a win.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Performance is a UX problem, not just an engineering one
&lt;/h4&gt;

&lt;p&gt;The biggest complaint about multi-agent systems is latency. "Why does it take 30 seconds?"&lt;/p&gt;

&lt;p&gt;Early on, I tried to optimize the agents — shorter prompts, single-shot generation, parallel calls. It helped, but not enough. What &lt;em&gt;actually&lt;/em&gt; improved user perception was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time SSE streaming.&lt;/strong&gt; Watching agents complete in sequence makes the wait feel productive, not wasted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Placeholder events.&lt;/strong&gt; The moment the Router decides the topology, the UI shows all agent nodes in the flow graph — even before they start. Users see the full pipeline up front.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token counters.&lt;/strong&gt; Showing token usage per call gives a concrete "here's what you're paying for" sense of progress.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Users will wait 40 seconds if they can see progress. They won't wait 10 seconds in a black box.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Testing multi-agent systems is qualitatively different
&lt;/h4&gt;

&lt;p&gt;Unit-testing a single LLM call is straightforward — assert the output format, check for common failure modes, replay with fixed seeds. Testing a 5-agent pipeline with iterative feedback loops is a different beast.&lt;/p&gt;

&lt;p&gt;Categories of bugs I encountered:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;How We Catch It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;State isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agent B reads Agent A's output from a &lt;em&gt;previous&lt;/em&gt; query&lt;/td&gt;
&lt;td&gt;C++ session_id isolation + Python memory reset per query&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Context leaks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Similar experiences from unrelated queries pollute the Solver's prompt&lt;/td&gt;
&lt;td&gt;Separate context builders per agent, with assert statements for placeholder keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Template drift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One agent's prompt template adds a &lt;code&gt;{placeholder}&lt;/code&gt; that doesn't exist in context&lt;/td&gt;
&lt;td&gt;Automated script that extracts placeholders from all templates and validates them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cancellation race&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User cancels mid-stream, but the next agent starts anyway&lt;/td&gt;
&lt;td&gt;Shared &lt;code&gt;_stream_cancelled&lt;/code&gt; flag checked before every LLM call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Score oscillation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Round 2 scores better than Round 1, but Round 3 scores worse&lt;/td&gt;
&lt;td&gt;Track &lt;code&gt;best_score&lt;/code&gt; across iterations, not just the last score&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The biggest practical win:&lt;/strong&gt; prompt-level validation. Every agent's context template is checked for &lt;code&gt;{placeholder}&lt;/code&gt; keys before execution. Missing keys are filled with empty strings rather than crashing, but the mismatch is logged. This single check caught more bugs than all the integration tests combined.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I'd Do Differently
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Build the Router first.&lt;/strong&gt; AAN should have been the default from day one. The explicit mode selection UI (Fast Path / Single Loop / DAG / Multi-Loop) was useful for debugging but harmful for user experience. Users don't want to think about execution modes. They want to type a query and get a good result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Instrument everything from the beginning.&lt;/strong&gt; The token monitor, duration tracker, and scoring system were added in reactive response to user complaints ("why did it take so long?" "why is this score so low?"). If I'd built the measurement infrastructure first, I would have caught several design flaws months earlier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Use a single LLM provider for development.&lt;/strong&gt; Switching between OpenAI, Anthropic, DeepSeek, and local models during development introduced confounding variables. Behavioral differences between providers (prompt sensitivity, JSON output format, refusal patterns) made it hard to isolate bugs in the framework itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Ship the CLI first, Web UI second.&lt;/strong&gt; The Flask Web UI is useful and visually compelling, but it adds a dependency layer that complicates setup. A CLI-first approach would have let early users try CLMA with zero configuration and provided faster feedback cycles.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where CLMA Goes Next
&lt;/h3&gt;

&lt;p&gt;The framework is actively used for personal projects, but there's plenty of room to grow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Near-term (next few months):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-turn conversations&lt;/strong&gt; — currently, each query is stateless. The next version will support follow-up queries with access to the session history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved AAN Router&lt;/strong&gt; — move from heuristic-based to a lightweight classifier (small LLM call or embedding-based) for more accurate topology selection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sandbox expansion&lt;/strong&gt; — Java, Go, Rust execution environments via Docker containers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Medium-term:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Plugin system&lt;/strong&gt; — the C++ PluginManager exists but needs better documentation and a curated registry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed execution&lt;/strong&gt; — multi-machine agent orchestration for very large tasks (entire repository generation).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic experience storage&lt;/strong&gt; — successful query-solution pairs are already saved; the next step is automatic retrieval and reuse in similar queries.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Long-term vision:&lt;/strong&gt;&lt;br&gt;
CLMA is a step toward &lt;strong&gt;self-improving code generation&lt;/strong&gt; — a system that not only generates and verifies code, but learns from its successes and failures to generate better code over time. The experience store, scoring system, and iterative feedback loop are the foundational pieces. The next step is connecting them into a continuous learning cycle.&lt;/p&gt;

&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;Building CLMA taught me that &lt;strong&gt;the bottleneck in code generation is not generation — it's verification.&lt;/strong&gt; Every LLM can produce plausible-looking code. The hard part is knowing whether it's &lt;em&gt;actually&lt;/em&gt; correct, and what to do about it when it isn't.&lt;/p&gt;

&lt;p&gt;The closed-loop approach works because it mirrors how good developers work:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Write a draft&lt;/li&gt;
&lt;li&gt;Review it critically&lt;/li&gt;
&lt;li&gt;Fix the problems&lt;/li&gt;
&lt;li&gt;Repeat until it's good enough&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The CLMA framework just automates this process — and makes it visible, measurable, and improvable.&lt;/p&gt;

&lt;p&gt;If this series inspired you to think differently about LLM-generated code, or if you have ideas for making CLMA better, I'd love to hear from you. Open an issue, submit a PR, or just star the repo — it all helps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/kriely/CLMA" rel="noopener noreferrer"&gt;github.com/kriely/CLMA&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Built with Hermes &amp;amp; DeepSeek.&lt;/strong&gt; Every line of CLMA — from the C++17 DAG engine to the SVG gauges in the Web UI — was written with the help of Hermes (my AI agent companion) running on DeepSeek's API. I'm a developer with ideas, not a big team with a budget. Hermes and DeepSeek are the tools that let me ship those ideas.&lt;br&gt;
&lt;em&gt;Because ideas shouldn't wait for the perfect stack — they should just be built.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Tags: #LLM #MultiAgent #CodeGeneration #OpenSource #SystemDesign #WebUI #SSE #DeepSeek&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
