DEV Community

Robin King
Robin King

Posted on

CLMA Frame Test

CLMA vs Web Chat: Putting Iterative Verification to the Test

Posted on May 6, 2026 · #CLMA #MultiAgent #CodeGeneration #EventSourcing #Comparison #Python

All code is open source on GitHub: github.com/kriely/CLMA


This is a companion piece to Building CLMA: A Self-Verifying Multi-Agent Framework from Scratch. In that article, I described the framework. Here, I put it to the test — head to head against a plain web chat, same model, same problem.


The Setup

Same LLM (DeepSeek) tasked with writing the same code. No human intervention on either side. Two questions:

  • Q1 — Thread-safe bounded blocking queue (put/get with timeout)
  • Q5 — Event sourcing framework for a bank account (events, replay, serialization, optimistic concurrency, business rules, freeze/unfreeze)

For Q5, the CLMA version went through 3 automated iteration rounds (Solver → Verifier → Refiner → Verifier → Refiner → Verifier → Evaluator). The web chat version was a single-shot output.


Q1: Bounded Blocking Queue

Both implementations passed all 12 test cases — basic put/get, blocking/unblocking behavior, timeout, edge cases (maxsize=1, maxsize=0), queue state queries, and invalid capacity.

12/12 pass for both. On the surface, a draw. But the engineering quality tells a different story.

CLMA Version (1.py)

# Two separate Conditions — put and get don't contend
self.not_empty = threading.Condition(self._lock)
self.not_full = threading.Condition(self._lock)
Enter fullscreen mode Exit fullscreen mode
# time.monotonic() — immune to system clock adjustments
remaining = timeout
while self.full():
    if remaining is not None:
        if remaining <= 0:
            raise Full
        start = time.monotonic()
        self.not_full.wait(remaining)
        remaining -= time.monotonic() - start
    else:
        self.not_full.wait()
Enter fullscreen mode Exit fullscreen mode

Web Chat Version (2.py)

# Single Condition — functional but suboptimal
self.cond = threading.Condition()
Enter fullscreen mode Exit fullscreen mode
# time.time() — affected by system clock changes
deadline = time.time() + timeout
while self.full():
    remaining = deadline - time.time()
    if remaining <= 0:
        raise QueueFull
    self.cond.wait(timeout=remaining)
Enter fullscreen mode Exit fullscreen mode

Key Differences

Aspect CLMA Web Chat
Conditions 2 (not_empty / not_full) — put/get don't contend 1 — notify() may wake wrong waiter
Clock time.monotonic() — immune to NTP adjustments time.time() — affected by system clock changes
Timeout precision Exact decrement per loop iteration Once-calculated deadline
Exception names Full, Empty — concise QueueFull, QueueEmpty — verbose
Edge case Handles timeout < 0 defensively No check for negative timeout
Comments English Chinese

Verdict: Both pass all tests, but CLMA's design is more robust for high-concurrency scenarios. Two Conditions prevent head-of-line blocking between producers and consumers. time.monotonic() avoids a real-world bug class (NTP jumps causing premature or delayed timeouts). The difference matters under load, not in a single-threaded test.


Q5: Event Sourcing Framework

This is where the gap opens wider. Both implement an event-sourced bank account with:

  • Events: account opened, deposited, withdrawn, frozen
  • Event store with optimistic concurrency control
  • Event replay (rebuild aggregate state from history)
  • Serialization / deserialization
  • Business rules: no negative deposits, no over-withdrawal, no withdrawal on frozen account

CLMA Version (4.py) — After 3 Iterations

The automated Verifier caught two things the initial output missed:

Round 1 → Round 2: "Where's the Unfrozen event? A frozen account can never be unfrozen — this is incomplete."

Round 2 → Round 3: "The freeze implementation blocks withdrawals, but should it also block deposits? This is a business policy decision — document it explicitly."

Result — CLMA adds the Unfrozen event:

class Unfrozen(Event):
    def __init__(self, aggregate_id: str, reason: str = "", ...):
        super().__init__(aggregate_id, event_id, timestamp)
        self.reason = reason
Enter fullscreen mode Exit fullscreen mode

And the BankAccount handles it properly:

def _apply(self, event: Event) -> None:
    if isinstance(event, Deposited):       self.balance += event.amount
    elif isinstance(event, Withdrawn):     self.balance -= event.amount
    elif isinstance(event, Frozen):        self.is_frozen = True
    elif isinstance(event, Unfrozen):      self.is_frozen = False  # ← Added by Verifier
    else: raise ValueError(...)
Enter fullscreen mode Exit fullscreen mode

Web Chat Version (3.py) — Single Shot

The web version has a clean architecture — proper Event base class, register_event decorator, payload() abstraction, serialization round-trip. But it has no Unfrozen event.

@register_event
class AccountFrozen(Event):
    def __init__(self, aggregate_id: str): ...
    # ... no Unfrozen counterpart exists
Enter fullscreen mode Exit fullscreen mode

The freeze() method works, but there's no unfreeze(). Once frozen, the account stays frozen forever.

The Test Results

Category CLMA Web Chat
Event basics (IDs, timestamps, types)
Serialization / deserialization
Event replay (deposit 100+50, withdraw 30 = 120)
Business rules (no negative, no overdraft)
Freeze → reject withdrawal
Unfreeze → allow operations again ❌ Missing
Optimistic concurrency

Both frameworks pass all standard event sourcing tests. But the missing Unfrozen event in the web chat version is not a cosmetic issue — it's a domain modeling gap. In any real banking system, frozen accounts need a thaw mechanism.

Why CLMA Found It

The third iteration round is where the value shows. The Verifier's feedback was:

"The freeze flow is incomplete. Freezing is an operation that must be reversible. Consider adding an Unfrozen event and updating the aggregate to apply it."

A human reviewer would spot this too. But the CLMA Verifier catches it automatically, in seconds, with no developer in the loop. This is the difference between code review as a process and code review as a downloaded prompt.


What This Means

Q1 (Blocking Queue) Q5 (Event Sourcing)
CLMA 12/12 ✅ + better design Full feature set ✅
Web Chat 12/12 ✅ + usable but less robust Missing Unfrozen event ❌

For simple, well-defined problems (Q1), a single-shot chat prompt gets you 90% of the way. The CLMA advantage is marginal — better engineering choices, but the output is functionally equivalent.

For complex, multi-faceted problems (Q5) where completeness matters — domain events, edge cases, business rules — the iterative verification loop earns its keep. The 3 rounds of automated review caught a real domain modeling gap that a single prompt missed. Not because the LLM couldn't write an Unfrozen event, but because no single prompt can anticipate all the completeness conditions of a non-trivial domain.

The pattern is clear: Generation quality is already good. Verification quality is where the gap is. And verification is exactly what CLMA automates.


Files

File Description
1.py CLMA — bounded blocking queue
2.py Web chat — bounded blocking queue
3.py Web chat — event sourcing framework
4.py CLMA — event sourcing framework (3 iterations)
test_compare.py Q1 test suite — 12 cases for both
test_q5_compare.py Q5 test suite — auto-detects class names

All comparison files are in the CLMA repository.


Tags: #CLMA #MultiAgent #CodeGeneration #EventSourcing #Comparison #Python #DeepSeek

Top comments (0)