Robin King

Posted on May 4

CLMA Frame Test

#ai #programming

CLMA vs Web Chat: Putting Iterative Verification to the Test

Posted on May 6, 2026 · #CLMA #MultiAgent #CodeGeneration #EventSourcing #Comparison #Python

All code is open source on GitHub: github.com/kriely/CLMA

This is a companion piece to Building CLMA: A Self-Verifying Multi-Agent Framework from Scratch. In that article, I described the framework. Here, I put it to the test — head to head against a plain web chat, same model, same problem.

The Setup

Same LLM (DeepSeek) tasked with writing the same code. No human intervention on either side. Two questions:

Q1 — Thread-safe bounded blocking queue (put/get with timeout)
Q5 — Event sourcing framework for a bank account (events, replay, serialization, optimistic concurrency, business rules, freeze/unfreeze)

For Q5, the CLMA version went through 3 automated iteration rounds (Solver → Verifier → Refiner → Verifier → Refiner → Verifier → Evaluator). The web chat version was a single-shot output.

Q1: Bounded Blocking Queue

Both implementations passed all 12 test cases — basic put/get, blocking/unblocking behavior, timeout, edge cases (maxsize=1, maxsize=0), queue state queries, and invalid capacity.

12/12 pass for both. On the surface, a draw. But the engineering quality tells a different story.

CLMA Version (1.py)

# Two separate Conditions — put and get don't contend
self.not_empty = threading.Condition(self._lock)
self.not_full = threading.Condition(self._lock)

# time.monotonic() — immune to system clock adjustments
remaining = timeout
while self.full():
    if remaining is not None:
        if remaining <= 0:
            raise Full
        start = time.monotonic()
        self.not_full.wait(remaining)
        remaining -= time.monotonic() - start
    else:
        self.not_full.wait()

Web Chat Version (2.py)

# Single Condition — functional but suboptimal
self.cond = threading.Condition()

# time.time() — affected by system clock changes
deadline = time.time() + timeout
while self.full():
    remaining = deadline - time.time()
    if remaining <= 0:
        raise QueueFull
    self.cond.wait(timeout=remaining)

Key Differences

Aspect	CLMA	Web Chat
Conditions	2 (not_empty / not_full) — put/get don't contend	1 — notify() may wake wrong waiter
Clock	`time.monotonic()` — immune to NTP adjustments	`time.time()` — affected by system clock changes
Timeout precision	Exact decrement per loop iteration	Once-calculated `deadline`
Exception names	`Full`, `Empty` — concise	`QueueFull`, `QueueEmpty` — verbose
Edge case	Handles `timeout < 0` defensively	No check for negative timeout
Comments	English	Chinese

Verdict: Both pass all tests, but CLMA's design is more robust for high-concurrency scenarios. Two Conditions prevent head-of-line blocking between producers and consumers. time.monotonic() avoids a real-world bug class (NTP jumps causing premature or delayed timeouts). The difference matters under load, not in a single-threaded test.

Q5: Event Sourcing Framework

This is where the gap opens wider. Both implement an event-sourced bank account with:

Events: account opened, deposited, withdrawn, frozen
Event store with optimistic concurrency control
Event replay (rebuild aggregate state from history)
Serialization / deserialization
Business rules: no negative deposits, no over-withdrawal, no withdrawal on frozen account

CLMA Version (4.py) — After 3 Iterations

The automated Verifier caught two things the initial output missed:

Round 1 → Round 2: "Where's the Unfrozen event? A frozen account can never be unfrozen — this is incomplete."

Round 2 → Round 3: "The freeze implementation blocks withdrawals, but should it also block deposits? This is a business policy decision — document it explicitly."

Result — CLMA adds the Unfrozen event:

class Unfrozen(Event):
    def __init__(self, aggregate_id: str, reason: str = "", ...):
        super().__init__(aggregate_id, event_id, timestamp)
        self.reason = reason

And the BankAccount handles it properly:

def _apply(self, event: Event) -> None:
    if isinstance(event, Deposited):       self.balance += event.amount
    elif isinstance(event, Withdrawn):     self.balance -= event.amount
    elif isinstance(event, Frozen):        self.is_frozen = True
    elif isinstance(event, Unfrozen):      self.is_frozen = False  # ← Added by Verifier
    else: raise ValueError(...)

Web Chat Version (3.py) — Single Shot

The web version has a clean architecture — proper Event base class, register_event decorator, payload() abstraction, serialization round-trip. But it has no Unfrozen event.

@register_event
class AccountFrozen(Event):
    def __init__(self, aggregate_id: str): ...
    # ... no Unfrozen counterpart exists

The freeze() method works, but there's no unfreeze(). Once frozen, the account stays frozen forever.

The Test Results

Category	CLMA	Web Chat
Event basics (IDs, timestamps, types)	✅	✅
Serialization / deserialization	✅	✅
Event replay (deposit 100+50, withdraw 30 = 120)	✅	✅
Business rules (no negative, no overdraft)	✅	✅
Freeze → reject withdrawal	✅	✅
Unfreeze → allow operations again	✅	❌ Missing
Optimistic concurrency	✅	✅

Both frameworks pass all standard event sourcing tests. But the missing Unfrozen event in the web chat version is not a cosmetic issue — it's a domain modeling gap. In any real banking system, frozen accounts need a thaw mechanism.

Why CLMA Found It

The third iteration round is where the value shows. The Verifier's feedback was:

"The freeze flow is incomplete. Freezing is an operation that must be reversible. Consider adding an Unfrozen event and updating the aggregate to apply it."

A human reviewer would spot this too. But the CLMA Verifier catches it automatically, in seconds, with no developer in the loop. This is the difference between code review as a process and code review as a downloaded prompt.

What This Means

	Q1 (Blocking Queue)	Q5 (Event Sourcing)
CLMA	12/12 ✅ + better design	Full feature set ✅
Web Chat	12/12 ✅ + usable but less robust	Missing `Unfrozen` event ❌

For simple, well-defined problems (Q1), a single-shot chat prompt gets you 90% of the way. The CLMA advantage is marginal — better engineering choices, but the output is functionally equivalent.

For complex, multi-faceted problems (Q5) where completeness matters — domain events, edge cases, business rules — the iterative verification loop earns its keep. The 3 rounds of automated review caught a real domain modeling gap that a single prompt missed. Not because the LLM couldn't write an Unfrozen event, but because no single prompt can anticipate all the completeness conditions of a non-trivial domain.

The pattern is clear: Generation quality is already good. Verification quality is where the gap is. And verification is exactly what CLMA automates.

Files

File	Description
`1.py`	CLMA — bounded blocking queue
`2.py`	Web chat — bounded blocking queue
`3.py`	Web chat — event sourcing framework
`4.py`	CLMA — event sourcing framework (3 iterations)
`test_compare.py`	Q1 test suite — 12 cases for both
`test_q5_compare.py`	Q5 test suite — auto-detects class names

All comparison files are in the CLMA repository.

Tags: #CLMA #MultiAgent #CodeGeneration #EventSourcing #Comparison #Python #DeepSeek

DEV Community