CLMA vs Web Chat: Putting Iterative Verification to the Test
Posted on May 6, 2026 · #CLMA #MultiAgent #CodeGeneration #EventSourcing #Comparison #Python
All code is open source on GitHub: github.com/kriely/CLMA
This is a companion piece to Building CLMA: A Self-Verifying Multi-Agent Framework from Scratch. In that article, I described the framework. Here, I put it to the test — head to head against a plain web chat, same model, same problem.
The Setup
Same LLM (DeepSeek) tasked with writing the same code. No human intervention on either side. Two questions:
- Q1 — Thread-safe bounded blocking queue (put/get with timeout)
- Q5 — Event sourcing framework for a bank account (events, replay, serialization, optimistic concurrency, business rules, freeze/unfreeze)
For Q5, the CLMA version went through 3 automated iteration rounds (Solver → Verifier → Refiner → Verifier → Refiner → Verifier → Evaluator). The web chat version was a single-shot output.
Q1: Bounded Blocking Queue
Both implementations passed all 12 test cases — basic put/get, blocking/unblocking behavior, timeout, edge cases (maxsize=1, maxsize=0), queue state queries, and invalid capacity.
12/12 pass for both. On the surface, a draw. But the engineering quality tells a different story.
CLMA Version (1.py)
# Two separate Conditions — put and get don't contend
self.not_empty = threading.Condition(self._lock)
self.not_full = threading.Condition(self._lock)
# time.monotonic() — immune to system clock adjustments
remaining = timeout
while self.full():
if remaining is not None:
if remaining <= 0:
raise Full
start = time.monotonic()
self.not_full.wait(remaining)
remaining -= time.monotonic() - start
else:
self.not_full.wait()
Web Chat Version (2.py)
# Single Condition — functional but suboptimal
self.cond = threading.Condition()
# time.time() — affected by system clock changes
deadline = time.time() + timeout
while self.full():
remaining = deadline - time.time()
if remaining <= 0:
raise QueueFull
self.cond.wait(timeout=remaining)
Key Differences
| Aspect | CLMA | Web Chat |
|---|---|---|
| Conditions | 2 (not_empty / not_full) — put/get don't contend | 1 — notify() may wake wrong waiter |
| Clock |
time.monotonic() — immune to NTP adjustments |
time.time() — affected by system clock changes |
| Timeout precision | Exact decrement per loop iteration | Once-calculated deadline
|
| Exception names |
Full, Empty — concise |
QueueFull, QueueEmpty — verbose |
| Edge case | Handles timeout < 0 defensively |
No check for negative timeout |
| Comments | English | Chinese |
Verdict: Both pass all tests, but CLMA's design is more robust for high-concurrency scenarios. Two Conditions prevent head-of-line blocking between producers and consumers. time.monotonic() avoids a real-world bug class (NTP jumps causing premature or delayed timeouts). The difference matters under load, not in a single-threaded test.
Q5: Event Sourcing Framework
This is where the gap opens wider. Both implement an event-sourced bank account with:
- Events: account opened, deposited, withdrawn, frozen
- Event store with optimistic concurrency control
- Event replay (rebuild aggregate state from history)
- Serialization / deserialization
- Business rules: no negative deposits, no over-withdrawal, no withdrawal on frozen account
CLMA Version (4.py) — After 3 Iterations
The automated Verifier caught two things the initial output missed:
Round 1 → Round 2: "Where's the Unfrozen event? A frozen account can never be unfrozen — this is incomplete."
Round 2 → Round 3: "The freeze implementation blocks withdrawals, but should it also block deposits? This is a business policy decision — document it explicitly."
Result — CLMA adds the Unfrozen event:
class Unfrozen(Event):
def __init__(self, aggregate_id: str, reason: str = "", ...):
super().__init__(aggregate_id, event_id, timestamp)
self.reason = reason
And the BankAccount handles it properly:
def _apply(self, event: Event) -> None:
if isinstance(event, Deposited): self.balance += event.amount
elif isinstance(event, Withdrawn): self.balance -= event.amount
elif isinstance(event, Frozen): self.is_frozen = True
elif isinstance(event, Unfrozen): self.is_frozen = False # ← Added by Verifier
else: raise ValueError(...)
Web Chat Version (3.py) — Single Shot
The web version has a clean architecture — proper Event base class, register_event decorator, payload() abstraction, serialization round-trip. But it has no Unfrozen event.
@register_event
class AccountFrozen(Event):
def __init__(self, aggregate_id: str): ...
# ... no Unfrozen counterpart exists
The freeze() method works, but there's no unfreeze(). Once frozen, the account stays frozen forever.
The Test Results
| Category | CLMA | Web Chat |
|---|---|---|
| Event basics (IDs, timestamps, types) | ✅ | ✅ |
| Serialization / deserialization | ✅ | ✅ |
| Event replay (deposit 100+50, withdraw 30 = 120) | ✅ | ✅ |
| Business rules (no negative, no overdraft) | ✅ | ✅ |
| Freeze → reject withdrawal | ✅ | ✅ |
| Unfreeze → allow operations again | ✅ | ❌ Missing |
| Optimistic concurrency | ✅ | ✅ |
Both frameworks pass all standard event sourcing tests. But the missing Unfrozen event in the web chat version is not a cosmetic issue — it's a domain modeling gap. In any real banking system, frozen accounts need a thaw mechanism.
Why CLMA Found It
The third iteration round is where the value shows. The Verifier's feedback was:
"The freeze flow is incomplete. Freezing is an operation that must be reversible. Consider adding an Unfrozen event and updating the aggregate to apply it."
A human reviewer would spot this too. But the CLMA Verifier catches it automatically, in seconds, with no developer in the loop. This is the difference between code review as a process and code review as a downloaded prompt.
What This Means
| Q1 (Blocking Queue) | Q5 (Event Sourcing) | |
|---|---|---|
| CLMA | 12/12 ✅ + better design | Full feature set ✅ |
| Web Chat | 12/12 ✅ + usable but less robust | Missing Unfrozen event ❌ |
For simple, well-defined problems (Q1), a single-shot chat prompt gets you 90% of the way. The CLMA advantage is marginal — better engineering choices, but the output is functionally equivalent.
For complex, multi-faceted problems (Q5) where completeness matters — domain events, edge cases, business rules — the iterative verification loop earns its keep. The 3 rounds of automated review caught a real domain modeling gap that a single prompt missed. Not because the LLM couldn't write an Unfrozen event, but because no single prompt can anticipate all the completeness conditions of a non-trivial domain.
The pattern is clear: Generation quality is already good. Verification quality is where the gap is. And verification is exactly what CLMA automates.
Files
| File | Description |
|---|---|
1.py |
CLMA — bounded blocking queue |
2.py |
Web chat — bounded blocking queue |
3.py |
Web chat — event sourcing framework |
4.py |
CLMA — event sourcing framework (3 iterations) |
test_compare.py |
Q1 test suite — 12 cases for both |
test_q5_compare.py |
Q5 test suite — auto-detects class names |
All comparison files are in the CLMA repository.
Tags: #CLMA #MultiAgent #CodeGeneration #EventSourcing #Comparison #Python #DeepSeek
Top comments (0)