For the past year, the agent-tooling conversation has been dominated by memory: vector stores, knowledge graphs, session persistence. Useful — I build and maintain an open-source memory server myself. But after months of running multiple AI agents against real work every day, I am convinced the expensive failures were never memory failures. They were coordination failures: an agent acting on an unverified claim, two agents mutating the same state, a "small" change that was not small.
The fix that worked for me did not come from machine learning. It came from how courts and audits work: separation of roles, adversarial review, verdicts on the record.
The frame
In my current workflow, my agents operate under a protocol with four roles:
| Role | Mandate |
|---|---|
| Conductor | Orients, assigns scoped tasks, issues decisions. In this protocol, it does not implement production code. |
| Devil's advocate | Tries to refute. Every high-risk claim or spec gets a verdict — PASS, AMEND, or BLOCK — steelman first, then critique. |
| Executors | Do the scoped work — research, analysis, implementation — and report evidence, not summaries. |
| Human operator | Approves anything irreversible. No mutation without an explicit go. |
Nothing relies on goodwill or prompt-engineering finesse. Every message is a row in a SQLite database, addressed [FROM → TO], claimed before work starts, and answerable later. The transcript is the audit trail.
A typical assignment looks like this on the wire (content redacted, shape real):
[H][STATUS][CONDUCTOR -> EXECUTOR_2] scoped task + constraints (read-only; propose, don't mutate)
[M][STATUS][EXECUTOR_2] CLAIM — references the assignment, declares scope, starts work
[M][A][EXECUTOR_2 -> CONDUCTOR] report: summary / evidence / limits / next
[H][A][ADVOCATE -> CONDUCTOR] verdict: PASS-AMEND + numbered amendments
[H][DECISION][CONDUCTOR] approves items, citing the verdict; defers the rest to the human
[M][STATUS][EXECUTOR_2] executed: counts + verification query results
Three anonymized cases from recent weeks show why this pays.
Case 1: The confident wrong fact
An executor compiled a preparation document ahead of an important professional meeting. One background fact about the organization involved was plausible, widely repeated in secondary sources — and wrong. The advocate's standing rule is that web signals are hypotheses, not facts; its refutation pass demanded a primary source, found a conflicting primary source, and returned AMEND. The fact was corrected in the same review cycle, before the document was used. The cost of being confidently wrong in that room is hard to quantify; the cost of the review was one extra pass.
Case 2: The "15-minute config fix"
A corrective plan for an internal business application contained what looked like the smallest item on the list: change one default value in a config file. Estimated effort: fifteen minutes. The advocate's verdict reframed it: the default in question controlled default access-control behavior — a security policy decision, not a tweak — and the existing test suite explicitly asserted the old behavior. Verdict: BLOCK until scoped as a policy change with its own spec and tests updated in the same commit. The fifteen-minute fix was real work pretending to be trivial. The review gate caught the pretense.
Case 3: A day on the record
One recent working day, four role sessions — running on two different model families (Claude and Codex) — coordinated through the same database: addressed task queues, claim-before-work so no two agents grab the same item, watermark cursors so each role knows what it has read, and explicit human approval before any mutation. Two sessions were terminated mid-day during a workstation cleanup; their successors re-bound to the same roles with full continuity, because the state lived in the database, not in a chat window. By evening: five adversarial verdicts issued, two executor lanes completed, sixty-nine database mutations applied — every one of them after a recorded decision, none silently.
How it works
The mechanics are deliberately boring:
-
Topics and roles. A debate topic is a row. Roles are declared per topic and bound to concrete sessions, with generation counters — when a session dies, its successor rebinds as
g2and continuity is explicit, not assumed. -
Messages. Each message carries a kind (
Q,A,STATUS,DECISION,PING,WATERMARK), a priority (H/M/L/INFO), and is addressed by role and convention. Kind semantics are validated before insert. - Claims. Work starts with a CLAIM referencing the assignment — two agents cannot silently grab the same item, and stale claims can be reclaimed with an audit row.
- Verdicts and gates. Specs go through the advocate before implementation. The chain is SPEC → advocate PASS → decision → build → diff → commit; skipping a step is a recorded protocol breach. Work is also tagged by type — analysis, review, implementation — and implementation-tagged work fails closed rather than dispatching without approval.
- Cursors. Watermarks make reads resumable; every role knows exactly what it has and has not seen.
It runs on SQLite — no message broker, no orchestration framework; one local SQLite file at the core. Everything is queryable after the fact, which is the whole point: the coordination layer is just rows, and rows can be audited.
The full protocol, with message kinds, state transitions, and failure-handling rules, is in docs/DEBATE_PROTOCOL.md.
What this is not
This is not autonomy. The human approves every irreversible step, and that is the point: the protocol exists to make delegation reviewable, not to remove the reviewer. It is also not free — adversarial review costs latency and tokens, and on trivial work it is overhead. The honest accounting is that I spend review cycles the way one buys insurance: most days it returns nothing, and then one day it returns the whole premium. Scheduled, unattended runs on a Linux host are where this is heading next — early plumbing exists, but I am not claiming it works yet.
Why "reviewable coordination"
I deliberately avoid grander labels. What I can defend from evidence is narrower: multiple agents, multiple vendors' models, one shared queue, adversarial review before consequential action, a human gate on mutations, and a complete audit trail — on a single SQLite file. If your agents only remember things, they will remember your mistakes fluently. Reviewable coordination is what catches the mistakes before they ship.
The protocol and the memory server it runs on are open source: https://github.com/RMANOV/sqlite-memory-mcp — start with docs/DEBATE_PROTOCOL.md.
Top comments (0)