Problem: Current LLM agent frameworks treat the chat history as the single source of truth for state. This is architecturally equivalent to a kernel persisting its state only through stdin/stdout logs. It works temporarily, but predictably fails under load.
Three measurable failure modes:
-
Undetected execution loops — no watchdog. The agent re-runs
write_file('config.json', data)because the confirmation fell out of context. Tokens burn until max_iterations. -
Silent state corruption — LLM emits invalid JSON for a tool call. Some frameworks swallow it and proceed with
null. Others abort. None roll back the file system. A half-written file persists. - Quadratic token cost — context grows every iteration (O(n²) attention). No budgeting, no signal before truncation.
These aren't bugs. They are architectural consequences of treating a probabilistic system (LLM) as a general-purpose deterministic machine. With documented tool-call hallucination rates of 2-5% (ToolAlpaca, API-Bank), relying on the model to self-manage state is untenable past ~50 tool calls.
The AEP approach: Instead of state-in-context, we define a deterministic sandbox operated by a microkernel runtime with an 8-register address space (R0-R7):
| Register | Function |
|---|---|
| R0 | Program counter |
| R1 | Watchdog timer (deadline, loop counter, state hash window) |
| R2 | Context budget (tokens used, remaining) |
| R3 | Sandbox state (content hash) |
| R4 | Error register (structured stderr: code + payload) |
| R5 | Schema registry (last tool + params) |
| R6 | Transaction buffer (write-ahead log for rollback) |
| R7 | Executive metadata (task_id, depth, cumulative tokens) |
These registers do not live in the LLM context. They live in the runtime — Python, Rust, Go, whatever. The LLM only interacts with them through tool calls routed by the microkernel, never through chat messages. This decouples context (compressible) from operational state (exact).
Resilience mechanics (AEP-0008):
Watchdog (R1): After every tool execution, the runtime hashes the sandbox. If hash == previous_hash, it increments a loop counter. When counter >= threshold (default 3), the task is ejected with WATCHDOG_LOOP. This catches cycles without state progress, not just call count.
ACID transactions (R4, R6): Every mutation passes schema validation. On violation:
- Rollback via WAL replay (R6 restores previous sandbox)
- Structured error injected in R4:
{code, expected schema, received payload, recovery hint} - Runtime returns R4 as tool result — model parses and self-corrects
- Three consecutive rollbacks on the same tool → watchdog abort
Net effect: invalid JSON never touches the filesystem. Corrupted state is reverted before any external process reads it.
Benchmark — Controlled methodology:
Pipeline: agent transforms 20 CSV spreadsheets (diverse schemas, mixed encoding, up to 15 columns) from natural language instructions. Baseline: same agent + same model (Claude 3.5 Sonnet, max_iterations=90) without AEP runtime. n=50 per arm, shuffled, temp=0.
| Metric | Baseline | AEP Runtime | Delta |
|---|---|---|---|
| Tokens consumed (mean) | 312,450 | 62,890 | -79.9% |
| Schema accuracy (first call) | 64% | 86% | +22pp |
| Loop rate (>=3 cycles, no progress) | 18% | 0% | -18pp |
| Post-execution file corruption | 2 cases | 0 cases | -100% |
| Wall time (mean) | 8m42s | 2m13s | -74.5% |
Methodology notes (read before citing):
- 95% CI for tokens: ±4.2% baseline, ±3.1% AEP
- Only spreadsheet pipeline tested — no code gen, web scraping, or pure CoT data yet
- Schema accuracy measures payload-passing validation, not semantic output correctness
- Full fixture set + run script at
benchmark/fixtures/andbenchmark/run_benchmark.sh
The -80% token reduction breaks down as: 55% from context compression (tool messages pruned after WAL confirm), 20% from loop elimination, 5% from fast rollback (1-2 iterations vs 5-8).
What exists today:
Repo: https://github.com/ferreiratechnology2025-max/CogniX
The core spec (AEP-0001 through AEP-0012) is frozen at v1.1.0. The Compliance Kit (compliance/) has 11 YAML tests:
- Watchdog ejection on N cycles with no state delta
- Rollback restores sandbox after invalid schema
- R4 captures structured error with code
- WAL persists before apply()
- Task isolation in concurrent execution
- Context budget ejection at limit
- Watchdog bypass for idempotent tools
- Rollback does not affect unrelated sandbox state
- Forced hash collision behavior
- WAL lock contention timeout
- Independent benchmark reproducibility
What we're asking the community:
-
Audit the spec (AEP-0001 through 0012 in
spec/). If R6 transaction semantics don't match your use case, open an issue describing the gap. - Port the runtime to Rust or Go. The Python runtime is a POC; the spec is language-agnostic.
-
Run the benchmark independently.
run_benchmark.shtakes ~3 minutes on commodity hardware.
The protocol is published. The tests are available. The engineering speaks for itself.
Top comments (0)