DEV Community

Romel
Romel

Posted on

Agent Execution Protocol v1.1 — A microkernel runtime for LLM agents with watchdog timers and ACID transactions

Problem: Current LLM agent frameworks treat the chat history as the single source of truth for state. This is architecturally equivalent to a kernel persisting its state only through stdin/stdout logs. It works temporarily, but predictably fails under load.

Three measurable failure modes:

  1. Undetected execution loops — no watchdog. The agent re-runs write_file('config.json', data) because the confirmation fell out of context. Tokens burn until max_iterations.
  2. Silent state corruption — LLM emits invalid JSON for a tool call. Some frameworks swallow it and proceed with null. Others abort. None roll back the file system. A half-written file persists.
  3. Quadratic token cost — context grows every iteration (O(n²) attention). No budgeting, no signal before truncation.

These aren't bugs. They are architectural consequences of treating a probabilistic system (LLM) as a general-purpose deterministic machine. With documented tool-call hallucination rates of 2-5% (ToolAlpaca, API-Bank), relying on the model to self-manage state is untenable past ~50 tool calls.


The AEP approach: Instead of state-in-context, we define a deterministic sandbox operated by a microkernel runtime with an 8-register address space (R0-R7):

Register Function
R0 Program counter
R1 Watchdog timer (deadline, loop counter, state hash window)
R2 Context budget (tokens used, remaining)
R3 Sandbox state (content hash)
R4 Error register (structured stderr: code + payload)
R5 Schema registry (last tool + params)
R6 Transaction buffer (write-ahead log for rollback)
R7 Executive metadata (task_id, depth, cumulative tokens)

These registers do not live in the LLM context. They live in the runtime — Python, Rust, Go, whatever. The LLM only interacts with them through tool calls routed by the microkernel, never through chat messages. This decouples context (compressible) from operational state (exact).


Resilience mechanics (AEP-0008):

Watchdog (R1): After every tool execution, the runtime hashes the sandbox. If hash == previous_hash, it increments a loop counter. When counter >= threshold (default 3), the task is ejected with WATCHDOG_LOOP. This catches cycles without state progress, not just call count.

ACID transactions (R4, R6): Every mutation passes schema validation. On violation:

  1. Rollback via WAL replay (R6 restores previous sandbox)
  2. Structured error injected in R4: {code, expected schema, received payload, recovery hint}
  3. Runtime returns R4 as tool result — model parses and self-corrects
  4. Three consecutive rollbacks on the same tool → watchdog abort

Net effect: invalid JSON never touches the filesystem. Corrupted state is reverted before any external process reads it.


Benchmark — Controlled methodology:

Pipeline: agent transforms 20 CSV spreadsheets (diverse schemas, mixed encoding, up to 15 columns) from natural language instructions. Baseline: same agent + same model (Claude 3.5 Sonnet, max_iterations=90) without AEP runtime. n=50 per arm, shuffled, temp=0.

Metric Baseline AEP Runtime Delta
Tokens consumed (mean) 312,450 62,890 -79.9%
Schema accuracy (first call) 64% 86% +22pp
Loop rate (>=3 cycles, no progress) 18% 0% -18pp
Post-execution file corruption 2 cases 0 cases -100%
Wall time (mean) 8m42s 2m13s -74.5%

Methodology notes (read before citing):

  • 95% CI for tokens: ±4.2% baseline, ±3.1% AEP
  • Only spreadsheet pipeline tested — no code gen, web scraping, or pure CoT data yet
  • Schema accuracy measures payload-passing validation, not semantic output correctness
  • Full fixture set + run script at benchmark/fixtures/ and benchmark/run_benchmark.sh

The -80% token reduction breaks down as: 55% from context compression (tool messages pruned after WAL confirm), 20% from loop elimination, 5% from fast rollback (1-2 iterations vs 5-8).


What exists today:

Repo: https://github.com/ferreiratechnology2025-max/CogniX

The core spec (AEP-0001 through AEP-0012) is frozen at v1.1.0. The Compliance Kit (compliance/) has 11 YAML tests:

  1. Watchdog ejection on N cycles with no state delta
  2. Rollback restores sandbox after invalid schema
  3. R4 captures structured error with code
  4. WAL persists before apply()
  5. Task isolation in concurrent execution
  6. Context budget ejection at limit
  7. Watchdog bypass for idempotent tools
  8. Rollback does not affect unrelated sandbox state
  9. Forced hash collision behavior
  10. WAL lock contention timeout
  11. Independent benchmark reproducibility

What we're asking the community:

  • Audit the spec (AEP-0001 through 0012 in spec/). If R6 transaction semantics don't match your use case, open an issue describing the gap.
  • Port the runtime to Rust or Go. The Python runtime is a POC; the spec is language-agnostic.
  • Run the benchmark independently. run_benchmark.sh takes ~3 minutes on commodity hardware.

The protocol is published. The tests are available. The engineering speaks for itself.

Top comments (0)