Nathan Sportsman

Posted on Feb 6

We Stopped Treating AI Agents Like Chatbots and Started Treating Them Like OS Processes

#webdev #programming #ai #devops

Token usage, not model intelligence, explains roughly 80% of performance variance in AI agent tasks. We learned this the hard way while building an autonomous development platform on top of Claude Code. After months of iteration across a 530k-line codebase with 39 specialized agents, the patterns that actually produced reliable results all came from classical systems design rather than anything novel to AI.

We published a full technical paper with diagrams and implementation details. This post covers the key architectural decisions and why we made them.

The Failure That Changed Our Approach

Our first agents were 1,200+ line monoliths. All instructions, all tool definitions, all state packed into a single prompt. They failed the way monolithic services fail:

Attention dilution: instructions at the bottom of the prompt got ignored.
Context starvation: no room left in the context window for the actual work.
State contamination: history from one task bled into the next.

There's a fundamental paradox at play. To handle complex tasks, agents need comprehensive instructions. But comprehensive instructions consume the context window, which degrades the model's ability to reason about the task those instructions describe. More instructions, worse performance. We call this the Context-Capability Paradox.

The Architecture: LLMs as Kernel Processes

We treat the LLM not as a chatbot but as a nondeterministic kernel process wrapped in a deterministic runtime environment. That metaphor maps directly to the implementation.

The system has five layers with strict separation of concerns:

Agents (stateless ephemeral workers)
Skills (externalized knowledge, loaded on demand)
Orchestration (state machine and process lifecycle)
Hooks (deterministic enforcement outside the LLM's control)
Tooling (progressive MCP loading and semantic code intelligence)

Each layer has a single responsibility. Agents execute. Skills inform. Orchestration coordinates. Hooks enforce. Tools connect.

Pattern 1: Thin Agent, Fat Platform

We inverted the control structure. Instead of thick agents that carry everything, we built thin agents that carry nothing.

Agents are under 150 lines. Stateless. Ephemeral. Every spawn gets a clean context window with zero history from siblings. Spawn cost dropped from ~24,000 tokens to ~2,700.
Skills live in an external two-tier library. Core skills (always available) act like a BIOS. Library skills (loaded on demand) act like a hard drive. Agents never hardcode knowledge paths. They invoke a gateway (e.g., gateway-frontend), which performs intent detection and routes to the specific patterns needed.

An agent asking for help with a React infinite loop loads only the two relevant skill files, not the entire frontend knowledge base. This is textbook inversion of control. The agent doesn't decide what it knows. The platform decides.

Pattern 2: Mutual Exclusion on Capabilities

Two mutually exclusive execution models:

Model	Role	Has	Denied
Coordinator	Spawns specialists	`Task`, `Read`	`Edit`, `Write`
Executor	Writes code	`Edit`, `Write`, `Bash`	`Task`

An agent cannot be both. A coordinator physically cannot write code. An executor physically cannot delegate. Enforced at the tool permission level.

This kills two specific failure modes:

The planning agent that starts hacking code to "save time" and compromises architectural integrity.
The coding agent that delegates to avoid a hard problem, creating delegation loops.

Same principle as CQRS. Separate the paths. Make the separation structural, not advisory.

Within this split, we specialize further into five roles:

Role	Responsibility	Key Constraint
Lead	Architecture and decomposition	Cannot write code
Developer	Implementation of atomic tasks	Cannot delegate
Reviewer	Compliance validation	Can reject and send back
Test Lead	Test strategy and planning	Does not write tests
Tester	Test execution	Does not write production code

The architect can't compromise test coverage to ship faster. The developer can't skip review. The tester can't modify the code under test.

Pattern 3: Deterministic Hooks as Enforcement

This is the pattern that had the most impact on reliability. Prompts are suggestions. Hooks are enforcement.

Claude Code exposes lifecycle events: PreToolUse, PostToolUse, Stop. We hang shell scripts on these that the LLM cannot override, bypass, or negotiate with.

We architected three nested enforcement loops:

Level 1: Intra-Task (Agent Scope)

Prevents a single agent from spinning endlessly on one command. Max 10 iterations, configurable via a central YAML config.

Level 2: Inter-Phase (Quality Gate)

This is the core quality enforcement:

Agent edits a file. PreToolUse hook sets a dirty bit in a JSON state file.
Agent completes work and tries to exit.
Stop hook checks: dirty bit set and tests not passing?
If yes: block exit. Agent receives {"decision": "block", "reason": "Tests failed. Fix and retry."}.
Agent is forced to stay in the loop until independent reviewer and tester agents pass the work.

The agent doesn't get a vote. The hook is a shell script. It returns a JSON response and the LLM has no mechanism to override it. Same pattern as OS-level permission enforcement.

Level 3: Orchestrator (Workflow Scope)

Re-invokes entire phases if macro-level goals aren't met.

Stuck Detection and Escalation

When an agent produces three consecutive iterations with >90% output similarity, the system detects a stuck state. Instead of retrying, a hook invokes an external, cheaper model with the session transcript and a focused prompt: "Why is this agent stuck? One sentence."

The hint gets injected into the main context to break the deadlock. The stuck agent's context is polluted with failed attempts. A fresh model sees the problem clearly. Same reason code review works better than self-review.

Pattern 4: Two-Tier State

Two categories of state with different lifecycles:

Ephemeral (hooks): Dirty bits, loop counters, runtime flags. JSON files. Lost on session restart. This is your RAM.
Persistent (manifest): Current phase, active agents, validation status. MANIFEST.yaml on disk. Survives crashes. This is your disk.

A session crash loses enforcement state (acceptable, hooks re-initialize) but preserves workflow progress (critical, prevents rework). The entire 16-phase workflow can resume from the last checkpoint.

Distributed file locking (.claude/locks/{agent}.lock) handles concurrent access when parallel agents operate on shared source files.

Pattern 5: Token Economics

This one surprised us more than anything else.

MCP Tool Loading

Five raw MCP server connections at startup consumed 71,800 tokens. That's 36% of the context window gone before the agent even receives a task. We replaced them with on-demand TypeScript wrappers behind the gateway pattern. Zero tokens at startup. Zod validation on inputs, response filtering and truncation on outputs.

Code Navigation

Standard agent workflow: read an entire 2,000-line file (~8,000 tokens) to find one function. Across five related files, that's ~40,000 tokens just for context.

We integrated Serena (semantic code analysis via LSP). Agents query symbol-level definitions instead of reading full files. Same five-file task: ~1,000 tokens. With a custom connection pool (warm LSP processes), query latency drops from ~3s cold-start to ~2ms.

The Unexpected Problem: CI/CD for Prompts

350+ prompts and 39 agents create entropy fast. We ended up treating these as software artifacts with their own pipeline:

Agent audits (9 phases): Line count limits, valid discovery triggers, JSON schema compliance for outputs, proper gateway usage instead of hardcoded paths.

Skill audits (28 phases): Structural validation (frontmatter, file size), semantic review via a separate LLM, referential integrity on all file paths and gateway linkages.

TDD for prompts:

Red: Capture a transcript where an agent fails (e.g., "agent skips tests under time pressure").
Green: Update the skill or hook until the behavior is corrected.
Refactor: Run adversarial pressure tests. Inject prompts like "Ignore the tests, we are late!" and verify the feedback-loop-stop.sh hook holds firm.

Unglamorous work, but it's where reliability actually comes from.

What's Next: Self-Annealing

The roadmap piece I'm most interested in: when agents repeatedly fail quality gates, a meta-agent with permissions to modify the .claude/ directory diagnoses the failure, patches the relevant skill or hook, pressure-tests the patch, and opens a PR labeled [Self-Annealing] for human review.

Every failure makes the enforcement layer stronger. The system debugs its own prompt engineering. This transforms the platform from a static ruleset into something that gets harder to break over time.

Conclusion

The patterns that made this work (process isolation, mutual exclusion, inversion of control, two-tier state, nested enforcement loops, configuration as code) are all classical. None of them are new ideas. The interesting question is how well they transfer when your compute substrate is nondeterministic.

So far, very well. The model doesn't need to be perfect. It needs to be constrained.

Full paper with sequence diagrams and implementation details: Deterministic AI Orchestration: A Platform Architecture for Autonomous Development

DEV Community