João Pedro Silva Setas

Posted on Mar 11

Why Erlang's Supervision Trees Are the Missing Piece for AI Agents

#agents #ai #architecture #systemdesign

Every week, a new AI agent framework launches. LangChain, CrewAI, AutoGen, Magentic-One — the list grows faster than anyone can evaluate.

They all solve the same problem: how do you make an LLM do multi-step tasks? Chain some prompts, give it tools, add memory. Ship it.

But none of them answer the question that actually matters in production: what happens when your agent crashes at 3am?

I run 8 AI agents that manage my solo company — CEO, CFO, COO, Marketing, Accountant, Lawyer, CTO, and an Improver that upgrades the others. They share a persistent knowledge graph, consult each other automatically, and post content to social media while I sleep.

They crash. Regularly.

Why AI Agents Aren't Containers

Here's the core problem most frameworks ignore: AI agents are deeply stateful.

A web server is (mostly) stateless. Kill the container, spin up a new one from the same image. No data lost. Kubernetes was designed for exactly this pattern.

AI agents are different:

Context accumulates — an agent mid-task holds a conversation history, tool call results, intermediate reasoning. Lose that, and it starts over from scratch.
Failures are semantic, not just process failures — "the agent entered an infinite loop and burned $50 in API tokens" is different from "the container OOM-killed." You need supervision that understands what went wrong, not just that something stopped.
Coordination requires state — agents that collaborate share context, delegate subtasks, track who's done what. Kill one, and the others are left with stale references.
Costs are real — every crashed-and-restarted agent potentially re-runs expensive LLM calls. Crash recovery isn't just about uptime. It's about not burning money.

Most frameworks deal with this by... not dealing with it. They assume the happy path. If something fails, you restart the whole script manually.

That works for demos. It doesn't work when your agent is supposed to post a tweet at 14:00 UTC every day, rain or shine.

Erlang Solved This in 1986

In 1986, Joe Armstrong and the Ericsson team had a problem: build telephone switches that handle millions of concurrent calls with 99.999% uptime. That's 5.26 minutes of downtime per year.

Their solution: don't prevent crashes. Expect them and recover automatically.

This led to OTP (Open Telecom Platform) and its killer feature: supervision trees.

The core idea is simple:

Every process has a supervisor — a parent process whose only job is watching children
When a child crashes, the supervisor restarts it according to a defined strategy
Supervisors can supervise other supervisors — creating a tree of fault tolerance
The restart happens in microseconds, not seconds

Here's what a basic agent supervisor looks like in Elixir:

defmodule AgentSupervisor do
  use Supervisor

  def start_link(opts) do
    Supervisor.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def init(_opts) do
    children = [
      {AgentWorker, id: :ceo, role: :strategy, model: :claude_sonnet},
      {AgentWorker, id: :marketing, role: :content, model: :claude_sonnet},
      {AgentWorker, id: :accountant, role: :tax, model: :claude_haiku},
      {MemoryServer, path: "memory.jsonl"},
      {SchedulerWorker, interval: :timer.minutes(5)}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end
end

Three restart strategies cover every failure pattern:

:one_for_one — only restart the crashed process. Perfect for independent agents.
:one_for_all — restart everything if one crashes. Use when tightly coupled agent teams have shared state where partial state is worse than a full restart.
:rest_for_one — restart the crashed process and everything started after it. Useful when later agents depend on earlier ones.

What This Looks Like in Practice

Here's a real scenario from my system. My agents share a persistent knowledge graph stored as a JSONL file — one JSON object per line, each representing an entity or relation. Eight agents read and write to this file through a Model Context Protocol (MCP) memory server. Every strategic decision, client pipeline update, prompt run timestamp, and lesson learned goes here.

The race condition was textbook. When multiple agents fire parallel tool calls — say, create_entities and create_relations in the same batch — both operations would:

Read the entire JSONL file into memory
Parse every line into an in-memory graph
Append their new entities/relations
Serialize the full graph back to disk

Step 4 is the problem. Both operations read the same file state. Both write back the full graph plus their additions. The second write obliterates the first's additions entirely. No error, no warning — data just vanishes.

In a typical framework, this would mean:

Agent tries to read memory → gets a JSON parse error (if a write was interrupted mid-line)
Agent crashes or returns garbage
I wake up, see broken output, manually debug the JSONL file
Fix the file, restart everything
Repeat next time it happens

With supervision trees:

Memory server process detects corruption on load
Process crashes — intentionally. In Erlang, crashing is a feature, not a bug.
Supervisor restarts the memory server in microseconds
On restart, the init callback runs auto-repair: wraps each JSON.parse in a try/catch, skips corrupt lines, deduplicates entities by name and relations by from|type|to key
Agents resume with clean data
I'm asleep. Everything just works.

The fix I implemented to address the root cause: a local fork of the MCP memory server with three additions:

Async mutex — a queue-based lock that serializes all write operations. When one saveGraph() is running, subsequent calls wait their turn. This eliminates the read-modify-write race entirely.
Atomic writes — every save writes to a .tmp file first, then renames it over the original. A crash mid-write gives you either the old complete file or the new complete file — never a half-written mess.
Auto-repair on load — the graph loader wraps each line's JSON.parse in a try/catch. Corrupt lines get skipped with a warning. Duplicate entities (same name) and duplicate relations (same from/type/to triple) are collapsed.

Here's roughly what the mutex pattern looks like:

class Mutex {
  constructor() { this._queue = []; this._locked = false; }
  async acquire() {
    return new Promise(resolve => {
      if (!this._locked) { this._locked = true; resolve(); }
      else { this._queue.push(resolve); }
    });
  }
  release() {
    if (this._queue.length > 0) this._queue.shift()();
    else this._locked = false;
  }
}

// Every mutating operation goes through the lock:
async createEntities(entities) {
  await this.mutex.acquire();
  try {
    const graph = await this.loadGraph();  // read
    graph.entities.push(...newEntities);    // modify
    await this.saveGraph(graph);            // write (atomic)
  } finally { this.mutex.release(); }
}

This is exactly the kind of infrastructure problem that disappears on the BEAM. Erlang processes don't share memory. Each process has its own heap. There's no concurrent write to the same file because the memory server is a single GenServer processing messages sequentially from its mailbox — mutual exclusion is built into the execution model, not bolted on with a mutex.

The key insight: the supervision tree doesn't prevent the bug. It makes the bug survivable. The corrupt write still happens occasionally (on the JavaScript version — the BEAM version wouldn't have this class of bug at all), but the system recovers before anyone notices.

Each Process Is an Island

BEAM processes (Erlang's virtual machine) have properties that map perfectly to AI agents:

Isolation — each process has its own heap memory. A crash in one can't corrupt another. Your Marketing agent going haywire can't touch the Accountant's tax calculations.
Lightweight — each process is ~2KB. You can run hundreds of thousands on a single machine. An 8-agent system with tool workers, a memory server, and a scheduler process would fit comfortably on a machine with 256MB RAM.
Preemptive scheduling — the BEAM VM enforces fair CPU sharing. One agent stuck in an expensive computation can't starve the others. Every agent gets its turn.
Message passing — agents communicate by sending immutable messages. No shared mutable state, no locks, no race conditions (except at I/O boundaries, which is where the mutex comes in).

Compare this to running AI agents as Python threads or async tasks. One unhandled exception can take down the entire process. One memory leak slowly poisons the whole system. One blocking call freezes everything.

My current system runs on Node.js with a hand-rolled mutex and atomic file writes to paper over exactly these problems. It works — 91% scheduler success rate, auto-repairing memory, months of uptime. But every fix is fighting the runtime instead of working with it. On the BEAM, process isolation and sequential mailbox processing eliminate entire categories of bugs before you write a line of application code.

Why This Matters Now

AI agents are moving from demos to production. And production means:

Agents that run 24/7, not just during a demo
Real money flowing through API calls ($0.01 per prompt adds up quick when an agent loops)
Users depending on outputs — posts that need to go out, invoices that need to be generated, compliance deadlines that can't be missed
Multiple agents coordinating, where one failure cascades if not contained

The industry is rediscovering problems that telecom solved decades ago. Ericsson's AXD 301 switch achieved 99.9999999% uptime — nine nines — using these exact patterns. Not because the hardware never failed, but because the software expected failure and recovered faster than users noticed.

Your AI agent doesn't need nine nines. But it does need to survive a 3am crash without you waking up to fix it.

The Counterargument

"But I'm not going to rewrite my Python agent in Elixir."

Fair. And you don't have to. The supervision tree pattern is more important than the language:

Wrap agents in health-check loops that detect hangs and kill them
Checkpoint state regularly so a restart doesn't lose everything
Set budget caps that pause agents before they burn your API credits
Monitor semantically — is the agent making progress, or is it looping?

But if you're choosing a foundation for a new agent system — especially one that needs to run multiple coordinating agents reliably — I'd argue the BEAM gives you a 40-year head start. These patterns aren't libraries you install. They're built into the runtime.

What I'd Build Next

If I were starting a new AI agent platform from scratch today:

Process-per-agent with OTP supervisors
State checkpointing to PostgreSQL on every tool call
Per-agent spend tracking with configurable budget caps
PubSub for inter-agent messaging — no external message queue needed
Telemetry hooks for observability (OpenTelemetry + Sentry)

This is roughly what I'm building with OpenClaw Cloud, and it's why I chose Elixir for the stack. Not because Elixir is trendy, but because the problem — running many stateful, failure-prone, communicating processes — is literally what the BEAM was designed for.

I'm João, a solo developer from Portugal building SaaS with Elixir and Phoenix. I recently wrote about running a solo company with AI agent departments — this article is the technical deep-dive on why that system stays reliable. Find me on X (@joaosetas).