What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?
An architecture I've been thinking through — and why I think it might actually work.
The Frustration That Started This
Every time I look at multi-agent AI frameworks, I see the same pattern:
- Install LangChain / CrewAI / AutoGen
- Set up 4 different API keys
- Configure a message broker or HTTP server for agent communication
- Handle serialization, retries, timeouts, and routing
- Debug a system where the state lives... somewhere inside a Python object in memory
For what? To have two LLM processes pass text to each other.
I'm a systems programmer. My instinct when I see this is: this is too much infrastructure for the actual problem.
So I started asking a simpler question: What's the minimum coordination layer two agents actually need?
The answer I keep coming back to has been sitting in Unix since the 1970s.
The Insight I Can't Stop Thinking About
LLM agents have a property that's easy to overlook: they are extraordinarily slow workers.
A single inference call takes 2–10 seconds. Your agents are not going to saturate a network pipe. They're not going to race condition your shared memory. The bottleneck is never the IPC layer — it's always the model.
This changes the calculus completely. Every classical argument against filesystem-based IPC — performance, latency, throughput — evaporates when your workers operate in seconds, not microseconds.
What's left are the advantages:
- State is just files. Human-readable, inspectable, grep-able.
- Crash recovery is free. If an agent dies mid-task, the file is still there.
- No serialization protocol. Agents write markdown. Other agents read markdown.
-
Debuggability is trivial. Your logs ARE your state.
ls tasks/is your dashboard. - Zero dependencies. No broker. No database. No framework.
This is the Unix philosophy applied to LLM agents: write programs that communicate through text streams, because that is a universal interface.
The Architecture I'm Designing
Here's the directory structure I'm thinking through:
workspace/
├── manifest.json ← coordinator's index, tracks all tasks
├── tasks/
│ ├── task_001_pending.md
│ ├── task_002_inprogress_agent_a.md
│ └── task_003_done.md
├── agents/
│ ├── orchestrator.md ← system prompt / role definition
│ ├── agent_a.md
│ └── agent_b.md
└── outputs/
└── task_003_result.md
Two things coordinate everything:
-
Filename encodes state.
task_001_pending.md→task_001_inprogress_agent_a.md→task_001_done.md -
manifest.jsonis the coordinator's index. Tracks all tasks, ownership, timestamps.
No agent talks directly to another agent. They communicate by mutating the filesystem.
Approach 1: Versioned File Naming
The simpler, more crash-safe approach I want to explore first.
The state machine lives in the filename
task_001_pending.md → available for any agent to pick up
task_001_inprogress_agent_a.md → agent renames it (atomic claim)
task_001_done.md → agent renames on completion
task_001_failed.md → unrecoverable error
Renaming a file is atomic on POSIX filesystems. That's your concurrency primitive — no explicit locks needed.
What the manifest.json would look like
{
"tasks": [
{
"id": "task_001",
"status": "done",
"owner": "agent_a",
"created_at": "2025-03-13T10:00:00Z",
"completed_at": "2025-03-13T10:00:42Z",
"output": "outputs/task_001_result.md"
}
]
}
Conceptual agent loop
while True:
pending = glob("tasks/*_pending.md")
if not pending:
sleep(2)
continue
task_file = pending[0]
claimed = claim_task(task_file) # atomic rename
if not claimed:
continue # another agent got it first
task_content = read(claimed)
result = call_llm(task_content)
write(f"outputs/{task_id}_result.md", result)
rename(claimed, f"tasks/{task_id}_done.md")
update_manifest(task_id, status="done")
Crash recovery — the part I find most compelling
If an agent dies mid-task, the file sits at task_001_inprogress_agent_a.md. The orchestrator can detect stale inprogress files older than a threshold and reset them to pending. No data lost. No complex recovery logic. The filesystem is your persistent state.
Approach 2: Named Pipes (FIFOs)
For use cases where you want real-time streaming handoff between two agents rather than polling.
mkfifo workspace/pipes/orchestrator_to_agent_a
mkfifo workspace/pipes/agent_a_to_orchestrator
# Orchestrator sends task downstream
with open("pipes/orchestrator_to_agent_a", "w") as pipe:
pipe.write(task_content)
# Agent reads, processes, responds back
with open("pipes/orchestrator_to_agent_a", "r") as pipe:
task = pipe.read()
result = call_llm(task)
with open("pipes/agent_a_to_orchestrator", "w") as pipe:
pipe.write(result)
When I'd reach for pipes vs versioned files
| Scenario | Use |
|---|---|
| Tasks are independent, parallel | Versioned files |
| Sequential pipeline, output feeds next agent | Named pipes |
| Crash recovery is critical | Versioned files |
| Real-time streaming between two agents | Named pipes |
| More than 2 agents coordinating | Versioned files |
My current thinking: start with versioned files. Pipes are compelling for strict sequential pipelines but add blocking complexity that's hard to debug.
State Management: Two-Layer Design
Layer 1 — Filename (per-task state)
Atomic, visible at a glance, no extra tooling.
Layer 2 — manifest.json (system state)
Single source of truth. Write atomically: write to manifest.tmp.json, rename to manifest.json. POSIX rename is atomic — safe for concurrent readers.
What This Would and Wouldn't Solve
It would solve:
- Framework fatigue — zero dependencies
- Debuggability — state is always inspectable on disk
- Crash recovery — filesystem is persistent by default
- Context isolation — each agent is an independent CLI process
- Observability —
watch -n1 ls tasks/is literally your dashboard
It wouldn't solve:
- Distributed systems — agents must share a filesystem (same machine or NFS)
- High throughput — if you need 1000 tasks/sec, use a proper queue
- Real-time streaming — versioned files add ~2s polling latency
This is deliberately a single-machine, low-throughput, high-debuggability architecture. I think that matches the actual deployment reality of most local LLM agent use cases — which are rarely distributed, rarely high-throughput, but almost always painful to debug.
Why I Think This Direction Is Worth Exploring
Current agent frameworks are built assuming agents are fast, distributed, and network-native. They inherit the full complexity of distributed systems design.
LLM agents are none of those things. They're slow, usually running locally or single-tenant, and their output is human-readable text.
The filesystem is a better fit for these actual properties. The 1970s Unix designers had this right for a different reason — and it might be accidentally correct again for a new one.
I'm working on a reference implementation and plan to share it once it's solid enough to be useful. Would be genuinely interested in whether others have tried this approach or hit the walls I'm anticipating — especially around the atomic rename behavior on non-POSIX filesystems (Windows, NFS edge cases).
Harshad Biradar — Systems programmer, building things from first principles.
Top comments (0)