HARSHAD BIRADAR

Posted on Mar 13

What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?

#ai #agents #a2a #linux

What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?

An architecture I've been thinking through — and why I think it might actually work.

The Frustration That Started This

Every time I look at multi-agent AI frameworks, I see the same pattern:

Install LangChain / CrewAI / AutoGen
Set up 4 different API keys
Configure a message broker or HTTP server for agent communication
Handle serialization, retries, timeouts, and routing
Debug a system where the state lives... somewhere inside a Python object in memory

For what? To have two LLM processes pass text to each other.

I'm a systems programmer. My instinct when I see this is: this is too much infrastructure for the actual problem.

So I started asking a simpler question: What's the minimum coordination layer two agents actually need?

The answer I keep coming back to has been sitting in Unix since the 1970s.

The Insight I Can't Stop Thinking About

LLM agents have a property that's easy to overlook: they are extraordinarily slow workers.

A single inference call takes 2–10 seconds. Your agents are not going to saturate a network pipe. They're not going to race condition your shared memory. The bottleneck is never the IPC layer — it's always the model.

This changes the calculus completely. Every classical argument against filesystem-based IPC — performance, latency, throughput — evaporates when your workers operate in seconds, not microseconds.

What's left are the advantages:

State is just files. Human-readable, inspectable, grep-able.
Crash recovery is free. If an agent dies mid-task, the file is still there.
No serialization protocol. Agents write markdown. Other agents read markdown.
Debuggability is trivial. Your logs ARE your state. ls tasks/ is your dashboard.
Zero dependencies. No broker. No database. No framework.

This is the Unix philosophy applied to LLM agents: write programs that communicate through text streams, because that is a universal interface.

The Architecture I'm Designing

Here's the directory structure I'm thinking through:

workspace/
├── manifest.json              ← coordinator's index, tracks all tasks
├── tasks/
│   ├── task_001_pending.md
│   ├── task_002_inprogress_agent_a.md
│   └── task_003_done.md
├── agents/
│   ├── orchestrator.md        ← system prompt / role definition
│   ├── agent_a.md
│   └── agent_b.md
└── outputs/
    └── task_003_result.md

Two things coordinate everything:

Filename encodes state. task_001_pending.md → task_001_inprogress_agent_a.md → task_001_done.md
manifest.json is the coordinator's index. Tracks all tasks, ownership, timestamps.

No agent talks directly to another agent. They communicate by mutating the filesystem.

Approach 1: Versioned File Naming

The simpler, more crash-safe approach I want to explore first.

The state machine lives in the filename

task_001_pending.md              → available for any agent to pick up
task_001_inprogress_agent_a.md   → agent renames it (atomic claim)
task_001_done.md                 → agent renames on completion
task_001_failed.md               → unrecoverable error

Renaming a file is atomic on POSIX filesystems. That's your concurrency primitive — no explicit locks needed.

What the manifest.json would look like

{
  "tasks": [
    {
      "id": "task_001",
      "status": "done",
      "owner": "agent_a",
      "created_at": "2025-03-13T10:00:00Z",
      "completed_at": "2025-03-13T10:00:42Z",
      "output": "outputs/task_001_result.md"
    }
  ]
}

Conceptual agent loop

while True:
    pending = glob("tasks/*_pending.md")
    if not pending:
        sleep(2)
        continue

    task_file = pending[0]
    claimed = claim_task(task_file)  # atomic rename
    if not claimed:
        continue  # another agent got it first

    task_content = read(claimed)
    result = call_llm(task_content)

    write(f"outputs/{task_id}_result.md", result)
    rename(claimed, f"tasks/{task_id}_done.md")
    update_manifest(task_id, status="done")

Crash recovery — the part I find most compelling

If an agent dies mid-task, the file sits at task_001_inprogress_agent_a.md. The orchestrator can detect stale inprogress files older than a threshold and reset them to pending. No data lost. No complex recovery logic. The filesystem is your persistent state.

Approach 2: Named Pipes (FIFOs)

For use cases where you want real-time streaming handoff between two agents rather than polling.

mkfifo workspace/pipes/orchestrator_to_agent_a
mkfifo workspace/pipes/agent_a_to_orchestrator

# Orchestrator sends task downstream
with open("pipes/orchestrator_to_agent_a", "w") as pipe:
    pipe.write(task_content)

# Agent reads, processes, responds back
with open("pipes/orchestrator_to_agent_a", "r") as pipe:
    task = pipe.read()

result = call_llm(task)

with open("pipes/agent_a_to_orchestrator", "w") as pipe:
    pipe.write(result)

When I'd reach for pipes vs versioned files

Scenario	Use
Tasks are independent, parallel	Versioned files
Sequential pipeline, output feeds next agent	Named pipes
Crash recovery is critical	Versioned files
Real-time streaming between two agents	Named pipes
More than 2 agents coordinating	Versioned files

My current thinking: start with versioned files. Pipes are compelling for strict sequential pipelines but add blocking complexity that's hard to debug.

State Management: Two-Layer Design

Layer 1 — Filename (per-task state)

Atomic, visible at a glance, no extra tooling.

Layer 2 — manifest.json (system state)

Single source of truth. Write atomically: write to manifest.tmp.json, rename to manifest.json. POSIX rename is atomic — safe for concurrent readers.

What This Would and Wouldn't Solve

It would solve:

Framework fatigue — zero dependencies
Debuggability — state is always inspectable on disk
Crash recovery — filesystem is persistent by default
Context isolation — each agent is an independent CLI process
Observability — watch -n1 ls tasks/ is literally your dashboard

It wouldn't solve:

Distributed systems — agents must share a filesystem (same machine or NFS)
High throughput — if you need 1000 tasks/sec, use a proper queue
Real-time streaming — versioned files add ~2s polling latency

This is deliberately a single-machine, low-throughput, high-debuggability architecture. I think that matches the actual deployment reality of most local LLM agent use cases — which are rarely distributed, rarely high-throughput, but almost always painful to debug.

Why I Think This Direction Is Worth Exploring

Current agent frameworks are built assuming agents are fast, distributed, and network-native. They inherit the full complexity of distributed systems design.

LLM agents are none of those things. They're slow, usually running locally or single-tenant, and their output is human-readable text.

The filesystem is a better fit for these actual properties. The 1970s Unix designers had this right for a different reason — and it might be accidentally correct again for a new one.

I'm working on a reference implementation and plan to share it once it's solid enough to be useful. Would be genuinely interested in whether others have tried this approach or hit the walls I'm anticipating — especially around the atomic rename behavior on non-POSIX filesystems (Windows, NFS edge cases).

Harshad Biradar — Systems programmer, building things from first principles.

DEV Community

What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?

What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?

The Frustration That Started This

The Insight I Can't Stop Thinking About

The Architecture I'm Designing

Approach 1: Versioned File Naming

The state machine lives in the filename

What the manifest.json would look like

Conceptual agent loop

Crash recovery — the part I find most compelling

Approach 2: Named Pipes (FIFOs)

When I'd reach for pipes vs versioned files

State Management: Two-Layer Design

Layer 1 — Filename (per-task state)

Layer 2 — manifest.json (system state)

What This Would and Wouldn't Solve

It would solve:

It wouldn't solve:

Why I Think This Direction Is Worth Exploring

Top comments (0)