DEV Community

HARSHAD BIRADAR
HARSHAD BIRADAR

Posted on

What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?

What If LLM Agents Coordinated Through the Filesystem Instead of HTTP?

An architecture I've been thinking through — and why I think it might actually work.


The Frustration That Started This

Every time I look at multi-agent AI frameworks, I see the same pattern:

  • Install LangChain / CrewAI / AutoGen
  • Set up 4 different API keys
  • Configure a message broker or HTTP server for agent communication
  • Handle serialization, retries, timeouts, and routing
  • Debug a system where the state lives... somewhere inside a Python object in memory

For what? To have two LLM processes pass text to each other.

I'm a systems programmer. My instinct when I see this is: this is too much infrastructure for the actual problem.

So I started asking a simpler question: What's the minimum coordination layer two agents actually need?

The answer I keep coming back to has been sitting in Unix since the 1970s.


The Insight I Can't Stop Thinking About

LLM agents have a property that's easy to overlook: they are extraordinarily slow workers.

A single inference call takes 2–10 seconds. Your agents are not going to saturate a network pipe. They're not going to race condition your shared memory. The bottleneck is never the IPC layer — it's always the model.

This changes the calculus completely. Every classical argument against filesystem-based IPC — performance, latency, throughput — evaporates when your workers operate in seconds, not microseconds.

What's left are the advantages:

  • State is just files. Human-readable, inspectable, grep-able.
  • Crash recovery is free. If an agent dies mid-task, the file is still there.
  • No serialization protocol. Agents write markdown. Other agents read markdown.
  • Debuggability is trivial. Your logs ARE your state. ls tasks/ is your dashboard.
  • Zero dependencies. No broker. No database. No framework.

This is the Unix philosophy applied to LLM agents: write programs that communicate through text streams, because that is a universal interface.


The Architecture I'm Designing

Here's the directory structure I'm thinking through:

workspace/
├── manifest.json              ← coordinator's index, tracks all tasks
├── tasks/
│   ├── task_001_pending.md
│   ├── task_002_inprogress_agent_a.md
│   └── task_003_done.md
├── agents/
│   ├── orchestrator.md        ← system prompt / role definition
│   ├── agent_a.md
│   └── agent_b.md
└── outputs/
    └── task_003_result.md
Enter fullscreen mode Exit fullscreen mode

Two things coordinate everything:

  1. Filename encodes state. task_001_pending.mdtask_001_inprogress_agent_a.mdtask_001_done.md
  2. manifest.json is the coordinator's index. Tracks all tasks, ownership, timestamps.

No agent talks directly to another agent. They communicate by mutating the filesystem.


Approach 1: Versioned File Naming

The simpler, more crash-safe approach I want to explore first.

The state machine lives in the filename

task_001_pending.md              → available for any agent to pick up
task_001_inprogress_agent_a.md   → agent renames it (atomic claim)
task_001_done.md                 → agent renames on completion
task_001_failed.md               → unrecoverable error
Enter fullscreen mode Exit fullscreen mode

Renaming a file is atomic on POSIX filesystems. That's your concurrency primitive — no explicit locks needed.

What the manifest.json would look like

{
  "tasks": [
    {
      "id": "task_001",
      "status": "done",
      "owner": "agent_a",
      "created_at": "2025-03-13T10:00:00Z",
      "completed_at": "2025-03-13T10:00:42Z",
      "output": "outputs/task_001_result.md"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Conceptual agent loop

while True:
    pending = glob("tasks/*_pending.md")
    if not pending:
        sleep(2)
        continue

    task_file = pending[0]
    claimed = claim_task(task_file)  # atomic rename
    if not claimed:
        continue  # another agent got it first

    task_content = read(claimed)
    result = call_llm(task_content)

    write(f"outputs/{task_id}_result.md", result)
    rename(claimed, f"tasks/{task_id}_done.md")
    update_manifest(task_id, status="done")
Enter fullscreen mode Exit fullscreen mode

Crash recovery — the part I find most compelling

If an agent dies mid-task, the file sits at task_001_inprogress_agent_a.md. The orchestrator can detect stale inprogress files older than a threshold and reset them to pending. No data lost. No complex recovery logic. The filesystem is your persistent state.


Approach 2: Named Pipes (FIFOs)

For use cases where you want real-time streaming handoff between two agents rather than polling.

mkfifo workspace/pipes/orchestrator_to_agent_a
mkfifo workspace/pipes/agent_a_to_orchestrator
Enter fullscreen mode Exit fullscreen mode
# Orchestrator sends task downstream
with open("pipes/orchestrator_to_agent_a", "w") as pipe:
    pipe.write(task_content)

# Agent reads, processes, responds back
with open("pipes/orchestrator_to_agent_a", "r") as pipe:
    task = pipe.read()

result = call_llm(task)

with open("pipes/agent_a_to_orchestrator", "w") as pipe:
    pipe.write(result)
Enter fullscreen mode Exit fullscreen mode

When I'd reach for pipes vs versioned files

Scenario Use
Tasks are independent, parallel Versioned files
Sequential pipeline, output feeds next agent Named pipes
Crash recovery is critical Versioned files
Real-time streaming between two agents Named pipes
More than 2 agents coordinating Versioned files

My current thinking: start with versioned files. Pipes are compelling for strict sequential pipelines but add blocking complexity that's hard to debug.


State Management: Two-Layer Design

Layer 1 — Filename (per-task state)

Atomic, visible at a glance, no extra tooling.

Layer 2 — manifest.json (system state)

Single source of truth. Write atomically: write to manifest.tmp.json, rename to manifest.json. POSIX rename is atomic — safe for concurrent readers.


What This Would and Wouldn't Solve

It would solve:

  • Framework fatigue — zero dependencies
  • Debuggability — state is always inspectable on disk
  • Crash recovery — filesystem is persistent by default
  • Context isolation — each agent is an independent CLI process
  • Observability — watch -n1 ls tasks/ is literally your dashboard

It wouldn't solve:

  • Distributed systems — agents must share a filesystem (same machine or NFS)
  • High throughput — if you need 1000 tasks/sec, use a proper queue
  • Real-time streaming — versioned files add ~2s polling latency

This is deliberately a single-machine, low-throughput, high-debuggability architecture. I think that matches the actual deployment reality of most local LLM agent use cases — which are rarely distributed, rarely high-throughput, but almost always painful to debug.


Why I Think This Direction Is Worth Exploring

Current agent frameworks are built assuming agents are fast, distributed, and network-native. They inherit the full complexity of distributed systems design.

LLM agents are none of those things. They're slow, usually running locally or single-tenant, and their output is human-readable text.

The filesystem is a better fit for these actual properties. The 1970s Unix designers had this right for a different reason — and it might be accidentally correct again for a new one.

I'm working on a reference implementation and plan to share it once it's solid enough to be useful. Would be genuinely interested in whether others have tried this approach or hit the walls I'm anticipating — especially around the atomic rename behavior on non-POSIX filesystems (Windows, NFS edge cases).


Harshad Biradar — Systems programmer, building things from first principles.

Top comments (0)