sly-the-fox

Posted on Mar 16

Your Multi-Agent System Has a Memory Problem

#agents #ai #architecture #python

Every multi-agent system I've seen break in production broke for the same reason: agents couldn't remember what already happened.

Not hallucination. Not bad prompts. Memory. The system had no way to persist decisions, track what was active, or share context between agents that needed to coordinate.

I run 39 agents in a production operating system. Solving the memory problem was harder than designing the agents themselves.

The three kinds of agent memory

Agent memory isn't one thing. It's at least three distinct problems, and most systems only solve the first.

1. Session memory (what happened in this conversation)

This is what you get for free with most LLM APIs. The context window. It works until the conversation ends, then it's gone.

# This is NOT memory. This is a conversation buffer.
messages = [
    {"role": "user", "content": "Deploy the new config"},
    {"role": "assistant", "content": "Config deployed to staging."}
]
# Session ends. Everything above disappears.

2. Shared state (what's true right now across all agents)

This is where most systems fall apart. When Agent A makes a decision, Agent B needs to know about it before it acts. Not eventually. Before.

# Shared state: a set of canonical files every agent reads
SHARED_STATE = {
    "active/now.md": "Current tasks, owners, and status",
    "active/priorities.md": "Ranked priority list",
    "active/blockers.md": "What's stuck and why",
    "core/history/decisions.md": "Decisions with rationale"
}

def before_action(agent, task):
    """Every agent reads shared state before acting."""
    context = {}
    for path, purpose in SHARED_STATE.items():
        context[path] = read_file(path)
    return context

The key insight: shared state must be file-based (or database-backed), not held in any single agent's context. Agents come and go. The state persists.

3. Institutional memory (what happened over time, why decisions were made)

This is the layer nobody builds until they need it. Two weeks into running a multi-agent system, someone asks "why is the deployment config set to X?" No agent remembers. The decision happened in a conversation that's long gone.

# A decision record with rationale
DECISION = {
    "id": "DEC-042",
    "date": "2026-03-15",
    "decision": "Use file-based shared state instead of Redis",
    "rationale": "Agents run in ephemeral sessions. Files persist across "
                 "sessions without requiring a running service.",
    "decided_by": "architect",
    "alternatives_considered": ["Redis", "SQLite", "env vars"]
}

If you don't record the "why," you'll revisit the same decisions repeatedly. I've watched agents re-debate resolved questions because the rationale wasn't captured anywhere they could read.

The shared state pattern

Here's the pattern that works in my system. It's simple, and that's the point.

Rule 1: Canonical files are the source of truth.

Not agent memory. Not conversation history. Files on disk. Every agent reads them. A small set of agents are authorized to write them.

Rule 2: Read before you act.

Before any agent takes action, it reads relevant shared state. This is non-negotiable. An agent that acts without reading current state will contradict decisions made since its last session.

Rule 3: Write-back is immediate.

When an agent completes a task, it updates shared state in the same operation. Not "later." Not "at the end of the session." Immediately. Stale state causes cascading errors.

def complete_task(agent, task, result):
    """Task completion always includes state update."""
    # 1. Do the work
    execute(task)

    # 2. Update shared state (same operation)
    update_file("active/now.md", mark_complete(task))
    update_file("core/history/decisions.md", record_decision(task, result))

    # 3. Never separate these steps

Rule 4: Ownership is explicit.

Not every agent can write to every file. The PM agent owns active/now.md. The architect owns core/history/decisions.md. Contention on shared files goes to a governor.

What this looks like in practice

My system's shared state layer is a directory called active/ with seven files:

File	Purpose	Owner
`now.md`	Current tasks and status	PM
`priorities.md`	Ranked priorities	Strategist
`blockers.md`	What's stuck	PM
`risks.md`	Active risk register	Sentinel
`improvements.md`	Proposed system changes	Improver
`inbox.md`	Unrouted incoming items	Router
`coherence-metrics.md`	System health signals	Evaluator

Every agent reads the files relevant to its work before acting. Seven agents write to them. The governor resolves conflicts.

Is it elegant? No. It's a directory of markdown files. But it's been running for two weeks without a single state conflict, and every agent sees the same picture of reality.

Start here

If your multi-agent system has more than two agents:

Create one shared file that tracks "what's active right now"
Make every agent read it before acting
Make the responsible agent update it after completing work
Log decisions with rationale, not just outcomes

You can get sophisticated later. Databases, event streams, vector stores. But the pattern underneath is always the same: agents need a shared, persistent, authoritative picture of the world. Without it, you're running parallel amnesiacs.

Building Sigil for cryptographic audit trails, because shared state is only trustworthy if you can prove it wasn't tampered with. More on agent architecture: The Alignment Layer.

DEV Community

Your Multi-Agent System Has a Memory Problem

The three kinds of agent memory

The shared state pattern

What this looks like in practice

Start here

Top comments (0)