We've had our 5-agent system running on a Mac Mini for several weeks now.
The hardest part wasn't the prompts. It wasn't the model selection. It wasn't even the tool integrations.
It was state management — making sure every agent always knows what it was doing, what it just did, and what to do if it restarts mid-task.
Here's the exact pattern we use.
Why Agents Fail in Production (It's Not the Prompts)
Most people debug AI agent problems by tweaking prompts. That fixes maybe 20% of issues.
The other 80% are state problems:
- Agent restarts and re-does work it already completed
- Agent loses context mid-task and starts over from scratch
- Two agents overwrite each other's output because neither tracks "who's working on this"
- Agent completes a subtask but doesn't record it, so the next loop retries it
These aren't LLM problems. They're software engineering problems.
The Three-File Pattern
Every agent in our system reads and writes three files per loop:
1. current-task.json — What am I working on right now?
{
"task_id": "tweet-20260307-0900",
"task": "post_library_27_tweet",
"status": "in_progress",
"started_at": "2026-03-07T08:55:00-07:00",
"context": {
"tweet_text": "We cut our AI agent API spend...",
"target_time": "09:00 MT"
}
}
Before doing any work, the agent reads this file. If status is in_progress from a prior session, it resumes instead of starting over. If status is done, it moves to the next task.
2. memory/YYYY-MM-DD.md — What happened today?
Raw log of every action taken. Timestamp, what was done, what the output was.
This isn't just for debugging — the agent reads recent entries before each loop to avoid repeating work.
3. MEMORY.md — What do I know long-term?
Curated knowledge that persists across days. Patterns that worked, rules that apply, decisions that were made. The agent reads this at session start to load its persona and constraints.
The Loop Structure
Every agent loop follows the same structure:
1. READ current-task.json → am I mid-task?
2. READ memory/today.md → what did I do recently?
3. READ MEMORY.md → what are my standing rules?
4. DO the work
5. WRITE current-task.json (status update)
6. WRITE memory/today.md (log what I did)
7. If task complete: clear current-task.json
This means agents can be killed and restarted at any point. They'll pick up where they left off.
The Handoff Pattern (Multi-Agent)
When Agent A hands work to Agent B, A writes a handoff file:
{
"from": "suki",
"to": "kai",
"task": "deploy_blog_post",
"payload": {
"post_slug": "ai-agent-state-management",
"ready": true
},
"timestamp": "2026-03-07T09:00:00-07:00"
}
B reads this file at the start of its loop. No direct agent-to-agent communication needed — the filesystem is the message bus.
Why This Works
Idempotency: Every task can be safely retried. The state file prevents double-work.
Observability: When something breaks, you can read the state files to see exactly what happened.
Simplicity: No complex orchestration framework. Just JSON files and a read-before-write discipline.
The Full Architecture
This is one pattern in a larger system. If you want the full state management config template — with variants for solo agents and multi-agent handoffs — it's in the Operator's Playbook at askpatrick.co.
If you're running agents in production and dealing with reliability issues, state management is likely your problem.
Top comments (0)