The Multi-Agent AI Stack Nobody Documents Properly (My 5-Agent Setup That Runs 24/7)

#ai #tutorial #showdev #discuss

I've been running a 5-agent AI system in production for several months. Not a demo. Not a weekend project. A system that actually does work autonomously while I sleep.

Most articles about multi-agent AI are either toy examples or vendor marketing. This one is neither. Here's what actually works, what breaks, and how I structured the whole thing.

Why Single-Agent AI Fails at Scale

A single agent handling a complex task is like asking one person to simultaneously research a problem, write the code, review it for bugs, and deploy it — while keeping everything in their head.

Context windows are the first wall you hit. When you stuff a 15-step task into a single prompt, the model starts losing coherence around step 8. It hallucinates, contradicts itself, or forgets constraints you stated at the top.

The second problem is error propagation. A single agent that makes a bad assumption early carries that error through every subsequent step. There's no checkpoint.

The third problem is specialization. A generalist agent asked to do security review and feature development at the same time does both poorly. The mental model for "write new code fast" conflicts directly with "scrutinize everything for vulnerabilities."

The solution isn't a smarter single agent. It's decomposition.

The 5-Agent Architecture

Here's how I split responsibilities:

Researcher — reads context, searches memory, gathers existing patterns. Never writes code. Only produces structured findings.

Planner — takes researcher output and produces a step-by-step execution plan with explicit success criteria per step.

Coder — implements exactly what the plan specifies. Doesn't deviate. Doesn't "improve" things outside scope.

Reviewer — reads the diff, checks against the original spec, flags security issues, type errors, edge cases. Produces a structured verdict: pass, revise, or reject.

Executor — handles deployment, file writes, git commits, test runs. The only agent with side effects.

The key insight: only the Executor touches the real world. Every other agent produces text that feeds the next agent. This makes the whole pipeline auditable and reversible.

Orchestration: The Part Nobody Shows

Most tutorials show agents as boxes in a diagram. Here's the actual Python orchestration loop I use:

import anthropic

client = anthropic.Anthropic()

AGENTS = {
    "researcher": "You are a researcher. Read the task and return structured findings. No code.",
    "planner": "You are a planner. Take findings and produce a numbered execution plan with success criteria.",
    "coder": "You are a coder. Implement the plan exactly. Return only the changed files as JSON.",
    "reviewer": "You are a reviewer. Check the implementation against the plan. Return: PASS, REVISE, or REJECT with reasons.",
    "executor": "You are an executor. Apply approved changes. Return a summary of actions taken.",
}

def run_agent(role: str, context: str, max_tokens: int = 2048) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=max_tokens,
        system=AGENTS[role],
        messages=[{"role": "user", "content": context}]
    )
    return response.content[0].text

def run_pipeline(task: str) -> dict:
    findings = run_agent("researcher", f"Task: {task}")
    plan = run_agent("planner", f"Findings:\n{findings}")
    implementation = run_agent("coder", f"Plan:\n{plan}", max_tokens=4096)

    verdict = run_agent("reviewer", f"Plan:\n{plan}\n\nImplementation:\n{implementation}")

    if "REJECT" in verdict:
        return {"status": "rejected", "reason": verdict}

    if "REVISE" in verdict:
        implementation = run_agent("coder", f"Revise based on: {verdict}\n\nPlan:\n{plan}")

    result = run_agent("executor", f"Apply:\n{implementation}")
    return {"status": "complete", "result": result}

This is simplified, but it's structurally accurate. The real version adds retry logic, structured output parsing, and per-agent cost tracking.

Memory and State Management

Agents are stateless by default. This is a feature, not a bug — but you need to handle state explicitly.

I use three layers:

Short-term context: the conversation thread passed between agents in the same pipeline run. This is just strings being concatenated.

Medium-term working memory: a local JSON file that stores task state. If the pipeline crashes at step 3, it resumes from step 3, not step 1.

import json, pathlib

STATE_FILE = pathlib.Path("pipeline_state.json")

def save_state(task_id: str, step: str, data: dict):
    state = json.loads(STATE_FILE.read_text()) if STATE_FILE.exists() else {}
    state[task_id] = {"step": step, "data": data}
    STATE_FILE.write_text(json.dumps(state, indent=2))

def load_state(task_id: str) -> dict | None:
    if not STATE_FILE.exists():
        return None
    state = json.loads(STATE_FILE.read_text())
    return state.get(task_id)

Long-term pattern memory: after a successful pipeline run, I store what worked — the plan structure, which reviewer feedback triggered revisions, final token counts. This becomes context for the Researcher on the next run.

Cost Control in Practice

Running 5 agents per task adds up. Here's how I keep it under control.

Model routing: not every step needs Sonnet. The Researcher and Planner run on Haiku for most tasks. Only the Coder and Reviewer run on Sonnet. The Executor is often deterministic enough to skip the LLM entirely.

MODEL_ROUTING = {
    "researcher": "claude-haiku-4-5-20251001",
    "planner": "claude-haiku-4-5-20251001",
    "coder": "claude-sonnet-4-6",
    "reviewer": "claude-sonnet-4-6",
    "executor": "claude-haiku-4-5-20251001",
}

Token budgets per agent: I set hard max_tokens limits. The Researcher doesn't need 4096 tokens. Constraining it forces concise output and reduces cost downstream (that output becomes input for the next agent).

Caching: for tasks that share context — like a codebase summary — I pass the same system prompt prefix and let Anthropic's prompt caching handle it. On repeated runs, this cuts 60-70% of input costs for the context-heavy agents.

My current average cost per full pipeline run is around $0.04–0.08 for a typical coding task. Running 50 tasks a day costs about $2-4.

What Actually Breaks (And How to Handle It)

Reviewer loops: sometimes the Reviewer keeps returning REVISE and the Coder keeps making the same mistake. I cap revision cycles at 2. On the third failure, the pipeline halts and logs the failure for human review.

Executor drift: if the Executor gets vague implementation output from the Coder, it will hallucinate file paths. The fix is requiring the Coder to return structured JSON with explicit file paths and content — no ambiguity.

Context bleed: if you're not careful, the Planner starts echoing back things the Researcher said verbatim, inflating token usage for no benefit. Each agent prompt explicitly says "do not repeat prior context, only produce your role's output."

Running this system 24/7 means these edge cases happen constantly. The pipeline needs to be defensive, not optimistic.

I packaged the full architecture, prompts, and cost breakdown into a guide: Multi-Agent AI Stack: The Complete Builder Guide — covers orchestration, memory, error recovery, and real deployment costs.

What's your biggest pain point when moving from a single agent to a multi-agent setup — context management, cost, or something else entirely?