Why Multi-Agent Systems Are a Trap (And What I Learned the Hard Way)

#ai #python #architecture #programming

There's a moment in every ambitious AI engineering project where you convince yourself that more agents means more power. I hit that moment early in building my Python orchestration framework — and I spent several painful weeks learning exactly why that intuition is wrong.

The seductive pitch: decompose complex tasks into specialized sub-agents, run them in parallel, let them coordinate. What actually happened was a reliability nightmare that taught me more about agentic architecture than any framework documentation ever could.

The Problem I Actually Built

My Python orchestration system was designed to automate complex, multi-step workflows — the kind that require planning, research, code generation, and validation to happen in a coherent sequence. Early on, I structured it as a web of parallel agents: a planner, several workers, a validator, and a synthesizer, all exchanging structured messages.

On paper it was elegant. In practice, it had three failure modes I couldn't engineer away:

Context drift. Each agent only saw the slice of information it was handed. The worker writing one module couldn't see what the worker writing another module had decided. By the time the synthesizer tried to combine outputs, I had conflicting assumptions baked into the results — variable names that clashed, patterns that contradicted each other, interfaces that didn't align.

Cascading partial failures. When one agent produced ambiguous output, every downstream agent amplified the ambiguity. A planner that returned a slightly underspecified task description produced workers that each interpreted it differently. Nothing failed loudly. Everything just drifted, quietly, until the final output was incoherent.

Debugging opacity. When something went wrong in a parallel multi-agent system, tracing the failure was miserable. Was it the planner? One of the workers? The message-passing layer? I'd rebuilt the worst parts of distributed systems debugging inside a single Python process.

The Architecture Shift That Fixed It

The fix wasn't clever. It was humbling: I flattened the architecture.

Instead of parallel agents exchanging messages, I restructured the system around a single linear execution chain where each step receives the full accumulated context of everything that came before. The pattern looks like this:

import anthropic

client = anthropic.Anthropic()

def run_orchestrated_task(task: str) -> str:
    # Full context accumulates at each step
    context = {
        "task": task,
        "steps_completed": [],
        "artifacts": {}
    }

    # Each agent in the chain receives complete context
    for step_name, instruction in agent_steps:
        result = run_step(client, context, step_name, instruction)
        context["steps_completed"].append({
            "agent": step_name,
            "result": result
        })

    return context["artifacts"].get("final_output", "")

This single change eliminated context drift entirely. The implementer sees the planner's full reasoning. The validator sees every decision the implementer made. Nothing is hidden, nothing is re-interpreted from scratch.

Why Parallelism Is Usually the Wrong Bet

The appeal of parallel agents is speed. But in most agentic workflows, the bottleneck isn't compute — it's correctness. You're not waiting on CPU cycles; you're waiting on language model calls that need to produce coherent, consistent output.

Parallelism trades coherence for speed. For tasks where consistency matters — and in my experience, almost all production tasks require consistency — that's a bad trade.

The cases where I've found parallelism genuinely useful are narrow: independent sub-tasks with no shared state, clearly-bounded scope, and results that are combined by a human reviewer rather than automatically synthesized. Think: running the same analysis on ten separate documents and returning ten separate reports. Not: building one coherent system from ten parallel contributors.

The Context Engineering Insight

What changed my mental model was reframing the problem. I'd been thinking about agent architecture as a parallelism problem. It's actually a context problem.

The question isn't "how many agents can work at once?" It's "what does each agent need to know to make the right decision?"

When I started designing with that question first, the architecture became obvious. Every agent needs the full history of what was decided before it. That requirement naturally pushes you toward sequential execution — not because parallelism is impossible, but because the information dependencies make it impractical.

Here's the pattern I settled on for my orchestration framework:

import anthropic
from dataclasses import dataclass, field
from typing import Any

@dataclass
class ExecutionContext:
    original_task: str
    decisions: list[dict[str, Any]] = field(default_factory=list)
    artifacts: dict[str, str] = field(default_factory=dict)

    def record_decision(self, agent: str, decision: str, rationale: str):
        self.decisions.append({
            "agent": agent,
            "decision": decision,
            "rationale": rationale
        })

    def to_prompt_context(self) -> str:
        sections = [f"Task: {self.original_task}"]
        if self.decisions:
            sections.append("Decisions made so far:")
            for d in self.decisions:
                sections.append(
                    f"  [{d['agent']}] {d['decision']}\n"
                    f"    Rationale: {d['rationale']}"
                )
        return "\n\n".join(sections)


def run_step(
    ctx: ExecutionContext,
    agent_name: str,
    instruction: str
) -> str:
    client = anthropic.Anthropic()
    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system=ctx.to_prompt_context(),
        messages=[{"role": "user", "content": instruction}]
    )
    result = message.content[0].text
    ctx.record_decision(agent_name, result[:200], "see full output")
    return result

The key piece is to_prompt_context() — every agent's system prompt is rebuilt from the full decision history. Nothing is ever hidden.

What I'd Tell Myself Earlier

If I were starting the orchestration framework again, I'd establish these constraints from the beginning:

One writer at a time. Only one agent writes or modifies shared artifacts at a given moment. Research agents, planning agents, and analysis agents can run in parallel if their outputs are truly independent. But anything touching a shared artifact runs sequentially.

Explicit state over implicit message passing. Instead of agents exchanging messages, agents read from and write to an explicit execution context. The state is the contract. There's no ambiguity about what the next agent knows.

Fail loudly or not at all. If an agent can't complete its step with the context it has, it should raise an exception rather than produce ambiguous output. Silent degradation is the enemy of debuggable systems.

Sequential is not slow. The processing time for a ten-step agent chain is dominated by the LLM call latency, not the orchestration overhead. Parallelism rarely moves the needle in ways users notice.

The Reliability Payoff

After flattening the architecture, the system became boring in the best way. Failures were rare, and when they happened, they were traceable. The planner's output was in the log. The implementer's reasoning was in the log. The validator's critique was in the log. I could read the execution history like a transcript and immediately see where things went wrong.

That debugging experience is itself evidence for the architecture. If you can't read what your agents did and understand why, you've built something too complex to maintain.

Practical Next Steps

If you're building agentic systems and finding reliability hard to achieve:

Flatten before scaling. Get a single-chain architecture working reliably before adding any parallelism.
Serialize your context explicitly. Write the to_prompt_context() function and think carefully about what every agent actually needs to know.
Log decisions, not just outputs. The rationale behind each decision is what makes the system debuggable.
Test failure modes deliberately. Give agents ambiguous inputs and watch what cascades. That's where your architecture's assumptions are hiding.

The multi-agent dream is real, but it's further away than the frameworks suggest. The path there runs through reliable single-chain architectures that you actually understand — not through distributed complexity that looks impressive until it doesn't.

If this was useful, I write about building production AI and agentic systems at learn-agentic-ai.com — including hands-on learning paths available in both English and Brazilian Portuguese. Come build something real.