I used to blame prompts for most multi-agent failures.
If a Research Agent kept bouncing work back to an Analyst Agent, and the Analyst Agent kicked it to a Writer Agent, and the Writer Agent asked a Reviewer Agent for approval, I assumed the fix was better instructions.
Usually it wasn’t.
The real issue was org design.
Once I started looking at how CrewAI, OpenAI Agents SDK, AutoGen, and OpenClaw actually structure delegation, one pattern kept showing up:
narrow workers + one coordinator beats peer-to-peer agent teams once the workflow has 2 or more delegated steps.
That sounds less exciting than “autonomous AI team.” It’s also what survives production.
The failure mode looks like prompting, but it isn’t
Here’s the classic loop:
- Research Agent finds partial info
- Analyst Agent asks for more context
- Writer Agent wants a summary first
- Reviewer Agent asks for another source
- Research Agent starts over
Everybody is active.
Nobody owns the result.
That’s not intelligence. That’s committee behavior.
I ran into a good thread on r/openclaw where someone said it cleanly: when agents are designed as peers, nobody clearly owns the outcome.
That matches what I’ve seen in real systems. The agents aren’t under-prompted.
They’re over-authorized.
The boring answer: most frameworks already agree
The funny part is the major frameworks mostly point in the same direction.
Anthropic: start simpler than you want to
Anthropic’s guidance on agents is basically: use simple, composable patterns first. Don’t jump straight to giant autonomous systems.
That’s a strong signal.
A lot of what gets marketed as autonomy is just architecture debt with nicer branding.
CrewAI: hierarchy is explicit for a reason
CrewAI doesn’t accidentally make hierarchy a deliberate choice.
You have to opt into it:
from crewai import Crew, Process
crew = Crew(
agents=[researcher, analyst, writer],
process=Process.hierarchical,
manager_llm="gpt-4"
)
That matters.
You don’t get “everyone delegates to everyone” by default. You choose a manager. You choose hierarchy. You make authority explicit.
Framework authors know what happens when every worker starts freelancing.
AutoGen: group chat is flexible, orchestrators are easier to debug
AutoGen has a Group Chat pattern where a manager selects the next speaker in a shared thread.
That can work.
It can also drift badly if your termination conditions are weak.
Its Mixture of Agents pattern is closer to what I trust in production:
- worker agents do scoped jobs
- one orchestrator coordinates
- aggregation is explicit
- termination is explicit
Less magical. More maintainable.
I’ll take that trade every time.
OpenAI Agents SDK: handoffs are boundaries, not vibes
OpenAI’s Agents SDK gets something important right: delegation is explicit.
A triage agent hands off to a specialist through transfer tools, not random peer chatter.
That means you can reason about transitions.
And the guardrail model is even more useful than most people realize:
- input guardrails
- output guardrails
- tool guardrails
If you want checkpoints at every delegated step, the important one is usually tool guardrails.
That’s where work crosses boundaries.
OpenClaw: scoped state beats hive-mind memory
OpenClaw’s filesystem layout tells you a lot about its design philosophy.
Per-agent state is isolated:
~/.openclaw/agents/<agentId>/...
~/.openclaw/agents/<agentId>/sessions
~/.openclaw/agents/<agentId>/agent/auth-profiles.json
And the CLI reinforces that model:
openclaw agents add work
openclaw agents list --bindings
That’s not cosmetic.
It’s a hint: agents should be scoped units with local state, not one giant shared consciousness writing into the same pile of memory.
What actually breaks in peer-to-peer agent systems
In practice, the same 3 things fail over and over.
1. Authority gets blurry
If the Research Agent, Analyst Agent, and Writer Agent can all reopen the task, redefine the goal, or delegate sideways, you don’t have a workflow.
You have a Slack channel with no manager.
There should be exactly one place where these decisions happen:
- approve final answer
- reopen completed work
- terminate the run
- escalate scope
If more than one agent can do that, expect loops.
2. Memory gets polluted
This is the quieter failure.
If every worker can write directly into shared memory, temporary guesses become durable facts.
A bad search result becomes “known context.”
A one-off exception becomes policy.
A speculative assumption gets replayed in future runs.
A much better pattern is:
- workers emit structured memory proposals
- supervisor or curator validates them
- only approved memory gets committed
That sounds strict because it should be.
3. Stopping conditions are vague
A lot of agent systems don’t fail because the model is dumb.
They fail because nobody defined done.
If your workflow can’t answer these questions, it will wander:
- what counts as task completion?
- who can terminate?
- how many retries are allowed?
- what happens when confidence is low?
- when does the system escalate instead of asking again?
If those rules aren’t explicit, the model will improvise. Improvisation is expensive.
The pattern I trust now
If I’m building something for n8n, Make, Zapier, OpenClaw, or a custom Python worker, this is the default shape I want.
1. One supervisor owns the outcome
Only one agent can:
- approve final output
- reopen work
- terminate the run
- decide whether another worker gets involved
That agent is the coordinator. Not a peer.
2. Workers get narrow jobs
Good worker roles:
- Web Search Agent
- Data Extraction Agent
- SQL Agent
- Refund Agent
- Writer Agent
- Code Review Agent
Bad worker role:
- General Strategic Collaborator That Can Reinterpret The Whole Task
Workers should do one thing well.
3. Workers return artifacts, not workflow opinions
I want outputs like:
- ranked source list
- JSON extraction
- SQL result
- draft with citations
- code diff
- confidence score
I do not want:
I think the Planning Agent should ask the Reviewer Agent whether we should revisit the earlier assumptions.
That’s how you turn a 3-step task into synthetic middle management.
4. Memory has an owner
Either:
- the supervisor writes memory, or
- a dedicated memory curator writes memory
Workers can propose memory events.
They should not casually commit them.
A simple shape looks like this:
{
"event_type": "user_preference_candidate",
"scope": "account_level",
"evidence": ["User explicitly requested CSV export twice"],
"confidence": 0.82,
"proposed_by": "support_agent"
}
Then a higher-authority component decides whether that becomes durable memory.
5. Every boundary gets a checkpoint
If an agent:
- calls a tool
- writes a file
- hits an API
- triggers another agent
- updates memory
that boundary should be inspectable and stoppable.
Not later. At that point.
A practical implementation sketch
Here’s a simple orchestrator pattern in Python-style pseudocode:
class Supervisor:
def __init__(self, workers, memory_store):
self.workers = workers
self.memory_store = memory_store
def run(self, task):
plan = self.make_plan(task)
for step in plan:
result = self.dispatch(step)
if not self.validate(result, step):
return {"status": "failed", "reason": "validation_failed"}
if result.get("memory_event"):
self.review_memory_event(result["memory_event"])
return self.finalize(plan)
def dispatch(self, step):
worker = self.workers[step["worker"]]
return worker.execute(step["input"])
def review_memory_event(self, event):
if event["confidence"] > 0.9:
self.memory_store.write(event)
And a worker stays narrow:
class SearchWorker:
def execute(self, input_data):
query = input_data["query"]
results = web_search(query)
return {
"sources": results[:5],
"confidence": 0.77,
"memory_event": None
}
That’s boring by design.
Boring is good.
Why this matters even more in automations
This gets more important when agents are running inside automations.
If you’re wiring LLM workflows into:
- n8n
- Make
- Zapier
- OpenClaw
- custom cron jobs
- queue workers
- internal support pipelines
you usually care less about “agent personality” and more about:
- predictable completion
- bounded retries
- observable failure modes
- stable cost
- easy debugging
That’s why hierarchical orchestration wins so often.
It fails in ways you can inspect.
Peer-to-peer systems often fail as conversation sprawl. Those are much harder to debug, and much more annoying to pay for.
The cost problem nobody mentions enough
This is where architecture decisions hit your bill.
Loose multi-agent collaboration burns tokens fast.
Every sideways clarification, every repeated summary, every “just checking alignment” turn adds cost. If you’re on per-token pricing, bad org design doesn’t just create bugs. It creates a billing problem.
That’s one reason this topic matters for teams running production agents and automations.
When you have workflows running all day in n8n, Make, Zapier, or custom workers, you want the system to do useful work, not generate internal meetings.
That’s also why predictable pricing changes how you design these systems. If you’re using a drop-in OpenAI-compatible API like Standard Compute, you can be more aggressive about running real automations continuously without babysitting token spend. Flat monthly pricing removes a lot of the hesitation around agent-heavy workloads.
But even then, I’d still fix the org chart first.
Unlimited compute is great.
Wasted compute is still wasted.
Which pattern is better?
Short version:
| Approach | What it gets right |
|---|---|
| CrewAI Hierarchical Process | Manager coordinates and validates work; hierarchy is explicit; authority is clear |
| OpenAI Agents SDK Handoffs | Delegation happens through explicit transfer boundaries; guardrails can be attached where work crosses tools |
| AutoGen Mixture of Agents | Single orchestrator plus worker layers is easier to reason about than free-form group chat |
| OpenClaw Scoped Agents | Per-agent state and bindings encourage isolation instead of one giant shared memory blob |
If I need a workflow that runs reliably at 2 a.m., I’m picking hierarchical orchestration over peer collaboration most of the time.
Not because it’s prettier.
Because I can debug it.
Are some loops still prompt problems?
Yes.
Sometimes the issue really is:
- weak tool schema
- bad retry logic
- sloppy termination condition
- poor context packing
- model unreliability
Not every loop is an org-chart problem.
But a surprising number are.
And I think people reach for prompt tuning way too early because it feels easier than admitting the system has no real authority structure.
My current rule of thumb
If a task spans 2 or more delegated steps, I default to this:
- one supervisor
- narrow workers
- explicit handoffs
- scoped memory
- hard checkpoints
- explicit termination
Only after that do I start tuning prompts.
Because the counterintuitive fix for looping agents is usually giving them less freedom, not more.
That can feel like you’re making the system less intelligent.
In practice, you’re making it less political.
And that’s usually what production agent systems need.
If your agents keep looping, I wouldn’t start by rewriting the personality prompt.
I’d redraw the org chart.
Top comments (0)