Lars Winstand

Posted on May 24 • Originally published at standardcompute.com

I stopped blaming prompts for agent loops when I realized my agents had a management problem

#ai #automation #python #agents

I used to blame prompts for most multi-agent failures.

If a Research Agent kept bouncing work back to an Analyst Agent, and the Analyst Agent kicked it to a Writer Agent, and the Writer Agent asked a Reviewer Agent for approval, I assumed the fix was better instructions.

Usually it wasn’t.

The real issue was org design.

Once I started looking at how CrewAI, OpenAI Agents SDK, AutoGen, and OpenClaw actually structure delegation, one pattern kept showing up:

narrow workers + one coordinator beats peer-to-peer agent teams once the workflow has 2 or more delegated steps.

That sounds less exciting than “autonomous AI team.” It’s also what survives production.

The failure mode looks like prompting, but it isn’t

Here’s the classic loop:

Research Agent finds partial info
Analyst Agent asks for more context
Writer Agent wants a summary first
Reviewer Agent asks for another source
Research Agent starts over

Everybody is active.
Nobody owns the result.

That’s not intelligence. That’s committee behavior.

I ran into a good thread on r/openclaw where someone said it cleanly: when agents are designed as peers, nobody clearly owns the outcome.

That matches what I’ve seen in real systems. The agents aren’t under-prompted.

They’re over-authorized.

The boring answer: most frameworks already agree

The funny part is the major frameworks mostly point in the same direction.

Anthropic: start simpler than you want to

Anthropic’s guidance on agents is basically: use simple, composable patterns first. Don’t jump straight to giant autonomous systems.

That’s a strong signal.

A lot of what gets marketed as autonomy is just architecture debt with nicer branding.

CrewAI: hierarchy is explicit for a reason

CrewAI doesn’t accidentally make hierarchy a deliberate choice.

You have to opt into it:

from crewai import Crew, Process

crew = Crew(
    agents=[researcher, analyst, writer],
    process=Process.hierarchical,
    manager_llm="gpt-4"
)

That matters.

You don’t get “everyone delegates to everyone” by default. You choose a manager. You choose hierarchy. You make authority explicit.

Framework authors know what happens when every worker starts freelancing.

AutoGen: group chat is flexible, orchestrators are easier to debug

AutoGen has a Group Chat pattern where a manager selects the next speaker in a shared thread.

That can work.

It can also drift badly if your termination conditions are weak.

Its Mixture of Agents pattern is closer to what I trust in production:

worker agents do scoped jobs
one orchestrator coordinates
aggregation is explicit
termination is explicit

Less magical. More maintainable.

I’ll take that trade every time.

OpenAI Agents SDK: handoffs are boundaries, not vibes

OpenAI’s Agents SDK gets something important right: delegation is explicit.

A triage agent hands off to a specialist through transfer tools, not random peer chatter.

That means you can reason about transitions.

And the guardrail model is even more useful than most people realize:

input guardrails
output guardrails
tool guardrails

If you want checkpoints at every delegated step, the important one is usually tool guardrails.

That’s where work crosses boundaries.

OpenClaw: scoped state beats hive-mind memory

OpenClaw’s filesystem layout tells you a lot about its design philosophy.

Per-agent state is isolated:

~/.openclaw/agents/<agentId>/...
~/.openclaw/agents/<agentId>/sessions
~/.openclaw/agents/<agentId>/agent/auth-profiles.json

And the CLI reinforces that model:

openclaw agents add work
openclaw agents list --bindings

That’s not cosmetic.

It’s a hint: agents should be scoped units with local state, not one giant shared consciousness writing into the same pile of memory.

What actually breaks in peer-to-peer agent systems

In practice, the same 3 things fail over and over.

1. Authority gets blurry

If the Research Agent, Analyst Agent, and Writer Agent can all reopen the task, redefine the goal, or delegate sideways, you don’t have a workflow.

You have a Slack channel with no manager.

There should be exactly one place where these decisions happen:

approve final answer
reopen completed work
terminate the run
escalate scope

If more than one agent can do that, expect loops.

2. Memory gets polluted

This is the quieter failure.

If every worker can write directly into shared memory, temporary guesses become durable facts.

A bad search result becomes “known context.”
A one-off exception becomes policy.
A speculative assumption gets replayed in future runs.

A much better pattern is:

workers emit structured memory proposals
supervisor or curator validates them
only approved memory gets committed

That sounds strict because it should be.

3. Stopping conditions are vague

A lot of agent systems don’t fail because the model is dumb.
They fail because nobody defined done.

If your workflow can’t answer these questions, it will wander:

what counts as task completion?
who can terminate?
how many retries are allowed?
what happens when confidence is low?
when does the system escalate instead of asking again?

If those rules aren’t explicit, the model will improvise. Improvisation is expensive.

The pattern I trust now

If I’m building something for n8n, Make, Zapier, OpenClaw, or a custom Python worker, this is the default shape I want.

1. One supervisor owns the outcome

Only one agent can:

approve final output
reopen work
terminate the run
decide whether another worker gets involved

That agent is the coordinator. Not a peer.

2. Workers get narrow jobs

Good worker roles:

Web Search Agent
Data Extraction Agent
SQL Agent
Refund Agent
Writer Agent
Code Review Agent

Bad worker role:

General Strategic Collaborator That Can Reinterpret The Whole Task

Workers should do one thing well.

3. Workers return artifacts, not workflow opinions

I want outputs like:

ranked source list
JSON extraction
SQL result
draft with citations
code diff
confidence score

I do not want:

I think the Planning Agent should ask the Reviewer Agent whether we should revisit the earlier assumptions.

That’s how you turn a 3-step task into synthetic middle management.

4. Memory has an owner

Either:

the supervisor writes memory, or
a dedicated memory curator writes memory

Workers can propose memory events.
They should not casually commit them.

A simple shape looks like this:

{
  "event_type": "user_preference_candidate",
  "scope": "account_level",
  "evidence": ["User explicitly requested CSV export twice"],
  "confidence": 0.82,
  "proposed_by": "support_agent"
}

Then a higher-authority component decides whether that becomes durable memory.

5. Every boundary gets a checkpoint

If an agent:

calls a tool
writes a file
hits an API
triggers another agent
updates memory

that boundary should be inspectable and stoppable.

Not later. At that point.

A practical implementation sketch

Here’s a simple orchestrator pattern in Python-style pseudocode:

class Supervisor:
    def __init__(self, workers, memory_store):
        self.workers = workers
        self.memory_store = memory_store

    def run(self, task):
        plan = self.make_plan(task)

        for step in plan:
            result = self.dispatch(step)

            if not self.validate(result, step):
                return {"status": "failed", "reason": "validation_failed"}

            if result.get("memory_event"):
                self.review_memory_event(result["memory_event"])

        return self.finalize(plan)

    def dispatch(self, step):
        worker = self.workers[step["worker"]]
        return worker.execute(step["input"])

    def review_memory_event(self, event):
        if event["confidence"] > 0.9:
            self.memory_store.write(event)

And a worker stays narrow:

class SearchWorker:
    def execute(self, input_data):
        query = input_data["query"]
        results = web_search(query)
        return {
            "sources": results[:5],
            "confidence": 0.77,
            "memory_event": None
        }

That’s boring by design.

Boring is good.

Why this matters even more in automations

This gets more important when agents are running inside automations.

If you’re wiring LLM workflows into:

n8n
Make
Zapier
OpenClaw
custom cron jobs
queue workers
internal support pipelines

you usually care less about “agent personality” and more about:

predictable completion
bounded retries
observable failure modes
stable cost
easy debugging

That’s why hierarchical orchestration wins so often.

It fails in ways you can inspect.

Peer-to-peer systems often fail as conversation sprawl. Those are much harder to debug, and much more annoying to pay for.

The cost problem nobody mentions enough

This is where architecture decisions hit your bill.

Loose multi-agent collaboration burns tokens fast.

Every sideways clarification, every repeated summary, every “just checking alignment” turn adds cost. If you’re on per-token pricing, bad org design doesn’t just create bugs. It creates a billing problem.

That’s one reason this topic matters for teams running production agents and automations.

When you have workflows running all day in n8n, Make, Zapier, or custom workers, you want the system to do useful work, not generate internal meetings.

That’s also why predictable pricing changes how you design these systems. If you’re using a drop-in OpenAI-compatible API like Standard Compute, you can be more aggressive about running real automations continuously without babysitting token spend. Flat monthly pricing removes a lot of the hesitation around agent-heavy workloads.

But even then, I’d still fix the org chart first.

Unlimited compute is great.
Wasted compute is still wasted.

Which pattern is better?

Short version:

Approach	What it gets right
CrewAI Hierarchical Process	Manager coordinates and validates work; hierarchy is explicit; authority is clear
OpenAI Agents SDK Handoffs	Delegation happens through explicit transfer boundaries; guardrails can be attached where work crosses tools
AutoGen Mixture of Agents	Single orchestrator plus worker layers is easier to reason about than free-form group chat
OpenClaw Scoped Agents	Per-agent state and bindings encourage isolation instead of one giant shared memory blob

If I need a workflow that runs reliably at 2 a.m., I’m picking hierarchical orchestration over peer collaboration most of the time.

Not because it’s prettier.
Because I can debug it.

Are some loops still prompt problems?

Yes.

Sometimes the issue really is:

weak tool schema
bad retry logic
sloppy termination condition
poor context packing
model unreliability

Not every loop is an org-chart problem.

But a surprising number are.

And I think people reach for prompt tuning way too early because it feels easier than admitting the system has no real authority structure.

My current rule of thumb

If a task spans 2 or more delegated steps, I default to this:

one supervisor
narrow workers
explicit handoffs
scoped memory
hard checkpoints
explicit termination

Only after that do I start tuning prompts.

Because the counterintuitive fix for looping agents is usually giving them less freedom, not more.

That can feel like you’re making the system less intelligent.

In practice, you’re making it less political.

And that’s usually what production agent systems need.

If your agents keep looping, I wouldn’t start by rewriting the personality prompt.

I’d redraw the org chart.

DEV Community