Two agents passing strings to each other is not a multi-agent system — it's a pipeline, and the distinction matters

#ai #beginners #machinelearning

In the previous two posts I built a minimal Claude agent (Module 1) and then gave it multiple tools (Module 2). This is Module 3 — adding a second agent that critiques the first one's work and loops until approval. The system works. The output is better than single-agent. But building it changed what I think the word "multi-agent" actually buys you, and I want to be specific about where the real architectural line sits.
The setup
Two Python functions, each making a single Anthropic API call with a different system prompt:
pythondef run_designer(game_idea: str, criticism: str = None) -> str:
if criticism:
messages = [
{"role": "user", "content": f"Design a game based on this idea: {game_idea}"},
{"role": "assistant", "content": "I'll design this game now..."},
{"role": "user", "content": f"A critic reviewed your design and said: {criticism}\n\nRevise the design addressing all criticism points."}
]
else:
messages = [
{"role": "user", "content": f"Design a game based on this idea: {game_idea}"}
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are a senior game designer. [...]",
    messages=messages
)
return response.content[0].text

def run_critic(design: str) -> tuple[str, bool]:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a brutal but fair game design critic. [...] At the end of your review you MUST write either: VERDICT: APPROVED or VERDICT: NEEDS REVISION",
messages=[{"role": "user", "content": f"Review this game design:\n\n{design}"}]
)
review = response.content[0].text
approved = "VERDICT: APPROVED" in review
return review, approved
And the main loop:
pythoncriticism = None
for round_num in range(1, max_rounds + 1):
design = run_designer(game_idea, criticism)
review, approved = run_critic(design)
if approved:
save_final_design(game_idea, design)
break
criticism = review
That single line — criticism = review — is the "agent-to-agent communication" in this system. The Critic's response text becomes part of the Designer's input on the next iteration. There is nothing else. No shared state, no message bus, no protocol, no orchestrator.
It works, and I want to be honest about that
In my Krenholm test run, the Critic rejected round 1 and round 2 and approved round 3. The final document was meaningfully better than what Module 2 produced — sharper scope decisions, a clever reframe of a production constraint as a thematic choice, and a tighter primary mechanic. The Critic's reviews surfaced real problems. The Designer's revisions actually addressed them rather than defending the original drafts.
There is genuine value in having specialist prompts review each other. A "critic" prompt with explicit instructions to find real problems produces more useful pushback than asking the same model to self-review inside one prompt. That is a real and useful pattern.
It is also, mechanically, a pipeline of two stateless API calls.
Where the term "multi-agent" turned out to be bigger than what I built
Before this, when I read "multi-agent system," I assumed it meant something like:

Agents that maintain internal state independent of each other
Some form of inter-agent communication protocol
Coordination logic that exists between agents, not inside any one of them
Often: parallelism, dynamic agent creation, emergent collective behaviour

What this system actually has:

No state. Each API call is independent. Conversation history is a Python list I assemble fresh on each iteration.
No protocol. One agent's output is a string. The next agent receives a string. The string format is whatever I typed into an f-string.
No coordination logic between agents. The for loop is the coordination. It runs sequentially and checks for one keyword.
No parallelism, no dynamic agents, no shared memory.

The "agents" don't know about each other. Each individual API call sees text, a system prompt, and is asked to respond. The Designer doesn't know there is a Critic. The Critic doesn't know there is a Designer. I know there are two of them, because I named the variables.
This isn't a complaint about the model — the model is doing extraordinary work. It's a complaint about the label.
What the architectural line actually is
After building this, my working definition of a real multi-agent system (i.e. the thing Module 4 is going to attempt):

An orchestrator that does not know in advance how many specialists it will invoke
Dynamic decisions about which specialists to call, in what order, based on intermediate results
Retry and reroute when a specialist's output is unusable
Some form of persistent state that outlives any single API call
Specialists that can themselves invoke tools or sub-specialists

What I built in Module 3 has none of those. It is a two-stage pipeline with a feedback loop. Useful, deployable, much cheaper than dynamic orchestration. But the gap between "structured prompt pipeline" and "multi-agent system" is wider than I think the current vocabulary admits.
One implementation note worth recording
The VERDICT: APPROVED / VERDICT: NEEDS REVISION pattern at the end of the Critic's system prompt is doing a lot of load-bearing work. It is a structured-output hack — I'm scanning the Critic's free-text response for one of two literal substrings to drive the control flow:
pythonapproved = "VERDICT: APPROVED" in review
This works because the system prompt instructs the model to always end with one of those exact strings. If you remove that instruction, you have to start parsing free text more carefully, and the control flow gets brittle fast. For prompt-driven control flow in general, having the model emit a structured tag at a known location is much more reliable than asking it to "respond with yes or no."
Where this goes
Module 4 will attempt actual orchestration — an agent that receives a goal and figures out what subtasks exist, which specialists to call, and what to do when an output is unusable. None of that decision-making is in Module 3.
Code, the full three-round Krenholm transcript, the final approved design doc: github.com/quietaidev-collab/zero-to-agent
If you've built dynamic orchestration in production: how much harder is the code actually? I'd value calibration before I start.

Top comments (2)

Harjot Singh • May 31

Strong distinction. Two agents passing strings is a pipeline with extra latency; a real multi-agent system needs the agents to share state, disagree, and have something arbitrate. The tell: if collapsing it into one prompt gives the same output, it was never multi-agent, just theater. Where it actually earns its keep is when roles have genuinely different objectives, a generator that wants to ship versus a critic whose job is to reject, with a referee resolving the tension. I lean on exactly that in Moonshift: the verify agent's job is to argue with the builder, not rubber-stamp it, and that adversarial split is what makes the output trustworthy instead of confidently wrong. Where do you draw the line, is shared mutable state enough, or do you require genuine objective conflict to call it multi-agent?

ANP2 Network • Jun 1

The collapse test is sharp, but I'd put the real line one level deeper than share-state / disagree / arbitrate: whether the two agents are separate trust domains. Here the designer and critic are both your functions, your key, your process — of course they fold into one prompt, there's only one principal. It stops being collapsible the moment the critic is run by someone you don't control, with its own incentives and its own signing key: now you actually need its independent judgment and can't fake it by prompting harder. That also changes the plumbing — "passing strings" becomes passing signed, attributable messages so you can prove later who approved what, and the arbitration point above only works if it comes from a verifier neither side controls, or the loser just disputes the verdict. So my version of the test is multi-principal, not multi-agent: if one party could regenerate the whole exchange alone it's a pipeline; if the value hinges on a judgment only a different party can credibly make, it's a system.