In the previous two posts I built a minimal Claude agent (Module 1) and then gave it multiple tools (Module 2). This is Module 3 — adding a second agent that critiques the first one's work and loops until approval. The system works. The output is better than single-agent. But building it changed what I think the word "multi-agent" actually buys you, and I want to be specific about where the real architectural line sits.
The setup
Two Python functions, each making a single Anthropic API call with a different system prompt:
pythondef run_designer(game_idea: str, criticism: str = None) -> str:
if criticism:
messages = [
{"role": "user", "content": f"Design a game based on this idea: {game_idea}"},
{"role": "assistant", "content": "I'll design this game now..."},
{"role": "user", "content": f"A critic reviewed your design and said: {criticism}\n\nRevise the design addressing all criticism points."}
]
else:
messages = [
{"role": "user", "content": f"Design a game based on this idea: {game_idea}"}
]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system="You are a senior game designer. [...]",
messages=messages
)
return response.content[0].text
def run_critic(design: str) -> tuple[str, bool]:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a brutal but fair game design critic. [...] At the end of your review you MUST write either: VERDICT: APPROVED or VERDICT: NEEDS REVISION",
messages=[{"role": "user", "content": f"Review this game design:\n\n{design}"}]
)
review = response.content[0].text
approved = "VERDICT: APPROVED" in review
return review, approved
And the main loop:
pythoncriticism = None
for round_num in range(1, max_rounds + 1):
design = run_designer(game_idea, criticism)
review, approved = run_critic(design)
if approved:
save_final_design(game_idea, design)
break
criticism = review
That single line — criticism = review — is the "agent-to-agent communication" in this system. The Critic's response text becomes part of the Designer's input on the next iteration. There is nothing else. No shared state, no message bus, no protocol, no orchestrator.
It works, and I want to be honest about that
In my Krenholm test run, the Critic rejected round 1 and round 2 and approved round 3. The final document was meaningfully better than what Module 2 produced — sharper scope decisions, a clever reframe of a production constraint as a thematic choice, and a tighter primary mechanic. The Critic's reviews surfaced real problems. The Designer's revisions actually addressed them rather than defending the original drafts.
There is genuine value in having specialist prompts review each other. A "critic" prompt with explicit instructions to find real problems produces more useful pushback than asking the same model to self-review inside one prompt. That is a real and useful pattern.
It is also, mechanically, a pipeline of two stateless API calls.
Where the term "multi-agent" turned out to be bigger than what I built
Before this, when I read "multi-agent system," I assumed it meant something like:
Agents that maintain internal state independent of each other
Some form of inter-agent communication protocol
Coordination logic that exists between agents, not inside any one of them
Often: parallelism, dynamic agent creation, emergent collective behaviour
What this system actually has:
No state. Each API call is independent. Conversation history is a Python list I assemble fresh on each iteration.
No protocol. One agent's output is a string. The next agent receives a string. The string format is whatever I typed into an f-string.
No coordination logic between agents. The for loop is the coordination. It runs sequentially and checks for one keyword.
No parallelism, no dynamic agents, no shared memory.
The "agents" don't know about each other. Each individual API call sees text, a system prompt, and is asked to respond. The Designer doesn't know there is a Critic. The Critic doesn't know there is a Designer. I know there are two of them, because I named the variables.
This isn't a complaint about the model — the model is doing extraordinary work. It's a complaint about the label.
What the architectural line actually is
After building this, my working definition of a real multi-agent system (i.e. the thing Module 4 is going to attempt):
An orchestrator that does not know in advance how many specialists it will invoke
Dynamic decisions about which specialists to call, in what order, based on intermediate results
Retry and reroute when a specialist's output is unusable
Some form of persistent state that outlives any single API call
Specialists that can themselves invoke tools or sub-specialists
What I built in Module 3 has none of those. It is a two-stage pipeline with a feedback loop. Useful, deployable, much cheaper than dynamic orchestration. But the gap between "structured prompt pipeline" and "multi-agent system" is wider than I think the current vocabulary admits.
One implementation note worth recording
The VERDICT: APPROVED / VERDICT: NEEDS REVISION pattern at the end of the Critic's system prompt is doing a lot of load-bearing work. It is a structured-output hack — I'm scanning the Critic's free-text response for one of two literal substrings to drive the control flow:
pythonapproved = "VERDICT: APPROVED" in review
This works because the system prompt instructs the model to always end with one of those exact strings. If you remove that instruction, you have to start parsing free text more carefully, and the control flow gets brittle fast. For prompt-driven control flow in general, having the model emit a structured tag at a known location is much more reliable than asking it to "respond with yes or no."
Where this goes
Module 4 will attempt actual orchestration — an agent that receives a goal and figures out what subtasks exist, which specialists to call, and what to do when an output is unusable. None of that decision-making is in Module 3.
Code, the full three-round Krenholm transcript, the final approved design doc: github.com/quietaidev-collab/zero-to-agent
If you've built dynamic orchestration in production: how much harder is the code actually? I'd value calibration before I start.
Top comments (0)