DEV Community: Quietai.dev

Two agents passing strings to each other is not a multi-agent system — it's a pipeline, and the distinction matters

Quietai.dev — Sun, 31 May 2026 12:14:23 +0000

In the previous two posts I built a minimal Claude agent (Module 1) and then gave it multiple tools (Module 2). This is Module 3 — adding a second agent that critiques the first one's work and loops until approval. The system works. The output is better than single-agent. But building it changed what I think the word "multi-agent" actually buys you, and I want to be specific about where the real architectural line sits.
The setup
Two Python functions, each making a single Anthropic API call with a different system prompt:
pythondef run_designer(game_idea: str, criticism: str = None) -> str:
if criticism:
messages = [
{"role": "user", "content": f"Design a game based on this idea: {game_idea}"},
{"role": "assistant", "content": "I'll design this game now..."},
{"role": "user", "content": f"A critic reviewed your design and said: {criticism}\n\nRevise the design addressing all criticism points."}
]
else:
messages = [
{"role": "user", "content": f"Design a game based on this idea: {game_idea}"}
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are a senior game designer. [...]",
    messages=messages
)
return response.content[0].text

def run_critic(design: str) -> tuple[str, bool]:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system="You are a brutal but fair game design critic. [...] At the end of your review you MUST write either: VERDICT: APPROVED or VERDICT: NEEDS REVISION",
messages=[{"role": "user", "content": f"Review this game design:\n\n{design}"}]
)
review = response.content[0].text
approved = "VERDICT: APPROVED" in review
return review, approved
And the main loop:
pythoncriticism = None
for round_num in range(1, max_rounds + 1):
design = run_designer(game_idea, criticism)
review, approved = run_critic(design)
if approved:
save_final_design(game_idea, design)
break
criticism = review
That single line — criticism = review — is the "agent-to-agent communication" in this system. The Critic's response text becomes part of the Designer's input on the next iteration. There is nothing else. No shared state, no message bus, no protocol, no orchestrator.
It works, and I want to be honest about that
In my Krenholm test run, the Critic rejected round 1 and round 2 and approved round 3. The final document was meaningfully better than what Module 2 produced — sharper scope decisions, a clever reframe of a production constraint as a thematic choice, and a tighter primary mechanic. The Critic's reviews surfaced real problems. The Designer's revisions actually addressed them rather than defending the original drafts.
There is genuine value in having specialist prompts review each other. A "critic" prompt with explicit instructions to find real problems produces more useful pushback than asking the same model to self-review inside one prompt. That is a real and useful pattern.
It is also, mechanically, a pipeline of two stateless API calls.
Where the term "multi-agent" turned out to be bigger than what I built
Before this, when I read "multi-agent system," I assumed it meant something like:

Agents that maintain internal state independent of each other
Some form of inter-agent communication protocol
Coordination logic that exists between agents, not inside any one of them
Often: parallelism, dynamic agent creation, emergent collective behaviour

What this system actually has:

No state. Each API call is independent. Conversation history is a Python list I assemble fresh on each iteration.
No protocol. One agent's output is a string. The next agent receives a string. The string format is whatever I typed into an f-string.
No coordination logic between agents. The for loop is the coordination. It runs sequentially and checks for one keyword.
No parallelism, no dynamic agents, no shared memory.

The "agents" don't know about each other. Each individual API call sees text, a system prompt, and is asked to respond. The Designer doesn't know there is a Critic. The Critic doesn't know there is a Designer. I know there are two of them, because I named the variables.
This isn't a complaint about the model — the model is doing extraordinary work. It's a complaint about the label.
What the architectural line actually is
After building this, my working definition of a real multi-agent system (i.e. the thing Module 4 is going to attempt):

An orchestrator that does not know in advance how many specialists it will invoke
Dynamic decisions about which specialists to call, in what order, based on intermediate results
Retry and reroute when a specialist's output is unusable
Some form of persistent state that outlives any single API call
Specialists that can themselves invoke tools or sub-specialists

What I built in Module 3 has none of those. It is a two-stage pipeline with a feedback loop. Useful, deployable, much cheaper than dynamic orchestration. But the gap between "structured prompt pipeline" and "multi-agent system" is wider than I think the current vocabulary admits.
One implementation note worth recording
The VERDICT: APPROVED / VERDICT: NEEDS REVISION pattern at the end of the Critic's system prompt is doing a lot of load-bearing work. It is a structured-output hack — I'm scanning the Critic's free-text response for one of two literal substrings to drive the control flow:
pythonapproved = "VERDICT: APPROVED" in review
This works because the system prompt instructs the model to always end with one of those exact strings. If you remove that instruction, you have to start parsing free text more carefully, and the control flow gets brittle fast. For prompt-driven control flow in general, having the model emit a structured tag at a known location is much more reliable than asking it to "respond with yes or no."
Where this goes
Module 4 will attempt actual orchestration — an agent that receives a goal and figures out what subtasks exist, which specialists to call, and what to do when an output is unusable. None of that decision-making is in Module 3.
Code, the full three-round Krenholm transcript, the final approved design doc: github.com/quietaidev-collab/zero-to-agent
If you've built dynamic orchestration in production: how much harder is the code actually? I'd value calibration before I start.

My agent re-ran a tool it didn't like the output of — multi-tool agents and the thing the docs don't tell you about tool descriptions

Quietai.dev — Wed, 27 May 2026 19:47:49 +0000

In the last post I built a minimal Claude agent with one tool, no framework, just the Anthropic SDK. This is the follow-up: three tools, and the discovery that most of the agent's planning behaviour was coming from somewhere I hadn't been paying attention to.

The setup: three tools, no specified order
Module 1's agent had one tool (save_game_design). Module 2 adds two more:
pythondef search_similar_games(genre: str, mechanics: str) -> str:
# returns reference games for a genre
...

def estimate_dev_time(features: list, team_size: int) -> str:
base_weeks = len(features) * 2
adjusted = base_weeks / max(team_size, 1)
solo_note = " Note: as a solo dev, budget 3x this estimate." if team_size == 1 else ""
return f"Estimated development time: {adjusted:.0f}-{adjusted*1.5:.0f} weeks.{solo_note}"
The tool definitions are where it gets interesting. Notice the description fields:
pythontools = [
{
"name": "search_similar_games",
"description": "Search for similar existing games to use as reference "
"for scope and mechanics. Call this early to ground the "
"design in reality.",
...
},
{
"name": "estimate_dev_time",
"description": "Estimate development time based on planned features and "
"team size. Call this before finalizing the design to "
"ensure scope is realistic.",
...
},
{
"name": "save_game_design",
"description": "Save the completed game design document to a file. "
"Only call this when the full design is ready.",
...
}
]
"Call this early." "Call this before finalizing." "Only call this when the full design is ready." Those phrases are the entire sequencing logic. There is no orchestration code. The system prompt lists the steps as a suggestion but doesn't enforce them. The loop is identical to Module 1 — branch on block.name to dispatch the right function.
What it did with one prompt
Same Krenholm prompt as Module 1. The agent's actual sequence:
[search_similar_games] → Prison Architect, RimWorld, Dwarf Fortress
[estimate_dev_time] → 24-36 weeks (3x for solo dev)
[estimate_dev_time] → 20-30 weeks ← it ran this one again
[save_game_design]
Between the two estimates, the model's text output:

"24–36 weeks for the core is workable, but 3x polish is a serious warning. Let me trim scope smartly — dropping procedural maps and day/night cycle — and re-estimate the leaner version."

It observed a result, evaluated it, cut two features from its own feature list, and re-ran the estimate to verify. The observe → evaluate → adjust loop, with no code telling it to do that. The only thing that makes re-running possible is that the loop keeps going as long as stop_reason == "tool_use" — the agent can call the same tool as many times as it decides to.
The thing the docs underplay: tool descriptions are planning instructions
I came into Module 2 expecting to write orchestration logic — some state machine deciding "research, then design, then estimate, then save." I wrote none. The sequencing came entirely from the natural-language description fields.
This reframes what a tool definition is. It's not just an API contract telling the model what arguments to pass. The description is a planning hint the model reads when deciding whether and when to call the tool relative to the others. "Call this early" and "call this before finalizing" are, functionally, a plan written in prose.
Concretely: in an earlier run with a vaguer save-tool description, the agent saved the document before finishing the design. Tightening the description to "Only call this when the full design is ready" fixed the ordering without touching the system prompt or the loop. If your multi-tool agent calls things in the wrong order, look at the descriptions before you reach for orchestration code.
One implementation gotcha: don't assume one tool_use block per response
At one point the model said two steps were "independent" and it could run them together. In my run the calls still arrived sequentially — but the API permits multiple tool_use blocks in a single response, so the defensive way to write the loop is to iterate over all of them and return all the corresponding tool_result blocks in the next message:
pythonif response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = dispatch(block.name, block.input) # branch on name
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
conversation_history.append({"role": "user", "content": tool_results})
continue
Assume one-tool-per-turn and you'll silently drop calls.
The Windows tax, still being paid
Committing Module 2, PowerShell rejected &&:
The token '&&' is not a valid statement separator in this version.
Every git tutorial uses git add . && git commit && git push. On PowerShell you run them as three separate commands. Minor, but it's the third Windows-specific paper cut in two modules. If you're following along on Windows, expect these.
What it produced
The design document is meaningfully better than Module 1's — not because the model improved, but because it had tools feeding it reality and it used them on its own output. The doc cites the reference games directly in the art direction, includes a realistic dev-time roadmap, and has an "Intentionally Cut Features" section listing what the agent removed to keep scope shippable.
The mechanic that produced all of it is still just the loop: reason, call a tool, read the result, decide whether to keep going. The leverage turned out to be in the prose around the loop, not the loop itself.
Code and full terminal log: github.com/quietaidev-collab/zero-to-agent
Module 3 next: two of these agents talking to each other.

I built a Claude agent without a framework and something I didn't expect to find weird turned out to be weird

Quietai.dev — Fri, 22 May 2026 11:10:31 +0000

I built my first agent last night using the Anthropic Python SDK directly. No LangChain, no CrewAI, no orchestration framework. About 60 lines of code, one tool, one goal. The point was to see the mechanics without abstractions on top of them.
Three things stood out. None of them were what I expected going in.

The conversation history array is the agent There is no hidden state. The "memory" is a Python list you maintain yourself. Every iteration, you append the model's full response — including any tool_use blocks — and then append any tool results back into the list before the next API call. pythonconversation_history = [] conversation_history.append({"role": "user", "content": user_input})

_while True:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=system_prompt,
tools=tools,
messages=conversation_history,
)

conversation_history.append({
    "role": "assistant",
    "content": response.content
})

if response.stop_reason == "tool_use":
    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = save_game_design(**block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result
            })

    conversation_history.append({
        "role": "user",
        "content": tool_results
    })
    continue

break_

The non-obvious shape: tool results go back in as a user message, not assistant or system. The content is an array of tool_result blocks keyed by tool_use_id from the previous turn. Get this wrong and the model can't connect its own decision to the result of that decision.
I expected this to feel like an abstraction. It doesn't. It feels like maintaining a transcript. Which is what it is.

stop_reason == "tool_use" is the entire control flow The loop terminates when the model's response has any stop_reason other than "tool_use" — usually "end_turn". That's it. There is no explicit "task complete" signal. This is worth sitting with. The agent stops because the model decides not to call any more tools. Whether the task was actually completed correctly is something you have to verify outside the loop. The model's decision to stop is a judgment, not a status code. In practice this means your system prompt is doing real work. Mine ended with:

Save the design when you think it's ready. Don't wait for permission.

Without that last sentence, the model drafts, then asks "should I save this now?" — which kills the agentic behaviour. With it, the model just calls save_game_design() when it judges the work is done.
The line between "obedient chatbot" and "agent that takes initiative" is one sentence in a system prompt. I had not expected the leverage to be that concentrated.

The tool's description field is documentation the model reads tools = [{ "name": "save_game_design", "description": "Save the completed game design document to a file. " "Call this when you have a complete game design ready.", "input_schema": { "type": "object", "properties": { "title": {"type": "string", "description": "The game title, used for the filename"}, "content": {"type": "string", "description": "The full game design document in markdown format"} }, "required": ["title", "content"] } }] The description field is not for me. It is the only documentation the model has about when to call this tool. My first version had a vaguer description and the agent kept calling save mid-conversation, before the design was actually finished. Tightening it to "Call this when you have a complete game design ready" fixed the behaviour. Same for the per-parameter descriptions — they steer which arguments the model produces. If your agent calls tools at the wrong times or with weird arguments, the fix is almost always in these strings.

Three crashes worth saving you
Outdated model string. claude-sonnet-4-20250514 is gone. Current at time of writing: claude-sonnet-4-6. The model itself will confidently give you the old string. Verify against current docs.
Windows file encoding. If the model puts a Unicode character into a file write on Windows, you'll get UnicodeEncodeError: 'charmap' codec can't encode.... Fix:
pythonopen(filename, 'w', encoding='utf-8')
One parameter.
API key in Git history. Commit .gitignore before .env. If your key ever touches a commit — even one you immediately reverted — rotate it on the Anthropic console. GitHub's secret scanner is good. It is not your security model.

The actual thing I want to flag
When I ran the working version, I gave the agent a vague prompt about an Estonian factory. It produced a complete game design document for a real 19th century textile mill I had not mentioned — three interlocking game loops, win and lose conditions pulled from historical events, scope estimate, art direction. Then it called save_game_design() on its own. I had been making coffee. When I came back the file was on my hard drive.
What I want to flag is not that this happened. It is that I read what the agent had written and agreed with its judgment that the document was finished.
That's a different sentence than "I was impressed." It is also a sentence I had not expected to be writing about a piece of Python I had just finished debugging.
The frameworks (LangGraph, CrewAI) are elaborations of the loop above with conventions on top. Building the unframeworked version once is the fastest way to understand what those frameworks are actually doing.
Full code and the design document the agent produced: github.com/quietaidev-collab/zero-to-agent