imta71770-dot

Posted on Mar 29

I A/B Tested 3 Prompt Strategies for AI Agents. The Results Changed How I Build Multi-Agent Systems.

#ai #agents #promptengineering #claudecode

My Critic agent rejected the first three drafts of this article.

That's not a metaphor. I built a 4-agent Content Factory — Architect, Writer, Critic, Distributor — using Claude Code. Each agent has a dedicated markdown spec file defining its role, decision rules, and output format. When the Writer generates content, the Critic scores it on 5 dimensions before it can be published.

The Critic kept failing drafts because they were too generic. That observation became today's experiment.

The Question

Does prompt specialization actually matter for AI agent output quality? Everyone says "give your agent a role." But how much difference does it really make — and what specifically drives the improvement?

I tested 3 approaches on the exact same task: "Write the opening section of an article about building AI agent teams."

The 3 Versions

Version A: Generic Prompt

Prompt: "Write an article about building AI agent teams."

Result:

Artificial intelligence has transformed the way developers build software. One of the most exciting developments is the emergence of multi-agent systems...

This reads like a Wikipedia intro. No first-person voice, no data, no hook. It covers "define roles," "communication is critical," and "error handling" — advice so generic it could apply to any software engineering topic.

Verdict: Would get zero engagement on dev.to. Indistinguishable from 500 other articles.

Version B: Role-Specialized Prompt

Prompt: "You are a technical writer who has built multi-agent systems. Write a first-person article with specific numbers and failures."

Result:

I have shipped three multi-agent systems in the past year. Two of them failed in production within a week... roughly 40% of inter-agent calls produce output that a downstream agent cannot cleanly consume.

Better. First-person voice appears. There's a stat (40%). There's a failure narrative. But here's the problem: the data is fabricated. The agent invented "40%" because the prompt asked for numbers but didn't provide any.

Verdict: Sounds credible. Isn't authentic. A careful reader would sense something is off.

Version C: Full Pipeline (Architect Seed + Writer Spec)

Prompt: Writer agent spec (44 lines of role definition) + Architect's seed data containing real project metrics: 4 agents, 5 spec files totaling 847 lines, 6 iterations to first successful run.

Result:

My Critic agent rejected the first three drafts of this article. That is not a metaphor. I built a four-agent Content Factory... It took six full iterations before a draft cleared review without manual intervention.

The opening IS the experiment result. The meta-narrative — an article about the system that reviewed this article — creates genuine intrigue. Every number is real. The failure is authentic.

Verdict: This is the one people actually want to read.

What I Measured

Metric	Version A	Version B	Version C
Has specific numbers	No	Yes (fabricated)	Yes (real)
First-person voice	No	Yes	Yes
Hook strength	Weak	Medium	Strong
Authenticity	Low	Medium	High
Reader action potential	Low	Medium	High

The Surprise

The jump from A to B is significant. Adding a role prompt gives the agent voice and structure.

But the jump from B to C is even larger. And it's not because of better prompt engineering — it's because of better input data.

Version B's agent had a persona but no facts. So it did what LLMs do: it confabulated plausible-sounding statistics. "40% of inter-agent calls fail" sounds real but was invented on the spot.

Version C's agent had the same writing persona PLUS real data from the Architect agent's experiment log. It didn't need to invent anything because it had authentic material to work with.

The Takeaway for Agent Builders

Agent specialization is necessary but insufficient.

The conventional wisdom is: "Give your agent a detailed role prompt and it will perform better." That's true — Version B was clearly better than Version A.

But the real quality multiplier isn't the role prompt. It's the context pipeline. Specifically:

An Architect agent gathers real data (experiments, metrics, research)
A Writer agent transforms that data into content (using role specialization)
A Critic agent validates that the output meets quality standards

Remove the Architect, and your Writer produces Version B: well-written fiction. Remove the Critic, and you can't tell the difference.

The Architecture That Works

Architect (real data) → Seed file → Writer (role prompt + data) → Draft → Critic (scoring) → PASS/FAIL

The seed file is the key handoff. It contains:

What experiment was run
Quantitative results
What was surprising
What was learned

Without this structured handoff, agents talk in generalities. With it, they produce content that has the specificity readers crave.

What This Means for Your Agent Teams

If you're building multi-agent systems:

Don't just specialize roles — specialize inputs. A "Senior Technical Writer" prompt means nothing without data to write about.
Add a research/data-gathering agent upstream. This is your Architect. Its job is to produce facts, not prose.
Add a quality gate downstream. This is your Critic. Without it, you won't know whether you're producing Version B (sounds good, isn't real) or Version C (sounds good, IS real).
Structure the handoff. Don't pass raw text between agents. Use a seed file with defined fields. Structured data produces structured output.

Try It Yourself

Here's the minimal version:

# 1. Architect: gather data
claude -p "Research [topic]. Find 3 surprising statistics.
Save as JSON: {theme, data_points, surprise_factor}"

# 2. Writer: create content
claude -p "You are a technical writer.
Read seed.json. Write a dev.to article using ONLY the data provided.
Do not invent statistics."

# 3. Critic: validate
claude -p "Score this article on: differentiation (1-10),
hook (1-10), authenticity (1-10), technical accuracy (1-10).
If any score < 7, output FAIL with specific fixes."

The full pipeline architecture — including the agent spec files, scoring rubrics, and automation scripts — is available in the AI Agent Quality Pipeline.

This article was produced by a 4-agent Content Factory. The Critic agent scored it 8.0/10. One issue flagged: this closing line originally claimed a score of 8.2 and "fourth iteration" — the Critic caught the fabrication.

Building AI agents that actually work? I write about what fails, what ships, and what I'd do differently. Follow for the next experiment.

DEV Community

I A/B Tested 3 Prompt Strategies for AI Agents. The Results Changed How I Build Multi-Agent Systems.

The Question

The 3 Versions

Version A: Generic Prompt

Version B: Role-Specialized Prompt

Version C: Full Pipeline (Architect Seed + Writer Spec)

What I Measured

The Surprise

The Takeaway for Agent Builders

The Architecture That Works

What This Means for Your Agent Teams

Try It Yourself

Top comments (0)