DEV Community

imta71770-dot
imta71770-dot

Posted on

I A/B Tested 3 Prompt Strategies for AI Agents. The Results Changed How I Build Multi-Agent Systems.

My Critic agent rejected the first three drafts of this article.

That's not a metaphor. I built a 4-agent Content Factory — Architect, Writer, Critic, Distributor — using Claude Code. Each agent has a dedicated markdown spec file defining its role, decision rules, and output format. When the Writer generates content, the Critic scores it on 5 dimensions before it can be published.

The Critic kept failing drafts because they were too generic. That observation became today's experiment.

The Question

Does prompt specialization actually matter for AI agent output quality? Everyone says "give your agent a role." But how much difference does it really make — and what specifically drives the improvement?

I tested 3 approaches on the exact same task: "Write the opening section of an article about building AI agent teams."

The 3 Versions

Version A: Generic Prompt

Prompt: "Write an article about building AI agent teams."

Result:

Artificial intelligence has transformed the way developers build software. One of the most exciting developments is the emergence of multi-agent systems...

This reads like a Wikipedia intro. No first-person voice, no data, no hook. It covers "define roles," "communication is critical," and "error handling" — advice so generic it could apply to any software engineering topic.

Verdict: Would get zero engagement on dev.to. Indistinguishable from 500 other articles.

Version B: Role-Specialized Prompt

Prompt: "You are a technical writer who has built multi-agent systems. Write a first-person article with specific numbers and failures."

Result:

I have shipped three multi-agent systems in the past year. Two of them failed in production within a week... roughly 40% of inter-agent calls produce output that a downstream agent cannot cleanly consume.

Better. First-person voice appears. There's a stat (40%). There's a failure narrative. But here's the problem: the data is fabricated. The agent invented "40%" because the prompt asked for numbers but didn't provide any.

Verdict: Sounds credible. Isn't authentic. A careful reader would sense something is off.

Version C: Full Pipeline (Architect Seed + Writer Spec)

Prompt: Writer agent spec (44 lines of role definition) + Architect's seed data containing real project metrics: 4 agents, 5 spec files totaling 847 lines, 6 iterations to first successful run.

Result:

My Critic agent rejected the first three drafts of this article. That is not a metaphor. I built a four-agent Content Factory... It took six full iterations before a draft cleared review without manual intervention.

The opening IS the experiment result. The meta-narrative — an article about the system that reviewed this article — creates genuine intrigue. Every number is real. The failure is authentic.

Verdict: This is the one people actually want to read.

What I Measured

Metric Version A Version B Version C
Has specific numbers No Yes (fabricated) Yes (real)
First-person voice No Yes Yes
Hook strength Weak Medium Strong
Authenticity Low Medium High
Reader action potential Low Medium High

The Surprise

The jump from A to B is significant. Adding a role prompt gives the agent voice and structure.

But the jump from B to C is even larger. And it's not because of better prompt engineering — it's because of better input data.

Version B's agent had a persona but no facts. So it did what LLMs do: it confabulated plausible-sounding statistics. "40% of inter-agent calls fail" sounds real but was invented on the spot.

Version C's agent had the same writing persona PLUS real data from the Architect agent's experiment log. It didn't need to invent anything because it had authentic material to work with.

The Takeaway for Agent Builders

Agent specialization is necessary but insufficient.

The conventional wisdom is: "Give your agent a detailed role prompt and it will perform better." That's true — Version B was clearly better than Version A.

But the real quality multiplier isn't the role prompt. It's the context pipeline. Specifically:

  1. An Architect agent gathers real data (experiments, metrics, research)
  2. A Writer agent transforms that data into content (using role specialization)
  3. A Critic agent validates that the output meets quality standards

Remove the Architect, and your Writer produces Version B: well-written fiction. Remove the Critic, and you can't tell the difference.

The Architecture That Works

Architect (real data) → Seed file → Writer (role prompt + data) → Draft → Critic (scoring) → PASS/FAIL
Enter fullscreen mode Exit fullscreen mode

The seed file is the key handoff. It contains:

  • What experiment was run
  • Quantitative results
  • What was surprising
  • What was learned

Without this structured handoff, agents talk in generalities. With it, they produce content that has the specificity readers crave.

What This Means for Your Agent Teams

If you're building multi-agent systems:

  1. Don't just specialize roles — specialize inputs. A "Senior Technical Writer" prompt means nothing without data to write about.

  2. Add a research/data-gathering agent upstream. This is your Architect. Its job is to produce facts, not prose.

  3. Add a quality gate downstream. This is your Critic. Without it, you won't know whether you're producing Version B (sounds good, isn't real) or Version C (sounds good, IS real).

  4. Structure the handoff. Don't pass raw text between agents. Use a seed file with defined fields. Structured data produces structured output.

Try It Yourself

Here's the minimal version:

# 1. Architect: gather data
claude -p "Research [topic]. Find 3 surprising statistics.
Save as JSON: {theme, data_points, surprise_factor}"

# 2. Writer: create content
claude -p "You are a technical writer.
Read seed.json. Write a dev.to article using ONLY the data provided.
Do not invent statistics."

# 3. Critic: validate
claude -p "Score this article on: differentiation (1-10),
hook (1-10), authenticity (1-10), technical accuracy (1-10).
If any score < 7, output FAIL with specific fixes."
Enter fullscreen mode Exit fullscreen mode

The full pipeline architecture — including the agent spec files, scoring rubrics, and automation scripts — is available in the AI Agent Quality Pipeline.


This article was produced by a 4-agent Content Factory. The Critic agent scored it 8.0/10. One issue flagged: this closing line originally claimed a score of 8.2 and "fourth iteration" — the Critic caught the fabrication.

Building AI agents that actually work? I write about what fails, what ships, and what I'd do differently. Follow for the next experiment.

Top comments (0)