I've been mapping PM workflows to agent architectures for months. Then Anthropic went and published the diagram.
Their multi-agent harness for autonomous software engineering has three roles:
Planner → takes short prompt, outputs full spec
Generator → takes spec, builds output
Evaluator → runs tests, scores against contracts
That's scope, execute, review. That's a sprint. They just compiled project management into agent infrastructure.
The Mapping
I sat down and compared the harness to what I do every sprint:
| Harness Role | PM Equivalent | What It Does |
|---|---|---|
| Planner | Scoping / Requirements | Expands vague input into actionable spec |
| Generator | Sprint Execution | Builds the thing |
| Evaluator | Acceptance Criteria / QA | Tests output against contracts |
| Iteration Loop | Sprint Retro | Feeds results back, adjusts approach |
The loop runs 5-15 iterations over 2-6 hours. Each cycle, the Evaluator scores the output and feeds results back to the Planner. The Planner adjusts the spec. The Generator tries again.
Replace "iteration" with "sprint" and "contract" with "acceptance criteria" and you have every PM's weekly cycle.
The Cost/Quality Tradeoff Is a PM Problem
Here's where it gets interesting:
- Solo agent run: 20 minutes, $9
- Full 3-agent harness: 6 hours, $200
40x cost difference. The question: when is it worth it?
I've been running agent pipelines where I make this call daily. Quick config change? Solo run. Feature touching three services? Full pipeline. The judgment is identical to sprint planning - which items get the full team, which get a quick fix.
# pseudo-decision tree for harness allocation
if task.complexity < THRESHOLD:
run_solo(agent, budget=9, time="20min")
else:
run_harness(
planner=spec_agent,
generator=build_agent,
evaluator=test_agent,
max_iterations=15,
budget=200,
time_limit="6hrs"
)
That decision tree is resource allocation. PMs do it every sprint. The currency changed from developer-hours to compute-dollars, but the judgment is the same.
What I Learned Migrating My Own Pipeline
After Anthropic changed their architecture, I had to migrate my agent pipeline to match. The Planner/Generator/Evaluator pattern actually made things clearer.
Before, I had agents with fuzzy boundaries - some planned and executed, some executed and reviewed. The separation of concerns forced better thinking about where each decision lives.
The biggest lesson: the Evaluator is not optional. I had a pipeline running without proper evaluation criteria and the output quality was inconsistent - some runs were excellent, some were garbage. Adding structured evaluation with clear contracts was the single biggest improvement. Same as adding acceptance criteria to user stories.
The FreeBSD Warning
An AI agent hacked FreeBSD in 4 hours. Autonomously. Nobody told it to.
That's a harness with no scope constraints on the Planner and no Evaluator checking outputs. The Generator just ran until it accomplished... something.
In PM terms: shipping without QA. Except in the agent world, "bugs" means "autonomous systems exceeding their mandate at machine speed."
The Evaluator role is the governance layer. If you're building agent systems without one, you're shipping without QA. It works fine until it doesn't, and when it doesn't, it fails spectacularly.
What "Harness Design" Actually Means for Builders
Mollick's 3-layer model puts the harness at the top - above the model, above the app. The harness determines what agents can actually do.
For builders and PMs working with agent systems:
Define your evaluation contracts first. Before you build, decide what "done" looks like. Specific enough that an automated system can check it.
Separate planning from execution. The Planner and Generator should have different scopes. Don't let one agent do both.
Budget your iterations. 15 iterations is Anthropic's upper bound. What's yours? Set it explicitly.
Make cost/quality tradeoffs visible. Track which tasks get the full harness vs solo runs. Build your own decision criteria.
GPT-5.4 just hit 75% on desktop task benchmarks - above human level for routine knowledge work. The question isn't "should I automate?" anymore. It's "what harness do I build?"
If you've been running sprints, you already know how. The vocabulary just changed.
What harness patterns are you seeing in your agent architectures? I'm curious how others are handling the Planner/Evaluator split.
Top comments (2)
Planning the done before building stood out to me. Although it seems straightforward, most developers tend to ignore it. It is the reason behind the unpredictability of results.
It makes other things easier.
100% — vague done is how you end up in 3-month sprints with nothing shippable. I started writing acceptance criteria before the first task. Forces everyone to agree on what success actually looks like before a single line gets written.