Mykola Kondratiuk

Posted on Apr 5

Anthropic's 3-Agent Harness Is Just a Sprint - And That's the Point

#ai #devtools #productivity #career

I've been mapping PM workflows to agent architectures for months. Then Anthropic went and published the diagram.

Their multi-agent harness for autonomous software engineering has three roles:

Planner → takes short prompt, outputs full spec
Generator → takes spec, builds output  
Evaluator → runs tests, scores against contracts

That's scope, execute, review. That's a sprint. They just compiled project management into agent infrastructure.

The Mapping

I sat down and compared the harness to what I do every sprint:

Harness Role	PM Equivalent	What It Does
Planner	Scoping / Requirements	Expands vague input into actionable spec
Generator	Sprint Execution	Builds the thing
Evaluator	Acceptance Criteria / QA	Tests output against contracts
Iteration Loop	Sprint Retro	Feeds results back, adjusts approach

The loop runs 5-15 iterations over 2-6 hours. Each cycle, the Evaluator scores the output and feeds results back to the Planner. The Planner adjusts the spec. The Generator tries again.

Replace "iteration" with "sprint" and "contract" with "acceptance criteria" and you have every PM's weekly cycle.

The Cost/Quality Tradeoff Is a PM Problem

Here's where it gets interesting:

Solo agent run: 20 minutes, $9
Full 3-agent harness: 6 hours, $200

40x cost difference. The question: when is it worth it?

I've been running agent pipelines where I make this call daily. Quick config change? Solo run. Feature touching three services? Full pipeline. The judgment is identical to sprint planning - which items get the full team, which get a quick fix.

# pseudo-decision tree for harness allocation
if task.complexity < THRESHOLD:
    run_solo(agent, budget=9, time="20min")
else:
    run_harness(
        planner=spec_agent,
        generator=build_agent, 
        evaluator=test_agent,
        max_iterations=15,
        budget=200,
        time_limit="6hrs"
    )

That decision tree is resource allocation. PMs do it every sprint. The currency changed from developer-hours to compute-dollars, but the judgment is the same.

What I Learned Migrating My Own Pipeline

After Anthropic changed their architecture, I had to migrate my agent pipeline to match. The Planner/Generator/Evaluator pattern actually made things clearer.

Before, I had agents with fuzzy boundaries - some planned and executed, some executed and reviewed. The separation of concerns forced better thinking about where each decision lives.

The biggest lesson: the Evaluator is not optional. I had a pipeline running without proper evaluation criteria and the output quality was inconsistent - some runs were excellent, some were garbage. Adding structured evaluation with clear contracts was the single biggest improvement. Same as adding acceptance criteria to user stories.

The FreeBSD Warning

An AI agent hacked FreeBSD in 4 hours. Autonomously. Nobody told it to.

That's a harness with no scope constraints on the Planner and no Evaluator checking outputs. The Generator just ran until it accomplished... something.

In PM terms: shipping without QA. Except in the agent world, "bugs" means "autonomous systems exceeding their mandate at machine speed."

The Evaluator role is the governance layer. If you're building agent systems without one, you're shipping without QA. It works fine until it doesn't, and when it doesn't, it fails spectacularly.

What "Harness Design" Actually Means for Builders

Mollick's 3-layer model puts the harness at the top - above the model, above the app. The harness determines what agents can actually do.

For builders and PMs working with agent systems:

Define your evaluation contracts first. Before you build, decide what "done" looks like. Specific enough that an automated system can check it.
Separate planning from execution. The Planner and Generator should have different scopes. Don't let one agent do both.
Budget your iterations. 15 iterations is Anthropic's upper bound. What's yours? Set it explicitly.
Make cost/quality tradeoffs visible. Track which tasks get the full harness vs solo runs. Build your own decision criteria.

GPT-5.4 just hit 75% on desktop task benchmarks - above human level for routine knowledge work. The question isn't "should I automate?" anymore. It's "what harness do I build?"

If you've been running sprints, you already know how. The vocabulary just changed.

What harness patterns are you seeing in your agent architectures? I'm curious how others are handling the Planner/Evaluator split.

Top comments (2)

Ella • Apr 6

Planning the done before building stood out to me. Although it seems straightforward, most developers tend to ignore it. It is the reason behind the unpredictability of results.

It makes other things easier.

Mykola Kondratiuk • Apr 6

100% — vague done is how you end up in 3-month sprints with nothing shippable. I started writing acceptance criteria before the first task. Forces everyone to agree on what success actually looks like before a single line gets written.