DEV Community

Cover image for Anthropic Says Use More Agents to Fix Agent Code. Here's What's Missing.
MergeShield
MergeShield

Posted on • Originally published at mergeshield.dev

Anthropic Says Use More Agents to Fix Agent Code. Here's What's Missing.

Last week, Anthropic published their recommended architecture for building production apps with Claude Code. The core idea: a multi-agent harness where a Planner expands prompts into specs, a Generator implements features, and an Evaluator grades output against criteria.

It's a solid pattern inspired by GANs - one system creates, another critiques, and the tension drives quality up.

But there's a gap nobody seems to be talking about.

Anthropic's multi-agent harness showing Planner, Generator, and Evaluator - all using the same model with shared blind spots
The Generator and Evaluator are both Claude - they share the same training data and the same blind spots.

The Shared Blind Spot Problem

When your Generator is Claude and your Evaluator is also Claude, they share the same training data, the same biases, and the same blind spots.

It's like asking your coworker to proofread something they helped you write. They'll catch typos. But the structural problems - the wrong assumptions, the edge cases neither of you considered - those survive because you both have the same mental model of what "correct" looks like.

We've seen this play out:

  • Auth flows that passed evaluation but used client-side token storage with no expiry
  • API endpoints both agents considered "complete" but had no rate limiting
  • Database queries that worked in tests but had no indexes for production scale

The Generator optimizes for "does it work?" The Evaluator asks the same question slightly differently. Nobody asks: "What would break this in production?"

What Same-Model Evaluators Miss

AI models have consistent failure patterns when generating code. These aren't random - they're systematic:

Happy-path optimization. AI writes code that handles expected input perfectly. Edge cases, concurrent access, network timeouts get skipped because the model optimizes for the prompt scenario, not production scenarios.

Security as afterthought. Models treat security like junior devs often do - something you add after the feature works. Hardcoded secrets, missing CSRF, SQL injection vectors.

Blast radius blindness. When an agent modifies auth middleware, it doesn't reason about how many services depend on that module. Models think locally, not systemically.

Test coverage gaps. AI-generated tests mirror the implementation. If the code has a bug, the test often encodes that bug as expected behavior.

Why External Evaluation Changes Everything

Mature engineering orgs don't ask the developer who wrote code to also write the security review. They have separate teams with separate checklists:

  • Security review looks for attack vectors, not functionality
  • Architecture review looks for coupling and blast radius, not correctness
  • Performance review looks for bottlenecks, not feature completeness

The same applies to AI code. External evaluation should score across dimensions the generator wasn't optimizing for:

  • Security - auth changes, secrets, injection risks
  • Blast Radius - how many components affected
  • Test Gaps - whether tests actually cover new behavior
  • Dependencies - supply chain concerns
  • Breaking Changes - API contract modifications

When evaluation criteria are orthogonal to generation criteria, you catch problems the generator structurally cannot see.

The Missing Piece: Trust That Evolves

Anthropic's harness treats every sprint the same. First feature gets identical evaluation to the fiftieth. No memory, no learning.

But in real teams, trust is earned. A dev who consistently ships clean code gets less scrutiny on routine changes. AI agents should work the same way:

  • New agents start with maximum scrutiny
  • Each clean PR builds trust incrementally
  • High-risk findings reset trust immediately
  • Trusted agents auto-merge low-risk changes
  • Untrusted agents require human review

The harness gives you per-sprint quality control. Trust scoring gives you quality control that compounds over time.

The Complete Picture

Anthropic's harness solves code quality within a single session. But it doesn't address:

  • Cross-session learning (does the agent improve over time?)
  • Multi-agent governance (Claude + Copilot + Cursor in one repo)
  • Risk-proportional review (dependency bump vs auth middleware change)
  • Audit trail (which agent, what risk score, what decision)

The generator-evaluator loop handles the inner feedback cycle. Governance handles everything outside - organizational policies, trust relationships, risk-based routing.

The complete governance stack - 3 layers: inner loop harness, external risk evaluation across 6 dimensions, and trust-based governance routing
The complete stack: inner-loop quality (harness) + external risk scoring across 6 dimensions + trust-based governance routing.

What to Do About It

  1. Use the harness pattern for inner-loop quality. It works.
  2. Add external evaluation with different criteria the generator wasn't optimizing for.
  3. Build trust incrementally. Track which agents produce clean code. Let data drive review policy.
  4. Automate what's safe. Low-risk PRs from trusted agents don't need human review.
  5. Keep an audit trail. When production breaks, trace which agent introduced the change.

The harness gives you better code. Governance gives you confidence that what ships is safe. You need both.


This is the approach we're building at MergeShield - external risk scoring across 6 dimensions, per-agent trust scores that evolve over time, and auto-merge rules for trusted agents. Try the interactive demo to see it in action.

Top comments (0)