Lovanaut

Posted on Mar 27

Anthropic Proved AI Can't Evaluate Its Own Work. Here's How I Rebuilt My Claude Code Setup Around That.

#ai #llm #testing #tooling

I've been building products with Claude Code for months. Every time I asked "is this implementation correct?", the answer was "yes, it's properly implemented." Every time. Even when the code had bugs that broke in production.

Then Anthropic published a blog post that explained exactly why. I mapped my setup against their findings, and realized: my evaluator layer was almost empty.

Here's how I rebuilt it.

Jump to:

What Anthropic's experiment showed
Mapping this to Claude Code
Layer 1: Rules — always-on review criteria
Layer 2: Skills — on-demand reviewers
Layer 3: Agent separation — who builds vs who reviews
3 principles for evaluation design
Final file structure
Harness design checklist

What Anthropic's experiment showed

In March 2026, Anthropic published "Harness design for long-running apps" — experiments where AI agents autonomously built apps over multi-hour sessions.

The headline finding:

Agents asked to evaluate their own work tend to confidently praise it, even when it's clearly mediocre to human observers.

It gets worse. Agents would spot real problems, then wave them off as unimportant and approve anyway. They'd skim instead of testing edge cases. Anthropic themselves put it bluntly: "Claude is an inadequate QA agent out of the box."

Their fix was splitting generation from evaluation — three agents:

Agent	Role
Planner	Expands a one-line prompt into a full product spec
Generator	Writes the code
Evaluator	Clicks through the running app, finds bugs, scores against criteria

The difference was stark:

Setup	Time	Cost	Result
Solo (no evaluator)	20 min	$9	Core features broken
Full harness (with evaluator)	6 hrs	$200	Basic functionality worked + AI features

The evaluator checked 27 criteria per sprint and filed bug reports like "fillRectangle exists but doesn't fire on mouseUp."

Reading this, it clicked: Claude Code's config system can give you the same split.

Mapping to Claude Code

Here's the mapping:

Anthropic's agent	Claude Code equivalent	What goes here
Planner	CLAUDE.md + planning skills	Project context, constraints, design rationale
Generator	Claude Code + technical skills	Code generation
Evaluator	Review agents + rules + hooks	Quality gates, automated checks

When I laid out my own setup this way, the gap was obvious. CLAUDE.md had project context. Skills had coding patterns. But nothing was actually checking whether the output was correct. I was asking the generator to review its own work — the exact failure mode Anthropic documented.

Layer 1: Rules — always-on review criteria

Files in ~/.claude/rules/ load every session. Put things here that AI won't do on its own but that matter in production.

Here's a taste of my Supabase/PostgreSQL rules (30 total):

# ~/.claude/rules/supabase-postgres.md

## Index FK columns
PostgreSQL does NOT auto-index foreign key columns.

## RLS performance
Wrap auth.uid() in SELECT to prevent per-row execution:
- BAD: using (user_id = auth.uid())
- GOOD: using (user_id = (select auth.uid()))

## Cursor-based pagination
- BAD: OFFSET (slow on deep pages)
- GOOD: WHERE id > $last_id ORDER BY id LIMIT 20

Without this rule, the AI generates auth.uid() without the SELECT wrapper every time. Works fine with small tables. Tests pass. Then production slows to a crawl as rows grow. Classic "surface-level test that misses the deeper bug" — exactly what Anthropic described.

I think of rules as "never step on this mine again" files. Every rule started as a real mistake.

Layer 2: Skills — on-demand reviewers

Rules load every session, so too many will eat your context window. Anything task-specific goes into skills (.claude/skills/), which show only their titles by default and load on demand.

You can wire up keyword-triggered activation:

# ~/.claude/CLAUDE.md (excerpt)

## Automatic Skill Activation

### Testing / TDD
**Trigger:** test, TDD, coverage, unit test
**Action:** Run test-driven-development skill

### Bug / Error handling
**Trigger:** bug, error, debug, broken
**Action:** Run systematic-debugging skill

### Completion check
**Trigger:** done, verify, review, check
**Action:** Run verification-before-completion skill

The completion check is the big one. In Anthropic's experiment, the evaluator ran checks at the end of each sprint. Same idea here — say "verify this" and a quality checklist kicks in behind the scenes.

I run 27 skills total. 7 auto-activate on keywords.

Layer 3: Agent separation — who builds vs who reviews

The core insight from Anthropic: separate the builder from the reviewer. In Claude Code, you do this through agent orchestration:

# ~/.claude/rules/agents.md

## Role: Manager
You are a manager and agent orchestrator.

**Rules:**
- Never implement directly — delegate all implementation to Sub Agents
- Break tasks into small units and run PDCA cycles

## Delegation
### Always delegate:
1. Code implementation
2. Debugging
3. Test creation

### Manager handles directly:
1. Task decomposition
2. Progress verification
3. Plan adjustment

The trick: pin the main Claude as "manager = reviewer" and push all implementation to Sub Agents. This gives Claude Code the same generator/evaluator split that Anthropic proved works.

Main Claude doesn't write code. It reviews what Sub Agents produce. Planner + evaluator = main Claude. Generator = Sub Agents.

3 principles for evaluation design

1. Weight your criteria toward what AI misses

Anthropic's frontend experiment used 4 criteria: design quality, originality, technical execution, functionality. They weighted design and originality higher — because the AI already did well on technical execution and functionality.

Lower the weight on things AI handles naturally. Raise it on things AI overlooks.

In practice:

Low weight (AI's fine): syntactically correct code, basic API endpoints
High weight (AI misses): performance traps (auth.uid() without SELECT), UX decisions that affect conversion, security blind spots

2. Prune your harness as models improve

From Anthropic:

Every component of a harness encodes an assumption about what the model can't do alone, and these assumptions can be wrong or quickly become outdated as models improve.

When they went from Opus 4.5 to 4.6, sprint decomposition became unnecessary — the newer model handled long sessions on its own. Same thing happens with your Claude Code rules. Something essential today may be dead weight next quarter.

I've revised my CLAUDE.md 8 times. The test for each line: "If I delete this, will the AI make mistakes?" No? Cut it.

3. Better models mean more harness possibilities, not fewer

The space of interesting harness combinations expands rather than contracts as models improve.

Sounds backwards, but it matches my experience. As the AI gets smarter, you can delegate more. But the remaining evaluation points get more nuanced, more subtle, and higher-stakes. The easy rules go away. The hard judgment calls stay.

Final file structure

~/.claude/
├── CLAUDE.md                # Planner layer (project overview + skill triggers)
├── rules/
│   ├── agents.md            # Generator/evaluator split
│   ├── supabase-postgres.md # Review criteria (DB/RLS, 30 rules)
│   ├── react-nextjs.md      # Review criteria (React/Next.js)
│   ├── security.md          # Review criteria (security)
│   ├── coding-style.md      # Code quality
│   └── testing.md           # Test quality
└── skills/                   # On-demand reviewers
    ├── verification-before-completion  # End-of-task checks
    ├── systematic-debugging            # Debug-time checks
    ├── test-driven-development         # TDD enforcement
    └── ... (27 total)

How it maps to Anthropic's architecture:

CLAUDE.md → Planner artifact (the product spec)
rules/ → Review criteria (the sprint contract)
skills/ → Specialist reviewers (activate when needed)
agents.md → Builder/reviewer separation

Harness design checklist

Planner layer (CLAUDE.md)

[ ] Project overview and constraints documented
[ ] Design decisions and their rationale included
[ ] Under 200 lines (bloated CLAUDE.md degrades instruction-following)

Review criteria (rules/)

[ ] Cover what AI misses, not what it already does well
[ ] Weighted toward production-critical concerns
[ ] Regularly pruned as models improve

On-demand reviewers (skills/)

[ ] Separated by domain (DB, React, security, etc.)
[ ] Keyword auto-activation configured
[ ] Completion verification skill exists

Builder/reviewer separation (agents.md)

[ ] Main Claude delegates implementation to Sub Agents
[ ] Verification flow for implementation results

DEV Community