I've been building products with Claude Code for months. Every time I asked "is this implementation correct?", the answer was "yes, it's properly implemented." Every time. Even when the code had bugs that broke in production.
Then Anthropic published a blog post that explained exactly why. I mapped my setup against their findings, and realized: my evaluator layer was almost empty.
Here's how I rebuilt it.
Jump to:
- What Anthropic's experiment showed
- Mapping this to Claude Code
- Layer 1: Rules — always-on review criteria
- Layer 2: Skills — on-demand reviewers
- Layer 3: Agent separation — who builds vs who reviews
- 3 principles for evaluation design
- Final file structure
- Harness design checklist
What Anthropic's experiment showed
In March 2026, Anthropic published "Harness design for long-running apps" — experiments where AI agents autonomously built apps over multi-hour sessions.
The headline finding:
Agents asked to evaluate their own work tend to confidently praise it, even when it's clearly mediocre to human observers.
It gets worse. Agents would spot real problems, then wave them off as unimportant and approve anyway. They'd skim instead of testing edge cases. Anthropic themselves put it bluntly: "Claude is an inadequate QA agent out of the box."
Their fix was splitting generation from evaluation — three agents:
| Agent | Role |
|---|---|
| Planner | Expands a one-line prompt into a full product spec |
| Generator | Writes the code |
| Evaluator | Clicks through the running app, finds bugs, scores against criteria |
The difference was stark:
| Setup | Time | Cost | Result |
|---|---|---|---|
| Solo (no evaluator) | 20 min | $9 | Core features broken |
| Full harness (with evaluator) | 6 hrs | $200 | Basic functionality worked + AI features |
The evaluator checked 27 criteria per sprint and filed bug reports like "fillRectangle exists but doesn't fire on mouseUp."
Reading this, it clicked: Claude Code's config system can give you the same split.
Mapping to Claude Code
Here's the mapping:
| Anthropic's agent | Claude Code equivalent | What goes here |
|---|---|---|
| Planner | CLAUDE.md + planning skills | Project context, constraints, design rationale |
| Generator | Claude Code + technical skills | Code generation |
| Evaluator | Review agents + rules + hooks | Quality gates, automated checks |
When I laid out my own setup this way, the gap was obvious. CLAUDE.md had project context. Skills had coding patterns. But nothing was actually checking whether the output was correct. I was asking the generator to review its own work — the exact failure mode Anthropic documented.
Layer 1: Rules — always-on review criteria
Files in ~/.claude/rules/ load every session. Put things here that AI won't do on its own but that matter in production.
Here's a taste of my Supabase/PostgreSQL rules (30 total):
# ~/.claude/rules/supabase-postgres.md
## Index FK columns
PostgreSQL does NOT auto-index foreign key columns.
## RLS performance
Wrap auth.uid() in SELECT to prevent per-row execution:
- BAD: using (user_id = auth.uid())
- GOOD: using (user_id = (select auth.uid()))
## Cursor-based pagination
- BAD: OFFSET (slow on deep pages)
- GOOD: WHERE id > $last_id ORDER BY id LIMIT 20
Without this rule, the AI generates auth.uid() without the SELECT wrapper every time. Works fine with small tables. Tests pass. Then production slows to a crawl as rows grow. Classic "surface-level test that misses the deeper bug" — exactly what Anthropic described.
I think of rules as "never step on this mine again" files. Every rule started as a real mistake.
Layer 2: Skills — on-demand reviewers
Rules load every session, so too many will eat your context window. Anything task-specific goes into skills (.claude/skills/), which show only their titles by default and load on demand.
You can wire up keyword-triggered activation:
# ~/.claude/CLAUDE.md (excerpt)
## Automatic Skill Activation
### Testing / TDD
**Trigger:** test, TDD, coverage, unit test
**Action:** Run test-driven-development skill
### Bug / Error handling
**Trigger:** bug, error, debug, broken
**Action:** Run systematic-debugging skill
### Completion check
**Trigger:** done, verify, review, check
**Action:** Run verification-before-completion skill
The completion check is the big one. In Anthropic's experiment, the evaluator ran checks at the end of each sprint. Same idea here — say "verify this" and a quality checklist kicks in behind the scenes.
I run 27 skills total. 7 auto-activate on keywords.
Layer 3: Agent separation — who builds vs who reviews
The core insight from Anthropic: separate the builder from the reviewer. In Claude Code, you do this through agent orchestration:
# ~/.claude/rules/agents.md
## Role: Manager
You are a manager and agent orchestrator.
**Rules:**
- Never implement directly — delegate all implementation to Sub Agents
- Break tasks into small units and run PDCA cycles
## Delegation
### Always delegate:
1. Code implementation
2. Debugging
3. Test creation
### Manager handles directly:
1. Task decomposition
2. Progress verification
3. Plan adjustment
The trick: pin the main Claude as "manager = reviewer" and push all implementation to Sub Agents. This gives Claude Code the same generator/evaluator split that Anthropic proved works.
Main Claude doesn't write code. It reviews what Sub Agents produce. Planner + evaluator = main Claude. Generator = Sub Agents.
3 principles for evaluation design
1. Weight your criteria toward what AI misses
Anthropic's frontend experiment used 4 criteria: design quality, originality, technical execution, functionality. They weighted design and originality higher — because the AI already did well on technical execution and functionality.
Lower the weight on things AI handles naturally. Raise it on things AI overlooks.
In practice:
- Low weight (AI's fine): syntactically correct code, basic API endpoints
-
High weight (AI misses): performance traps (
auth.uid()without SELECT), UX decisions that affect conversion, security blind spots
2. Prune your harness as models improve
From Anthropic:
Every component of a harness encodes an assumption about what the model can't do alone, and these assumptions can be wrong or quickly become outdated as models improve.
When they went from Opus 4.5 to 4.6, sprint decomposition became unnecessary — the newer model handled long sessions on its own. Same thing happens with your Claude Code rules. Something essential today may be dead weight next quarter.
I've revised my CLAUDE.md 8 times. The test for each line: "If I delete this, will the AI make mistakes?" No? Cut it.
3. Better models mean more harness possibilities, not fewer
The space of interesting harness combinations expands rather than contracts as models improve.
Sounds backwards, but it matches my experience. As the AI gets smarter, you can delegate more. But the remaining evaluation points get more nuanced, more subtle, and higher-stakes. The easy rules go away. The hard judgment calls stay.
Final file structure
~/.claude/
├── CLAUDE.md # Planner layer (project overview + skill triggers)
├── rules/
│ ├── agents.md # Generator/evaluator split
│ ├── supabase-postgres.md # Review criteria (DB/RLS, 30 rules)
│ ├── react-nextjs.md # Review criteria (React/Next.js)
│ ├── security.md # Review criteria (security)
│ ├── coding-style.md # Code quality
│ └── testing.md # Test quality
└── skills/ # On-demand reviewers
├── verification-before-completion # End-of-task checks
├── systematic-debugging # Debug-time checks
├── test-driven-development # TDD enforcement
└── ... (27 total)
How it maps to Anthropic's architecture:
- CLAUDE.md → Planner artifact (the product spec)
- rules/ → Review criteria (the sprint contract)
- skills/ → Specialist reviewers (activate when needed)
- agents.md → Builder/reviewer separation
Harness design checklist
Planner layer (CLAUDE.md)
- [ ] Project overview and constraints documented
- [ ] Design decisions and their rationale included
- [ ] Under 200 lines (bloated CLAUDE.md degrades instruction-following)
Review criteria (rules/)
- [ ] Cover what AI misses, not what it already does well
- [ ] Weighted toward production-critical concerns
- [ ] Regularly pruned as models improve
On-demand reviewers (skills/)
- [ ] Separated by domain (DB, React, security, etc.)
- [ ] Keyword auto-activation configured
- [ ] Completion verification skill exists
Builder/reviewer separation (agents.md)
- [ ] Main Claude delegates implementation to Sub Agents
- [ ] Verification flow for implementation results
Links
- Anthropic's original post: Harness design for long-running apps
- Claude Code docs: docs.anthropic.com
- Japanese version (Zenn): https://zenn.dev/lova_man/articles/99777e473b3c2c

Top comments (0)