Building a Website with Anthropic's Generator-Evaluator Loop
Anthropic recently published their harness design for long-running apps — a GAN-inspired architecture where a Generator builds and an Evaluator critiques in a loop. I replicated it with Kiro CLI to build a marketing website autonomously.
Result: 12 iterations, 3.5 hours, zero manual coding.
The Architecture
Planner (1x) → Generator ↔ Evaluator (12x)
Each agent is a separate CLI process — clean slate, no shared context:
#!/bin/bash
# orchestrator.sh — spawns fresh sessions per agent
KIRO="/Users/me/.local/bin/kiro-cli"
invoke_kiro() {
"$KIRO" chat --no-interactive --trust-all-tools \
"Read $1 and execute all instructions."
}
# Planner (once)
invoke_kiro "prompts/planner.md"
# Generator ↔ Evaluator loop
for ((i=1; i<=12; i++)); do
pkill -f "next dev" 2>/dev/null; sleep 2
invoke_kiro "prompts/generator.md"
invoke_kiro "prompts/evaluator.md"
done
Communication is only through files:
- Planner writes
.harness/spec.md - Evaluator writes
.harness/eval-report.md - Generator reads both
The Evaluator Uses Playwright
This is the key differentiator. The evaluator doesn't just read code — it browses the live site:
# evaluator.md (excerpt)
### Start Server
nohup npx next dev --port 3000 > /dev/null 2>&1 & disown; sleep 8
### Test with Playwright MCP
1. browser_navigate to http://localhost:3000
2. browser_snapshot at 1440x900, 768x1024, 375x812
3. browser_click all interactive elements
4. Score: Design Quality, Originality, Craft, Functionality
### Always provide feedback — never just "PASS"
The Design Skill (Anti-AI-Slop)
Both Generator and Evaluator reference Anthropic's frontend design skill:
## NEVER
- Inter, Roboto, Arial, Space Grotesk
- Purple gradients on white backgrounds
- Predictable card layouts
## ALWAYS
- Bold, distinctive font choices
- Unexpected layouts, asymmetry
- Commit fully to a direction
Generic AI patterns score max 5/10. This forces creative risk-taking.
Iteration Progression
| Iteration | What Happened | Design Score |
|---|---|---|
| 1 | Generic dark site, Inter font | 5/10 |
| 2 | Amber palette pivot, custom SVGs | 6/10 |
| 3 | Fixed a11y, responsive, fonts | 6/10 |
| 4 | Terminal Noir pivot — IBM Plex Mono, grain, scanlines | 7/10 |
| 5-12 | Refinement — animations, reduced-motion, polish | 7-8/10 |
The biggest jumps came from pivots, not refinements.
Key Lessons
- Separate generator from evaluator — AI can't judge its own work
- Clean slate per invocation — prevents context anxiety
- Playwright > code review — catches visual bugs code review misses
- Penalize generic patterns — or the model converges on AI slop
- Pivots > polish — tell the generator to scrap and restart if scores stagnate
Two Architectures
I built templates for both use cases:
Frontend (design-focused):
- No sprints, single build
- 5-15 iterations of generate → evaluate → improve
- Scoring: Design Quality, Originality, Craft, Functionality
Full-stack (correctness-focused):
- Sprint-based with contracts
- Generator + Evaluator negotiate "done" before coding
- Pass/fail per sprint, max 3 retries
Try It
git clone https://github.com/Mnemo-mcp/Harness
cd Harness
./orchestrator.sh
Requires: Kiro CLI + Playwright MCP configured in .kiro/settings/mcp.json.
The model is the engine. The harness is what makes it drive straight.
Top comments (0)