Nikhil tiwari

Posted on May 17

Building a Website with Anthropic's Generator-Evaluator Loop (Harness Engineering)

#claude #ai #productivity #automation

Building a Website with Anthropic's Generator-Evaluator Loop

Anthropic recently published their harness design for long-running apps — a GAN-inspired architecture where a Generator builds and an Evaluator critiques in a loop. I replicated it with Kiro CLI to build a marketing website autonomously.

Result: 12 iterations, 3.5 hours, zero manual coding.

The Architecture

Planner (1x) → Generator ↔ Evaluator (12x)

Each agent is a separate CLI process — clean slate, no shared context:

#!/bin/bash
# orchestrator.sh — spawns fresh sessions per agent

KIRO="/Users/me/.local/bin/kiro-cli"

invoke_kiro() {
  "$KIRO" chat --no-interactive --trust-all-tools \
    "Read $1 and execute all instructions."
}

# Planner (once)
invoke_kiro "prompts/planner.md"

# Generator ↔ Evaluator loop
for ((i=1; i<=12; i++)); do
  pkill -f "next dev" 2>/dev/null; sleep 2
  invoke_kiro "prompts/generator.md"
  invoke_kiro "prompts/evaluator.md"
done

Communication is only through files:

Planner writes .harness/spec.md
Evaluator writes .harness/eval-report.md
Generator reads both

The Evaluator Uses Playwright

This is the key differentiator. The evaluator doesn't just read code — it browses the live site:

# evaluator.md (excerpt)

### Start Server
nohup npx next dev --port 3000 > /dev/null 2>&1 & disown; sleep 8

### Test with Playwright MCP
1. browser_navigate to http://localhost:3000
2. browser_snapshot at 1440x900, 768x1024, 375x812
3. browser_click all interactive elements
4. Score: Design Quality, Originality, Craft, Functionality

### Always provide feedback — never just "PASS"

The Design Skill (Anti-AI-Slop)

Both Generator and Evaluator reference Anthropic's frontend design skill:

## NEVER
- Inter, Roboto, Arial, Space Grotesk
- Purple gradients on white backgrounds
- Predictable card layouts

## ALWAYS
- Bold, distinctive font choices
- Unexpected layouts, asymmetry
- Commit fully to a direction

Generic AI patterns score max 5/10. This forces creative risk-taking.

Iteration Progression

Iteration	What Happened	Design Score
1	Generic dark site, Inter font	5/10
2	Amber palette pivot, custom SVGs	6/10
3	Fixed a11y, responsive, fonts	6/10
4	Terminal Noir pivot — IBM Plex Mono, grain, scanlines	7/10
5-12	Refinement — animations, reduced-motion, polish	7-8/10

The biggest jumps came from pivots, not refinements.

Key Lessons

Separate generator from evaluator — AI can't judge its own work
Clean slate per invocation — prevents context anxiety
Playwright > code review — catches visual bugs code review misses
Penalize generic patterns — or the model converges on AI slop
Pivots > polish — tell the generator to scrap and restart if scores stagnate

Two Architectures

I built templates for both use cases:

Frontend (design-focused):

No sprints, single build
5-15 iterations of generate → evaluate → improve
Scoring: Design Quality, Originality, Craft, Functionality

Full-stack (correctness-focused):

Sprint-based with contracts
Generator + Evaluator negotiate "done" before coding
Pass/fail per sprint, max 3 retries

Try It

git clone https://github.com/Mnemo-mcp/Harness
cd Harness
./orchestrator.sh

Requires: Kiro CLI + Playwright MCP configured in .kiro/settings/mcp.json.

The model is the engine. The harness is what makes it drive straight.

Live site | Harness repo

DEV Community