DEV Community

Cover image for Building a Website with Anthropic's Generator-Evaluator Loop (Harness Engineering)
Nikhil tiwari
Nikhil tiwari

Posted on

Building a Website with Anthropic's Generator-Evaluator Loop (Harness Engineering)

Building a Website with Anthropic's Generator-Evaluator Loop

Anthropic recently published their harness design for long-running apps — a GAN-inspired architecture where a Generator builds and an Evaluator critiques in a loop. I replicated it with Kiro CLI to build a marketing website autonomously.

Result: 12 iterations, 3.5 hours, zero manual coding.

The Architecture

Planner (1x) → Generator ↔ Evaluator (12x)
Enter fullscreen mode Exit fullscreen mode

Each agent is a separate CLI process — clean slate, no shared context:

#!/bin/bash
# orchestrator.sh — spawns fresh sessions per agent

KIRO="/Users/me/.local/bin/kiro-cli"

invoke_kiro() {
  "$KIRO" chat --no-interactive --trust-all-tools \
    "Read $1 and execute all instructions."
}

# Planner (once)
invoke_kiro "prompts/planner.md"

# Generator ↔ Evaluator loop
for ((i=1; i<=12; i++)); do
  pkill -f "next dev" 2>/dev/null; sleep 2
  invoke_kiro "prompts/generator.md"
  invoke_kiro "prompts/evaluator.md"
done
Enter fullscreen mode Exit fullscreen mode

Communication is only through files:

  • Planner writes .harness/spec.md
  • Evaluator writes .harness/eval-report.md
  • Generator reads both

The Evaluator Uses Playwright

This is the key differentiator. The evaluator doesn't just read code — it browses the live site:

# evaluator.md (excerpt)

### Start Server
nohup npx next dev --port 3000 > /dev/null 2>&1 & disown; sleep 8

### Test with Playwright MCP
1. browser_navigate to http://localhost:3000
2. browser_snapshot at 1440x900, 768x1024, 375x812
3. browser_click all interactive elements
4. Score: Design Quality, Originality, Craft, Functionality

### Always provide feedback — never just "PASS"
Enter fullscreen mode Exit fullscreen mode

The Design Skill (Anti-AI-Slop)

Both Generator and Evaluator reference Anthropic's frontend design skill:

## NEVER
- Inter, Roboto, Arial, Space Grotesk
- Purple gradients on white backgrounds
- Predictable card layouts

## ALWAYS
- Bold, distinctive font choices
- Unexpected layouts, asymmetry
- Commit fully to a direction
Enter fullscreen mode Exit fullscreen mode

Generic AI patterns score max 5/10. This forces creative risk-taking.

Iteration Progression

Iteration What Happened Design Score
1 Generic dark site, Inter font 5/10
2 Amber palette pivot, custom SVGs 6/10
3 Fixed a11y, responsive, fonts 6/10
4 Terminal Noir pivot — IBM Plex Mono, grain, scanlines 7/10
5-12 Refinement — animations, reduced-motion, polish 7-8/10

The biggest jumps came from pivots, not refinements.

Key Lessons

  1. Separate generator from evaluator — AI can't judge its own work
  2. Clean slate per invocation — prevents context anxiety
  3. Playwright > code review — catches visual bugs code review misses
  4. Penalize generic patterns — or the model converges on AI slop
  5. Pivots > polish — tell the generator to scrap and restart if scores stagnate

Two Architectures

I built templates for both use cases:

Frontend (design-focused):

  • No sprints, single build
  • 5-15 iterations of generate → evaluate → improve
  • Scoring: Design Quality, Originality, Craft, Functionality

Full-stack (correctness-focused):

  • Sprint-based with contracts
  • Generator + Evaluator negotiate "done" before coding
  • Pass/fail per sprint, max 3 retries

Try It

git clone https://github.com/Mnemo-mcp/Harness
cd Harness
./orchestrator.sh
Enter fullscreen mode Exit fullscreen mode

Requires: Kiro CLI + Playwright MCP configured in .kiro/settings/mcp.json.


The model is the engine. The harness is what makes it drive straight.

Live site | Harness repo

Top comments (0)