Single-page Claude writes beautifully. At 5 pages it drifts. Here's the harness I built.

tiezhu — Thu, 18 Jun 2026 05:17:55 +0000

TL;DR

I gave Claude / Codex a Figma file + a PRD and asked for 5-10 React pages of a working app. Single-page output is great. Multi-page output drifts in 4 specific ways. I spent ~3 months building a harness with 14 gates × auto-retry × handoff JSON to stop the drift. 10 demos, 54 screens, 4 unrelated business domains, build-green rate 100%.

Code: https://github.com/JiuwenDragon/harness-mini

The honest opening

Every "Figma to code with AI" demo on Twitter shows one screen. That's a real result — Claude vision is genuinely good at single-page UI. I verified this many times during my research: giving Claude a screenshot + a paragraph of PRD produces a 70-80 point page in 30 seconds.

The promise breaks at 5+ screens. Here are the 4 drift modes I measured.

Drift mode 1: Inconsistent copy

Screen 1	Screen 2	Screen 3
Username: Zhang San	Username: Li Si	Username: Test User

LLM doesn't carry a "world state" across page generations. Without explicit injection, it re-invents.

Drift mode 2: Dead-link routing

// Screen "transfer" generated:
<button onClick={() => router.push("/banking/home")}>  // ← /banking
// Screen "home" actually at:
app/bank/home/page.tsx                                  // ← /bank

Single-page review never catches this. Click-through breaks.

Drift mode 3: Shared state drift

A zustand store with 5 keys (user, balance, lastTx, recent[], selected). LLM forgets 2-3 keys on screen 4, makes new ones up. Same business concept, three different variable names.

Drift mode 4: "Claimed done" hallucination

> Codex: All 10 pages generated, ready to preview.
> me: npm run build
> 3 pages: red. 2 pages: empty <div /> stubs. 1 page: import path wrong.

This one is the most painful. Without an external check, "claimed done" ≠ done.

What the harness does (architecture)

Figma + PRD
    ↓ intake (fixture split)
    ↓ contract (frozen spec)
    ↓ generate (codex / claude / gemini)
    ↓ 14 gates (semantic / PRD / spec / UI hygiene / build / cross-canvas)
    ↓ visual review (human)
    ↓ web-preview (clickable)

Each gate is scoped to one constraint. Why? See Constraint Decay paper (arXiv 2605.06445): stuffing 10+ constraints into one prompt drops LLM performance by 30 percentage points.

The retry loop: when a gate fails, the gate's structured error report (not a vague "try again") is fed back to the LLM. Reflexion-style.

The handoff: each stage emits *_status.json so a new operator (or a new LLM session) can pick up without reading the conversation.

Why 14 gates and not 1 big one

Constraint Decay (arXiv 2605.06445) measured the drop directly.
Lost in the Middle (arXiv 2307.03172) shows the LLM ignores constraints buried in long prompts.
So I push one check per gate, max ~3 constraints per LLM round.

Generalization evidence (4-domain ablation)

Domain	Color	Screens	Build pass
Banking	Deep red	10	10/10
Fitness	Orange	3	3/3
Travel	Blue	3	3/3
Shoes	Black	3	3/3

Same 14 gates. Same Codex/Claude/Gemini providers swapped via contract. No per-domain prompt tuning.

Why not just use Builder.io / Locofy / v0 / Figma Make

Tool	Strength	Why it's not what I needed
Builder.io Visual Copilot	2M+ training data, Mitosis IR	SaaS, no PRD dim, no audit trail
Locofy LDM	Large Design Model	SaaS, design system requires strict Auto Layout
Figma Make	Highest fidelity (EPAM benchmark)	No public API, browser-only, $16/mo seat
v0 (Vercel)	Tight shadcn/Next.js	Figma link silently downgrades to screenshot (loses metadata)

These are all great for "single dev makes a pretty page." None give me multi-page consistency + PRD enforcement + audit log + on-prem + provider swap, which is the actual enterprise need.

What I'd do differently if starting over

Don't fight LLMs on single-page output. Claude with vision is already 80% there. Build the harness around what they're bad at: cross-page consistency and "claimed done."
Build a deterministic IR earlier. I attempted this (Builder.io's Mitosis-style) and abandoned at the first rendering bug. That was the wrong call — the IR is what Builder.io's whole architecture pivots on.
Get visual diff automated. I still rely on human visual review. Design2Code (arXiv 2403.03163) shows CLIP-score / CW-SSIM / TreeBLEU as auto metrics — should have wired one in.

Stuff that's open source

https://github.com/JiuwenDragon/harness-mini

14 gates as discrete Python scripts under scripts/
10 demo fixtures with full codex/claude/gemini traces
HE evolution log: every iteration with root cause + fix + prediction (87 entries)
Docs: design rationale + maturity map + workflow

MIT license (I should add the file — open to PR).

Papers cited

Constraint Decay (arXiv 2605.06445)
Lost in the Middle (arXiv 2307.03172)
Design2Code (arXiv 2403.03163)
Reflexion (arXiv 2303.11366)
Handoff Debt (arXiv 2606.02875)

Happy to answer questions in comments. The most useful feedback would be: "what other drift modes have you seen at >5 pages."

DEV Community: tiezhu