<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: tiezhu</title>
    <description>The latest articles on DEV Community by tiezhu (@xiaolangtizi).</description>
    <link>https://dev.to/xiaolangtizi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3989065%2Ff33502e8-8c43-453e-9719-742a553432f3.png</url>
      <title>DEV Community: tiezhu</title>
      <link>https://dev.to/xiaolangtizi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/xiaolangtizi"/>
    <language>en</language>
    <item>
      <title>Single-page Claude writes beautifully. At 5 pages it drifts. Here's the harness I built.</title>
      <dc:creator>tiezhu</dc:creator>
      <pubDate>Thu, 18 Jun 2026 05:17:55 +0000</pubDate>
      <link>https://dev.to/xiaolangtizi/single-page-claude-writes-beautifully-at-5-pages-it-drifts-heres-the-harness-i-built-4cdn</link>
      <guid>https://dev.to/xiaolangtizi/single-page-claude-writes-beautifully-at-5-pages-it-drifts-heres-the-harness-i-built-4cdn</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I gave Claude / Codex a Figma file + a PRD and asked for 5-10 React pages of a working app. &lt;strong&gt;Single-page output is great. Multi-page output drifts in 4 specific ways.&lt;/strong&gt; I spent ~3 months building a harness with 14 gates × auto-retry × handoff JSON to stop the drift. 10 demos, 54 screens, 4 unrelated business domains, build-green rate 100%.&lt;/p&gt;

&lt;p&gt;Code: &lt;a href="https://github.com/JiuwenDragon/harness-mini" rel="noopener noreferrer"&gt;https://github.com/JiuwenDragon/harness-mini&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest opening
&lt;/h2&gt;

&lt;p&gt;Every "Figma to code with AI" demo on Twitter shows one screen. That's a real result — Claude vision is genuinely good at single-page UI. I verified this many times during my research: &lt;strong&gt;giving Claude a screenshot + a paragraph of PRD produces a 70-80 point page in 30 seconds&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The promise breaks at 5+ screens. Here are the 4 drift modes I measured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drift mode 1: Inconsistent copy
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Screen 1&lt;/th&gt;
&lt;th&gt;Screen 2&lt;/th&gt;
&lt;th&gt;Screen 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Username: Zhang San&lt;/td&gt;
&lt;td&gt;Username: Li Si&lt;/td&gt;
&lt;td&gt;Username: Test User&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;LLM doesn't carry a "world state" across page generations. Without explicit injection, it re-invents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drift mode 2: Dead-link routing
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Screen "transfer" generated:&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;button&lt;/span&gt; &lt;span class="na"&gt;onClick&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/banking/home&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;  // ← /banking
// Screen "home" actually at:
app/bank/home/page.tsx                                  // ← /bank
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Single-page review never catches this. Click-through breaks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drift mode 3: Shared state drift
&lt;/h2&gt;

&lt;p&gt;A zustand store with 5 keys (user, balance, lastTx, recent[], selected). LLM forgets 2-3 keys on screen 4, makes new ones up. Same business concept, three different variable names.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drift mode 4: "Claimed done" hallucination
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; Codex: All 10 pages generated, ready to preview.
&amp;gt; me: npm run build
&amp;gt; 3 pages: red. 2 pages: empty &amp;lt;div /&amp;gt; stubs. 1 page: import path wrong.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This one is the most painful. Without an external check, "claimed done" ≠ done.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the harness does (architecture)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Figma + PRD
    ↓ intake (fixture split)
    ↓ contract (frozen spec)
    ↓ generate (codex / claude / gemini)
    ↓ 14 gates (semantic / PRD / spec / UI hygiene / build / cross-canvas)
    ↓ visual review (human)
    ↓ web-preview (clickable)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each gate is &lt;strong&gt;scoped to one constraint&lt;/strong&gt;. Why? See Constraint Decay paper (arXiv 2605.06445): stuffing 10+ constraints into one prompt drops LLM performance by 30 percentage points.&lt;/p&gt;

&lt;p&gt;The retry loop: when a gate fails, the gate's structured error report (not a vague "try again") is fed back to the LLM. Reflexion-style.&lt;/p&gt;

&lt;p&gt;The handoff: each stage emits &lt;code&gt;*_status.json&lt;/code&gt; so a new operator (or a new LLM session) can pick up without reading the conversation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why 14 gates and not 1 big one
&lt;/h2&gt;

&lt;p&gt;Constraint Decay (arXiv 2605.06445) measured the drop directly.&lt;br&gt;
Lost in the Middle (arXiv 2307.03172) shows the LLM ignores constraints buried in long prompts.&lt;br&gt;
So I push &lt;strong&gt;one check per gate&lt;/strong&gt;, max ~3 constraints per LLM round.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generalization evidence (4-domain ablation)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Color&lt;/th&gt;
&lt;th&gt;Screens&lt;/th&gt;
&lt;th&gt;Build pass&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Banking&lt;/td&gt;
&lt;td&gt;Deep red&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;10/10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fitness&lt;/td&gt;
&lt;td&gt;Orange&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Travel&lt;/td&gt;
&lt;td&gt;Blue&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shoes&lt;/td&gt;
&lt;td&gt;Black&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3/3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Same 14 gates. Same Codex/Claude/Gemini providers swapped via contract. No per-domain prompt tuning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not just use Builder.io / Locofy / v0 / Figma Make
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Strength&lt;/th&gt;
&lt;th&gt;Why it's not what I needed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Builder.io Visual Copilot&lt;/td&gt;
&lt;td&gt;2M+ training data, Mitosis IR&lt;/td&gt;
&lt;td&gt;SaaS, no PRD dim, no audit trail&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Locofy LDM&lt;/td&gt;
&lt;td&gt;Large Design Model&lt;/td&gt;
&lt;td&gt;SaaS, design system requires strict Auto Layout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Figma Make&lt;/td&gt;
&lt;td&gt;Highest fidelity (EPAM benchmark)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;No public API&lt;/strong&gt;, browser-only, $16/mo seat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v0 (Vercel)&lt;/td&gt;
&lt;td&gt;Tight shadcn/Next.js&lt;/td&gt;
&lt;td&gt;Figma link silently downgrades to screenshot (loses metadata)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are all great for "single dev makes a pretty page." None give me &lt;strong&gt;multi-page consistency + PRD enforcement + audit log + on-prem + provider swap&lt;/strong&gt;, which is the actual enterprise need.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do differently if starting over
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't fight LLMs on single-page output&lt;/strong&gt;. Claude with vision is already 80% there. Build the harness around what they're bad at: cross-page consistency and "claimed done."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build a deterministic IR earlier&lt;/strong&gt;. I attempted this (Builder.io's Mitosis-style) and abandoned at the first rendering bug. That was the wrong call — the IR is what Builder.io's whole architecture pivots on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Get visual diff automated&lt;/strong&gt;. I still rely on human visual review. Design2Code (arXiv 2403.03163) shows CLIP-score / CW-SSIM / TreeBLEU as auto metrics — should have wired one in.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Stuff that's open source
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/JiuwenDragon/harness-mini" rel="noopener noreferrer"&gt;https://github.com/JiuwenDragon/harness-mini&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;14 gates as discrete Python scripts under &lt;code&gt;scripts/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;10 demo fixtures with full codex/claude/gemini traces&lt;/li&gt;
&lt;li&gt;HE evolution log: every iteration with root cause + fix + prediction (87 entries)&lt;/li&gt;
&lt;li&gt;Docs: design rationale + maturity map + workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MIT license (I should add the file — open to PR).&lt;/p&gt;

&lt;h2&gt;
  
  
  Papers cited
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Constraint Decay (arXiv 2605.06445)&lt;/li&gt;
&lt;li&gt;Lost in the Middle (arXiv 2307.03172)&lt;/li&gt;
&lt;li&gt;Design2Code (arXiv 2403.03163)&lt;/li&gt;
&lt;li&gt;Reflexion (arXiv 2303.11366)&lt;/li&gt;
&lt;li&gt;Handoff Debt (arXiv 2606.02875)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy to answer questions in comments. The most useful feedback would be: "what other drift modes have you seen at &amp;gt;5 pages."&lt;/p&gt;

</description>
      <category>claude</category>
      <category>react</category>
      <category>showdev</category>
      <category>tooling</category>
    </item>
  </channel>
</rss>
