From 4–6 Revisions to 0–2: Adding a TDD Loop to AI-driven Flutter UI

Hyohyeok Jeong — Sat, 18 Apr 2026 20:35:04 +0000

TL;DR

After wiring a golden tests + Figma screenshot diff loop into my UI workflow, my average number of manual correction rounds dropped from 4–6 to 0–2. Here is how I got there.

The problem I wanted to solve

The dev-cycle skill (Korean), which had been working great for backend and infra work, did not carry over well to the frontend. With tests, lints, and CLAUDE.md in place, I could reliably get well-formed code that followed project patterns and passed behavior checks — but the AI was not actually building the UI to match what Figma MCP was handing it.

On top of that, I was running tasks in parallel across git worktrees. Launching each worktree as a running app, eyeballing it, and iterating by hand had become the obvious bottleneck.

At some point I caught myself wondering: should I just hand the AI business logic and API integration, and keep UI manual? But I wanted to preserve the productivity gains I had come to rely on, so I sat down and asked what concretely had to change.

Boiling it down, I had two problems:

I want the implementation to match the design file as closely as possible.
I want an easy way to verify and iterate on the implemented design.

Why it was failing: Figma MCP mapping limits + fuzzy "done" criteria

For "1. match the design file closely," two root causes stood out:

Figma MCP exports design context as React + Tailwind CSS. Mapping that into Flutter/Dart + Material is inherently lossy, and it is not something I can fix at the user level. Better to compensate elsewhere.
For well-formed code I have a concrete spec — tests and lints. Pass them and you are done. For UI, there is no equivalent. The AI has no objective "when is this done?" signal. I suspected this was fixable, though no obvious approach presented itself at first.

For "2. easy verification and iteration," I figured I could automate my way out:

Inspect simple components with the widget_book package
Use golden tests to capture composed, logic-bound UI as images
Verify actual behavior before CI review

The fix: wrap the UI in a verification loop

The core idea is simple: give the AI a concrete, TDD-style spec for the UI too.

Once I decided to lean on golden tests, the next question was how to define "passing." I fed the AI both the Figma screenshot (via Figma MCP) and the golden screenshot of the current implementation, and had it build its own diff checklist, review the gap, and patch it.

My first instinct was a pixel diff for a quantitative score. But rendering differences between Figma (web) and Flutter (app) are unavoidable, and chasing them can easily wreck the code — the AI itself pointed that out — so I kept the evaluation qualitative.

Results

Original design

Baseline approach (plan → build). Colors, fonts, and spacing drifted from the design. Components were not reused. Initial output was fast, but I averaged 4–6 correction rounds.

	Baseline	golden test + Figma MCP screenshot diff
Initial plan/build tokens	~160k tokens + (revision cost × 5)	~250k tokens + (revision cost × 1)
Avg. manual correction rounds	4–6	0–2
Visual fidelity (color/font/spacing)	Low	High
Component reuse	Low	High*

* Component reuse rules already live in CLAUDE.md, but the extra review pass in the new skill seems to enforce them a second time, and the effect compounds.

How the comparison actually works

The reason this produces a measurable delta is that the AI is now handed a concrete, spec-style rubric for the UI. When a UI task completes, it builds two checklists on its own.

① Spec-level diff — Figma vs rendered golden

Each component attribute (typography, spacing, borders, colors, shadow, etc.) is compared one item at a time between the Figma source and the rendered golden, and each is marked as ✅ match / ⚠️ approximation / ❌ mismatch. A simplified excerpt looks like this (the real checklist has 18+ rows):

Attribute	Figma	Rendered (golden)	Verdict
Label typography	Pretendard Medium 16, #1A1A1A	`LabelLarge(16/500)`, `onSurface`	✅ matches
Focused border	2px primary (#597D2E)	2px `AppColors.primary`	✅ exact match
Field radius	8px	8px (`fieldRadius`)	✅ exact match
Disabled fill	#F5F5F5 (neutral10)	`AppColors.surfaceContainerLow`	⚠️ approximation (token-semantic mapping)

② Intent check — golden variants vs design intent

Each golden variant per state (default, focused, disabled, no_label_enabled, no_label_focused, …) is checked against design intent at the state level. Here the bar is not pixel parity but "does this state convey what the design meant?"

The AI runs both passes, patches the gaps it finds, and then hands the result to me for approval. These checklists are triggered automatically at the UI Review Gate stage of the broader workflow.

The flow, end to end

Create a worktree
Pull metadata and screenshots through Figma MCP
Write widget tests (TDD RED)
Implement until widget tests pass (TDD GREEN)
Add golden tests
Compare Figma screenshots against golden screenshots, then patch
Human review
Simplify the code
Iterate review + patch
Done

Wrap-up

The sample size is small, but in practice this has been working well for me. What I felt most is that the wait time inside the correction loop itself shrank, and the number of times I had to step in dropped.

A few common-sense rules that pair well with it:

Break tasks into smaller chunks than you normally would
Build component-first, then compose
Lean on widget_book — a component-catalog tool, analogous to Storybook in the web ecosystem

If you regularly build UI from design files, this approach is worth trying on your next task. The full setup is open-sourced here: flutter-golden-cycle.

References

Flutter & Figma MCP (live session) — https://www.youtube.com/live/d7qrvytOxSA
Figma implement skill — figma-implement-design

This post was originally written in Korean by a human. The English translation was produced by an AI model.

DEV Community: Hyohyeok Jeong