TL;DR
After wiring a golden tests + Figma screenshot diff loop into my UI workflow, my average number of manual correction rounds dropped from 4–6 to 0–2. Here is how I got there.
The problem I wanted to solve
The dev-cycle skill (Korean), which had been working great for backend and infra work, did not carry over well to the frontend. With tests, lints, and CLAUDE.md in place, I could reliably get well-formed code that followed project patterns and passed behavior checks — but the AI was not actually building the UI to match what Figma MCP was handing it.
On top of that, I was running tasks in parallel across git worktrees. Launching each worktree as a running app, eyeballing it, and iterating by hand had become the obvious bottleneck.
At some point I caught myself wondering: should I just hand the AI business logic and API integration, and keep UI manual? But I wanted to preserve the productivity gains I had come to rely on, so I sat down and asked what concretely had to change.
Boiling it down, I had two problems:
- I want the implementation to match the design file as closely as possible.
- I want an easy way to verify and iterate on the implemented design.
Why it was failing: Figma MCP mapping limits + fuzzy "done" criteria
For "1. match the design file closely," two root causes stood out:
- Figma MCP exports design context as React + Tailwind CSS. Mapping that into Flutter/Dart + Material is inherently lossy, and it is not something I can fix at the user level. Better to compensate elsewhere.
- For well-formed code I have a concrete spec — tests and lints. Pass them and you are done. For UI, there is no equivalent. The AI has no objective "when is this done?" signal. I suspected this was fixable, though no obvious approach presented itself at first.
For "2. easy verification and iteration," I figured I could automate my way out:
- Inspect simple components with the
widget_bookpackage - Use golden tests to capture composed, logic-bound UI as images
- Verify actual behavior before CI review
The fix: wrap the UI in a verification loop
The core idea is simple: give the AI a concrete, TDD-style spec for the UI too.
Once I decided to lean on golden tests, the next question was how to define "passing." I fed the AI both the Figma screenshot (via Figma MCP) and the golden screenshot of the current implementation, and had it build its own diff checklist, review the gap, and patch it.
My first instinct was a pixel diff for a quantitative score. But rendering differences between Figma (web) and Flutter (app) are unavoidable, and chasing them can easily wreck the code — the AI itself pointed that out — so I kept the evaluation qualitative.
Results
Original design
Baseline approach (plan → build). Colors, fonts, and spacing drifted from the design. Components were not reused. Initial output was fast, but I averaged 4–6 correction rounds.
| Baseline | golden test + Figma MCP screenshot diff | |
|---|---|---|
| Initial plan/build tokens | ~160k tokens + (revision cost × 5) | ~250k tokens + (revision cost × 1) |
| Avg. manual correction rounds | 4–6 | 0–2 |
| Visual fidelity (color/font/spacing) | Low | High |
| Component reuse | Low | High* |
* Component reuse rules already live in
CLAUDE.md, but the extra review pass in the new skill seems to enforce them a second time, and the effect compounds.
How the comparison actually works
The reason this produces a measurable delta is that the AI is now handed a concrete, spec-style rubric for the UI. When a UI task completes, it builds two checklists on its own.
① Spec-level diff — Figma vs rendered golden
Each component attribute (typography, spacing, borders, colors, shadow, etc.) is compared one item at a time between the Figma source and the rendered golden, and each is marked as ✅ match / ⚠️ approximation / ❌ mismatch. A simplified excerpt looks like this (the real checklist has 18+ rows):
| Attribute | Figma | Rendered (golden) | Verdict |
|---|---|---|---|
| Label typography | Pretendard Medium 16, #1A1A1A |
LabelLarge(16/500), onSurface
|
✅ matches |
| Focused border | 2px primary (#597D2E) | 2px AppColors.primary
|
✅ exact match |
| Field radius | 8px | 8px (fieldRadius) |
✅ exact match |
| Disabled fill | #F5F5F5 (neutral10) | AppColors.surfaceContainerLow |
⚠️ approximation (token-semantic mapping) |
② Intent check — golden variants vs design intent
Each golden variant per state (default, focused, disabled, no_label_enabled, no_label_focused, …) is checked against design intent at the state level. Here the bar is not pixel parity but "does this state convey what the design meant?"
The AI runs both passes, patches the gaps it finds, and then hands the result to me for approval. These checklists are triggered automatically at the UI Review Gate stage of the broader workflow.
The flow, end to end
- Create a worktree
- Pull metadata and screenshots through Figma MCP
- Write widget tests (TDD RED)
- Implement until widget tests pass (TDD GREEN)
- Add golden tests
- Compare Figma screenshots against golden screenshots, then patch
- Human review
- Simplify the code
- Iterate review + patch
- Done
Wrap-up
The sample size is small, but in practice this has been working well for me. What I felt most is that the wait time inside the correction loop itself shrank, and the number of times I had to step in dropped.
A few common-sense rules that pair well with it:
- Break tasks into smaller chunks than you normally would
- Build component-first, then compose
- Lean on
widget_book— a component-catalog tool, analogous to Storybook in the web ecosystem
If you regularly build UI from design files, this approach is worth trying on your next task. The full setup is open-sourced here: flutter-golden-cycle.
References
- Flutter & Figma MCP (live session) — https://www.youtube.com/live/d7qrvytOxSA
- Figma implement skill — figma-implement-design
This post was originally written in Korean by a human. The English translation was produced by an AI model.





Top comments (0)