How I built a visual regression testing tool for i18n — and what AI got wrong along the way
I didn't start this hackathon with VisualLingo.
I started it with three AI models, a list of previous winners, and a problem I couldn't stop thinking about: what do developers actually struggle with when they add i18n to their apps?
Not the translation part. Everyone solves that. The part after.
How I Even Landed on This Idea
When the Lingo.dev Hackathon #2 opened, I did something a little unusual. Instead of just brainstorming on my own, I gave the hackathon overview to Claude, ChatGPT, and Kimi K2.5 separately, asked each of them to generate ideas, and told them to use Twitter search to find what real developers were complaining about with i18n.
Then I gave each AI the previous hackathon winners and asked them to refine their ideas based on what had already been done — so I wasn't just rebuilding something that already won.
Then I cross-fed their ideas to each other. GPT's ideas went to Claude. Claude's went to Kimi. Kimi's went to GPT. Each one was asked to arrange the combined pool into a priority list based on originality, technical depth, and real-world usefulness.
It sounds like overkill. But I wasn't willing to commit 2-3 days to an idea I wasn't sure about. That triangulation process gave me genuine confidence that the gap I was seeing — nobody is testing what translations do to layouts — was real and worth building for.
The idea that kept rising to the top: a tool that catches visual regressions caused by translation. Not text bugs. Layout bugs. The kind that only show up when German text is 35% longer than English, or when Arabic flips your entire UI right-to-left.
That became VisualLingo.
The Problem (And Why It's Sneaky)
Most developers think i18n is done when the translations are done.
Ship the Japanese strings. Wire up the Arabic locale. Run lingo localize. Done, right?
Here's what actually happens: German text is ~35% longer than English. Japanese can render wider or more compact depending on the characters. Arabic flips your entire layout right-to-left. None of these changes show up in your unit tests. None of them get caught by your translation pipeline. They show up when a real user in Berlin opens your app and sees a button that says "Anmeldung" getting cut off halfway.
They show up in production. After shipping. After the demo.
And the "just test it manually" answer doesn't scale. Make a change to your pricing card. Now check it in English, German, Japanese, and Arabic. Four browser tabs, four page loads, four visual inspections — every single time, for every page, for every change. Nobody does this consistently. So the bugs ship.
The Plan (And Why Planning Actually Mattered)
Once I had the idea, I went back to Claude and asked it to break the implementation into 5-6 concrete steps rather than one big prompt.
This was intentional. I've learned that AI works significantly better when it's not overloaded. Give it one giant "build me this entire tool" prompt and it hallucinates, cuts corners, makes assumptions. Break it into focused steps and it stays precise. So the plan looked roughly like:
- Set up Next.js app with locale routing
- Integrate Lingo.dev CLI for automated translation
- Build Playwright screenshot runner (one per locale per page)
- Implement pixelmatch comparison engine
- Build the results dashboard
- Wire everything together with a single
npm run diffcommand
I used Antigravity to implement it — working in a proper IDE keeps your project structured in a way that copy-pasting into a chat window never does. Files stay organised, context stays intact, the AI doesn't lose track of what already exists.
The plan was solid. The execution, though — that's where things got honest.
What AI Got Wrong (And What I Almost Shipped)
Here's the part nobody writes about.
After following the plan step by step, AI told me I was good to go. Ready to submit. And I almost did.
Instead, I went through everything myself — not trusting the AI's self-assessment, but actually checking: is this thing in my project? Is that actually integrated?
That's when I found it.
Lingo.dev — the entire point of the hackathon — wasn't properly integrated. The tool was doing visual diffs, yes. But it wasn't actually using the Lingo.dev CLI for translations. It was kind of working around it. For a Lingo.dev hackathon, that's not a minor oversight. That would have been a disqualifying gap.
I was frustrated. Genuinely. After the whole plan, after the step-by-step process specifically designed to avoid this kind of mistake, the AI had still missed something fundamental. And fixing it meant going back through steps I thought were done, re-prompting, re-integrating, burning through context limits I'd been trying to preserve.
The process got annoying. But I fixed it.
Then AI said I was good to go again.
I tested it myself again anyway.
And it was still breaking. The dashboard was showing errors, images weren't rendering correctly. After debugging, I found the root cause: I'd been using pseudo-languages for testing — placeholder locales that Lingo.dev doesn't actually support. Switching to real supported locales (Japanese, German, Arabic) fixed everything.
Two "you're ready to ship" moments from AI. Two bugs that would have killed the submission.
What I Actually Built
VisualLingo is a visual regression testing tool for i18n. The core idea: if translation can break your layout, then translating is a visual change — and visual changes need visual tests.
The workflow is simple:
1. Run translations (Lingo.dev CLI → en, ja, de, ar)
2. Screenshot every page in every locale (Playwright)
3. Compare screenshots pixel-by-pixel vs saved baselines (pixelmatch)
4. Report: PASS if diff < threshold, FAIL with highlighted diff if not
5. View results on a dashboard with side-by-side visual diffs
One command — npm run diff — tells you exactly which locale broke, which page, and where on the screen.
The config is intentionally minimal:
// visuallingo.config.json
{
"baseUrl": "http://localhost:3002",
"locales": ["en", "ja", "de", "ar"]
}
// i18n.json (Lingo.dev config)
{
"targets": ["ja", "de", "ar"]
}
The Pixel-Diff Engine
This is the part I'm most proud of technically.
On first run, VisualLingo boots in baseline mode — it screenshots every page in every locale and saves them as the "known good" state:
.tmp/
baselines/
en-demo.png
ja-demo.png
de-demo.png
ar-demo.png
Every subsequent run screenshots the same pages, then runs pixelmatch on each pair:
import { PNG } from "pngjs";
import pixelmatch from "pixelmatch";
const baseline = PNG.sync.read(fs.readFileSync(baselinePath));
const current = PNG.sync.read(fs.readFileSync(screenshotPath));
const { width, height } = baseline;
const diff = new PNG({ width, height });
const mismatchedPixels = pixelmatch(
baseline.data,
current.data,
diff.data,
width,
height,
{ threshold: 0.1 }
);
const diffPercent = (mismatchedPixels / (width * height)) * 100;
const passed = diffPercent < 0.5; // <0.5% pixel diff = PASS
The diff image — a visual map of exactly which pixels changed, highlighted in red — goes straight to the dashboard.
Things I had to figure out the hard way:
Image dimensions must match. If content reflows between runs, Playwright might screenshot at a slightly different height. pixelmatch throws on dimension mismatches. Fix: delete baselines and regenerate after structural changes.
Bootstrap order matters. Generate baselines on a broken state and everything will "pass" forever — because you're comparing broken-to-broken. Baselines must be created on a known good state.
The dev server must restart after code changes. Obvious in hindsight. Cost me 20 minutes of confusion.
Port matters. The app ran on 3002 (not 3000/3001, which were occupied). Every hardcoded URL in the tool had to point there. A small thing that caused a surprising amount of friction.
Seeing It Work
The demo app has intentional layout bugs baked in — specific to certain locales — so you can watch the detection happen in real time.
# Step 1: Create baselines (known good state)
npm run diff
# Output: 4 locales bootstrapped. Baselines saved.
# Step 2: Break a layout
# (e.g. give the Arabic pricing card a fixed width that clips the text)
# Step 3: Detect the regression
npm run diff
# Output: 3 passed, 1 failed
# ar-demo: 8.3% pixel diff — FAIL
The dashboard at localhost:3002/dashboard shows:
- ✅ English — PASS
- ✅ Japanese — PASS
- ✅ German — PASS
- ❌ Arabic — FAIL (with the red diff overlay showing exactly what changed)
No manually written test assertions. No checking four browser tabs. One command, one dashboard.
What Lingo.dev Made Possible
The Lingo.dev integration isn't just a feature — it's what makes the whole tool realistic.
The hardest part of visual regression testing for i18n isn't the pixel comparison. It's having up-to-date, real translations to test against. If you're manually maintaining JSON files, your test coverage is only as good as the last time someone updated them. Stale translations mean you're testing a state your users will never actually see.
With the Lingo.dev CLI, translations are always fresh:
npx lingo localize
# Translates en.json → ja.json, de.json, ar.json automatically
The full loop becomes:
Code change → npx lingo localize → npm run diff → dashboard
That's a CI-ready pipeline. Every PR could run this, catch layout regressions in any locale, and block merge on failures — before a single user sees a broken button.
The Lesson I'm Taking From This
Everyone says "ship early." Move fast. Get it out.
I think that advice is incomplete — especially when you're working heavily with AI.
AI will tell you you're done before you're done. It doesn't know what it doesn't know. It can't test your project. It can't feel the frustration of a dashboard that doesn't render. It confidently said "you're ready" twice, and twice it was wrong.
What saved me both times was slowing down and checking myself. Not trusting the AI's self-assessment. Not shipping the moment it said go.
If I were doing this again, I'd say: plan for the full time you have. Not 1 day, not 2. Use the whole window. One day to plan and build. One day to actually use the thing and find what's broken. One day to fix, polish, and prepare your presentation. That's a different product than what you ship at the end of day one — and it's one you can actually be proud of.
What's Next for VisualLingo
The tool right now is a local dev workflow. Here's where it goes:
CI integration — a GitHub Action that runs on every PR, comments with a locale diff report, and blocks merge on regressions.
Per-locale thresholds — Arabic RTL layouts differ more from English baselines than German does. Configurable pass/fail thresholds per locale would reduce false positives.
Component-level diffing — diff just the navbar or the pricing card, not the full page. More surgical, less noise.
Historical tracking — store diff percentages over time so you can see which locales are trending toward breakage before they actually break.
Final Thought
Adding i18n to your app is the right call. More users, more markets, more reach. But "it works in English" and "it works in all your locales" are genuinely different things — and the gap between them is full of clipped buttons and overflowing text that nobody caught because nobody had the time to test four languages manually after every change.
VisualLingo closes that gap. Pixel-perfect diffs, automated across every locale, in a workflow any developer can run in under a minute.
Built in 2-3 days. Debugged honestly. Ships bugs before your users do.
*VisualLingo was built for the Lingo.dev Hackathon #2 (Feb 2026). GitHub: github.com/VasuBansal7576/VisualLingo *
Top comments (0)