In agencies, testing exists but it's mostly manual. Someone opens the site before deploy, clicks around, checks the forms, makes sure nothing looks broken. Sometimes there are unit tests with Vitest or Jest. Sometimes there aren't, depending on the project and the deadline. When the deadline is in three days, automated testing is the first thing to go.
And inside that manual process, there's one type of testing that's almost impossible to do well by hand: visual.
I've been researching visual regression testing for a while. I built it in research projects, tried it on side experiments, read a lot. What strikes me most isn't the tool itself. It's how rarely it's used in agency work, compared to the problem it solves and how easy it's become to set it up.
I'm not here to tell you my team has it running in production and we're rockstars. I'm here to share what I learned researching it, hoping to save you the first few hours.
Why visual testing is the hardest
Think about the types of testing you know.
Unit tests with Vitest. They check that a function returns what it should. Fast, deterministic, easy to write.
Integration tests. Slower, more fragile, still manageable.
End-to-end. Slow, fragile, environment-heavy. Less common.
Visual. Checks that the site looks the way it should. And here everything breaks. What does "looks good" even mean? How do you express that as an assertion? expect(button).toBeVisible() tells you nothing about padding, color, border-radius, or how a component looks next to three others.
Visual testing breaks the mental model of "logical assertion over code". What you want to assert is perceptual, not logical. That's why most stacks skip it.
What VRT actually does
It solves the problem indirectly. Instead of describing how something "looks good", it compares screenshots. You take a baseline, take another after a change, compare pixel by pixel. If the diff goes over a threshold, the test fails.
Not elegant in theory. Pragmatic and it works.
This catches the bugs every agency knows. You change a Tailwind variable, touch a design system component, update a UI dependency. Unit tests pass. Lighthouse passes. Site goes live. Days later a client reports the secondary button has a thicker border, or the footer lost its padding.
Why it's barely used
Four reasons, none technical.
First, agency testing is mostly manual. The human eye is bad at catching subtle visual changes. Nobody notices a padding going from 16 to 14 pixels during a quick walkthrough on a Friday afternoon. And since the manual process is already in place and "works", nobody questions what it systematically misses.
Second, the first attempt always fails. Someone sets it up on a Friday, by Monday the pipeline is red with 20 false positives, the team disables it, it dies in the repo.
Third, the team learning curve. Not because it's complex, but because every dev needs to understand baselines, when to update them, how to review diffs in a PR. If only one person gets it, it doesn't scale.
Fourth, and this one bothers me the most. Most content on the topic is written by SaaS vendors (Percy, Chromatic, Applitools) selling their product. They tell you it's complex, you need their tool, self-hosted doesn't scale. It's not true. Playwright ships everything you need.
The stack that makes sense
- Playwright for running tests and comparing screenshots
- GitHub Actions for CI
- Screenshots versioned in your repo
- A masking convention for dynamic content
No external services. No licenses. All inside your repo.
// playwright.config.ts
import { defineConfig, devices } from '@playwright/test'
export default defineConfig({
testDir: './tests/visual',
expect: {
toHaveScreenshot: {
maxDiffPixelRatio: 0.01,
threshold: 0.2,
},
},
projects: [
{ name: 'mobile', use: devices['iPhone 13'] },
{ name: 'desktop', use: { viewport: { width: 1440, height: 900 } } },
],
})
A base test:
import { test, expect } from '@playwright/test'
const pages = ['/', '/about', '/services', '/contact']
for (const path of pages) {
test(`visual: ${path}`, async ({ page }) => {
await page.goto(path)
await page.waitForLoadState('networkidle')
await page.evaluate(() => document.fonts.ready)
const name = path === '/' ? 'home' : path.replace(/\//g, '-')
await expect(page).toHaveScreenshot(`${name}.png`, {
fullPage: true,
mask: [page.locator('[data-vrt-mask]')],
maskColor: '#808080',
})
})
}
Setup is faster than ever
Two years ago, building this stack from scratch was two or three days of focused work. Today, with current codegen tools, it takes a couple of hours if you know what to ask for. The barrier dropped. The knowledge of what to ask for is still critical. If you don't know you need document.fonts.ready before the screenshot, no tool will suggest it without context.
Here's the prompt I use to scaffold it on new projects:
I need visual regression testing in this project.
Stack: [Nuxt 3 / Next.js / WordPress / whatever]
CI: GitHub Actions
Implement:
1. Install Playwright as dev dependency. Configure Chromium only.
2. Create `playwright.config.ts` with two projects: mobile (iPhone 13)
and desktop (1440x900). Threshold 0.2, maxDiffPixelRatio 0.01.
3. Create `tests/visual/pages.spec.ts` iterating over the main public
routes. For each route:
- Wait for networkidle
- Wait for document.fonts.ready
- Scroll to bottom to force lazy loading
- Disable animations via injected CSS
- Take fullPage screenshot, masking [data-vrt-mask] elements
with maskColor #808080
4. Create `.github/workflows/visual-regression.yml` that:
- Runs on every PR
- Shards by project (mobile, desktop)
- Caches node_modules and Playwright browsers
- Uploads playwright-report as artifact on failure
- Does NOT auto-update baselines
5. Add scripts to package.json:
- `test:visual` to run tests
- `test:visual:update` to update baselines locally
6. Document in `tests/visual/README.md`:
- How to run tests locally
- When and how to approve new baselines
- The [data-vrt-mask] convention
Do not install Percy, Chromatic, or external services.
Self-hosted with Playwright only.
```
`
Drop it into your AI tooling at the repo root and the scaffold comes out in one pass. The prompt isn't the point. Understanding why each decision is there is.
## Masking dynamic content: where it all breaks
This is by far the hardest part of the whole setup. Everything else you can build in an afternoon. Masking can take weeks of fine-tuning if your site has any real complexity.
An honest list of the enemies:
- **Dates and timestamps**. "3 hours ago", "Published May 5, 2026". They change every run.
- **Sliders and carousels**. They start in different positions, change height when slide content has different size, often have autoplay you can't easily disable.
- **Animations and transitions**. A 400ms fade-in produces a different screenshot at ms 200 vs ms 600. Multiply by every component using `motion-safe` or GSAP.
- **WYSIWYG content**. The client edits CMS text and breaks your test. But you can't mask the whole block or you're testing nothing.
- **Lazy-loaded images**. Depending on browser request order, the page has different heights mid-render. Full-page screenshots come out different each time.
- **External iframes**. YouTube, Vimeo, maps, Calendly, chat widgets. Different load times, rotating content.
- **Cookie banners, A/B tests, geolocation**. The site the CI bot sees isn't the site you see, and sometimes it's not even the same between two runs.
- **Skeletons and loading states**. Capture early, you see the skeleton. Capture late, you see the data. Both valid, both break the test.
- **Async fonts (FOUT/FOIT)**. Screenshot before fonts load, you get Arial. After, you get Inter. Catastrophic diff.
- **Real-time counters**. Any "1,234 users online" widget guarantees a false positive every run.
False positives are worse than no tests. They train the team to ignore warnings. Two months in, nobody reads the reports and the setup dies.
The rule that worked for me: anything with genuinely dynamic content gets a `data-vrt-mask` attribute. Not tooling magic, team discipline:
```vue
<template>
<article class="post">
<h2>{{ post.title }}</h2>
<time data-vrt-mask>{{ post.publishedAt }}</time>
<YouTubeEmbed data-vrt-mask :id="post.videoId" />
<HeroSlider data-vrt-mask />
<p>{{ post.excerpt }}</p>
</article>
</template>
```
Playwright paints that region a neutral gray (`#808080`) before comparing. The rest gets compared pixel by pixel.
Use neutral gray, not Playwright's default magenta. When a dev opens a report with 15 diffs and half are screaming magenta blocks, their brain enters alarm mode and approves everything in a hurry. Neutral gray makes real diffs pop and masked regions disappear visually.
For animations, the highest-impact trick is killing them globally during the test:
```typescript
await page.addStyleTag({
content: `
*, *::before, *::after {
animation-duration: 0s !important;
animation-delay: 0s !important;
transition-duration: 0s !important;
transition-delay: 0s !important;
}
`,
})
```
This single line eliminates half the false positives.
For lazy images, force-scroll to the bottom before capturing. It's ugly. It works. Put it in a helper and never look at it again.
## Baseline approval
When a diff appears, someone decides if it's a bug or an intentional change. The common mistake is letting each dev approve their own baselines locally with `--update-snapshots` and committing the result. In three months you have approved baselines hiding real bugs. Someone shipped a broken margin, approved it, now it's the official state. Test passes green and lying.
What works:
1. Baselines update via PR. Never direct commit to main.
2. The PR shows visual diffs as artifact.
3. At least one reviewer who isn't the author.
```yaml
- uses: actions/upload-artifact@v4
if: always()
with:
name: playwright-report
path: playwright-report/
retention-days: 14
```
## When it's not worth it
If your site barely changes, one client, monthly deploys, the overhead isn't worth it. Baseline maintenance is real work.
If you have a shared design system across multiple clients, deploy several times a week, and every visual bug means an awkward conversation with a client lead, this is the best thing you'll add this year.
VRT doesn't catch logic bugs, accessibility bugs, or performance issues. It doesn't replace unit tests if you have them. It doesn't replace manual QA on critical flows like checkout or login. It's one more layer, the one covering the gap the others leave and the human eye misses worst.
## Closing
Visual regression testing is one of those tools that, when you look at it closely, you wonder why everyone isn't using it. The answer is a mix of inertia, vendor-driven content bias, and the fact that the first attempt always fails.
Visual testing remains the hardest. And it's still mostly done by hand, which in practice means it's not done at all. With today's tooling, the barrier to entry is gone. What's left is the decision.
I don't have the perfect recipe because I don't have it running in production yet. But after researching it deeply, I'm convinced any agency with a shared design system and frequent deploys should have this. And the self-hosted Playwright setup is fully viable, contrary to what the SaaS vendors want to sell you.
I'm starting to roll it into a project soon. If you already use it in production and learned things this article doesn't cover, write them. We need more honest content on this.
Top comments (0)