Michael Warren

Posted on Mar 25 • Originally published at michaelwarren.dev

Why is Visual Difference Testing still so hard?

#webdev #testing #webcomponents

This post is going to be half rant and half educational. At least that‘s what I'm aiming for.

What is VDT?

The concept goes by a few names, but is essentially the same regardless of terminology. You might have heard of “Visual Regression Testing” (VRT) or ”Visual Difference Testing” (VDT). It might be called something else in your sphere, but the idea is the same. Test the visual parts of your application so that when/if anything has visually changed, you can do something about it. I like to call it VDT because I think of the practice as hunting for differences between the current state of a UI and the new state. The difference might not be a regression, it might be a purposeful change or even an acknowledged cross-browser difference like focus states on buttons between Chromium and Webkit.

The most common practice—and I think maybe the actual only way to do it—is to render the part of your UI you want to test, take a screenshot of that scenario and save it. Then on your next test run, take the same screenshot of the same scenario again and compare the old screenshot against the new one somehow. There are a few ways to compare. I’ve seen methods that involve setting each screenshot to have a lot of transparency and overlaying the two on top of one another to see where differences might be. The most common method seems to be a pixel-by-pixel analysis of each screenshot and logging where there are differences. Some of the pixel tools are the old school deterministic route, and some of the newer ones are “AI-driven” because of course they are. Can’t build any kind of tool these days without sloppin’ some good ’ole AI on it somehow.

Why do we need VDT anyway?

You know why a lot of devs say they hate CSS? Because they don‘t really understand what it does and don’t really know the ramifications of making stylistic changes in their UIs? Well VDT helps devs react to those visual changes. If you test what it was before you changed, and test what it looks like after you change it, then ostensibly you can see what you changed, decide what to do about it and whether or not to continue with your desired approach. Like any testing, VDT is very useful in not releasing buggy UIs, except the difference is that the bugs are visual not logical.

VDT has different use & importance depending on your product

VDT matters in different ways for different kinds of products. If you mainly work on an application, then you probably aren‘t as concerned with small stylistic shifts that can happen. It doesn‘t super matter that some button or form element is 4px left of where it was the last run. You probably are more interested in using VDT as a “smoke test” for a broken app. Such as to guard against your best friend Claude slipping in a * { opacity: 0.6; } in your global stylesheet while you were dozing off at your computer waiting for your token-burning machine to be done “thinking”. Devs focused on sites and apps will mostly likely test screenshots of entire pages/flows just to confirm that nothing is wildly and obviously broken like the main “pay us money” CTA button being covered up by a rogue ad placement.

For UI component library devs like me, particularly design system devs like me, VDT is much more involved. Design systems are very interconnected libraries by nature. Design system components get used all over the place in other components, so it is very easy to make a change to a foundational component that yields wildly unexpected results across the codebase in a component you didn‘t even look at. For library devs, running VDT is more of a ”break the build immediately, throw a wall of red text” kind of situation rather than a ”oh yeah, we should really fix that padding on the sign up page sometime" one.

Also, we test components that have many more variants/states than the specific composition of an entire app. The humble button component has a theme, variant, rank, pill shape, with icons, loading state, and all the interactivity styles for :hover, and :focus-visible etc. To do VDT correctly, every single one of those visually distinct states should be tested so that we can have confidence we aren‘t introducing unwanted problems (tech debt) into our libs.

The fact that VDT is such an integral part of UI library development is why its STUPID that its still so damn hard to wire up.

<rant>

Here are the stupid hard things about VDT

Having just spent WAY too long trying to hook up a VDT approach, even in the bursting-at-the-seams-with-AI world we live in (AI didn‘ help at all, btw. It utterly failed at a lot of the final approach), here are the things that I’ve run across lately that are bugging me about VDT being so hard to get right.

Counting pixels sucks

It seems like the best we can really do is to examine the color data for each pixel in two screenshots and then mark one of them pink and say “this one is different!”. If we do that over and over again, we will know all the tiny differences in each screenshot. But do you want to know how easy it is to absolutely destroy the pixel counting approach? Just add a box-shadow to your component with a blur or spread. The dithering approach in each browser can be enough to trigger changes being registered which might be enough to end up failing your build. Isn‘t that great?! We UI devs have no control over the way a browser draws its translucent gradients, but they can for sure destroy our pipelines.

Thresholds are never good enough

Part of initial problem of the “counting pixels for differences” approach is that we know in advance that not ALL of the pixels in the screenshot are ones we care about. We don’t really care that much about box-shadow dithering, or how the browser makes text bold. So most of the VDT tools out there allow us to set thresholds for ”how many differences are ok before we get mad and break your build”. Most of the thresholds are expressed in terms of percentages—i.e, how much of the screenshot can be different before angry errors happen. But this dumb threshold approach isn’t ever good enough, because box-shadow differences can add up quickly and still lead to a bunch of false positives with no real way to actually get your CI build to pass. So after struggling for hours, devs will inevitably just raise the thresholds so they can continue along with their lives, and that introduces more chance for actual visual regressions to happen without CI ever going red.

Updating the screenshots in CI when things change on purpose is always hard

Not every visual change is a problem. We update our UIs. We fix bugs. We make breaking changes to innovate. Any time we make purposeful visual changes, the baseline images need to be updated. We have to mark somehow that some of the baseline images are ok to change, possible alongside others that are not ok to change. How do you mark purposeful changes if you are trying to automate VDT? CI builds can‘t really accept user input, and asking the user ”Did you mean to change these buttons to green?” destroys the whole concept of automation anyway. Deciding when & how to purposefully update the baseline images is complicated because its essentially cache-busting which is one of the hardest problems in computing.

Image storage is a huge pain

Admittedly, this pain point might just be because I‘m a design system dev, but we have a LOT of images. Our button component alone has 500 different screenshots. And that is including the main “kitchen sink“ page that renders all the default states and variants together in a single screen shot. Our design system buttons have :hover, :focus-visible, and :active, so our VDT script has to test each of those individually.

If there is a way to hover over more than one element at a time in an automated test, please let me know over on Bluesky

So our VDT script has to generate a screenshot for each of those interactive states, for each of our 4 different kinds of buttons. Our system also has 3 themes, and light & dark modes. So thats nearing 20 images for each different stylistic variant of buttons.

Deciding where to keep all those images is a headache. For VDT to work, we need a “baseline” screenshot and a “current” screenshot. While running your VDT test, thats 1.000 images or so JUST for buttons. Multiply that same concept for the other components with lots of stylistic variants like badges and alerts and the pile of images stacks up real quick.

Storing them in the repo

The simplest option might be to just store the baseline screens in your repo alongside your code and check them in. That totally works, but GitHub was not designed for binary file storage, it was designed for text files. GitHub has only recently implemented their “large file storage” feature so that you don’t have to clone down those 1.000 binaries and/or have them stored in each version of the history of your repo. But if you have a ton, then you still have to work around how GitHub‘s LFS storage works. I haven‘t tried that yet, because I opted for what a lot of folks do.

Store them in an external place or tool

Because the repo is a cumbersome place to store images, devs start looking for other places more suitable, AWS S3 buckets and such that are more designed to handle large batches of file types beyond just text. Some devs even opt to use an off-the-shelf tool that just does all the image storage and the pixel counting in a single place for you, while charging you an enterprise license fee and undoubtedly having a borderline unusable dashboard UI for viewing VDT test results. If you store your VDT screens in an external bucket like S3, that‘s less complex, but you still have to manage getting the images to & from your build in order to do the pixel matching.

If you roll your own pixel matching using some library, then your build pipeline has to have a step in it now called “Go get all the baseline images from S3”. That adds latency and performance cost to your build that will add up. Now your CI pipelines takes an extra few minutes over and above actually generating the screens and doing the pixel matching. Plus the complexity cost of adding build scripts to put the screens where the pixel match test tool thing expects them to be. Ever written a script to run through a pile of images and put them into sub directory folder structures according to splitting the file name so you can put the button screenshots in the /button/tests/__screensots__ folder so that the test runner knows where to look? Giant pain in the ass.

External tools can take results out of your pipeline

So you don‘t want to do all that roll your own and opt for an external tool. Your boss says the license is worth it and the dashboard UI isn‘t really that bad, it‘ll just take some training and getting used to. You'll still have to write a script that integrates with that tool, or you‘ll have to design your pipeline such that all the VDT tests run directly on the tool’s platform (looking at you Saucelabs). Either way, viewing the results mostly likely happens in the dashboard of that tool and NOT in your pipeline. Sure, you do the best you can, but your CI probably ends up a lot like mine where it will fail and error telling you that some screenshot did a bad, but you’ll have to actually login to that tool’s UI to actually see what about that particular screenshot failed.

External tools require more management

External tools mean more logins, more management of team access, and more heartache. If you rely on an external tool for results, then everyone on your team needs to be able to get in to see them and also to run their own tests on the PR branches. The tool I use (not going to name and shame cuz its likely not their fault) requires access tokens to send images up to their environment to be tested. That means that every team member needs to have their own token, because having just a single token for the whole team to share is a security no-no. So guess who gets to manage the team‘s access to the tool? You & me. Good luck getting Claude‘s help trying to figure out how to generate a new token for a new team member when you only go to that screen once a year. :)

VDT UIs are necessarily messy

This gripe is just me yelling at clouds I suppose. I don‘t really think there is a whole lot that VDT tool companies can do about their complicated UIs. I suppose that VDT UIs are just inherently complicated because of how complex VDT itself is. But that said, every UI I‘ve ever seen in a VDT testing external tool has been super busy and full of options I never seem to know how to use. There are buttons for turning the diff highlights on and off again, zooming the images, flipping back and forth from the baseline image to the current one, accepting desired purposeful changes to the baselines, and on and on.

And even if all of those features were wiped away, you‘re still left scrolling through potentially thousands of similar-looking screenshots trying to find the one marked red. Better hope the filtering options in the external tool are on point. More fun still, if the tool allows you to batch up your tests into groups, thats great, but guess who writes the integration code that decides on the group names? You & me. I do it by file name. When I generate the screenshot, I make the file name have a predictable format that includes the browser, theme, mode, and some property & value combo being tested. Then when I go to upload my screens to the external tool, I get to parse that filename back out again and group all those screens into batches that are basically tied to whatever the external tool UI is made to show. Figuring out how to make sense of thousands of images so you can filter and sort them later on in a UI you don‘t control is painful.

There are too many unpredictable timing variables

Since the main purpose of a VDT test is to render a screenshot in an automated pipeline environment, a lot of care must be taken to ensure that each visual scenario being tested is actually the same scenario over and over again. There are too many things about building UIs on the web that make this consistency hard. Does your component load a custom web font? Then your screenshot needs to wait until that font is loaded and visually applied before taking the screenshot otherwise your VDT tests will be full of false positives. The font will eventually load and your beautiful typography is fine, but if the screenshot gets taken too fast, you‘ll bust your thresholds. We can detect when a font is loaded, but there‘s no way to wait until that re-paint has been completed as far as I know. So we just guess and hope its done?

I am a web component developer, so I also have to wait until each of my web components is defined and has run its first update cycle before taking a screenshot. I use Lit which definitely helps a ton. Lit has a handy updateComplete property that is a Promise<boolean> that resolves true each time the component render cycle is completed. But what if there are child web components in the shadow root? Waiting until every component has completed its render cycle is another way that false positives can plague us. There is just no simple way to know for sure, so we setTimeout for some random number and hope its long enough.

Testing animating elements is impossible

What about trying to visually test the loading state of your button with its delightful spinner spinning away. Can‘t test that in a screenshot! There is no way to guarantee (that I know of) that the screenshot will be taken at the exact same time relative to when that loading spinner started spinning, so you either get the flakiest test known to man, or you‘re like me and just say “I guess we can’t test animated things with VDT at all”. Then you chuck those loading state VDT tests out the window and just hope you never break them accidentally.

But thats not all!

Screenshot size management

Since VDT takes a screenshot of all or a portion of the page, care must also be taken that the images are the right size. If you want to take a screenshot of a single button, but you code page.screenshot() with a 2000px viewport, you won‘t even be able to see the button. Therefore, to make the screens both usable for automated testing and visually scannable by a human (sorry AI, you still aren‘t good enough at images yet to help here much either) then you will find yourself having to design some test UI page that carefully crafts predictable container elements that you can screenshot into predictable sizes.

Remember that if the size of a screenshot changes from the size of the baseline image, thats a failed test! Even if the contents of the image didn‘t change because pixel diffing is dumb.

And each visual testing environment has its own default viewport. I use vitest - Browser mode which spins up Playwright under the hood, and I have no idea what the default viewport size is when a headless Chromium instance spins up. Is it the last viewport size that it was set to the last time it was run, or some static default? I have also seen that the size of the screenshot that you take with the screenshot() function depends on what size the viewport is. Your components also depend on what size the viewport is and it needs to be the same every time or you‘ll get a pile of false positive failures.

You basically have to code a mini app

If you are testing components like me, they probably don‘t exist entirely in isolation. If that button is themeable, then you need to test it IN the themes. So you need to spin up a test page with that theme‘s global css on it, but none of the others because your tests need to run in isolation. Your components also work in dark mode? Then you need to set that up also. Oh yeah, that‘ll add another 500 images now.

Kitchen sink tests don‘t help a ton

You try to avoid billions of images so you create ”kitchen sink” screenshot tests where you render lots of different variants of the same component type on the same page at the same time. That‘s a fun script to write. Nested for...in loops for days! Also remember that not every property and value your components have actually have a visual component to them. That label= property that just sets what the text is doesn‘t matter, so you‘ll also have to manually make an array of the props that DO have a visual component to them and only render/test those to help keep the screens down to manageable numbers (not possible).

And once you make your kitchen sink screenshot, the error message you get when something is wrong is the opposite of helpful. Because you grouped your components together in an attempt to cut down on bandwidth and storage sizes, the error you‘ll get when you bust a threshold will be something like ”One or more of these components, or something about the mini-app UI around them broke something. You get to go visually look at the kitchen sink image to figure out what it was now.” So your CI pipeline will just break and you need to pull up the actual screenshot to figure out which variant(s) it is that are the problem.

Why do we do this to ourselves?

Is the screenshot method truly the only way we really have of verifying visual issues? To me it seems like a huge disconnect that the only approach we seem to have come up with as a community to test the malleable, fluid world of stuff on the web is to create an absolutely rigid screenshot approach that absolutely cannot change in any way (notwithstanding the purposeful updates) or else the whole system breaks. Does that seem wrong? Why is this super rigid VDT testing format the accepted approach?

If this is the only way to do VDT, then VDT is just one of those sucky things about building stuff on the web, like naming CSS selectors and responding to web component haters on Twitter. If the screenshot method isn‘t the only way, where are the other ways? Is anyone out there working on a better way to visually statically analyze our components and apps so that we don‘t have to do this stuff? Can we come up with a mechanism that isn‘t so prone to timing issues, race conditions and brittle scenario building? I sure hope so. I‘m tired.

DEV Community