SamReid

Posted on May 21 • Originally published at grabdiff.com

Why your uptime monitor says everything's fine while users see a white screen

#webdev #monitoring #javascript #devops

It was 11:47 PM on a Thursday when the Slack messages started rolling in.

"Hey, the checkout page looks broken."

"Is the site down? I'm seeing a blank screen."

"Tried three different browsers, same thing."

I pulled up our uptime monitor. Green. Every check passing. Response times normal. HTTP 200 across the board. By every metric the monitor tracked, the site was healthy.

It was not healthy.

The checkout page - the single most important page in the entire application - was rendering a completely white screen for every user. Had been for about 40 minutes. Our monitoring had no idea.

That night cost us somewhere around four hours of investigation, a patch at 2 AM, and a post-mortem I'd rather forget. The root cause was a JavaScript runtime error triggered by a third-party A/B testing script that loaded asynchronously, after our monitor's simple HTTP check had already returned 200 and moved on.

The monitor was doing exactly what it was designed to do. That was the problem.

What a ping monitor actually checks

Let's be precise about what most uptime monitoring tools do, because I think a lot of developers have a fuzzy mental model here.

A basic uptime monitor makes an HTTP request to your URL. It checks:

Did the server respond?
Was the status code in the 2xx range?
Was the response time under some threshold?

That's it. Some monitors add a string match - "check that the response body contains the word 'homepage'" or whatever. That's slightly better. But not by much.

The thing is, your users don't care about any of that. They care whether the page works - whether the content they expect is visible, the buttons they need to click are there, and the thing they're trying to do is actually possible.

An HTTP 200 response with a blank body is still an HTTP 200. A page that returns your shell HTML but fails to mount any React components is still an HTTP 200. A cached CDN response that's three weeks stale is still an HTTP 200.

The failure modes nobody warns you about

Here are the actual production incidents I've seen (or personally caused) that a ping monitor will completely miss.

JavaScript crash on load

Your server renders the page and sends back valid HTML. But somewhere in the client-side bundle, there's an unhandled exception - a null reference, a failed import, an API that returned an unexpected shape. The page stays blank, or partially rendered, or stuck in a loading state. The server was fine. The HTTP response was fine. The user experience was not fine.

These are especially nasty because they often only affect certain browsers, certain screen sizes, or users in certain states (logged in vs. logged out, items in cart vs. empty cart).

CDN serving stale content

You deployed a fix. Your origin server is running the new code. But your CDN edge nodes are still serving the old, broken version to 80% of your users because cache invalidation didn't propagate correctly, or someone forgot to add a cache-busting header, or the CDN's purge API returned 200 but didn't actually do the thing.

Your monitor hits the origin. Clean. Users hit the CDN edge. Broken.

React/Next.js hydration failure

You're server-side rendering a React app. The server sends down HTML, which looks great in your monitor's response check. Then the client-side JavaScript tries to "hydrate" that HTML - attach event listeners, reconcile state - and something goes wrong. The page looks rendered, but nothing is interactive. Buttons don't respond. Forms won't submit. The hydration error is sitting in the browser console where nobody ever looks.

An A/B test or feature flag went sideways

Someone enabled a new variation in your experimentation platform. It works fine for 90% of users. For 10% - the control group, or a specific segment, or users with a certain cookie - it injects a script that breaks layout, hides a critical element, or throws an error. Your monitor isn't in that segment. Your monitor sees the happy path.

The checkout button just... isn't there anymore

A CSS change. A template logic error. A component that conditionally renders based on some state that's slightly wrong. Your add-to-cart button, your checkout button, your submit form - just gone. The page looks fine at a glance. The hero image is there, the nav is there, the product photos are there. But the one element users actually need to convert? Missing.

A monitor checking for HTTP 200 and a response time under 2 seconds has no idea.

Why visual monitoring is conceptually different

The insight is simple: instead of asking "did the server respond?", ask "does the page look right?"

Visual monitoring takes a screenshot of your page - using a real browser, running real JavaScript, waiting for render to complete - and compares it against a known-good baseline. If the current screenshot differs from the baseline beyond some threshold, something has changed and you get an alert.

This catches all the scenarios above because it's evaluating the end result that users actually see, not an intermediate HTTP handshake that happens before any of the interesting rendering work occurs.

The diff approach is also useful beyond pure "is it broken" detection. It catches:

Layout shifts you didn't intentionally make
Text that changed in a way it shouldn't have (wrong price, wrong copy)
Elements that disappeared
New elements that appeared (a cookie banner blocking content, an error message you didn't know was showing)
Visual regressions from a deploy that seemed fine but quietly broke something

The key is the baseline. You take a screenshot when everything is known to be working, mark it as the baseline, and then every subsequent check compares against that. When the diff exceeds your threshold, alert.

The alert should include the diff image - not just "something changed," but a visual showing exactly what changed, highlighted. That's the thing that makes it immediately actionable instead of just alarming.

The practical gotchas

Visual monitoring isn't magic, and there are real tradeoffs.

Dynamic content. If your page shows the current time, a live stock price, or a user-specific greeting, those will show up as diffs on every single check. You need to either mask those regions, or structure your pages so dynamic content is in predictable locations you can exclude.

Rendering variability. Fonts rendering slightly differently on different machines, antialiasing differences, animated elements caught mid-transition - all of these can cause false positives if your diff threshold is too sensitive. You need a threshold that's high enough to ignore noise but low enough to catch real problems. Getting this calibrated takes some tuning.

Auth-gated pages. If you want to monitor pages behind a login, you need to handle authentication in your screenshot flow. This is doable - you can script login sequences - but it adds complexity.

Cost of headless Chrome. Running Chrome instances at scale is not free. It's memory-hungry, and taking screenshots takes real time compared to a simple HTTP request. A 1-minute check interval for many URLs is meaningfully more expensive to operate than a ping monitor.

These are all solvable problems, but they're real, and anyone selling you a visual monitoring solution that doesn't mention them is glossing over the hard parts.

Where this leaves you

I'm not saying throw out your existing uptime monitor. Ping monitors are cheap, fast, and useful for catching the simplest failures - server down, TLS expired, DNS broken. Keep them.

But treat them for what they are: a necessary but not sufficient condition for "the site is working." The gap between "the server responded with 200" and "users can actually use this page" is where a lot of production incidents live, and most teams don't find out about those incidents until users tell them.

After the Thursday night incident I described at the start, I spent a weekend investigating visual monitoring options and ended up down a rabbit hole of how headless Chrome screenshot diffing actually works. None of the existing tools felt quite right - either too expensive, too complex to configure, or not quite targeted at the "is this page visually broken" question.

So I built GrabDiff. It takes screenshots of your URLs on a schedule using headless Chrome, diffs them against a stored baseline, and emails you with the diff image attached when something looks off. Free plan covers three monitors, no credit card required.

Your uptime monitor is probably lying to you right now. Hopefully about something minor.