DEV Community

Cover image for The 5 things traditional uptime monitors miss (and how to catch them)
SamReid
SamReid

Posted on

The 5 things traditional uptime monitors miss (and how to catch them)

Your uptime monitor is probably green right now. That doesn't mean everything is working.

HTTP ping monitors are good at one thing: checking whether your server responds. They're essentially useless for everything that happens after the response leaves your server - the JavaScript execution, the rendering, the CDN edge nodes, the client-side state that has to be right for your page to actually work.

I got tired of finding out about these failures from users instead of from my monitor. That's why I built GrabDiff - it takes actual screenshots of your pages, diffs them against a known-good baseline, and emails you the diff image when something looks off. Free plan, three monitors, no card. But first, here's what it's actually catching that your current monitor can't:

Here are the five categories of failures I've seen (and caused) in production that an HTTP monitor will miss completely, plus how you actually catch them.


1. JavaScript crashes on load

This is the most common silent failure on modern web apps, and the one most developers underestimate.

Your server sends back valid HTML. HTTP 200, response time under 500ms, your monitor is happy. Then the client-side bundle executes. Somewhere in there - a null reference on a property that's undefined in some edge case, a third-party script that assumes something about the DOM that isn't true, an API response that came back in a shape the frontend didn't expect - an unhandled exception gets thrown. The page freezes. Or goes blank. Or renders halfway and stops.

From your monitor's perspective: everything is fine.

From your user's perspective: white screen.

What makes this nasty: JavaScript errors are often conditional. They affect logged-in users but not logged-out ones. They affect users on certain plans, with certain browser versions, with certain cookies or localStorage state. Your monitor is hitting the URL fresh, unauthenticated, with a clean browser - it's not in the affected cohort.

How to catch it: Visual monitoring - take a screenshot with a real headless browser and compare it against a known-good baseline. A blank page or partial render will show up immediately as a large pixel diff. Standard HTTP monitoring cannot catch this.


2. CDN serving stale or broken content

You fixed the bug. Deployed. Checked the origin. Everything looks correct. And then the Slack DMs start: users are still seeing the broken version.

CDN cache invalidation is notoriously unreliable. The failure modes include:

  • Purge API returned 200 but didn't actually purge - this happens more than vendors want to admit
  • Edge nodes in some regions updated, others didn't - your origin check hit one data center, users are hitting another
  • Cache-Control headers were wrong - a max-age=86400 header set during a period when things were broken means users get the broken version for up to 24 more hours
  • The CDN cached a redirect or an error page - your 503 from 45 minutes ago is still being served as a cached 503 with a 200 wrapper

Your HTTP monitor hits the origin directly, or hits a CDN edge node that happens to have fresh cache. Users are hitting different edges.

How to catch it: Monitor from multiple geographic locations, and monitor what the page looks like, not just what status code it returns. A CDN serving an old broken page will return HTTP 200 with content that doesn't match your current baseline. Only a visual diff will catch the discrepancy.


3. React/Next.js hydration failures

Server-side rendering gives you the best of both worlds: fast initial paint from pre-rendered HTML, then full interactivity once the JavaScript loads and "hydrates" the DOM.

When hydration goes wrong, you get the worst of both worlds.

The server sends perfectly rendered HTML. Your monitor checks it, sees a 200, sees the content in the response body, marks it as healthy. The user's browser receives that HTML and renders it visually - the page looks fine. Then React tries to hydrate: match the server-rendered DOM against what the client-side bundle would have rendered, attach event listeners, take over control.

If there's a mismatch - different data, different component state, a prop that resolves differently on client vs. server - React throws a hydration error. Depending on how bad the mismatch is, the page might: silently fail and leave the page un-interactive, throw an error and remount (causing a flash and losing state), or crash entirely.

The user sees a page that looks correct but where buttons do nothing and forms don't submit.

How to catch it: Again, visual monitoring alone doesn't fully catch this one - a hydration failure might not visually change the page. What you really need here is headless browser monitoring that actually interacts with the page, not just screenshots it. But visual monitoring at least catches the cases where hydration failures cause visible layout breaks or blank sections.


4. Visual regressions from deploys

This one is subtle and often dismissed until it bites you.

You deployed a CSS change that seemed harmless. Or bumped a dependency. Or refactored a component. The page still loads, still returns 200, still has all the right content in the DOM. But something looks different - a font changed, a button moved, a section collapsed, a layout broke on certain viewport widths.

Maybe it's minor enough that you don't notice it in manual testing. Maybe it's only visible at certain screen sizes you didn't test. Maybe it's on a page that isn't part of your standard QA flow.

Users notice. Users get confused. Users don't convert. And nobody knows why conversion dropped 15% last Tuesday because it's not in any error log - it wasn't an error, it was just wrong.

How to catch it: This is exactly what visual diffing is built for. Take a screenshot before and after every deploy, compare them, and require a human to approve any visual change before it goes to production. This is what end-to-end visual testing tools like Percy do for CI, and what visual uptime monitoring does for production.

The key distinction: CI visual tests run on your test environment before deploy. Production visual monitoring catches the regressions that slip through - the ones that only appear with real data, real CDN behavior, or real third-party scripts.


5. Cron jobs and background workers silently dying

This one doesn't get talked about enough in the uptime monitoring context, because it's not about a web page being down - it's about a process that isn't running when it should be.

Your nightly data export job. Your email digest cron. Your subscription renewal checker. Your database backup task. These run in the background, they don't have HTTP endpoints to ping, and when they die - because of a deploy that changed an environment variable they depended on, a library update that broke a dependency, a database connection that started timing out - they die silently.

No alert. No log entry that anyone's watching. Just a job that was supposed to run at 3 AM and didn't.

You find out a week later when a customer asks why their export data is a week stale. Or when your database backup is missing and you need it.

How to catch it: Heartbeat monitoring. The pattern is: your cron job sends an HTTP ping to a monitoring endpoint at the end of each successful run. If the endpoint doesn't receive a ping within period + grace, it fires an alert. This inverts the monitoring model - instead of checking whether something is up, you're checking whether something ran.

# At the end of your cron job
curl -fsS "https://grabdiff.com/ping/your-monitor-slug" > /dev/null
Enter fullscreen mode Exit fullscreen mode

If that ping doesn't arrive on schedule, you get alerted. It's simple, it's reliable, and it catches the entire class of "background job died silently" failures.


The full picture

A complete monitoring setup that actually catches production failures looks like this:

Check Tool Catches
Server responds HTTP ping (Pingdom, UptimeRobot) Server down, DNS broken, TLS expired
Page renders correctly Visual screenshot monitor JS crashes, blank pages, CDN stale cache, visual regressions
Cron jobs run Heartbeat monitor Silent background job failures
SSL/domain expiry Certificate monitor Expiring certs, domain renewals

You need all four layers. HTTP ping is necessary but covers maybe 40% of what actually goes wrong. Visual monitoring and heartbeat monitoring cover most of the rest.


What I built

After running into enough of these failures - mostly categories 1, 2, and 5 - I built GrabDiff to handle the visual monitoring and heartbeat pieces alongside standard uptime checks.

It screenshots your URLs on a schedule using headless Chrome, diffs them against a baseline, and sends you the diff image in an alert when something changes. It also handles heartbeat monitoring for cron jobs and background workers, and tracks SSL/domain expiry.

Free plan covers three monitors. If you're runnin

Top comments (0)