DEV Community: SamReid

Your cron jobs are probably failing silently and you have no idea

SamReid — Fri, 22 May 2026 06:44:44 +0000

A user emailed last year to ask why their weekly export was two months stale.

I looked at the cron job. It was running. The logs said it was completing successfully. Except the logs were from two months ago, because the cron job had silently stopped running two months ago and nobody had noticed because there were no alerts, no errors, no way to notice - the job just... wasn't there anymore.

The cause was a deploy that changed an environment variable the job depended on. The job would start, hit the config error, and exit with a non-zero code. Cron would note this in syslog. Nobody was watching syslog.

This is the most common category of "invisible" production failure. The job exists. You can see it in crontab. It just isn't running. And nothing is watching whether it ran.

Why ping monitoring doesn't work for cron jobs

If your service has an HTTP endpoint, you can ping it. Cron jobs don't have HTTP endpoints. You could add one - a /status route that returns the last run time - but now you're building monitoring infrastructure into every job, and you still have to remember to check that endpoint on a schedule that aligns with when the job runs.

The bigger problem: a ping monitor checks whether something is reachable. For cron jobs, the question is whether something happened. Those are completely different things.

Your weekly backup job could be "reachable" in whatever sense you can ping it and get nothing, and that tells you nothing about whether the backup actually ran last Sunday at 3 AM.

Heartbeat monitoring: the inverted model

Heartbeat monitoring inverts the check. Instead of the monitor asking "is the service up?", the job tells the monitor "I finished successfully."

The pattern:

Create a heartbeat monitor with a period (how often the job runs) and a grace period (how long after the expected time to wait before alerting)
At the end of your cron job - after all the work is done, on the success path - send an HTTP ping to the heartbeat URL
If the monitor doesn't receive a ping within period + grace, fire an alert

That's it. The failure modes it catches:

Job didn't run at all (cron config broken, cron daemon down, environment issue)
Job ran but exited early with an error before reaching the success ping
Job ran, succeeded, but took longer than expected (catches jobs that are silently degrading over time)
Job was removed or disabled by accident

The beauty of it is that the monitoring lives outside the job. You don't have to instrument every job heavily - you add one line at the end of the success path.

The implementation

At the simplest level, heartbeat monitoring is just:

# at the end of your script, after all work is done
curl -fsS --retry 3 "https://grabdiff.com/ping/your-unique-slug" > /dev/null

GrabDiff gives you a unique unguessable URL for each heartbeat monitor. Set the period to match your cron schedule, set a grace period (I use 15 minutes for jobs that run hourly, a few hours for daily jobs), and you're done. If the ping doesn't arrive within that window, you get an email.

For application code instead of shell scripts, it's the same idea:

func runExportJob(ctx context.Context) error {
    // do the work
    if err := exportData(ctx); err != nil {
        return fmt.Errorf("export failed: %w", err) // no ping sent
    }

    // only ping on success
    if _, err := http.Get(os.Getenv("HEARTBEAT_URL")); err != nil {
        slog.Warn("heartbeat ping failed", "err", err)
        // don't fail the job over a monitoring ping failure
    }
    return nil
}

A few things worth noting:

Ping only on the success path. The monitor needs to distinguish "ran and succeeded" from "ran and failed." If you ping on both success and failure, you lose that signal.

Don't fail the job if the ping fails. Your backup job shouldn't fail because the monitoring endpoint was momentarily unreachable. Log it, but keep it separate from the job's exit code.

Include the job output in your error handling somewhere. The heartbeat tells you the job didn't run - it doesn't tell you why. Make sure your job logs to somewhere you can check when an alert fires.

Choosing period and grace

The period should match your cron schedule exactly. If your job runs 0 3 * * * (3 AM daily), your period is 24 hours.

Grace period is trickier. You want it long enough to not alert on jobs that take slightly longer than usual, but short enough to catch actual failures before they become a problem.

My rules of thumb:

Minute-level jobs: grace = 5 minutes
Hourly jobs: grace = 15 minutes
Daily jobs: grace = 2–4 hours (depending on how bad a missed run is)
Weekly jobs: grace = 12 hours

For anything with a business impact - billing runs, data exports, email digests - I keep the grace short and accept the occasional false positive from a slow run. A false positive is annoying. A missed billing run is worse.

The jobs worth monitoring

Every shop has a slightly different list, but the categories that almost always have critical unmonitored jobs:

Data exports and reports. Whatever you're generating for customers or stakeholders on a schedule. When these stop, you find out a week later.

Billing and subscription processing. Failed renewal attempts, expired trial follow-ups, invoice generation. Silent failures here have direct revenue impact.

Email digests and notifications. Users set expectations based on these. When they stop arriving, your support queue fills up.

Database backups. The one you really, really don't want to discover has been failing when you actually need a restore.

Search index updates. If your search depends on a nightly rebuild job and the job stops, search quietly degrades until someone notices results are stale.

Cache warming and pre-computation. These often run before peak traffic. If they don't run, you don't notice until peak traffic hits and things are slow.

Go through your crontab right now. For each job, ask: "How would I know if this stopped running?" If the answer is "a user would tell me," that job needs a heartbeat.

Start/fail endpoints for richer monitoring

Some heartbeat systems (including GrabDiff) support optional start and fail endpoints in addition to the success ping.

Start endpoint: ping when the job begins. Lets you track job duration and alert if a job runs too long.
Fail endpoint: explicit failure ping for when you want to differentiate "didn't run" from "ran and explicitly failed."

The start endpoint is the one I actually use regularly. Combined with the success ping, you get duration tracking. If a job that normally takes 3 minutes suddenly takes 45 minutes, that's worth knowing about even if it technically "succeeded."

#!/bin/bash
curl -fsS "https://grabdiff.com/ping/your-slug/start" > /dev/null

# ... do work ...

if [ $? -ne 0 ]; then
    curl -fsS "https://grabdiff.com/ping/your-slug/fail" > /dev/null
    exit 1
fi

curl -fsS "https://grabdiff.com/ping/your-slug" > /dev/null

This is probably more than you need for most jobs. One ping at the end of the success path is the right place to start.

The monitoring gap between "it exists" and "it ran"

The broader pattern here: there's a gap between "the system is configured to do a thing" and "the thing actually happened." Ping monitors cover the first half. Heartbeat monitoring covers the second.

Your cron job exists. Your renewal process is configured. Your backup job is in the schedule. The question is whether it's running.

For any job where the answer to "how would I know if this stopped?" is "I wouldn't," add a heartbeat. It's a one-line change to your script and a two-minute setup in your monitoring tool. The jobs I've seen cause the most damage are almost always ones where someone set them up, confirmed they ran once, and then never thought about them again until something downstream broke.

The two-month-stale export was embarrassing. The fix took about 90 seconds.

I wrote this because heartbeat monitoring is one of those things that nobody tells you about until after you've had the incident that makes you wish you'd known. It's not in most intro-to-devops content, and it probably should be.

If you've had a cron job go silent on you - or if you're running jobs right now that you realize have no heartbeat after reading this - drop a comment. I'm also curious whether anyone has a good pattern for monitoring jobs that are supposed to not run (maintenance windows, feature flags that disable background work). That one's trickier and I haven't landed on a clean solution.

How an expired SSL cert took down our checkout for six hours (and what I should have had watching)

SamReid — Fri, 22 May 2026 06:34:12 +0000

The site was "up." The monitor said so. HTTP 200, response times normal, no alerts.

What the monitor didn't know - what I didn't know - was that our SSL certificate had expired 87 minutes earlier and every user hitting the site was getting a certificate error in their browser. Not a down page. Not a 5xx. A cert error. The kind where browsers show a big red warning screen and most users immediately close the tab.

For a checkout flow, that's about as bad as the server being down. Worse, actually, because at least a down server triggers your uptime alert.

This is the post-mortem.

What happened

We were running Let's Encrypt with certbot and auto-renewal configured. The renewal was supposed to happen when the cert had 30 days left. It had been working fine for about 18 months.

Then it didn't.

The renewal job ran, hit a DNS validation error - our DNS provider had a 30-minute API hiccup that day - and failed silently. Certbot logged the failure, but nobody was watching certbot logs. The retry ran 12 hours later, same issue. Then it was fine. But by then, the "success" window had passed and the cert expired before the next attempt.

Let's Encrypt auto-renewal fails for reasons that feel random at the time:

DNS propagation delays when you're using DNS-01 validation and your DNS provider has latency
Rate limits - Let's Encrypt has per-domain limits (5 failures per hour) that cause subsequent retries to also fail
Firewall or load balancer changes that block the HTTP-01 validation path on port 80
File permission issues on the cert directory after a system update
Webhook or deploy hook failures - the cert renews but the service doesn't reload to pick up the new cert

In our case it was DNS validation timing plus a log nobody was watching. The cert expired at 3:14 PM. The Slack alert - from a user, not a monitor - came in at 4:58 PM.

Why my uptime monitor missed it for four hours

GrabDiff monitors SSL expiry now, which is part of why I built it. But at the time I was using a basic HTTP ping monitor. Here's what it was doing:

Make HTTP request to our URL
Check for 200 response
Mark as healthy

The problem is step 1. The monitor was connecting via HTTP (port 80) and following the redirect to HTTPS. The redirect itself returned 301, healthy. Then the HTTPS request... also returned 200?

Sort of. The monitor wasn't validating the SSL certificate. It was making the HTTPS request with cert verification disabled, because false positives from cert issues in test environments made that the default in a lot of ping monitoring setups. So it dutifully checked the response code, got a 200 (from behind the expired cert that browsers were rejecting), and marked everything green.

Four hours of "everything is fine."

What proper SSL monitoring actually checks

SSL expiry monitoring should check a few distinct things:

1. Certificate expiry date - the obvious one. Get the cert's Not After field and alert at configurable thresholds. I alert at 30 days and 7 days. If you're using Let's Encrypt with 90-day certs, a 30-day warning gives you two full renewal windows to fix it.

2. Full-chain validation - not just that a cert exists, but that the entire chain from your cert to the root CA is valid. Intermediate cert issues cause browser errors even when your cert itself hasn't expired.

3. Cert actually served matches expected domain - if something went wrong with your load balancer config and it's serving the cert for a different domain, that's a browser error even with a valid cert.

4. Port 443 is actually accepting connections - a "port not open" situation is different from "cert expired" but both cause the same user-facing result.

5. The cert returned matches what's on disk - this catches the case where renewal succeeded but the service didn't reload and is still serving the old, expired cert.

A ping monitor does none of these. A lot of "SSL monitoring" tools only do #1, which misses the cases that actually catch you off guard.

What I'd do differently

Monitor the cert directly, not via HTTP. Connect to port 443, do the TLS handshake, and inspect the cert that's actually being served. Don't just check the expiry date - validate the chain.

Set alert thresholds that give you time to fix things manually. Let's Encrypt certs renew at 30 days remaining. I alert at 30 days (something's wrong with auto-renewal) and 7 days (it's still not fixed and now it's urgent). That gives me 23 days between "something's wrong" and "now panic."

Watch the renewal logs. Not the SSL cert itself, but the renewal process. Set up a heartbeat - certbot's --deploy-hook can ping a monitoring URL on successful renewal. If the heartbeat doesn't arrive within period + grace, alert. This catches the "cert renewed but didn't reload" case too.

Test your renewal before it matters. certbot renew --dry-run in your staging environment, regularly. Not just once when you set it up.

The monitoring stack I run now

For SSL specifically: I use GrabDiff for the cert expiry checks - it connects directly to port 443, validates the full chain, and alerts at 30 days and 7 days with enough context to know what's actually wrong (expiry date, issuer, which check failed).

For the renewal heartbeat: I have certbot's --deploy-hook send a ping to a GrabDiff heartbeat monitor after each successful renewal. If it doesn't ping within 93 days (the Let's Encrypt cert lifetime plus a week), I get alerted. That catches the silent renewal failures before they become a problem.

The six-hour checkout outage cost us - I'd rather not quantify it. The monitoring stack that would have caught it costs $9/month. That math is not complicated.

The broader lesson

SSL expiry is one of the most embarrassing categories of outage because it's entirely predictable. You know the cert will expire. You have the date. The only question is whether you catch it before your users do.

The same is true for domain expiry, for that matter. I've seen teams let their primary domain expire because the renewal email went to a former employee's address and nobody caught it. The monitoring there is trivial - check the WHOIS expiry date, alert 60 days out. But people don't do it until they have to learn the hard way.

If your current monitoring setup would have missed the scenario I described above - HTTP 200 from an expired-cert server - it's worth spending 20 minutes fixing that before you have your own version of this post-mortem to write.

I wrote this mostly to stop myself from having to explain this incident verbally ever again. Now I can just link it.

But seriously - SSL expiry outages are embarrassing in a specific way because they're so avoidable, and I've seen them happen to teams that clearly knew what they were doing otherwise. If you've had your own cert-expiry story (or a renewal failure that was weirder than mine), I'd like to hear it in the comments. Knowing the failure modes other people have hit is the only way to build a monitoring checklist that actually covers the real world.

The 5 things traditional uptime monitors miss (and how to catch them)

SamReid — Fri, 22 May 2026 06:31:48 +0000

Your uptime monitor is probably green right now. That doesn't mean everything is working.

HTTP ping monitors are good at one thing: checking whether your server responds. They're essentially useless for everything that happens after the response leaves your server - the JavaScript execution, the rendering, the CDN edge nodes, the client-side state that has to be right for your page to actually work.

I got tired of finding out about these failures from users instead of from my monitor. That's why I built GrabDiff - it takes actual screenshots of your pages, diffs them against a known-good baseline, and emails you the diff image when something looks off. Free plan, three monitors, no card. But first, here's what it's actually catching that your current monitor can't:

Here are the five categories of failures I've seen (and caused) in production that an HTTP monitor will miss completely, plus how you actually catch them.

1. JavaScript crashes on load

This is the most common silent failure on modern web apps, and the one most developers underestimate.

Your server sends back valid HTML. HTTP 200, response time under 500ms, your monitor is happy. Then the client-side bundle executes. Somewhere in there - a null reference on a property that's undefined in some edge case, a third-party script that assumes something about the DOM that isn't true, an API response that came back in a shape the frontend didn't expect - an unhandled exception gets thrown. The page freezes. Or goes blank. Or renders halfway and stops.

From your monitor's perspective: everything is fine.

From your user's perspective: white screen.

What makes this nasty: JavaScript errors are often conditional. They affect logged-in users but not logged-out ones. They affect users on certain plans, with certain browser versions, with certain cookies or localStorage state. Your monitor is hitting the URL fresh, unauthenticated, with a clean browser - it's not in the affected cohort.

How to catch it: Visual monitoring - take a screenshot with a real headless browser and compare it against a known-good baseline. A blank page or partial render will show up immediately as a large pixel diff. Standard HTTP monitoring cannot catch this.

2. CDN serving stale or broken content

You fixed the bug. Deployed. Checked the origin. Everything looks correct. And then the Slack DMs start: users are still seeing the broken version.

CDN cache invalidation is notoriously unreliable. The failure modes include:

Purge API returned 200 but didn't actually purge - this happens more than vendors want to admit
Edge nodes in some regions updated, others didn't - your origin check hit one data center, users are hitting another
Cache-Control headers were wrong - a max-age=86400 header set during a period when things were broken means users get the broken version for up to 24 more hours
The CDN cached a redirect or an error page - your 503 from 45 minutes ago is still being served as a cached 503 with a 200 wrapper

Your HTTP monitor hits the origin directly, or hits a CDN edge node that happens to have fresh cache. Users are hitting different edges.

How to catch it: Monitor from multiple geographic locations, and monitor what the page looks like, not just what status code it returns. A CDN serving an old broken page will return HTTP 200 with content that doesn't match your current baseline. Only a visual diff will catch the discrepancy.

3. React/Next.js hydration failures

Server-side rendering gives you the best of both worlds: fast initial paint from pre-rendered HTML, then full interactivity once the JavaScript loads and "hydrates" the DOM.

When hydration goes wrong, you get the worst of both worlds.

The server sends perfectly rendered HTML. Your monitor checks it, sees a 200, sees the content in the response body, marks it as healthy. The user's browser receives that HTML and renders it visually - the page looks fine. Then React tries to hydrate: match the server-rendered DOM against what the client-side bundle would have rendered, attach event listeners, take over control.

If there's a mismatch - different data, different component state, a prop that resolves differently on client vs. server - React throws a hydration error. Depending on how bad the mismatch is, the page might: silently fail and leave the page un-interactive, throw an error and remount (causing a flash and losing state), or crash entirely.

The user sees a page that looks correct but where buttons do nothing and forms don't submit.

How to catch it: Again, visual monitoring alone doesn't fully catch this one - a hydration failure might not visually change the page. What you really need here is headless browser monitoring that actually interacts with the page, not just screenshots it. But visual monitoring at least catches the cases where hydration failures cause visible layout breaks or blank sections.

4. Visual regressions from deploys

This one is subtle and often dismissed until it bites you.

You deployed a CSS change that seemed harmless. Or bumped a dependency. Or refactored a component. The page still loads, still returns 200, still has all the right content in the DOM. But something looks different - a font changed, a button moved, a section collapsed, a layout broke on certain viewport widths.

Maybe it's minor enough that you don't notice it in manual testing. Maybe it's only visible at certain screen sizes you didn't test. Maybe it's on a page that isn't part of your standard QA flow.

Users notice. Users get confused. Users don't convert. And nobody knows why conversion dropped 15% last Tuesday because it's not in any error log - it wasn't an error, it was just wrong.

How to catch it: This is exactly what visual diffing is built for. Take a screenshot before and after every deploy, compare them, and require a human to approve any visual change before it goes to production. This is what end-to-end visual testing tools like Percy do for CI, and what visual uptime monitoring does for production.

The key distinction: CI visual tests run on your test environment before deploy. Production visual monitoring catches the regressions that slip through - the ones that only appear with real data, real CDN behavior, or real third-party scripts.

5. Cron jobs and background workers silently dying

This one doesn't get talked about enough in the uptime monitoring context, because it's not about a web page being down - it's about a process that isn't running when it should be.

Your nightly data export job. Your email digest cron. Your subscription renewal checker. Your database backup task. These run in the background, they don't have HTTP endpoints to ping, and when they die - because of a deploy that changed an environment variable they depended on, a library update that broke a dependency, a database connection that started timing out - they die silently.

No alert. No log entry that anyone's watching. Just a job that was supposed to run at 3 AM and didn't.

You find out a week later when a customer asks why their export data is a week stale. Or when your database backup is missing and you need it.

How to catch it: Heartbeat monitoring. The pattern is: your cron job sends an HTTP ping to a monitoring endpoint at the end of each successful run. If the endpoint doesn't receive a ping within period + grace, it fires an alert. This inverts the monitoring model - instead of checking whether something is up, you're checking whether something ran.

# At the end of your cron job
curl -fsS "https://grabdiff.com/ping/your-monitor-slug" > /dev/null

If that ping doesn't arrive on schedule, you get alerted. It's simple, it's reliable, and it catches the entire class of "background job died silently" failures.

The full picture

A complete monitoring setup that actually catches production failures looks like this:

Check	Tool	Catches
Server responds	HTTP ping (Pingdom, UptimeRobot)	Server down, DNS broken, TLS expired
Page renders correctly	Visual screenshot monitor	JS crashes, blank pages, CDN stale cache, visual regressions
Cron jobs run	Heartbeat monitor	Silent background job failures
SSL/domain expiry	Certificate monitor	Expiring certs, domain renewals

You need all four layers. HTTP ping is necessary but covers maybe 40% of what actually goes wrong. Visual monitoring and heartbeat monitoring cover most of the rest.

What I built

After running into enough of these failures - mostly categories 1, 2, and 5 - I built GrabDiff to handle the visual monitoring and heartbeat pieces alongside standard uptime checks.

It screenshots your URLs on a schedule using headless Chrome, diffs them against a baseline, and sends you the diff image in an alert when something changes. It also handles heartbeat monitoring for cron jobs and background workers, and tracks SSL/domain expiry.

Free plan covers three monitors. If you're runnin

How to build a visual uptime monitor with Go and headless Chrome

SamReid — Fri, 22 May 2026 06:25:34 +0000

Most uptime monitors work by making an HTTP request and checking the response code. It's fast, cheap, and catches about half the things that actually go wrong in production.

The other half - JavaScript crashes, CDN serving stale cache, React hydration failures, missing elements - only show up when you look at what the page actually renders, not what the server returns.

This is a walkthrough of how I built GrabDiff - the visual monitoring piece specifically: capture screenshots with headless Chrome, diff them against a baseline, and send an alert when something looks wrong. I'll use Go and chromedp, which is what GrabDiff runs under the hood. If you'd rather just use the thing than build it, GrabDiff has a free plan with three monitors and no card required.

The architecture

The core loop is simple:

On a schedule, capture a screenshot of a URL using headless Chrome
Compare it pixel-by-pixel against a stored baseline image
If the diff percentage exceeds a threshold, send an alert with the diff image attached
Otherwise, store the new screenshot (optionally updating the baseline over time)

The interesting engineering is in steps 2 and 3 - getting the diff right and making alerts actionable.

Capturing screenshots with chromedp

chromedp is a Go library that controls Chrome via the DevTools Protocol. It handles the browser lifecycle, navigation, and screenshot capture.

package screenshot

import (
    "context"
    "time"

    "github.com/chromedp/chromedp"
)

func Capture(url string) ([]byte, error) {
    opts := append(chromedp.DefaultExecAllocatorOptions[:],
        chromedp.Flag("headless", true),
        chromedp.Flag("disable-gpu", true),
        chromedp.Flag("no-sandbox", true),
        chromedp.WindowSize(1280, 800),
    )

    allocCtx, cancel := chromedp.NewExecAllocator(context.Background(), opts...)
    defer cancel()

    ctx, cancel := chromedp.NewContext(allocCtx)
    defer cancel()

    ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
    defer cancel()

    var buf []byte
    err := chromedp.Run(ctx,
        chromedp.Navigate(url),
        chromedp.Sleep(2*time.Second), // wait for JS to settle
        chromedp.FullScreenshot(&buf, 90),
    )
    if err != nil {
        return nil, err
    }
    return buf, nil
}

A few things worth noting:

chromedp.Sleep(2*time.Second) is a blunt instrument but effective. For most pages, 2 seconds is enough for the JavaScript to execute and the page to reach a stable state. For pages with complex async data fetching you might need more, or you can use chromedp.WaitVisible to wait for a specific element.

chromedp.FullScreenshot captures the entire page height, not just the viewport. This is usually what you want for monitoring - you care about the whole page, not just what happens to be visible above the fold.

The 90 in FullScreenshot is JPEG quality. You can use chromedp.CaptureScreenshot instead for PNG (larger files, lossless).

Pixel diffing

Once you have a screenshot, you need to compare it against the baseline. The core operation is straightforward: decode both images, iterate over pixels, count how many differ by more than some per-channel threshold.

package screenshot

import (
    "bytes"
    "image"
    "image/color"
    _ "image/jpeg"
    "image/png"
    "math"
)

type DiffResult struct {
    DiffPercent float64
    DiffImage   []byte // PNG with differences highlighted
}

func Diff(baseline, current []byte) (*DiffResult, error) {
    baseImg, _, err := image.Decode(bytes.NewReader(baseline))
    if err != nil {
        return nil, err
    }
    currImg, _, err := image.Decode(bytes.NewReader(current))
    if err != nil {
        return nil, err
    }

    bounds := baseImg.Bounds()
    diffImg := image.NewRGBA(bounds)

    var diffPixels int
    totalPixels := bounds.Dx() * bounds.Dy()

    for y := bounds.Min.Y; y < bounds.Max.Y; y++ {
        for x := bounds.Min.X; x < bounds.Max.X; x++ {
            br, bg, bb, _ := baseImg.At(x, y).RGBA()
            cr, cg, cb, _ := currImg.At(x, y).RGBA()

            // RGBA() returns values in [0, 65535]
            dr := math.Abs(float64(br) - float64(cr))
            dg := math.Abs(float64(bg) - float64(cg))
            db := math.Abs(float64(bb) - float64(cb))

            // threshold: 10% channel difference (6553 out of 65535)
            if dr > 6553 || dg > 6553 || db > 6553 {
                diffPixels++
                // highlight in red
                diffImg.Set(x, y, color.RGBA{R: 255, G: 0, B: 0, A: 255})
            } else {
                // keep original, slightly dimmed for context
                r, g, b, a := currImg.At(x, y).RGBA()
                diffImg.Set(x, y, color.RGBA{
                    R: uint8(r>>8) / 2,
                    G: uint8(g>>8) / 2,
                    B: uint8(b>>8) / 2,
                    A: uint8(a >> 8),
                })
            }
        }
    }

    diffPercent := float64(diffPixels) / float64(totalPixels) * 100

    var buf bytes.Buffer
    if err := png.Encode(&buf, diffImg); err != nil {
        return nil, err
    }

    return &DiffResult{
        DiffPercent: diffPercent,
        DiffImage:   buf.Bytes(),
    }, nil
}

The diff image produced here shows changed pixels in red against a dimmed version of the current screenshot. This gives you at a glance where the change is - useful when you're trying to tell whether it's a minor layout shift or something more serious.

On threshold tuning: 1% is a good starting point for DiffPercent. Anything above that is almost certainly a real change. Below 0.1% is usually antialiasing noise. The right number depends on how dynamic your pages are.

Storing baselines

You need to store the baseline image somewhere. For a simple setup, an object store (S3, Backblaze B2, R2) works well - store the baseline under a key like {monitor_id}/baseline.jpg and update it when the user explicitly marks a new baseline.

// On first check, or when user resets the baseline
func (s *Store) SetBaseline(ctx context.Context, monitorID string, img []byte) error {
    key := fmt.Sprintf("%s/baseline.jpg", monitorID)
    return s.upload(ctx, key, img, "image/jpeg")
}

// On each check
func (s *Store) GetBaseline(ctx context.Context, monitorID string) ([]byte, error) {
    key := fmt.Sprintf("%s/baseline.jpg", monitorID)
    return s.download(ctx, key)
}

One design decision worth thinking about: should the baseline update automatically? There are arguments either way. If you update it automatically after every "clean" check, you adapt to intentional page changes without manual intervention. If you require explicit resets, every change that slips past your threshold accumulates silently, and you'll eventually be diffing against something that looks nothing like your original known-good state.

GrabDiff requires explicit baseline resets. The reasoning: if you're updating the baseline automatically, you can drift into a state where "clean" means "whatever the page looked like yesterday" rather than "the page as I intended it." Explicit resets keep you honest.

Alerting

When DiffPercent exceeds your threshold, you want to notify someone fast. The two most useful channels are email (with the diff image attached) and webhooks.

type Alert struct {
    MonitorURL  string
    DiffPercent float64
    DiffImage   []byte
    CheckedAt   time.Time
}

func (s *EmailSender) SendAlert(ctx context.Context, to string, a Alert) error {
    body := fmt.Sprintf(
        "Visual change detected on %s\n\nDiff: %.2f%% of pixels changed\nChecked at: %s\n\nSee attached diff image.",
        a.MonitorURL, a.DiffPercent, a.CheckedAt.Format(time.RFC1123),
    )

    msg := gomail.NewMessage()
    msg.SetHeader("From", s.from)
    msg.SetHeader("To", to)
    msg.SetHeader("Subject", fmt.Sprintf("[GrabDiff] Change detected: %s", a.MonitorURL))
    msg.SetBody("text/plain", body)
    msg.AttachReader("diff.png", bytes.NewReader(a.DiffImage))

    return s.dialer.DialAndSend(msg)
}

The diff image as an attachment is the key thing. An alert that just says "something changed" is nearly useless - you have to go look at the site yourself to know if it matters. An alert with a diff image attached tells you immediately whether this is "someone changed a button color" or "the entire main content area is gone."

Scheduling

For the scheduling loop, a simple approach is a ticker per monitor:

func (w *Worker) Run(ctx context.Context, monitor Monitor) {
    ticker := time.NewTicker(monitor.Interval)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            if err := w.check(ctx, monitor); err != nil {
                slog.Error("check failed", "monitor", monitor.ID, "err", err)
            }
        }
    }
}

For anything beyond a handful of monitors, you'll want a proper job queue (River, Asynq, or even a simple Postgres-backed queue) rather than in-process goroutines. In-process schedulers don't survive restarts gracefully and make horizontal scaling harder.

The tradeoffs you'll hit in production

False positives from dynamic content. Timestamps, "last updated" labels, live counters, personalized greetings - these change on every screenshot and will trigger alerts constantly. You either need to mask those regions before diffing, or accept a higher threshold that makes you insensitive to small changes everywhere.

Headless Chrome resource usage. A single Chrome instance capturing a screenshot uses roughly 200-400MB of RAM and takes 3-8 seconds depending on page complexity. If you're running hundreds of monitors at frequent intervals, you need a pool of browser instances and careful scheduling to avoid spiking resource usage.

Authentication. Monitoring pages behind a login requires scripting the auth flow:

chromedp.Run(ctx,
    chromedp.Navigate("https://app.example.com/login"),
    chromedp.SendKeys(`input[name="email"]`, email),
    chromedp.SendKeys(`input[name="password"]`, password),
    chromedp.Click(`button[type="submit"]`),
    chromedp.WaitVisible(`#dashboard`),
    chromedp.Navigate(targetURL),
    chromedp.FullScreenshot(&buf, 90),
)

This works, but it's fragile - login flows change, CAPTCHAs appear, session handling has edge cases. Plan for maintenance.

SSRF. If you're accepting URLs from users, you need to validate them against a blocklist before passing them to Chrome. Users will point your monitor at http://169.254.169.254/latest/meta-data/ or internal network addresses. Validate the resolved IP against RFC 1918 and link-local ranges before making any request.

What this gets you

A working version of the above - capture, diff, alert - is maybe 500 lines of Go. It'll catch blank pages, missing elements, major layout regressions, and CDN serving stale content. It won't catch every failure, but it catches the ones that HTTP monitors miss entirely.

If you want to run it yourself, the full approach is essentially what I described. If you'd rather not manage the Chrome instances and the storage and the scheduling, I built GrabDiff to handle all of that - it does the screenshot diffing, sends you the diff image in the alert, and handles SSL/domain/cron monitoring alongside the visual checks. Free plan, three monitors, no credit card.

The point is that "HTTP 200" and "the page works" are not the same thing, and the gap between them is where the interesting production failures live. Visual monitoring is how you close that gap.

Why your uptime monitor says everything's fine while users see a white screen

SamReid — Thu, 21 May 2026 01:36:09 +0000

It was 11:47 PM on a Thursday when the Slack messages started rolling in.

"Hey, the checkout page looks broken."

"Is the site down? I'm seeing a blank screen."

"Tried three different browsers, same thing."

I pulled up our uptime monitor. Green. Every check passing. Response times normal. HTTP 200 across the board. By every metric the monitor tracked, the site was healthy.

It was not healthy.

The checkout page - the single most important page in the entire application - was rendering a completely white screen for every user. Had been for about 40 minutes. Our monitoring had no idea.

That night cost us somewhere around four hours of investigation, a patch at 2 AM, and a post-mortem I'd rather forget. The root cause was a JavaScript runtime error triggered by a third-party A/B testing script that loaded asynchronously, after our monitor's simple HTTP check had already returned 200 and moved on.

The monitor was doing exactly what it was designed to do. That was the problem.

What a ping monitor actually checks

Let's be precise about what most uptime monitoring tools do, because I think a lot of developers have a fuzzy mental model here.

A basic uptime monitor makes an HTTP request to your URL. It checks:

Did the server respond?
Was the status code in the 2xx range?
Was the response time under some threshold?

That's it. Some monitors add a string match - "check that the response body contains the word 'homepage'" or whatever. That's slightly better. But not by much.

The thing is, your users don't care about any of that. They care whether the page works - whether the content they expect is visible, the buttons they need to click are there, and the thing they're trying to do is actually possible.

An HTTP 200 response with a blank body is still an HTTP 200. A page that returns your shell HTML but fails to mount any React components is still an HTTP 200. A cached CDN response that's three weeks stale is still an HTTP 200.

The failure modes nobody warns you about

Here are the actual production incidents I've seen (or personally caused) that a ping monitor will completely miss.

JavaScript crash on load

Your server renders the page and sends back valid HTML. But somewhere in the client-side bundle, there's an unhandled exception - a null reference, a failed import, an API that returned an unexpected shape. The page stays blank, or partially rendered, or stuck in a loading state. The server was fine. The HTTP response was fine. The user experience was not fine.

These are especially nasty because they often only affect certain browsers, certain screen sizes, or users in certain states (logged in vs. logged out, items in cart vs. empty cart).

CDN serving stale content

You deployed a fix. Your origin server is running the new code. But your CDN edge nodes are still serving the old, broken version to 80% of your users because cache invalidation didn't propagate correctly, or someone forgot to add a cache-busting header, or the CDN's purge API returned 200 but didn't actually do the thing.

Your monitor hits the origin. Clean. Users hit the CDN edge. Broken.

React/Next.js hydration failure

You're server-side rendering a React app. The server sends down HTML, which looks great in your monitor's response check. Then the client-side JavaScript tries to "hydrate" that HTML - attach event listeners, reconcile state - and something goes wrong. The page looks rendered, but nothing is interactive. Buttons don't respond. Forms won't submit. The hydration error is sitting in the browser console where nobody ever looks.

An A/B test or feature flag went sideways

Someone enabled a new variation in your experimentation platform. It works fine for 90% of users. For 10% - the control group, or a specific segment, or users with a certain cookie - it injects a script that breaks layout, hides a critical element, or throws an error. Your monitor isn't in that segment. Your monitor sees the happy path.

The checkout button just... isn't there anymore

A CSS change. A template logic error. A component that conditionally renders based on some state that's slightly wrong. Your add-to-cart button, your checkout button, your submit form - just gone. The page looks fine at a glance. The hero image is there, the nav is there, the product photos are there. But the one element users actually need to convert? Missing.

A monitor checking for HTTP 200 and a response time under 2 seconds has no idea.

Why visual monitoring is conceptually different

The insight is simple: instead of asking "did the server respond?", ask "does the page look right?"

Visual monitoring takes a screenshot of your page - using a real browser, running real JavaScript, waiting for render to complete - and compares it against a known-good baseline. If the current screenshot differs from the baseline beyond some threshold, something has changed and you get an alert.

This catches all the scenarios above because it's evaluating the end result that users actually see, not an intermediate HTTP handshake that happens before any of the interesting rendering work occurs.

The diff approach is also useful beyond pure "is it broken" detection. It catches:

Layout shifts you didn't intentionally make
Text that changed in a way it shouldn't have (wrong price, wrong copy)
Elements that disappeared
New elements that appeared (a cookie banner blocking content, an error message you didn't know was showing)
Visual regressions from a deploy that seemed fine but quietly broke something

The key is the baseline. You take a screenshot when everything is known to be working, mark it as the baseline, and then every subsequent check compares against that. When the diff exceeds your threshold, alert.

The alert should include the diff image - not just "something changed," but a visual showing exactly what changed, highlighted. That's the thing that makes it immediately actionable instead of just alarming.

The practical gotchas

Visual monitoring isn't magic, and there are real tradeoffs.

Dynamic content. If your page shows the current time, a live stock price, or a user-specific greeting, those will show up as diffs on every single check. You need to either mask those regions, or structure your pages so dynamic content is in predictable locations you can exclude.

Rendering variability. Fonts rendering slightly differently on different machines, antialiasing differences, animated elements caught mid-transition - all of these can cause false positives if your diff threshold is too sensitive. You need a threshold that's high enough to ignore noise but low enough to catch real problems. Getting this calibrated takes some tuning.

Auth-gated pages. If you want to monitor pages behind a login, you need to handle authentication in your screenshot flow. This is doable - you can script login sequences - but it adds complexity.

Cost of headless Chrome. Running Chrome instances at scale is not free. It's memory-hungry, and taking screenshots takes real time compared to a simple HTTP request. A 1-minute check interval for many URLs is meaningfully more expensive to operate than a ping monitor.

These are all solvable problems, but they're real, and anyone selling you a visual monitoring solution that doesn't mention them is glossing over the hard parts.

Where this leaves you

I'm not saying throw out your existing uptime monitor. Ping monitors are cheap, fast, and useful for catching the simplest failures - server down, TLS expired, DNS broken. Keep them.

But treat them for what they are: a necessary but not sufficient condition for "the site is working." The gap between "the server responded with 200" and "users can actually use this page" is where a lot of production incidents live, and most teams don't find out about those incidents until users tell them.

After the Thursday night incident I described at the start, I spent a weekend investigating visual monitoring options and ended up down a rabbit hole of how headless Chrome screenshot diffing actually works. None of the existing tools felt quite right - either too expensive, too complex to configure, or not quite targeted at the "is this page visually broken" question.

So I built GrabDiff. It takes screenshots of your URLs on a schedule using headless Chrome, diffs them against a stored baseline, and emails you with the diff image attached when something looks off. Free plan covers three monitors, no credit card required.

Your uptime monitor is probably lying to you right now. Hopefully about something minor.