Sitewatch

Posted on Mar 26

Your Site Is Up. Your Site Is Broken. Both Are True.

#observability #webdev #monitoring #devops

Every monitoring tool on the market will tell you when your site is down. None of them will tell you when your site returns HTTP 200 OK but serves a broken experience.

We built Sitewatch to close that gap. This post covers the detection engineering behind it — what we check, how we confirm failures, and why traditional uptime monitoring has a fundamental blind spot.

The Problem: 200 OK Is Not "OK"

A standard uptime monitor sends an HTTP request, gets a 200 status code, and marks your site as healthy. But a 200 response tells you almost nothing about whether your site actually works.

Here are real failure patterns that return 200 OK:

A Vercel deploy changes your bundle hash. main.a4f2c.js now 404s, but the HTML document still returns 200. Your app shell loads. Nothing renders.
Cloudflare serves app.js with Content-Type: text/html instead of application/javascript. Chrome silently blocks execution. The page loads — with zero interactivity.
A WordPress plugin misconfigures redirects. /checkout redirects to /checkout redirects to /checkout. The browser gives up. Your uptime tool never follows the chain.
After an AWS migration, your domain resolves to a decommissioned server. It serves stale content from six months ago. HTTP 200. Green dashboard.

Every one of these is invisible to a ping-based monitor. Every one of them is a real incident that affects users.

Our Detection Model: 11 Rules Across 3 Categories

We don't just check "is the server responding." We run 11 detection rules across three domains:

Asset Integrity

This is where most silent failures live.

ASSET_MISSING — We check whether critical JS, CSS, and image assets return valid responses. After a deploy, bundle filenames often change (hashed filenames like main.a4f2c.js). If the HTML references an asset that now 404s, 403s, or 5xxs, the page is broken even though the document loaded fine.

ASSET_MIME_MISMATCH — We verify that the Content-Type header matches what the browser expects. A JavaScript file served as text/html will be silently blocked by the browser's MIME type checking. The page loads. The script never executes. No error in the server logs.

We use HEAD requests to check status codes and MIME types — no browser overhead, no JavaScript execution cost. If HEAD fails or returns ambiguous results, we fall back to GET.

Routing & Resolution

REDIRECT_LOOP — We follow redirect chains and detect circular patterns before ERR_TOO_MANY_REDIRECTS kills the request. We log the full chain with status codes and final destination.

HOST_DRIFT — We track the resolved IP and server headers over time. If your domain suddenly resolves to a different origin — common after DNS migrations, CDN changes, or failover misconfigurations — we flag it and compare the content fingerprint against the known baseline.

NON_HTML_PAGE — If a URL that should serve a web page returns JSON ({"error":"unauthorized"}), XML, or an empty body, we catch it. This happens more than you'd think with API gateways sitting in front of web apps.

Availability & Performance

UNAVAILABLE — The classic check. 5xx responses, timeouts, connection refused. We still do this — it's just not the only thing we do.

Plus broken link checking (up to 50 links per page), API endpoint monitoring, and response structure validation.

Confirmation: The 2-of-3 Retry Model

False positives are the fastest way to lose trust in a monitoring tool. If your team gets woken up at 3am for a transient CDN glitch that resolved itself in 8 seconds, they stop trusting alerts. Then they miss the real ones.

We use a 2-of-3 retry confirmation model:

First check detects an issue.
We immediately re-check twice more.
The issue must fail on at least 2 of 3 consecutive checks to become an incident.

A single transient failure — a momentary CDN hiccup, a network blip, a slow edge propagation — never triggers an alert. This eliminates the noise without adding dangerous delay. The entire confirmation cycle completes within the check interval.

For multi-region checks, this runs independently per region. A CDN edge failure in EU doesn't need to reproduce in US to become an incident — but it does need to reproduce within its own region.

Fingerprinting: One Problem, One Incident

When an asset breaks, every check that hits that asset will detect the same failure. Without deduplication, a broken JS bundle could generate dozens of alerts over a few hours.

We generate a SHA-256 fingerprint for each detected issue — combining the failure type, affected URL, and relevant metadata. If an active incident already exists with the same fingerprint, we don't create a new one. We also enforce a 30-minute per-incident cooldown on alert dispatches.

The result: 12 consecutive checks detecting the same broken asset → 1 alert, not 12.

Root Cause Classification

Detecting a failure is half the problem. The other half is telling you why it broke.

We classify every incident into one of 10 cause families across three domains:

Infrastructure: DNS resolution failure, origin server error, SSL/TLS certificate issue, network connectivity.

Application: Deployment artifact missing, application error, configuration drift.

Content Delivery: CDN cache misconfiguration, third-party dependency failure, redirect misconfiguration.

Classification uses evidence from the check data — HTTP headers, response content, redirect behavior, MIME types, IP resolution, and fingerprint diffs. We assign a confidence score up to 90%. A high-confidence diagnosis (70%+) means the evidence strongly matches a known failure pattern.

Stack-Aware Context

Knowing what broke and why is useful. Knowing how to fix it for your specific stack is what actually saves time.

We detect 23 tech stacks by analyzing HTTP response headers (X-Powered-By, Server, X-Vercel-Id), HTML meta tags, script patterns, and asset URL structures. Detection happens automatically with every check — no configuration required.

When an incident fires, the fix guidance is tailored to your stack. A "deployment artifact missing" incident on Vercel gets different remediation steps than the same failure on Netlify or a self-hosted Nginx setup.

Multi-Region: Catching Regional Divergence

CDN failures are often regional. Your site might work perfectly from US-East while EU users get stale cached assets from a failed purge.

Multi-region checks run the full detection suite from multiple geographic points. Each region runs its own independent 2-of-3 retry confirmation. We compare results across regions to distinguish between:

Global outage — all regions failing
Regional divergence — specific edges serving different content, stale caches, or geo-routing sending traffic to wrong origins

This catches CDN edge divergence, geo-routing misconfigurations, and regional failover failures — problems that are invisible if you only check from one location.

What We Don't Do

We don't run a headless browser. We don't execute JavaScript. We don't render the page and take screenshots.

This is a deliberate choice. Browser-based monitoring is slow, expensive, and fragile. It catches visual regressions — which matters — but it's a different problem from "your assets are broken and your site is non-functional."

Our checks are HTTP-based. They're fast, cheap to run at high frequency, and they catch the structural failures that actually take sites down in production. Visual regression testing is complementary but separate.

The Architecture in Summary

Check cycle:
  1. Fetch document (GET)
  2. Verify status, content-type, body structure
  3. Extract and verify critical assets (HEAD requests)
  4. Follow redirect chains
  5. Compare fingerprints against baseline
  6. Run detection rules (11 rules)
  7. 2-of-3 retry on failures
  8. Classify root cause (10 families, up to 90% confidence)
  9. Match tech stack (23 stacks)
  10. Generate incident with evidence + fix guidance
  11. Deduplicate and dispatch alerts

All of this runs every 5–30 minutes depending on plan, from multiple regions, without requiring any code changes or agent installation on your infrastructure.

Try It

Sitewatch has a free tier — 1 site, 30-minute checks, no credit card. If you manage multiple sites and want to catch the failures your current monitoring misses, that's what we built it for.

getsitewatch.com

If you have questions about the detection approach or want to dig into specific failure patterns, drop a comment — happy to go deeper on any of this.

Top comments (1)

Nicky Christensen • Mar 26

Nice, will definitely check it out. And also… really nice website you’ve built ;)