I ran deep checks on 50 production sites. 23 had silent failures their uptime monitors missed. Here's every failure type, why it happens, and how to check yourself
1. Expired or Misconfigured SSL Certificates (8 sites)
The most common failure — and the most avoidable.
Eight sites had SSL issues. Three had certificates that had expired within the last 30 days. Two had incomplete certificate chains (missing intermediate CA), which meant the site worked in Chrome on desktop but threw security warnings in Safari and on mobile devices. The remaining three had certificates that didn't match the domain — likely from a server migration or load balancer change where someone forgot to update the cert.
Why uptime monitors miss this: Most ping-based monitors hit the IP address or follow the first redirect. They don't validate the full certificate chain, check expiry, or verify that the certificate matches the domain. The server responds with 200, so the check passes.
How common is this? According to Keyfactor's 2024 PKI report, 88% of companies experienced an unplanned outage due to an expired certificate in the past two years. Microsoft Teams went down for three hours in 2020 because someone forgot to renew an auth certificate. Epic Games had a 5.5-hour outage from an expired wildcard cert.
How to check yourself:
# Check certificate expiry date
echo | openssl s_client -servername yoursite.com -connect yoursite.com:443 2>/dev/null | openssl x509 -noout -dates
# Check certificate chain completeness
echo | openssl s_client -servername yoursite.com -connect yoursite.com:443 2>/dev/null | grep -E "verify return|depth"
# Check if cert matches the domain
echo | openssl s_client -servername yoursite.com -connect yoursite.com:443 2>/dev/null | openssl x509 -noout -subject -issuer
If notAfter is in the past, your visitors are seeing browser warnings. If the chain verification fails, mobile users are getting blocked.
2. Stale CDN Content After Deploys (5 sites)
Five sites served HTML or assets that were one or more deployments old. The page loaded. It looked fine to a quick glance. But it was yesterday's version — including a known bug that the team thought they'd already fixed.
How it happens: The origin server has the new content, but CDN edge nodes are serving cached versions. Cache invalidation either failed, wasn't triggered, or hasn't propagated to all edge locations. This is particularly common with Cloudflare, CloudFront, and Vercel's edge network when cache-control headers aren't set correctly.
How nasty is it? A 2023 Catchpoint study of 150+ major e-commerce sites found that over 70% of "mysterious" production bugs reported by users were actually stale or inconsistent cached content. The website was working — but the CDN was serving yesterday's reality. As recently as December 2025, a Shopify merchant reported their CDN was serving mismatched product images in search results for weeks.
One of the sites I checked was serving HTML from 3 deploys ago to users in Europe, while US users saw the current version. The team had no idea because they were all based in the US.
How to check yourself:
# Generate a SHA-256 fingerprint of your page content
curl -s https://yoursite.com | sha256sum
# Compare with a different region (if you have a VPN)
# Or run it again after a deploy — if the hash doesn't change,
# your CDN is serving stale content
# Check cache headers to see if CDN is even invalidating
curl -sI https://yoursite.com | grep -iE "cache-control|x-cache|age|cf-cache"
If the age header shows a high number after a fresh deploy, or x-cache says HIT when you expected new content — your users aren't seeing what you shipped.
3. JavaScript Bundles That 404'd (4 sites)
Four sites had their HTML load fine (200 OK), but the JavaScript files it referenced no longer existed on the server.
How it happens: A deployment pushes new HTML that references app.3f8a2c.js, but the CDN edge node still serves the old HTML pointing to app.9d1b4e.js — which was deleted in the new release. This is the evil cousin of stale CDN content: the HTML is stale, but the assets it points to have been purged.
The result: a blank white page. Or a page that looks half-loaded — the header renders from server-side HTML, but nothing interactive works because the client-side JavaScript never boots.
This is extremely common with SPAs. Netlify's support forums have hundreds of threads about it. AWS Amplify has a known issue where NextJS static pages stop working after deploy because JavaScript files return 404. It happens any time users have cached HTML from a previous deploy that references JS bundles that no longer exist.
How to check yourself:
# Fetch the HTML and check every script source
curl -s https://yoursite.com | grep -oP 'src="[^"]*\.js"' | while read -r src; do
url=$(echo "$src" | grep -oP '"[^"]*"' | tr -d '"')
if [[ "$url" != http* ]]; then
url="https://yoursite.com$url"
fi
status=$(curl -o /dev/null -s -w "%{http_code}" "$url")
echo "$status $url"
done
If any of those return 404, your users are seeing a broken page while your uptime monitor shows green.
4. Redirect Chains and Loops (3 sites)
Three sites had redirect chains — sequences of 301/302 redirects that either took too many hops or looped entirely.
The worst example: A site redirected HTTP → HTTPS → www → non-www → HTTPS again, creating a loop. Every visitor hit this loop. The browser gives up after 20 redirects and shows ERR_TOO_MANY_REDIRECTS. The uptime monitor? It only followed the first redirect, got a 301, and marked it as "up."
How it happens: Conflicting redirect rules between the web server (nginx/Apache), the CDN (Cloudflare page rules), and the application (.htaccess or middleware). Someone adds a "force HTTPS" rule in Cloudflare and in nginx, or a "redirect to www" rule in the CDN that conflicts with a "redirect to non-www" rule in the application.
How to check yourself:
# Follow all redirects and show the chain
curl -sIL https://yoursite.com 2>&1 | grep -iE "^(HTTP/|location:)"
If you see more than 2 redirects, something is misconfigured. If you see the same URL appear twice, you have a loop.
5. WordPress Plugin Conflicts / White Screen of Death (2 sites)
Two WordPress sites rendered a completely blank page — the infamous White Screen of Death (WSOD). The server returned 200 OK with an empty HTML body. Both had recently updated plugins.
How common is this? It's one of the most searched WordPress errors on the internet. It happens when a plugin update introduces a PHP fatal error, exceeds the server's memory limit, or conflicts with another plugin or the active theme. The server doesn't crash — PHP catches the fatal error and returns an empty response with a 200 status code.
The specific cases I found: One site had updated a page builder plugin (Elementor) that conflicted with a custom theme. The other had a WooCommerce extension that exceeded the PHP memory limit after an update. Both sites had been showing a blank page for over 48 hours without the site owners knowing — because their uptime monitor said the server was responding fine.
How to check yourself:
# Check if the page has any actual content
content_length=$(curl -s https://yoursite.com | wc -c)
echo "Response body: $content_length bytes"
# If this is under 500 bytes for a full page, something is very wrong
# Check for PHP errors in the response headers
curl -sI https://yoursite.com | grep -iE "x-php|x-powered|content-length"
6. Mixed Content Blocking (1 site)
One site served its main page over HTTPS but loaded images and a font file over HTTP. Modern browsers block mixed content silently — no error page, no console warning for the user. The images just don't load, and the font falls back to a system font.
How it happens: Usually after an SSL migration. The site moves to HTTPS, but some hardcoded http:// references remain in the database (common in WordPress), in a template file, or in a third-party embed. It can also happen when a CDN or image hosting service changes its URL scheme.
How to check yourself:
# Find HTTP resources on an HTTPS page
curl -s https://yoursite.com | grep -oP '(src|href)="http://[^"]*"'
If that returns anything, those resources are being blocked by the browser and your visitors aren't seeing them.
The Pattern
Here's what all 23 failures had in common:
The server was up. The HTTP status code was 200. The uptime monitor said green.
Traditional uptime monitoring checks one thing: did the server respond? That's a useful signal — but it's the minimum signal. It tells you the infrastructure is alive. It tells you nothing about whether the application works.
And the numbers back this up. 85% of website bugs are first detected by users, not by monitoring tools. Not because monitoring doesn't exist — but because it's checking the wrong thing.
The gap between "server responds" and "site works for users" is where all 23 of these failures lived. And it's exactly the gap that most teams don't monitor.
What Real Monitoring Looks Like
After running this audit, the checks that actually caught problems were:
- SSL certificate validation — Is the cert valid, unexpired, chain complete, and matching the domain? Not just "does the connection work."
- Asset integrity checks — Does every script and stylesheet referenced in the HTML actually load with the correct status code?
- Content fingerprinting — Is the content what you expect, or is the CDN serving a stale version? A SHA-256 hash of the response body catches silent content drift.
- Redirect chain resolution — Does the URL resolve cleanly in ≤2 hops, or does it loop?
- Response body validation — Is the page actually rendering content, or is it a 200 OK with an empty body?
- Multi-region checks — A site can be broken in Frankfurt but fine in Virginia. Single-location checks miss regional CDN failures entirely.
This is what I built Sitewatch to do — run all of these checks continuously, from multiple regions, and alert when something breaks at the application layer, not just the infrastructure layer. Free for 1 site, no credit card.
Quick Audit Checklist
You don't need a tool to start. Run these against any production site right now:
- Check SSL certificate expiry and chain with
openssl s_client - Verify every
<script>and<link>resource returns 200 - Follow the full redirect chain with
curl -sIL— max 2 hops, no loops - Hash the response body with
sha256sumbefore and after a deploy - Check response body size — under 500 bytes means something is wrong
- Repeat from a different geographic location
If any check fails — you just found a bug your uptime monitor has been ignoring.
Have you found similar silent failures in production? Drop a comment — I'm collecting these for a follow-up post.
Top comments (0)