How an expired SSL cert took down our checkout for six hours (and what I should have had watching)

#webdev #security #devops #monitoring

The site was "up." The monitor said so. HTTP 200, response times normal, no alerts.

What the monitor didn't know - what I didn't know - was that our SSL certificate had expired 87 minutes earlier and every user hitting the site was getting a certificate error in their browser. Not a down page. Not a 5xx. A cert error. The kind where browsers show a big red warning screen and most users immediately close the tab.

For a checkout flow, that's about as bad as the server being down. Worse, actually, because at least a down server triggers your uptime alert.

This is the post-mortem.

What happened

We were running Let's Encrypt with certbot and auto-renewal configured. The renewal was supposed to happen when the cert had 30 days left. It had been working fine for about 18 months.

Then it didn't.

The renewal job ran, hit a DNS validation error - our DNS provider had a 30-minute API hiccup that day - and failed silently. Certbot logged the failure, but nobody was watching certbot logs. The retry ran 12 hours later, same issue. Then it was fine. But by then, the "success" window had passed and the cert expired before the next attempt.

Let's Encrypt auto-renewal fails for reasons that feel random at the time:

DNS propagation delays when you're using DNS-01 validation and your DNS provider has latency
Rate limits - Let's Encrypt has per-domain limits (5 failures per hour) that cause subsequent retries to also fail
Firewall or load balancer changes that block the HTTP-01 validation path on port 80
File permission issues on the cert directory after a system update
Webhook or deploy hook failures - the cert renews but the service doesn't reload to pick up the new cert

In our case it was DNS validation timing plus a log nobody was watching. The cert expired at 3:14 PM. The Slack alert - from a user, not a monitor - came in at 4:58 PM.

Why my uptime monitor missed it for four hours

GrabDiff monitors SSL expiry now, which is part of why I built it. But at the time I was using a basic HTTP ping monitor. Here's what it was doing:

Make HTTP request to our URL
Check for 200 response
Mark as healthy

The problem is step 1. The monitor was connecting via HTTP (port 80) and following the redirect to HTTPS. The redirect itself returned 301, healthy. Then the HTTPS request... also returned 200?

Sort of. The monitor wasn't validating the SSL certificate. It was making the HTTPS request with cert verification disabled, because false positives from cert issues in test environments made that the default in a lot of ping monitoring setups. So it dutifully checked the response code, got a 200 (from behind the expired cert that browsers were rejecting), and marked everything green.

Four hours of "everything is fine."

What proper SSL monitoring actually checks

SSL expiry monitoring should check a few distinct things:

1. Certificate expiry date - the obvious one. Get the cert's Not After field and alert at configurable thresholds. I alert at 30 days and 7 days. If you're using Let's Encrypt with 90-day certs, a 30-day warning gives you two full renewal windows to fix it.

2. Full-chain validation - not just that a cert exists, but that the entire chain from your cert to the root CA is valid. Intermediate cert issues cause browser errors even when your cert itself hasn't expired.

3. Cert actually served matches expected domain - if something went wrong with your load balancer config and it's serving the cert for a different domain, that's a browser error even with a valid cert.

4. Port 443 is actually accepting connections - a "port not open" situation is different from "cert expired" but both cause the same user-facing result.

5. The cert returned matches what's on disk - this catches the case where renewal succeeded but the service didn't reload and is still serving the old, expired cert.

A ping monitor does none of these. A lot of "SSL monitoring" tools only do #1, which misses the cases that actually catch you off guard.

What I'd do differently

Monitor the cert directly, not via HTTP. Connect to port 443, do the TLS handshake, and inspect the cert that's actually being served. Don't just check the expiry date - validate the chain.

Set alert thresholds that give you time to fix things manually. Let's Encrypt certs renew at 30 days remaining. I alert at 30 days (something's wrong with auto-renewal) and 7 days (it's still not fixed and now it's urgent). That gives me 23 days between "something's wrong" and "now panic."

Watch the renewal logs. Not the SSL cert itself, but the renewal process. Set up a heartbeat - certbot's --deploy-hook can ping a monitoring URL on successful renewal. If the heartbeat doesn't arrive within period + grace, alert. This catches the "cert renewed but didn't reload" case too.

Test your renewal before it matters. certbot renew --dry-run in your staging environment, regularly. Not just once when you set it up.

The monitoring stack I run now

For SSL specifically: I use GrabDiff for the cert expiry checks - it connects directly to port 443, validates the full chain, and alerts at 30 days and 7 days with enough context to know what's actually wrong (expiry date, issuer, which check failed).

For the renewal heartbeat: I have certbot's --deploy-hook send a ping to a GrabDiff heartbeat monitor after each successful renewal. If it doesn't ping within 93 days (the Let's Encrypt cert lifetime plus a week), I get alerted. That catches the silent renewal failures before they become a problem.

The six-hour checkout outage cost us - I'd rather not quantify it. The monitoring stack that would have caught it costs $9/month. That math is not complicated.

The broader lesson

SSL expiry is one of the most embarrassing categories of outage because it's entirely predictable. You know the cert will expire. You have the date. The only question is whether you catch it before your users do.

The same is true for domain expiry, for that matter. I've seen teams let their primary domain expire because the renewal email went to a former employee's address and nobody caught it. The monitoring there is trivial - check the WHOIS expiry date, alert 60 days out. But people don't do it until they have to learn the hard way.

If your current monitoring setup would have missed the scenario I described above - HTTP 200 from an expired-cert server - it's worth spending 20 minutes fixing that before you have your own version of this post-mortem to write.

I wrote this mostly to stop myself from having to explain this incident verbally ever again. Now I can just link it.

But seriously - SSL expiry outages are embarrassing in a specific way because they're so avoidable, and I've seen them happen to teams that clearly knew what they were doing otherwise. If you've had your own cert-expiry story (or a renewal failure that was weirder than mine), I'd like to hear it in the comments. Knowing the failure modes other people have hit is the only way to build a monitoring checklist that actually covers the real world.