Why your monitoring is missing the dumbest outages

#devops #networking #security #systems

An expired cert took down a subdomain on a Saturday with zero alerts. Here is the bash sweep that catches the failures hiding in plain sight.

A TLS certificate on a random internal subdomain expired on a Saturday. We had great uptime checks on the main app, but nobody thought to monitor the auth subdomain.

No logs screamed, no exceptions fired, no deploys failed. The certificate just quietly hit its expiration date and stopped being valid. The first alert we got was a frustrated Slack message with a screenshot of Chrome's full page security warning.

It was a completely stupid outage, which is exactly why it's worth talking about. As an industry we obsess over things that fail loudly. We set up complex alerting for 500s, OOM kills, and pod restarts. Then we completely ignore the infrastructure that fails by simply hitting a deadline.

The silent killers

There are three domain level failures that produce exactly zero errors in your application logs until the exact second they break.

TLS certificate expiry is the most common. Valid one second, hard fail the next. Browsers block the page, APIs reject the handshake, and your app looks like it's down even though the servers are running perfectly.

Domain registration expiry is the nightmare scenario. The actual domain name lapses, DNS stops resolving entirely, and you get to deal with registrar redemption fees and downtime measured in days instead of minutes.

DNS misconfiguration is the sneakiest. Maybe a record change didn't propagate, a CNAME is pointing to the wrong place, or a TTL is still serving an old IP address to a chunk of your users.

You have to go look at the dates and the records yourself. Your app has no idea these things exist.

The bash cheatsheet

Most of this uses tools you already have on your machine. I keep these in a scratchpad for whenever I spin up a new service or change a domain config.

Does the DNS even resolve?

dig +short A    example.com      # IPv4
dig +short AAAA example.com      # IPv6
dig +short      example.com NS   # nameservers
dig +short      example.com MX   # mail
dig +short      example.com TXT  # SPF / DKIM / verification records

If you just made a change and want to bypass your local cache to see if it propagated, hit a public resolver directly:

dig @1.1.1.1 +short example.com
dig @8.8.8.8 +short example.com

The TLS check that would have saved my weekend

echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null \
  | openssl x509 -noout -dates -subject -issuer

You are looking for the notAfter= date. If you want to get fancy and actually calculate the days left, pipe it through some basic date math:

exp=$(echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null \
  | openssl x509 -noout -enddate | cut -d= -f2)
echo "$(( ( $(date -d "$exp" +%s) - $(date +%s) ) / 86400 )) days left"

The WHOIS registration check

whois example.com | grep -iE 'expir|registrar|status'

Look for the Registry Expiry Date and the status codes like clientTransferProhibited. If this date passes, you lose the domain.

Turning one-liners into actual safety

Running these manually finds today's problem. It doesn't prevent next month's problem.

You really need to take that openssl snippet and wrap it in a basic cron job or set up a check with the Prometheus blackbox exporter. Have it ping your Slack channel at 30 days, 14 days, and 7 days out. For the domain registration, just turn on auto-renew at your registrar and put the actual expiration date on your calendar as a fallback.

Running these three commands manually is fine when you are troubleshooting or spinning something up. But I got tired of juggling terminal tabs and parsing openssl output just to ask a simple question. I ended up building a single page scratchpad that glues this data together so I don't have to run the commands on a Sunday morning. It pulls the WHOIS registration data, DNS records, and SSL cert transparency info into one view at astound.tools. It's strictly for the ad-hoc "is this domain healthy?" check, not a replacement for actual monitoring alerting.

Outages from expired certs and lapsed domains are the most embarrassing things to explain in a post-mortem because the fix is literally just looking at a calendar.