AlertSleep

Posted on Apr 17

How We Handle SSL Certificate Expiration Alerts at Scale

#webdev #devops #sre #security

It was a Tuesday morning in June 2021. LinkedIn — a platform used daily by hundreds of millions of professionals — went partially down. Not because of a DDoS attack, a bad deploy, or a database failure. Their SSL certificate had expired.

The issue was resolved within hours, but the damage was done: broken links, frustrated users, and a very public reminder that one of the most preventable failures in infrastructure still catches well-resourced engineering teams off guard. LinkedIn was not alone. Microsoft Teams suffered a similar SSL expiry incident in 2020. Spotify has had certificate-related hiccups. Even government sites regularly show up in breach reports because of expired certs.

If it can happen to them, it can happen to you.

What SSL Certificates Actually Are (and Why They Expire)

An SSL/TLS certificate is a cryptographically signed document that proves your server is who it says it is. It binds your domain name to a public key, and a trusted Certificate Authority (CA) vouches for that binding.

There are three main validation levels:

DV (Domain Validation) — Cheapest and fastest. CA only verifies you control the domain. Used by most personal sites and small services. 90-day Let's Encrypt certs fall here.
OV (Organization Validation) — CA verifies the organization's legal existence. Common for company sites.
EV (Extended Validation) — Strictest vetting. Used by banks and payment platforms.

Historically, SSL certificates were issued for 1–2 year terms. In 2020, Apple, Google, and Mozilla enforced a hard cap of 398 days for certificates trusted in their browsers. Then Let's Encrypt popularized 90-day certificates, arguing shorter lifespans reduce the damage window if a certificate is compromised.

The result: certificates expire faster than ever, and the margin for error is shrinking.

Why Manual Tracking Fails

When a team has two or three certificates, the spreadsheet approach works fine. Someone adds a row, sets a calendar reminder, done.

Then the company grows. Suddenly you have:

A wildcard cert for *.yourdomain.com
A separate cert for api.yourdomain.com managed by a different team
A staging cert someone set up and forgot about
A cert for a third-party integration endpoint you technically own
Let's Encrypt auto-renew that "should be working" but nobody has verified in six months

The spreadsheet becomes stale. Calendar reminders get snoozed. The person who set up the cert leaves the company. Auto-renewal fails silently because the DNS challenge no longer resolves correctly after a migration.

This is not a people problem. It is a systems problem. Manual tracking does not scale.

The Alert Timeline That Actually Works

After dealing with enough SSL-related incidents, the SRE community has largely converged on a tiered alerting strategy:

Days Until Expiry	Alert Type	Who Gets Notified
60 days	Awareness ping	Primary engineer / infra team
30 days	Action required	Team lead + primary engineer
14 days	Escalation	Manager + entire team
7 days	All-hands	Engineering leadership
1 day	Emergency	PagerDuty / on-call rotation

The 60-day notification is intentionally low-urgency. It gives the responsible party time to renew without pressure. By the time you hit 7 days, something has already gone wrong in your process — the earlier alerts were missed or ignored. The 1-day alert should be treated like a production incident.

The key insight: alert early enough that the first notification is never urgent. If your team is routinely panicking at 7 days or fewer, your alert window is too short.

Checking SSL Expiry: Code Examples

Using openssl CLI

# Check expiry date for a domain
echo | openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null \
  | openssl x509 -noout -dates

# Output:
# notBefore=Jan  1 00:00:00 2025 GMT
# notAfter=Mar 31 23:59:59 2025 GMT

To get the number of days remaining:

EXPIRY=$(echo | openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null \
  | openssl x509 -noout -enddate | cut -d= -f2)

EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || date -jf "%b %d %T %Y %Z" "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))

echo "Days until expiry: $DAYS_LEFT"

Using Node.js

const tls = require('tls');

function checkSSLExpiry(hostname, port = 443) {
  return new Promise((resolve, reject) => {
    const socket = tls.connect({ host: hostname, port, servername: hostname }, () => {
      const cert = socket.getPeerCertificate();
      socket.end();

      const expiryDate = new Date(cert.valid_to);
      const now = new Date();
      const daysRemaining = Math.floor((expiryDate - now) / (1000 * 60 * 60 * 24));

      resolve({ hostname, expiryDate, daysRemaining });
    });

    socket.on('error', reject);
  });
}

checkSSLExpiry('yourdomain.com').then(info => {
  console.log(`${info.hostname}: ${info.daysRemaining} days remaining`);

  if (info.daysRemaining <= 7) {
    console.error('CRITICAL: Certificate expires in less than 7 days!');
  } else if (info.daysRemaining <= 30) {
    console.warn('WARNING: Certificate expires soon.');
  }
});

Using Python

import ssl
import socket
from datetime import datetime, timezone

def check_ssl_expiry(hostname: str, port: int = 443) -> dict:
    context = ssl.create_default_context()
    with socket.create_connection((hostname, port), timeout=10) as sock:
        with context.wrap_socket(sock, server_hostname=hostname) as ssock:
            cert = ssock.getpeercert()

    expiry_str = cert['notAfter']
    expiry_date = datetime.strptime(expiry_str, '%b %d %H:%M:%S %Y %Z').replace(tzinfo=timezone.utc)
    days_remaining = (expiry_date - datetime.now(tz=timezone.utc)).days

    return {'hostname': hostname, 'days_remaining': days_remaining}

result = check_ssl_expiry('yourdomain.com')
print(f"{result['hostname']}: {result['days_remaining']} days remaining")

Automating Checks with a Cron Job

A simple cron-based approach for teams managing a small number of domains:

#!/bin/bash
# /usr/local/bin/check-ssl-certs.sh

DOMAINS=("yourdomain.com" "api.yourdomain.com" "dashboard.yourdomain.com")
ALERT_EMAIL="infra-team@yourcompany.com"
WARN_DAYS=30

for DOMAIN in "${DOMAINS[@]}"; do
  EXPIRY=$(echo | openssl s_client -connect "${DOMAIN}:443" -servername "${DOMAIN}" 2>/dev/null \
    | openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)

  if [ -z "$EXPIRY" ]; then
    echo "ERROR: Could not retrieve cert for ${DOMAIN}" \
      | mail -s "SSL Check Failed: ${DOMAIN}" "$ALERT_EMAIL"
    continue
  fi

  DAYS_LEFT=$(( ($(date -d "$EXPIRY" +%s) - $(date +%s)) / 86400 ))

  if [ "$DAYS_LEFT" -le "$WARN_DAYS" ]; then
    echo "SSL cert for ${DOMAIN} expires in ${DAYS_LEFT} days (${EXPIRY})" \
      | mail -s "SSL Warning: ${DOMAIN} expires in ${DAYS_LEFT} days" "$ALERT_EMAIL"
  fi
done

Add to crontab to run daily at 8 AM:

0 8 * * * /usr/local/bin/check-ssl-certs.sh

This gets you to a functional baseline. The limitation: it only works when your cron runner is healthy, and it has no concept of alert escalation or historical tracking.

External Monitoring as a Safety Net

Self-hosted cron jobs are a good first layer. They are not sufficient on their own. The machine running your cron job could be the same machine whose cert expires. Or the job runs but silently fails because your SMTP relay is down.

External monitoring services check your SSL certificates from outside your infrastructure, on a schedule, and alert you through independent channels (email, Slack, PagerDuty, SMS). This separation is the point — if your infrastructure has a problem, you still get notified.

AlertSleep is one example: it monitors SSL certificates continuously, tracks expiry dates across all your domains, and fires alerts at configurable thresholds — without requiring you to manage any infrastructure for the monitoring itself. For teams that want visibility without operational overhead, this kind of external check is a meaningful complement to internal automation.

Managing SSL at Scale: 50+ Certificates

When you cross the threshold of managing 50 or more certificates, new problems emerge.

Build a certificate inventory. Know which cert covers which domain, when it was issued, when it expires, who owns renewal, and whether it auto-renews. A simple internal wiki page is better than nothing. A proper certificate management tool is better still.

Wildcard certificates need special attention. A *.yourdomain.com wildcard might cover dozens of subdomains. If it expires, all of them break simultaneously. The blast radius of a wildcard expiry is much larger than a single-domain cert.

Treat auto-renewal as a process, not a guarantee. Let's Encrypt auto-renewal via certbot or ACME clients is reliable under normal conditions. It fails when DNS records change, when ports 80/443 are firewalled during the renewal window, or when the renewal configuration drifts after infrastructure changes. Verify that auto-renewal is actually succeeding, not just scheduled.

Use centralized alerting. Sending expiry alerts directly to individual engineers does not work at scale. Route all SSL alerts to a shared channel (Slack #infra-alerts) and a ticketing system. Coverage should not depend on any single person being available.

Closing Thoughts

SSL certificate expiration is a solved problem. The tools exist, the alert timelines are well-established, and the failure modes are well-documented. What makes it persistent as an incident cause is the gap between knowing what to do and actually having it in place.

The LinkedIn outage in 2021 was not a failure of knowledge. It was a failure of process. Somewhere in the chain, a certificate slipped through without the right person getting the right alert at the right time.

The fix is not complicated: external monitoring as your safety net, tiered alerts with enough lead time to act calmly, and an inventory that does not live in one person's head.

The goal is to make certificate expiry the most boring part of your infrastructure. An alert fires at 60 days, someone renews, done. No incident, no postmortem, no Tuesday morning scramble.

Setting up SSL monitoring for the first time? AlertSleep's SSL monitoring handles the external check layer and alert routing out of the box — worth a look before you build your own.

DEV Community