Expired certificates cause more outages than they should. Every time, the post-mortem says 'we'll monitor expiry dates.' Every time, six months later, someone forgets.
Here's how to actually solve it.
The two rules
Rule 1: Don't manage certs manually. If a human has to remember to renew, the system is broken. Use Let's Encrypt + cert-manager (or your cloud's equivalent) and let the machines handle it.
Rule 2: Monitor expiry as an SLI. 'Days until cert expires' is a metric. Alert at 14 days and at 7 days. Actually page at 3 days.
The gotchas
Certs you didn't know about. Internal services with self-signed certs that someone deployed in 2019 and nobody has touched since. Scan your infrastructure. Inventory everything.
Client certs. mTLS clients can have expired certs too. These are harder to find because they're often distributed across devices.
Third-party APIs. You don't manage their certs, but you break when they expire without notice. Monitor outbound connections with TLS validation turned on.
The renewal that silently fails. Automated renewal fails because of a config change. Nobody notices because nothing changed visibly until the old cert expires. Alert on renewal failures, not just expiry dates.
The quarterly audit
Once a quarter:
- List every domain/service that uses TLS
- Verify the renewal automation is working
- Check monitoring is actually firing (test alert on a staging cert)
- Delete certs that belong to services that no longer exist
The emotional truth
Nobody wants to work on cert management. That's why it breaks. Make it someone's explicit quarterly responsibility and reward them for boring success. You'll never have another cert outage.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)