DEV Community

Cover image for TLS Certificate Management Without Tears
Samson Tanimawo
Samson Tanimawo

Posted on

TLS Certificate Management Without Tears

Expired certificates cause more outages than they should. Every time, the post-mortem says 'we'll monitor expiry dates.' Every time, six months later, someone forgets.

Here's how to actually solve it.

The two rules

Rule 1: Don't manage certs manually. If a human has to remember to renew, the system is broken. Use Let's Encrypt + cert-manager (or your cloud's equivalent) and let the machines handle it.

Rule 2: Monitor expiry as an SLI. 'Days until cert expires' is a metric. Alert at 14 days and at 7 days. Actually page at 3 days.

The gotchas

Certs you didn't know about. Internal services with self-signed certs that someone deployed in 2019 and nobody has touched since. Scan your infrastructure. Inventory everything.

Client certs. mTLS clients can have expired certs too. These are harder to find because they're often distributed across devices.

Third-party APIs. You don't manage their certs, but you break when they expire without notice. Monitor outbound connections with TLS validation turned on.

The renewal that silently fails. Automated renewal fails because of a config change. Nobody notices because nothing changed visibly until the old cert expires. Alert on renewal failures, not just expiry dates.

The quarterly audit

Once a quarter:

  1. List every domain/service that uses TLS
  2. Verify the renewal automation is working
  3. Check monitoring is actually firing (test alert on a staging cert)
  4. Delete certs that belong to services that no longer exist

The emotional truth

Nobody wants to work on cert management. That's why it breaks. Make it someone's explicit quarterly responsibility and reward them for boring success. You'll never have another cert outage.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)