Postmortem: How a Cert-Manager 1.14 Bug Broke TLS for 1000+ Microservices

#postmortem #certmanager #broke #1000

Postmortem: How a Cert-Manager 1.14 Bug Broke TLS for 1000+ Microservices

Incident Summary

On March 12, 2024, at 14:30 UTC, our platform engineering team initiated a rolling update of Cert-Manager from version 1.13.5 to 1.14.0 across all 12 production Kubernetes clusters, following a successful staging validation. By 15:05 UTC, our monitoring systems began firing critical alerts for TLS handshake failures, HTTP 503 errors, and certificate expiration warnings across 1,247 microservices. The incident lasted 2 hours and 17 minutes, with full resolution achieved by 17:22 UTC. Estimated impact: 12 minutes of partial downtime for 89% of customer-facing services, resulting in ~$240k in lost transaction volume.

Incident Timeline

14:30 UTC: Cert-Manager 1.14.0 rollout begins across all production clusters, using a 5% canary phase per cluster.
14:47 UTC: Canary phase completes successfully; full rollout to all nodes starts.
15:05 UTC: First alert triggers for TLS handshake failures in the payments microservice cluster.
15:12 UTC: Incident response team assembles; initial investigation points to expired TLS certificates across multiple services.
15:28 UTC: Root cause identified: Cert-Manager 1.14.0 webhook rejects all Certificate renewal requests, blocking new cert issuance.
15:35 UTC: Mitigation starts: roll back Cert-Manager to 1.13.5 across all clusters.
16:10 UTC: Rollback completes for 8 of 12 clusters; TLS recovery begins for 60% of affected services.
17:22 UTC: All clusters rolled back, all certificates renewed, full service recovery confirmed.

Root Cause Analysis

Cert-Manager 1.14.0 introduced a breaking change in the cert-manager-webhook component, which validates Custom Resource (CR) submissions for Certificate, Issuer, and ClusterIssuer resources. The change added a strict validation check for the secretName field in Certificate resources, requiring it to match a new annotation cert-manager.io/secret-name that was not present in existing resources deployed prior to 1.14.0.

This validation logic contained a bug: it applied the check to all Certificate resources, including those created under Cert-Manager 1.13.x and earlier, which did not have the new annotation. As a result, the webhook rejected all renewal requests for existing certificates, and blocked new certificate creation. Since Cert-Manager automatically renews certificates 30 days before expiration, the rollout coincided with a batch of renewals for 1,000+ certificates, all of which failed simultaneously.

Expired certificates were not immediately replaced because the renewal requests were rejected, leading to TLS termination failures at the ingress controller level. Services using these certificates could not establish TLS connections, returning 503 errors to clients.

Mitigation and Resolution

Once the root cause was identified, the incident response team initiated an immediate rollback to Cert-Manager 1.13.5, the last known stable version. The rollback was performed cluster-by-cluster to avoid further disruption, with canary validation after each cluster rollback. After all clusters were rolled back, the team manually triggered certificate renewal for all affected resources, which completed within 47 minutes. A post-rollback audit confirmed no lingering certificate issues.

Lessons Learned

Staging validation gaps: Our staging environment did not have a representative sample of production Certificate resources, missing the annotation mismatch that triggered the bug. We now replicate 10% of production Cert-Manager resources to staging for validation.
Canary rollout scope: The 5% canary phase did not include certificate renewal workloads, which are triggered by timer rather than deployment events. We now add renewal workload simulation to canary validation.
Cert-Manager version pinning: We previously allowed automatic minor version updates; we now pin to patch versions and require 72 hours of staging runtime before production rollout.
Alerting improvements: We added a new alert for Cert-Manager webhook rejection rates, which would have fired within 2 minutes of the rollout, reducing time to detection by 33 minutes.

Preventative Actions

Implement mandatory staging validation for all Cert-Manager updates, including renewal workload simulation.
Pin Cert-Manager to specific patch versions, with manual approval for minor version upgrades.
Add webhook rejection rate and certificate expiration lag alerts to all cluster monitoring dashboards.
Contribute a fix for the 1.14.x webhook validation bug to the Cert-Manager open-source project, with a regression test for annotation-less Certificate resources.