Postmortem: 1-Hour Outage When Let's Encrypt Rate Limited Our Ingress Certificates

#postmortem #1hour #outage #when

Postmortem: 1-Hour Outage When Let's Encrypt Rate Limited Our Ingress Certificates

May 20, 2024 | Incident Duration: 1 hour (09:20 - 10:20 UTC)

Executive Summary

On May 15, 2024, our production services experienced a 1-hour total outage caused by Let's Encrypt rate limiting our account after a misconfigured test environment and faulty cert-manager settings triggered excessive duplicate certificate requests. All TLS-terminated ingress endpoints returned handshake errors, making our application inaccessible to users globally.

Timeline of Events

08:45 UTC: Faulty cron job in test environment begins recreating the test namespace every 2 minutes, each recreation triggering a new certificate request for test.example.com via our cert-manager Let's Encrypt production issuer.
09:10 UTC: Let's Encrypt returns HTTP 429 (Too Many Requests) errors for test.example.com certificate requests, as we exceeded the 5 duplicate certificates per registered domain per week rate limit.
09:15 UTC: Cert-manager's global backoff blocks all certificate requests from our account to respect Let's Encrypt rate limits.
09:20 UTC: Scheduled production deployment of app.example.com ingress starts. Cert-manager deletes the existing valid (60 days remaining) certificate for app.example.com and attempts to request a new certificate, which is blocked by the rate limit. The ingress controller falls back to an untrusted self-signed certificate, causing TLS errors for all users.
09:20 UTC: First user reports 503 errors and TLS handshake failures when accessing our application.
09:25 UTC: On-call engineering team identifies TLS errors, checks cert-manager logs, and discovers repeated 429 errors from Let's Encrypt's ACME API.
09:30 UTC: Root cause identified: faulty cron job generating excessive cert requests, combined with cert-manager configured to delete existing certificates on ingress updates.
09:35 UTC: Faulty cron job disabled; cert-manager configuration updated to preserve existing certificates.
09:40 UTC: Team switches to a secondary Let's Encrypt account (unused for 6 months) to bypass the rate limit on the primary account.
09:55 UTC: New certificates for all production ingress resources are provisioned via the secondary account.
10:20 UTC: All services are fully restored, ending the 1-hour outage.

Root Cause Analysis

Two compounding misconfigurations led to the outage:

Faulty test environment cron job: A deprecated cron job in our test environment was configured to recreate the test namespace every 2 minutes to "reset" state, but was not disabled after the test completed. Each namespace recreation triggered a new ingress deployment, which cert-manager processed as a request for a new certificate for test.example.com, even though the existing certificate was still valid for 89 days.
Cert-manager misconfiguration: Our cert-manager ClusterIssuer was configured with cert-manager.io/preserveCerts: "false", meaning existing Certificate resources were deleted and recreated on every ingress update. When the Let's Encrypt rate limit hit, cert-manager could not provision new certificates, but had already deleted valid production certificates during a separate deployment, leaving ingress endpoints with no valid TLS cert.
Shared rate limit scope: All environments (test, staging, production) used the same primary Let's Encrypt account, so excessive requests from the test environment triggered rate limits that blocked production certificate renewals.

Remediation Steps

Immediately disabled the faulty test environment cron job to stop excessive certificate requests.
Updated cert-manager ClusterIssuer to set cert-manager.io/preserveCerts: "true" to prevent deletion of valid existing certificates.
Switched production ingress to use a secondary Let's Encrypt account that had not hit rate limits, allowing immediate provisioning of new certificates.
Deployed new certificates to all production ingress resources and verified TLS termination worked correctly.
Submitted a rate limit waiver request to Let's Encrypt for our primary account, which was approved 24 hours later.

Lessons Learned

Isolate environments by Let's Encrypt account: Use separate Let's Encrypt accounts for test/staging and production to prevent cross-environment rate limit contamination.
Test cert-manager changes in staging: All cert-manager configuration updates must be tested in a staging environment with simulated rate limit errors before production deployment.
Add alerting for cert-manager errors: Implement alerts for HTTP 429 errors from Let's Encrypt, certificate expiration < 7 days, and failed cert provisioning attempts.
Preserve existing certificates: Never delete valid existing certificates on ingress updates unless the certificate's domain set has changed. Use cert-manager's preserveCerts setting to enforce this.
CI/CD pipeline guards: Add checks to prevent excessive redeployments of ingress resources (e.g., rate limit ingress deployments to 2 per hour per namespace).
Backup TLS certificates: Store backup copies of all production TLS certificates in a secure vault to deploy quickly if cert-manager fails.

Conclusion

This outage was entirely preventable with better environment isolation and cert-manager configuration. We have implemented all remediation steps and lessons learned to ensure this type of incident does not recur. We also thank Let's Encrypt for their quick waiver approval and ongoing work providing free TLS certificates to the web.