My certificates were renewing, the logs said CertificateIssued, but my pods were still screaming about TLS handshake failures. It's the classic "everything looks green in the dashboard but the app is broken" scenario. I had a fully automated pipeline using cert-manager and Cloudflare DNS-01, yet my internal services were intermittently failing to validate the very certificates they were using.
If you've already set up the basic ClusterIssuer and think you're done, you've likely only hit the happy path. The real friction starts when you move from a single static IP to a dynamic environment or when you realize Kubernetes is lying to you about how it resolves DNS.
The DNS-01 Foundation
For those who haven't wrestled with this, DNS-01 is the only sane way to handle TLS in a homelab or private cloud. Unlike HTTP-01, which requires opening port 80 to the world and routing traffic to a specific challenge pod, DNS-01 proves ownership by dropping a TXT record into your DNS provider.
I use cert-manager for this because manually rotating certificates is a job for people who enjoy waking up at 3 AM to fix a production outage. The basic setup involves a ClusterIssuer that talks to the Cloudflare API.
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: cloudflare
spec:
acme:
email: admin@example.com
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: cloudflare-acme-account-key
solvers:
- selector:
dnsZones:
- example.com
dns01:
cloudflare:
apiTokenSecretRef:
name: cloudflare-dns01-token
key: token
The most common point of failure here isn't the YAML, it's the API token. Cloudflare's permissions are granular. If you give the token Zone:Read but forget DNS:Edit, the issuer will hang indefinitely while trying to create the TXT record. I've spent two hours debugging a "network timeout" that was actually just a 403 Forbidden from the Cloudflare API.
The ndots Trap
Once the certificates are issued, a new problem emerges: resolution. I noticed that some pods could reach internal services via their TLS names, while others failed with certificate signed by unknown authority or simply timed out.
The culprit was the Kubernetes ndots setting. By default, K8s sets ndots: 5. This means if a hostname has fewer than five dots, the resolver tries to append all the search domains listed in /etc/resolv.conf before trying the absolute name.
When a pod tries to connect to api.example.com, it doesn't just look up that name. It tries api.example.com.namespace.svc.cluster.local, then api.example.com.svc.cluster.local, and so on. This creates a massive amount of DNS noise and, in some edge cases with certain DNS providers or internal resolvers, leads to the wrong IP being returned or the request being dropped. I've written about this specific nightmare in my post on Wildcard DNS and ndots:5.
The fix is to explicitly set ndots: 2 for pods that need to talk to external services frequently. This tells the resolver: "if there are at least two dots, just try the name as-is first."
spec:
containers:
- name: ai-agent-worker
image: my-agent:latest
dnsConfig:
options:
- name: ndots
value: "2"
Adding this simple block stopped the intermittent TLS handshake failures. It's a detail that isn't in the cert-manager docs because it's a Kubernetes networking behavior, not a certificate issue. But in production, those two things are inextricably linked.
Automating the Dynamic IP Headache
DNS-01 solves the identity problem, but it doesn't solve the reachability problem. If your ISP gives you a dynamic IP or, worse, puts you behind CGNAT, your A records become useless the moment your modem reboots.
I needed a way to keep my external services (like a Plex instance or a private dashboard) accessible without manually updating Cloudflare every time my IP shifted. I chose a GitOps-managed CronJob over a standalone script on a VM because I want my entire infrastructure state defined in code.
The logic here has to be smarter than a simple curl and update. If you're on a corporate network or certain residential fibers, curl ifconfig.me might return a private IP or a CGNAT address. Updating your public DNS record to a 10.x.x.x address is a great way to take your services offline for everyone.
I built a small wrapper that validates the current public IP before pushing the update to Cloudflare.
apiVersion: batch/v1
kind: CronJob
metadata:
name: cloudflare-ddns-updater
spec:
schedule: "*/5 * * * *" # Check every 5 minutes
jobTemplate:
spec:
template:
spec:
containers:
- name: ddns-updater
image: guatulab/cloudflare-ddns:latest
env:
- name: CLOUDFLARE_API_TOKEN
valueFrom:
secretKeyRef:
name: cloudflare-ddns-credentials
key: token
- name: DOMAIN
value: services.example.com
command: ["/bin/sh", "-c"]
args:
- |
CURRENT_IP=$(curl -s ifconfig.me)
# Prevent updating DNS with internal/CGNAT IPs
if [[ $CURRENT_IP =~ ^(10\.|172\.(1[6-9]|2[0-9]|3[0-1])\.|192\.168\.) ]]; then
echo "Detected private IP: $CURRENT_IP. Skipping update."
exit 1
fi
# Fetch current record to avoid unnecessary API calls
RECORD_IP=$(curl -s -X GET "https://api.cloudflare.com/client/v4/zones/$(dig +short example.com @1.1.1.1 | cut -d' ' -f1)/dns_records?type=A&name=services.example.com" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | jq -r .result[0].content)
if [ "$CURRENT_IP" != "$RECORD_IP" ]; then
echo "IP changed from $RECORD_IP to $CURRENT_IP. Updating..."
curl -s -X PUT "https://api.cloudflare.com/client/v4/zones/$(dig +short example.com @1.1.1.1 | cut -d' ' -f1)/dns_records/$(curl -s -X GET "https://api.cloudflare.com/client/v4/zones/$(dig +short example.com @1.1.1.1 | cut -d' ' -f1)/dns_records?type=A&name=services.example.com" -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | jq -r .result[0].id)" \
-H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"type\":\"A\",\"name\":\"services.example.com\",\"content\":\"$CURRENT_IP\",\"ttl\":120,\"proxied\":true}"
else
echo "IP unchanged. Doing nothing."
fi
A few notes on this implementation:
-
The
PUTvsPOST: I usePUTto update an existing record ID rather thanPOSTto create a new one. This prevents duplicate A records for the same hostname. -
Proxied Status: I set
proxied: trueto keep the Cloudflare WAF and CDN in front of my home IP. Exposing your home IP directly is an invitation for botnets to scan your open ports. - TTL: I keep the TTL at 120 seconds. If your IP changes, you don't want to wait an hour for DNS propagation to finish.
The Gotchas and Tradeoffs
While this setup is largely "set and forget," there are a few things that can still bite you.
Rate Limiting
If you have a massive number of certificates and a very short renewal window, you can hit Cloudflare's API rate limits. I've seen this happen when a cluster restart triggers 50+ Certificate requests simultaneously. The fix is to implement a staggered renewal or use a single wildcard certificate for all internal services.
Token Scope
I strongly advise against using a Global API Key. If your Kubernetes cluster is compromised and you've stored a Global Key in a Secret, the attacker has full control over your entire Cloudflare account. Use a scoped API Token with the absolute minimum permissions: Zone.DNS:Edit and Zone.Zone:Read. For more on securing secrets in K8s, check out my post on SealedSecrets.
The CGNAT Wall
If you are truly behind CGNAT (where your WAN IP is shared with hundreds of other customers), no amount of DDNS will help. In that case, you have to stop fighting the network and switch to a tunnel. I've used Cloudflare Tunnels (cloudflared) for this, but if you want to keep traffic internal, a Tailscale subnet router is a better bet.
Summary of the Workflow
When I build out new infrastructure, I follow this sequence to avoid the pain I've described:
| Component | Tool | Purpose | Key Detail |
|---|---|---|---|
| Issuance | cert-manager | Automated TLS | Use scoped API Tokens, not Global Keys |
| Validation | Cloudflare DNS-01 | Zero-port exposure | Ensure DNS:Edit permissions are set |
| Resolution | K8s dnsConfig
|
Fix handshake errors | Set ndots: 2 for external-facing pods |
| Reachability | Custom CronJob | Dynamic IP handling | Validate against private/CGNAT IPs before update |
The goal isn't just to get a green checkmark from Let's Encrypt. The goal is a system where the certificates are valid, the DNS resolves instantly, and the IP updates automatically without me having to touch a terminal. If you're building similar AI agent orchestration or IoT pipelines, getting the networking layer right is non-negotiable. If you need help architecting this for a production environment, you can find my infrastructure consulting services here.
The gap between the documentation and a working system is usually filled with these small, annoying details. The docs tell you how to install cert-manager; they don't tell you that ndots: 5 will make your certificates feel like they're broken. Focus on the resolution path and the API permissions, and the rest usually falls into place.
Top comments (0)