Guatu

Posted on May 29 • Originally published at guatulabs.dev

Cloudflare DNS-01: Fixing the Gap Between Automation and Reality

#cloudflare #certmanager #kubernetes #dns01

My certificates were renewing, the logs said CertificateIssued, but my pods were still screaming about TLS handshake failures. It's the classic "everything looks green in the dashboard but the app is broken" scenario. I had a fully automated pipeline using cert-manager and Cloudflare DNS-01, yet my internal services were intermittently failing to validate the very certificates they were using.

If you've already set up the basic ClusterIssuer and think you're done, you've likely only hit the happy path. The real friction starts when you move from a single static IP to a dynamic environment or when you realize Kubernetes is lying to you about how it resolves DNS.

The DNS-01 Foundation

For those who haven't wrestled with this, DNS-01 is the only sane way to handle TLS in a homelab or private cloud. Unlike HTTP-01, which requires opening port 80 to the world and routing traffic to a specific challenge pod, DNS-01 proves ownership by dropping a TXT record into your DNS provider.

I use cert-manager for this because manually rotating certificates is a job for people who enjoy waking up at 3 AM to fix a production outage. The basic setup involves a ClusterIssuer that talks to the Cloudflare API.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: cloudflare
spec:
  acme:
    email: admin@example.com
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: cloudflare-acme-account-key
    solvers:
      - selector:
          dnsZones:
            - example.com
        dns01:
          cloudflare:
            apiTokenSecretRef:
              name: cloudflare-dns01-token
              key: token

The most common point of failure here isn't the YAML, it's the API token. Cloudflare's permissions are granular. If you give the token Zone:Read but forget DNS:Edit, the issuer will hang indefinitely while trying to create the TXT record. I've spent two hours debugging a "network timeout" that was actually just a 403 Forbidden from the Cloudflare API.

The `ndots` Trap

Once the certificates are issued, a new problem emerges: resolution. I noticed that some pods could reach internal services via their TLS names, while others failed with certificate signed by unknown authority or simply timed out.

The culprit was the Kubernetes ndots setting. By default, K8s sets ndots: 5. This means if a hostname has fewer than five dots, the resolver tries to append all the search domains listed in /etc/resolv.conf before trying the absolute name.

When a pod tries to connect to api.example.com, it doesn't just look up that name. It tries api.example.com.namespace.svc.cluster.local, then api.example.com.svc.cluster.local, and so on. This creates a massive amount of DNS noise and, in some edge cases with certain DNS providers or internal resolvers, leads to the wrong IP being returned or the request being dropped. I've written about this specific nightmare in my post on Wildcard DNS and ndots:5.

The fix is to explicitly set ndots: 2 for pods that need to talk to external services frequently. This tells the resolver: "if there are at least two dots, just try the name as-is first."

spec:
  containers:
    - name: ai-agent-worker
      image: my-agent:latest
  dnsConfig:
    options:
      - name: ndots
        value: "2"

Adding this simple block stopped the intermittent TLS handshake failures. It's a detail that isn't in the cert-manager docs because it's a Kubernetes networking behavior, not a certificate issue. But in production, those two things are inextricably linked.

Automating the Dynamic IP Headache

DNS-01 solves the identity problem, but it doesn't solve the reachability problem. If your ISP gives you a dynamic IP or, worse, puts you behind CGNAT, your A records become useless the moment your modem reboots.

I needed a way to keep my external services (like a Plex instance or a private dashboard) accessible without manually updating Cloudflare every time my IP shifted. I chose a GitOps-managed CronJob over a standalone script on a VM because I want my entire infrastructure state defined in code.

The logic here has to be smarter than a simple curl and update. If you're on a corporate network or certain residential fibers, curl ifconfig.me might return a private IP or a CGNAT address. Updating your public DNS record to a 10.x.x.x address is a great way to take your services offline for everyone.

I built a small wrapper that validates the current public IP before pushing the update to Cloudflare.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cloudflare-ddns-updater
spec:
  schedule: "*/5 * * * *" # Check every 5 minutes
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: ddns-updater
              image: guatulab/cloudflare-ddns:latest
              env:
                - name: CLOUDFLARE_API_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: cloudflare-ddns-credentials
                      key: token
                - name: DOMAIN
                  value: services.example.com
              command: ["/bin/sh", "-c"]
              args:
                - |
                  CURRENT_IP=$(curl -s ifconfig.me)
                  # Prevent updating DNS with internal/CGNAT IPs
                  if [[ $CURRENT_IP =~ ^(10\.|172\.(1[6-9]|2[0-9]|3[0-1])\.|192\.168\.) ]]; then
                    echo "Detected private IP: $CURRENT_IP. Skipping update."
                    exit 1
                  fi

                  # Fetch current record to avoid unnecessary API calls
                  RECORD_IP=$(curl -s -X GET "https://api.cloudflare.com/client/v4/zones/$(dig +short example.com @1.1.1.1 | cut -d' ' -f1)/dns_records?type=A&name=services.example.com" \
                    -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | jq -r .result[0].content)

                  if [ "$CURRENT_IP" != "$RECORD_IP" ]; then
                    echo "IP changed from $RECORD_IP to $CURRENT_IP. Updating..."
                    curl -s -X PUT "https://api.cloudflare.com/client/v4/zones/$(dig +short example.com @1.1.1.1 | cut -d' ' -f1)/dns_records/$(curl -s -X GET "https://api.cloudflare.com/client/v4/zones/$(dig +short example.com @1.1.1.1 | cut -d' ' -f1)/dns_records?type=A&name=services.example.com" -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" | jq -r .result[0].id)" \
                      -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
                      -H "Content-Type: application/json" \
                      -d "{\"type\":\"A\",\"name\":\"services.example.com\",\"content\":\"$CURRENT_IP\",\"ttl\":120,\"proxied\":true}"
                  else
                    echo "IP unchanged. Doing nothing."
                  fi

A few notes on this implementation:

The PUT vs POST: I use PUT to update an existing record ID rather than POST to create a new one. This prevents duplicate A records for the same hostname.
Proxied Status: I set proxied: true to keep the Cloudflare WAF and CDN in front of my home IP. Exposing your home IP directly is an invitation for botnets to scan your open ports.
TTL: I keep the TTL at 120 seconds. If your IP changes, you don't want to wait an hour for DNS propagation to finish.

The Gotchas and Tradeoffs

While this setup is largely "set and forget," there are a few things that can still bite you.

Rate Limiting

If you have a massive number of certificates and a very short renewal window, you can hit Cloudflare's API rate limits. I've seen this happen when a cluster restart triggers 50+ Certificate requests simultaneously. The fix is to implement a staggered renewal or use a single wildcard certificate for all internal services.

Token Scope

I strongly advise against using a Global API Key. If your Kubernetes cluster is compromised and you've stored a Global Key in a Secret, the attacker has full control over your entire Cloudflare account. Use a scoped API Token with the absolute minimum permissions: Zone.DNS:Edit and Zone.Zone:Read. For more on securing secrets in K8s, check out my post on SealedSecrets.

The CGNAT Wall

If you are truly behind CGNAT (where your WAN IP is shared with hundreds of other customers), no amount of DDNS will help. In that case, you have to stop fighting the network and switch to a tunnel. I've used Cloudflare Tunnels (cloudflared) for this, but if you want to keep traffic internal, a Tailscale subnet router is a better bet.

Summary of the Workflow

When I build out new infrastructure, I follow this sequence to avoid the pain I've described:

Component	Tool	Purpose	Key Detail
Issuance	cert-manager	Automated TLS	Use scoped API Tokens, not Global Keys
Validation	Cloudflare DNS-01	Zero-port exposure	Ensure `DNS:Edit` permissions are set
Resolution	K8s `dnsConfig`	Fix handshake errors	Set `ndots: 2` for external-facing pods
Reachability	Custom CronJob	Dynamic IP handling	Validate against private/CGNAT IPs before update

The goal isn't just to get a green checkmark from Let's Encrypt. The goal is a system where the certificates are valid, the DNS resolves instantly, and the IP updates automatically without me having to touch a terminal. If you're building similar AI agent orchestration or IoT pipelines, getting the networking layer right is non-negotiable. If you need help architecting this for a production environment, you can find my infrastructure consulting services here.

The gap between the documentation and a working system is usually filled with these small, annoying details. The docs tell you how to install cert-manager; they don't tell you that ndots: 5 will make your certificates feel like they're broken. Focus on the resolution path and the API permissions, and the rest usually falls into place.

DEV Community

Cloudflare DNS-01: Fixing the Gap Between Automation and Reality

The DNS-01 Foundation

The `ndots` Trap

Automating the Dynamic IP Headache

The Gotchas and Tradeoffs

Rate Limiting

Token Scope

The CGNAT Wall

Summary of the Workflow

Top comments (0)

The DNS-01 Foundation

The ndots Trap

Automating the Dynamic IP Headache

The Gotchas and Tradeoffs

Rate Limiting

Token Scope

The CGNAT Wall

Summary of the Workflow

The `ndots` Trap