Wildcard DNS + ndots:5: The TLS Nightmare and How to Fix It

#kubernetes #dns #tls #networking

I was trying to pull a model from an external registry into my cluster when the connection just died. No 404, no timeout, just a blunt SSL: CERTIFICATE_VERIFY_FAILED. The error message specifically complained about a RemoteCertificateNameMismatch.

THE SYMPTOM

The error was explicit: the certificate presented by the IP address I was hitting didn't match the hostname I requested. When I checked the logs, the client was attempting to connect to registry.ollama.ai, but the server responding was actually my local ingress controller, presenting a certificate for *.example.com.

WHAT I EXPECTED

I expected the pod to reach out to the public internet, hit the real Ollama registry, and pull the layers. Since the registry is a public endpoint, it should have bypassed my internal DNS logic entirely. I had a wildcard DNS entry set up for my internal domain, so I assumed any name that didn't exist internally would simply fail or resolve via the upstream provider.

WHAT ACTUALLY HAPPENED

The culprit was the Kubernetes default ndots:5 setting in /etc/resolv.conf.

In Kubernetes, the default DNS configuration is designed to prioritize internal service discovery. When a pod tries to resolve a name, the resolver looks at the number of dots in that name. If the number of dots is less than the ndots value (which defaults to 5), the resolver doesn't treat the name as a Fully Qualified Domain Name (FQDN) immediately. Instead, it iteratively appends the search domains from the cluster's dnsConfig to the hostname, trying to see if a local service exists before going to the internet.

Because registry.ollama.ai only has two dots, it failed the ndots:5 threshold. The resolver started a loop:

registry.ollama.ai.namespace.svc.cluster.local -> NXDOMAIN
registry.ollama.ai.svc.cluster.local -> NXDOMAIN
registry.ollama.ai.cluster.local -> NXDOMAIN
...and so on.

Eventually, it hit a point where the search path ended, and it fell back to the upstream DNS. However, because I have a wildcard DNS record (*.example.com) pointing to my internal load balancer, and my internal search domains are part of that same hierarchy, the resolver accidentally matched the request to my internal ingress. My ingress saw a request for a domain it "kinda" recognized via the wildcard logic and served the default wildcard certificate. The TLS handshake failed because registry.ollama.ai is definitely not *.example.com.

THE FIX

You can't just "fix" the global DNS settings without breaking service discovery for every other pod in the cluster. The fix is to target the specific workloads that need to reach external services and adjust their dnsConfig.

If you are hitting a service with two dots (like huggingface.co or registry.ollama.ai), you need to set ndots to a value lower than the number of dots in your target hostname.

For a deployment pulling from a registry, I updated the spec like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-worker
spec:
  template:
    spec:
      dnsConfig:
        options:
          - name: ndots
            value: "2" # Tells the resolver to treat 2-dot names as FQDNs
      containers:
        - name: ollama
          image: ollama/ollama
          env:
            - name: OLLAMA_HOST
              value: "registry.ollama.ai"

If you're dealing with even simpler hostnames, you might need ndots: 1.

To verify what's happening, I used nslookup from inside a debug pod. This is the only way to see the "search path" in action:

# Run a temporary debug pod
kubectl run dns-debug --rm -it --image=curlimages/curl -- /bin/sh

# Inside the pod, check the resolution path
nslookup registry.ollama.ai

If you see the resolver attempting to append .svc.cluster.local repeatedly before finding the real IP, you know ndots is the problem.

WHY THIS MATTERS

This isn't just a "homelab" problem. If you're running multi-tenant clusters or using complex service meshes, this behavior can cause massive latency spikes because of the extra DNS queries, or worse, silent routing errors.

When you introduce wildcard DNS for internal tooling—which is common when you want to automate infrastructure as code—you increase the surface area for these "accidental" matches.

The lesson: if a pod needs to talk to the outside world using a short hostname, don't trust the default ndots:5. Explicitly set the dnsConfig in your deployment manifest. It's a small bit of extra configuration that prevents a massive headache during TLS debugging.

DEV Community