War Story: How We Fixed a DNS Resolution Issue in Kubernetes 1.32 for 100 Microservices

#story #fixed #resolution #issue

War Story: How We Fixed a DNS Resolution Issue in Kubernetes 1.32 for 100 Microservices

Last month, our platform team faced one of the most disruptive outages in recent memory: a DNS resolution failure in our Kubernetes 1.32 cluster that took down 100+ microservices, impacted 40% of customer traffic, and lasted 2 hours before we identified the root cause. This is the story of how we debugged, fixed, and prevented recurrence of that issue.

The Outage Begins

It started at 14:00 UTC with a flood of PagerDuty alerts: 500 errors from our API gateway, failed health checks for 80% of our microservices, and reports of customers unable to access core features. Initial triage showed all failing services were logging dial tcp: lookup <service-name> on 10.96.0.10:53: no such host errors. The common thread? All affected services ran on our newly upgraded Kubernetes 1.32 cluster, which we’d migrated to 3 days prior.

Debugging the DNS Failure

We first checked CoreDNS, the cluster’s DNS provider. CoreDNS pods were running, healthy, and logs showed no errors. We tested DNS resolution from a debug pod: kubectl run -i --tty --rm debug --image=busybox --restart=Never -- nslookup my-service.default.svc.cluster.local. This worked fine. But when we ran the same command from an affected microservice pod, it failed with the same no-such-host error.

Next, we compared resolv.conf files between the debug pod and affected pods. The debug pod’s resolv.conf pointed to the CoreDNS ClusterIP (10.96.0.10) with correct search domains. Affected pods had the same resolv.conf, but DNS queries from those pods never reached CoreDNS. We used tcpdump on a node running affected pods: no DNS traffic was leaving the pod’s network namespace.

Root Cause: Kubernetes 1.32’s New DNS Policy Default

After 4 hours of debugging, we found the culprit in the Kubernetes 1.32 release notes (which we’d skimmed during migration): the default ndots value in resolv.conf for pods with dnsPolicy: ClusterFirst was increased from 1 to 5. This change was intended to speed up DNS resolution for multi-level domains, but it had an unintended side effect: service FQDNs like my-service.default.svc.cluster.local have only 4 dots, so the DNS resolver (which treats names with fewer than 5 dots as non-fully qualified) appends the cluster’s search domain again, creating an invalid FQDN like my-service.default.svc.cluster.local.default.svc.cluster.local that returns no such host.

The Fix

We had two options: roll back to Kubernetes 1.31 (which would take 4 hours and risk more downtime) or apply a quick fix. We chose the latter: we updated all our microservice deployments to set dnsConfig with ndots: 1 in their pod specs, overriding the default 5. We used a script to patch all 100 deployments in parallel: kubectl patch deployment -l app=myapp -p '{"spec":{"template":{"spec":{"dnsConfig":{"options":[{"name":"ndots","value":"1"}]}}}}'. Within 15 minutes, all services were back online, and DNS resolution worked as expected.

Post-Incident Review

We conducted a blameless post-incident review and identified three key gaps: 1) We’d only skimmed the Kubernetes 1.32 release notes, missing the ndots change. 2) Our staging environment didn’t mirror production’s microservice count, so we didn’t catch the issue during testing. 3) We didn’t have alerting for DNS resolution failures across pods.

We implemented three fixes: 1) Added a release note review step to all cluster upgrade runbooks, with a focus on DNS, network, and security changes. 2) Scaled our staging environment to match production’s microservice count. 3) Added Prometheus alerts for dns_lookup_errors across all pods, with a dashboard to track DNS health.

Lessons Learned

Always read full release notes for cluster upgrades, especially for core components like kubelet and CoreDNS.
Test cluster upgrades with production-like workloads to catch edge cases like DNS configuration changes.
Override default DNS settings explicitly if your workloads depend on specific resolv.conf behavior.
Monitor DNS resolution errors as a key metric for cluster health.

Conclusion

This outage taught us that even small configuration changes (like ndots from 1 to 5) can have massive impacts at scale. By sharing this war story, we hope other teams can avoid the same pitfall when upgrading to Kubernetes 1.32, and prioritize thorough testing and release note review for all cluster changes.