Kubernetes DNS Delays Fixed: How We Eliminated Intermittent Downtime in Production (ChatGPT is best guide? lets see)

#k8s #debugging #production #william

Hey there, fellow Kubernetes wranglers!
William here, and I am glad to post my 3rd article! If you're knee-deep in cluster configs and scratching your head over mysterious DNS hiccups in your pods, this post is for you. As a senior who's spent more hours than I'd like debugging K8s networking quirks. This 5-minute read dives into a real-world saga of fixing DNS resolution delays in a production namespace (shoutout to "jasmine" – our quirky testbed for social media cron jobs). We'll cover the root cause, the fixes, the pitfalls (including why AI advice isn't always golden), and how we finally nailed it. By the end, you'll have actionable tips to tweak your pod DNS settings and avoid 5-second lookup nightmares. Let's debug!

Where it all started

It all kicked off in our Kubernetes cluster (running v1.20-ish, based on some trial-and-error revelations we'll get to). We had a namespace packed with services and a whopping 33 CronJobs handling everything from social media posts to email notifications. Think handle-socialmedia-video-posts firing every minute – high-frequency stuff.

Pods were running fine... until they weren't. Logs showed intermittent DNS resolution failures or delays. A simple curl to an internal service would hang for exactly 5 seconds before succeeding. This was killer for our time-sensitive CronJobs; missed schedules meant delayed posts and unhappy users. At first, I blamed the network overlay (we're on Calico), but kubectl exec into a pod and running nslookup revealed the culprit: every lookup was trying five unnecessary resolutions before falling back.

Enter the infamous ndots:5 default in /etc/resolv.conf. In K8s, if your domain has fewer than 5 dots (e.g., my-service.default.svc.cluster.local), it appends search domains first – leading to timeouts on non-existent FQDNs.

Figuring out the reason

Diving deeper, I grepped pod logs and ran cat /etc/resolv.conf via kubectl exec. Sure enough: options ndots:5 timeout:2 attempts:2. That "ndots:5" is the Kubernetes default, inherited from the kubelet. For our setup – short domain names and frequent external API calls (e.g., to TikTok or YouTube) – it meant 4 extra failed lookups per request, each timing out after 1 second (hence the 5s delay).

Cross-referencing official docs and debugging guides , it clicked: we needed to override this with dnsConfig in the pod spec, setting ndots:2 to prioritize absolute names and cut the fluff.

But here's the twist – for Deployments (like our php-service), it was straightforward. For CronJobs? The config lives deeper in the jobTemplate.

What really helped

The game-changer was patching the resources directly with kubectl patch. For Deployments:

kubectl patch deployment php-service -n jasmine --type=json -p='[{"op":"add","path":"/spec/template/spec/dnsConfig","value":{"options":[{"name":"ndots","value":"2"}]}}]'

This triggers a rollout, and new pods spawn with the updated /etc/resolv.conf. Bam – our php-service pods showed ndots:2 immediately after a kubectl rollout restart.

For CronJobs, the path is longer (as per the API spec ):

kubectl patch cronjob process-automatic-post -n jasmine --type=json -p='[{"op":"add","path":"/spec/jobTemplate/spec/template/spec/dnsConfig","value":{"options":[{"name":"ndots","value":"2"}]}}]'

To apply to all 33 at once:

kubectl get cronjob -n jasmine --no-headers -o name | xargs -I {} kubectl patch {} -n jasmine --type=json -p='[{"op":"add","path":"/spec/jobTemplate/spec/template/spec/dnsConfig","value":{"options":[{"name":"ndots","value":"2"}]}}]'

New Jobs from future schedules now create pods with ndots:2. To verify: kubectl create job --from=cronjob/<name> test-ndots -n jasmine and check the logs.

Bonus: We scripted a checker to scan running pods:

kubectl get pods -n jasmine --no-headers | awk '$3=="Running" {print $1}' | while read pod; do
    value=$(kubectl exec "$pod" -n "$NS" -- cat /etc/resolv.conf | grep -o 'ndots:[0-9]\+' | cut -d: -f2)
    [ "$value" = "2" ] && echo "✓ $pod → ndots:2" || echo "✗ $pod → ndots:${value:-5}"
done

This gave us instant feedback on which pods had picked up the change.

Major blockers

Oh boy, the roadblocks were real. Here's what nearly derailed us:

Real bang: online git issues

Scouring GitHub , we hit classics like intermittent 5s DNS delays (#56903) and calls to let kubelet set default ndots (#127137). One eye-opener was a CronJob-specific issue (#76790) where fresh pods failed DNS right after creation – exactly our symptom! These threads validated ndots as the villain but also warned about cluster-wide tweaks (e.g., via DaemonSet overrides) that could break other namespaces.
Do not blindly follow AI answer (had to hold back all)

Early on, an AI suggested patching CronJobs with a bogus path: /spec/jobTemplate/spec/template/spec/podDNSConfig. Spoiler: podDNSConfig doesn't exist! It threw "unknown field" errors and wasted hours. We had to cross-verify against official API docs and test manually. Lesson: AI is great for ideas, but always kubectl explain cronjob.spec.jobTemplate.spec.template.spec.dnsConfig to confirm paths.

Key Updates

As we iterated, we added monitoring: a bash loop running the checker every 30s, logging to ndots-full.log with timestamps. This captured the transition – from all pods at ndots:5 to a mix, then fully fixed.

We also patched the sneaky cron-service Deployment (a long-running worker mimicking cron behavior) the same way as others. And for future-proofing, we eyed cluster-wide defaults via kubelet flags, but stuck to per-resource patches to avoid side effects.

Successful resolve and big hug

After the correct patches, victory! New CronJob pods (e.g., process-automatic-posts-118ch) finally showed ndots:2, slashing DNS times from 5s to <100ms. Our social feeds flowed smoothly, no more missed schedules. A big virtual hug to the K8s community – docs, GitHub threads, and even Stack Overflow rants saved the day. Special thanks to that one issue commenter who yelled "CHECK YOUR API VERSION!" – you were right.

Conclusion

Troubleshooting K8s DNS is a rite of passage: start with logs, verify configs, and don't trust unverified paths (AI or otherwise). Key takeaway: Set ndots:2 via dnsConfig for snappy resolutions in high-lookup workloads. Test with manual Jobs, monitor rollouts, and always reference the API reference . If you're hitting similar snags, drop a comment – let's debug together. Happy clustering! 🚀