Why I Wrote This Article — The Incident That Sparked Everything
This article didn’t come from curiosity.
It came from pain.
One morning, I received a message from the DevOps team:
“Some services are failing to resolve hostnames again —
we’re getting Temporary failure in name resolution.”
And this wasn’t the first time.
It had happened before — randomly, unpredictably, quietly causing latency and connection failures.
As the new Cloud Architect, the responsibility landed on my desk:
“We need this fixed forever — no more band-aids.”
So I started investigating.
(NOTE: I’ll publish a second article soon with the full debugging journey.)
Nothing looked broken at first.
Pods healthy. Cluster stable. CoreDNS replicas are running.
No crashes. No alerts.
But something felt off — so I went deep into metrics.
And there it was:
CoreDNS wasn’t resolving —
it was drowning in NXDOMAIN.
Thousands per second.
It wasn’t an outage.
It was a storm — a silent performance killer - we have around 80% of the DNS queries with response code NXDOMAIN
And the storm had one surprising source…
🕵️ The Real Breakthrough — It Was One Hostname
When I traced DNS traffic volume by hostname,
The data made me stop.
It wasn’t many hostnames.
It wasn’t dozens.
It was one about from 80% to 90% of the DNS queries are related to only one host.
Our RabbitMQ endpoint — the heart of our event-driven system — contained only four dots:
rabbitmq.eu-west-1.aws.company.production
And with Kubernetes default ndots=5,
This meant the resolver didn’t treat it as a fully qualified domain.
Instead, Kubernetes expanded it through every search domain in the pod:
rabbitmq.eu-west-1.aws.company.production.default.svc.cluster.local ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production.svc.cluster.local ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production.cluster.local ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production. ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production ✅ finally correct
For each attempt:
- A lookup ❌
- AAAA lookup ❌
🟡 4 to 8 extra DNS queries for every single valid lookup
RabbitMQ is used everywhere — messaging, telemetry, queues, notifications.
So every millisecond meant more queries → more NXDOMAIN → more pressure.
We weren’t resolving DNS.
We were manufacturing DNS traffic.
⚡ The One-Character Fix That Saved Us
Under pressure and needing a fast mitigation,
I tried a tiny change that felt almost silly:
I added a trailing dot to the hostname.
Just one dot:
rabbitmq.eu-west-1.aws.company.production.
That trailing dot tells Linux resolver:
“This is a fully qualified domain.
Do not apply search paths.”
The effect was instant:
- ❌ NXDOMAIN flood dropped immediately
- 💡 CoreDNS CPU reduced by ≈50%
- ⚡ Lookup performance improved ~5x
- 🧘 Zero failures since
- 😊 Developers finally stopped pinging me about DNS issues
We didn’t scale DNS.
We didn’t tune CoreDNS.
We didn’t rewrite applications.
We removed unnecessary work.
One dot → stability restored.
We later applied additional DNS optimizations to handle even larger query loads
— more on that in the next article.
🔍 The Root Cause: ndots in /etc/resolv.conf
Every Kubernetes pod has a resolver config like:
search default.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
nameserver 172.20.0.10
options ndots:5
The ndots value controls:
How many dots must exist in a hostname before it is treated as an absolute FQDN.
If hostname dot-count < ndots → search domains appended
This Kubernetes default exists to support internal service discovery:
service → service.default.svc.cluster.local → resolves successfully
But for external hostnames?
🚩 Disaster waiting to happen.
🧪 Benchmark — Measured Results
I have used this python script to test the ndots effect
#!/usr/bin/env python3
import argparse
import asyncio
import time
import dns.asyncresolver
async def main() -> None:
parser = argparse.ArgumentParser(
description="Measure DNS lookup time for multiple queries using current resolv.conf (ndots/search)."
)
parser.add_argument("host", help="Hostname to resolve (bare name to exercise ndots/search)")
parser.add_argument(
"-n",
"--queries",
type=int,
default=100,
help="Number of concurrent queries to issue (default: 100)",
)
parser.add_argument(
"-t",
"--timeout",
type=float,
default=2.0,
help="Per-query timeout in seconds (default: 2.0)",
)
args = parser.parse_args()
resolver = dns.asyncresolver.Resolver() # uses /etc/resolv.conf (ndots/search respected)
resolver.timeout = args.timeout
resolver.lifetime = args.timeout
resolver.use_edns = False
async def one_query() -> None:
try:
await resolver.resolve(args.host, "A", search=True)
except Exception:
# Ignore failures; we only care about timing behavior.
pass
tasks = [asyncio.create_task(one_query()) for _ in range(args.queries)]
start = time.monotonic()
await asyncio.gather(*tasks)
elapsed = time.monotonic() - start
print(f"{args.queries} queries for '{args.host}' in {elapsed:.3f}s ({elapsed/args.queries:.4f}s/query)")
if __name__ == "__main__":
asyncio.run(main())
Python async DNS resolver test:
Before fix (no trailing dot)
python ndots_async_bench.py rabbitmq.eu-west-1.aws.xxxxx.production -n 100
100 queries for 'rabbitmq.eu-west-1.aws.xxxxx.production' in 2.207s (0.0221s/query)
python ndots_async_bench.py rabbitmq.eu-west-1.aws.xxxxx.production -n 10
10 queries for 'rabbitmq.eu-west-1.aws.xxxxx.production' in 0.302s (0.0302s/query)
100 queries → 2.207s (0.0221 s/query)
10 queries → 0.302s (0.0302 s/query)
After fix (trailing dot → FQDN)
rabbitmq.eu-west-1.aws.company.production.
python ndots_async_bench.py rabbitmqc.eu-west-1.aws.xxxxx.production. -n 100
100 queries for 'rabbitmqc.eu-west-1.aws.xxxxx.production.' in 0.399s (0.0040s/query)
python ndots_async_bench.py rabbitmqc.eu-west-1.aws.xxxxx.production. -n 10
10 queries for 'rabbitmqc.eu-west-1.aws.xxxxx.production.' in 0.095s (0.0095s/query)
100 queries → 0.399s (0.0040 s/query)
10 queries → 0.095s (0.0095 s/query)
🚀 DNS became ~5x faster
🧨 NXDOMAIN traffic dropped nearly in half
🛠 Fixing the Problem (Two Options)
1️⃣ Use Fully Qualified Domain Names with a trailing dot
Examples:
api.company.internal.
googleapis.com.
rabbitmq.eu-west-1.aws.company.
✔ Easiest fix
✔ No Kubernetes changes
✔ Zero search-domain expansion
✔ Best performance
2️⃣ Reduce ndots for external workloads
spec:
dnsConfig:
options:
- name: ndots
value: "2"
AWS docs state:
You can reduce the number of requests to CoreDNS by lowering the ndots option of your workload or fully qualifying your domain requests by including a trailing . (e.g. api.example.com. ).
📘 References
Kubernetes Docs
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/
Linux Resolver
https://man7.org/linux/man-pages/man5/resolv.conf.5.html
🎯 Final Thought
Sometimes the biggest reliability problems
come from the smallest defaults.
ndots=5 is perfect for Kubernetes internal services…
but for external hostnames it can quietly overwhelm DNS
and drag performance down across the entire cluster.
One dot fixed everything.
Fix it once → enjoy peace and performance forever.
💬 If you'd like Part 2 (the full debugging journey — how I traced and proved the root cause), comment below:
Show me the debugging story



Top comments (0)