DEV Community

Cover image for 🚀 The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it)
Ahmed Shendy
Ahmed Shendy

Posted on

🚀 The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it)

Why I Wrote This Article — The Incident That Sparked Everything

This article didn’t come from curiosity.

It came from pain.

One morning, I received a message from the DevOps team:

“Some services are failing to resolve hostnames again —

we’re getting Temporary failure in name resolution.”

And this wasn’t the first time.

It had happened before — randomly, unpredictably, quietly causing latency and connection failures.

As the new Cloud Architect, the responsibility landed on my desk:

“We need this fixed forever — no more band-aids.”

So I started investigating.

(NOTE: I’ll publish a second article soon with the full debugging journey.)

Nothing looked broken at first.

Pods healthy. Cluster stable. CoreDNS replicas are running.

No crashes. No alerts.

But something felt off — so I went deep into metrics.

And there it was:

CoreDNS wasn’t resolving —

it was drowning in NXDOMAIN.

Thousands per second.

NXDOMAIN vs NOERROR over time (Query per second)

It wasn’t an outage.

It was a storm — a silent performance killer - we have around 80% of the DNS queries with response code NXDOMAIN

And the storm had one surprising source…

🕵️ The Real Breakthrough — It Was One Hostname

When I traced DNS traffic volume by hostname,

The data made me stop.

It wasn’t many hostnames.

It wasn’t dozens.

It was one about from 80% to 90% of the DNS queries are related to only one host.

NXDOMAIN vs NOERROR over time for rabbit MQ hostname (Query per second)

Our RabbitMQ endpoint — the heart of our event-driven system — contained only four dots:

rabbitmq.eu-west-1.aws.company.production

And with Kubernetes default ndots=5,

This meant the resolver didn’t treat it as a fully qualified domain.

Instead, Kubernetes expanded it through every search domain in the pod:

rabbitmq.eu-west-1.aws.company.production.default.svc.cluster.local ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production.svc.cluster.local ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production.cluster.local ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production. ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production ✅ finally correct

For each attempt:

  • A lookup ❌
  • AAAA lookup ❌

🟡 4 to 8 extra DNS queries for every single valid lookup

RabbitMQ is used everywhere — messaging, telemetry, queues, notifications.

So every millisecond meant more queries → more NXDOMAIN → more pressure.

We weren’t resolving DNS.

We were manufacturing DNS traffic.

⚡ The One-Character Fix That Saved Us

Under pressure and needing a fast mitigation,

I tried a tiny change that felt almost silly:

I added a trailing dot to the hostname.

Just one dot:

rabbitmq.eu-west-1.aws.company.production.

That trailing dot tells Linux resolver:

“This is a fully qualified domain.

Do not apply search paths.”

The effect was instant:

  • ❌ NXDOMAIN flood dropped immediately
  • 💡 CoreDNS CPU reduced by ≈50%
  • ⚡ Lookup performance improved ~5x
  • 🧘 Zero failures since
  • 😊 Developers finally stopped pinging me about DNS issues

We didn’t scale DNS.

We didn’t tune CoreDNS.

We didn’t rewrite applications.

We removed unnecessary work.

One dot → stability restored.

We later applied additional DNS optimizations to handle even larger query loads

— more on that in the next article.

🔍 The Root Cause: ndots in /etc/resolv.conf

Every Kubernetes pod has a resolver config like:

search default.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
nameserver 172.20.0.10
options ndots:5
Enter fullscreen mode Exit fullscreen mode

The ndots value controls:
How many dots must exist in a hostname before it is treated as an absolute FQDN.

If hostname dot-count < ndots → search domains appended

This Kubernetes default exists to support internal service discovery:

service → service.default.svc.cluster.local → resolves successfully
But for external hostnames?

🚩 Disaster waiting to happen.

🧪 Benchmark — Measured Results
I have used this python script to test the ndots effect

#!/usr/bin/env python3
import argparse
import asyncio
import time

import dns.asyncresolver


async def main() -> None:
    parser = argparse.ArgumentParser(
        description="Measure DNS lookup time for multiple queries using current resolv.conf (ndots/search)."
    )
    parser.add_argument("host", help="Hostname to resolve (bare name to exercise ndots/search)")
    parser.add_argument(
        "-n",
        "--queries",
        type=int,
        default=100,
        help="Number of concurrent queries to issue (default: 100)",
    )
    parser.add_argument(
        "-t",
        "--timeout",
        type=float,
        default=2.0,
        help="Per-query timeout in seconds (default: 2.0)",
    )
    args = parser.parse_args()

    resolver = dns.asyncresolver.Resolver()  # uses /etc/resolv.conf (ndots/search respected)
    resolver.timeout = args.timeout
    resolver.lifetime = args.timeout
    resolver.use_edns = False

    async def one_query() -> None:
        try:
            await resolver.resolve(args.host, "A", search=True)
        except Exception:
            # Ignore failures; we only care about timing behavior.
            pass

    tasks = [asyncio.create_task(one_query()) for _ in range(args.queries)]
    start = time.monotonic()
    await asyncio.gather(*tasks)
    elapsed = time.monotonic() - start
    print(f"{args.queries} queries for '{args.host}' in {elapsed:.3f}s ({elapsed/args.queries:.4f}s/query)")


if __name__ == "__main__":
    asyncio.run(main())

Enter fullscreen mode Exit fullscreen mode

Python async DNS resolver test:

Before fix (no trailing dot)

python ndots_async_bench.py rabbitmq.eu-west-1.aws.xxxxx.production -n 100
100 queries for 'rabbitmq.eu-west-1.aws.xxxxx.production' in 2.207s (0.0221s/query) 

python ndots_async_bench.py rabbitmq.eu-west-1.aws.xxxxx.production -n 10
10 queries for 'rabbitmq.eu-west-1.aws.xxxxx.production' in 0.302s (0.0302s/query)
Enter fullscreen mode Exit fullscreen mode

100 queries → 2.207s (0.0221 s/query)
10 queries → 0.302s (0.0302 s/query)

After fix (trailing dot → FQDN)
rabbitmq.eu-west-1.aws.company.production.

python ndots_async_bench.py rabbitmqc.eu-west-1.aws.xxxxx.production. -n 100
100 queries for 'rabbitmqc.eu-west-1.aws.xxxxx.production.' in 0.399s (0.0040s/query) 
python ndots_async_bench.py rabbitmqc.eu-west-1.aws.xxxxx.production. -n 10 
10 queries for 'rabbitmqc.eu-west-1.aws.xxxxx.production.' in 0.095s (0.0095s/query)

Enter fullscreen mode Exit fullscreen mode

100 queries → 0.399s (0.0040 s/query)
10 queries → 0.095s (0.0095 s/query)
🚀 DNS became ~5x faster
🧨 NXDOMAIN traffic dropped nearly in half

NXDOMAIN vs NOERROR over time (Query per second) after fix for one hostname

🛠 Fixing the Problem (Two Options)

1️⃣ Use Fully Qualified Domain Names with a trailing dot

Examples:

api.company.internal.
googleapis.com.
rabbitmq.eu-west-1.aws.company.
Enter fullscreen mode Exit fullscreen mode

✔ Easiest fix
✔ No Kubernetes changes
✔ Zero search-domain expansion
✔ Best performance

2️⃣ Reduce ndots for external workloads

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
Enter fullscreen mode Exit fullscreen mode

AWS docs state:

You can reduce the number of requests to CoreDNS by lowering the ndots option of your workload or fully qualifying your domain requests by including a trailing . (e.g. api.example.com. ).

📘 References

Kubernetes Docs
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

AWS EKS
https://docs.aws.amazon.com/eks/latest/best-practices/scale-cluster-services.html#:~:text=Reduce%20external%20queries%20by%20lowering%20ndots

Linux Resolver
https://man7.org/linux/man-pages/man5/resolv.conf.5.html

🎯 Final Thought

Sometimes the biggest reliability problems
come from the smallest defaults.

ndots=5 is perfect for Kubernetes internal services…
but for external hostnames it can quietly overwhelm DNS
and drag performance down across the entire cluster.

One dot fixed everything.

Fix it once → enjoy peace and performance forever.

💬 If you'd like Part 2 (the full debugging journey — how I traced and proved the root cause), comment below:

Show me the debugging story

Top comments (3)

Collapse
 
kamalmost profile image
KamalMostafa

Good read, never understood the trailing dots in DNS. now it makes sense why its needed. What I liked the most is that its not AI written. keep writing plz.

Collapse
 
ahmedshendy profile image
Ahmed Shendy

Many thanks for your comment Kamal

Collapse
 
kchapma profile image
Kevin Chapman

Show me the debugging story