DEV Community

Cover image for 🚀 The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it)
Ahmed Shendy
Ahmed Shendy

Posted on

🚀 The Hidden DNS Misconfiguration That Was Killing Performance in Our EKS Cluster (and How We Fixed it)

Why I Wrote This Article — The Incident That Sparked Everything

This article didn’t come from curiosity.

It came from pain.

One morning, I received a message from the DevOps team:

“Some services are failing to resolve hostnames again —

we’re getting Temporary failure in name resolution.”

And this wasn’t the first time.

It had happened before — randomly, unpredictably, quietly causing latency and connection failures.

As the new Cloud Architect, the responsibility landed on my desk:

“We need this fixed forever — no more band-aids.”

So I started investigating.

(NOTE: I’ll publish a second article soon with the full debugging journey.)

Nothing looked broken at first.

Pods healthy. Cluster stable. CoreDNS replicas are running.

No crashes. No alerts.

But something felt off — so I went deep into metrics.

And there it was:

CoreDNS wasn’t resolving —

it was drowning in NXDOMAIN.

Thousands per second.

NXDOMAIN vs NOERROR over time (Query per second)

It wasn’t an outage.

It was a storm — a silent performance killer - we have around 80% of the DNS queries with response code NXDOMAIN

And the storm had one surprising source…

🕵️ The Real Breakthrough — It Was One Hostname

When I traced DNS traffic volume by hostname,

The data made me stop.

It wasn’t many hostnames.

It wasn’t dozens.

It was one about from 80% to 90% of the DNS queries are related to only one host.

NXDOMAIN vs NOERROR over time for rabbit MQ hostname (Query per second)

Our RabbitMQ endpoint — the heart of our event-driven system — contained only four dots:

rabbitmq.eu-west-1.aws.company.production

And with Kubernetes default ndots=5,

This meant the resolver didn’t treat it as a fully qualified domain.

Instead, Kubernetes expanded it through every search domain in the pod:

rabbitmq.eu-west-1.aws.company.production.default.svc.cluster.local ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production.svc.cluster.local ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production.cluster.local ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production. ❌ NXDOMAIN
rabbitmq.eu-west-1.aws.company.production ✅ finally correct

For each attempt:

  • A lookup ❌
  • AAAA lookup ❌

🟡 4 to 8 extra DNS queries for every single valid lookup

RabbitMQ is used everywhere — messaging, telemetry, queues, notifications.

So every millisecond meant more queries → more NXDOMAIN → more pressure.

We weren’t resolving DNS.

We were manufacturing DNS traffic.

⚡ The One-Character Fix That Saved Us

Under pressure and needing a fast mitigation,

I tried a tiny change that felt almost silly:

I added a trailing dot to the hostname.

Just one dot:

rabbitmq.eu-west-1.aws.company.production.

That trailing dot tells Linux resolver:

“This is a fully qualified domain.

Do not apply search paths.”

The effect was instant:

  • ❌ NXDOMAIN flood dropped immediately
  • 💡 CoreDNS CPU reduced by ≈50%
  • ⚡ Lookup performance improved ~5x
  • 🧘 Zero failures since
  • 😊 Developers finally stopped pinging me about DNS issues

We didn’t scale DNS.

We didn’t tune CoreDNS.

We didn’t rewrite applications.

We removed unnecessary work.

One dot → stability restored.

We later applied additional DNS optimizations to handle even larger query loads

— more on that in the next article.

🔍 The Root Cause: ndots in /etc/resolv.conf

Every Kubernetes pod has a resolver config like:

search default.svc.cluster.local svc.cluster.local cluster.local eu-west-1.compute.internal
nameserver 172.20.0.10
options ndots:5
Enter fullscreen mode Exit fullscreen mode

The ndots value controls:
How many dots must exist in a hostname before it is treated as an absolute FQDN.

If hostname dot-count < ndots → search domains appended

This Kubernetes default exists to support internal service discovery:

service → service.default.svc.cluster.local → resolves successfully
But for external hostnames?

🚩 Disaster waiting to happen.

🧪 Benchmark — Measured Results
I have used this python script to test the ndots effect

#!/usr/bin/env python3
import argparse
import asyncio
import time

import dns.asyncresolver


async def main() -> None:
    parser = argparse.ArgumentParser(
        description="Measure DNS lookup time for multiple queries using current resolv.conf (ndots/search)."
    )
    parser.add_argument("host", help="Hostname to resolve (bare name to exercise ndots/search)")
    parser.add_argument(
        "-n",
        "--queries",
        type=int,
        default=100,
        help="Number of concurrent queries to issue (default: 100)",
    )
    parser.add_argument(
        "-t",
        "--timeout",
        type=float,
        default=2.0,
        help="Per-query timeout in seconds (default: 2.0)",
    )
    args = parser.parse_args()

    resolver = dns.asyncresolver.Resolver()  # uses /etc/resolv.conf (ndots/search respected)
    resolver.timeout = args.timeout
    resolver.lifetime = args.timeout
    resolver.use_edns = False

    async def one_query() -> None:
        try:
            await resolver.resolve(args.host, "A", search=True)
        except Exception:
            # Ignore failures; we only care about timing behavior.
            pass

    tasks = [asyncio.create_task(one_query()) for _ in range(args.queries)]
    start = time.monotonic()
    await asyncio.gather(*tasks)
    elapsed = time.monotonic() - start
    print(f"{args.queries} queries for '{args.host}' in {elapsed:.3f}s ({elapsed/args.queries:.4f}s/query)")


if __name__ == "__main__":
    asyncio.run(main())

Enter fullscreen mode Exit fullscreen mode

Python async DNS resolver test:

Before fix (no trailing dot)

python ndots_async_bench.py rabbitmq.eu-west-1.aws.xxxxx.production -n 100
100 queries for 'rabbitmq.eu-west-1.aws.xxxxx.production' in 2.207s (0.0221s/query) 

python ndots_async_bench.py rabbitmq.eu-west-1.aws.xxxxx.production -n 10
10 queries for 'rabbitmq.eu-west-1.aws.xxxxx.production' in 0.302s (0.0302s/query)
Enter fullscreen mode Exit fullscreen mode

100 queries → 2.207s (0.0221 s/query)
10 queries → 0.302s (0.0302 s/query)

After fix (trailing dot → FQDN)
rabbitmq.eu-west-1.aws.company.production.

python ndots_async_bench.py rabbitmqc.eu-west-1.aws.xxxxx.production. -n 100
100 queries for 'rabbitmqc.eu-west-1.aws.xxxxx.production.' in 0.399s (0.0040s/query) 
python ndots_async_bench.py rabbitmqc.eu-west-1.aws.xxxxx.production. -n 10 
10 queries for 'rabbitmqc.eu-west-1.aws.xxxxx.production.' in 0.095s (0.0095s/query)

Enter fullscreen mode Exit fullscreen mode

100 queries → 0.399s (0.0040 s/query)
10 queries → 0.095s (0.0095 s/query)
🚀 DNS became ~5x faster
🧨 NXDOMAIN traffic dropped nearly in half

NXDOMAIN vs NOERROR over time (Query per second) after fix for one hostname

🛠 Fixing the Problem (Two Options)

1️⃣ Use Fully Qualified Domain Names with a trailing dot

Examples:

api.company.internal.
googleapis.com.
rabbitmq.eu-west-1.aws.company.
Enter fullscreen mode Exit fullscreen mode

✔ Easiest fix
✔ No Kubernetes changes
✔ Zero search-domain expansion
✔ Best performance

2️⃣ Reduce ndots for external workloads

spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
Enter fullscreen mode Exit fullscreen mode

AWS docs state:

You can reduce the number of requests to CoreDNS by lowering the ndots option of your workload or fully qualifying your domain requests by including a trailing . (e.g. api.example.com. ).

📘 References

Kubernetes Docs
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

AWS EKS
https://docs.aws.amazon.com/eks/latest/best-practices/scale-cluster-services.html#:~:text=Reduce%20external%20queries%20by%20lowering%20ndots

Linux Resolver
https://man7.org/linux/man-pages/man5/resolv.conf.5.html

🎯 Final Thought

Sometimes the biggest reliability problems
come from the smallest defaults.

ndots=5 is perfect for Kubernetes internal services…
but for external hostnames it can quietly overwhelm DNS
and drag performance down across the entire cluster.

One dot fixed everything.

Fix it once → enjoy peace and performance forever.

💬 If you'd like Part 2 (the full debugging journey — how I traced and proved the root cause), comment below:

Show me the debugging story

Top comments (0)