DevHelm

Posted on May 17 • Edited on Jun 2 • Originally published at devhelm.io

How to Fix Slow DNS Lookup: A Complete Troubleshooting Guide

#guides #infrastructure #reliability

Every connection your application makes starts with a DNS lookup. When that lookup is slow — or fails entirely — the symptoms range from vague latency increases to hard-down pages that return ERR_NAME_NOT_RESOLVED. This guide walks through how to fix slow DNS lookup issues, diagnose two of the most common DNS errors (DNS_PROBE_FINISHED_NXDOMAIN and "DNS server not responding"), and set up monitoring so these problems never wake you up at 3 AM again.

Why DNS lookups slow down

A DNS lookup traverses multiple layers before returning an IP address. Your stub resolver asks a recursive resolver, which queries root nameservers, then TLD nameservers, then the authoritative nameserver for the domain. Each hop adds latency. In a best case — a warm cache hit on the recursive resolver — resolution takes under 1 ms. In the worst case — a cold cache, long CNAME chains, DNSSEC validation, and an authoritative server on another continent — it can exceed 500 ms.

The most common causes of slow DNS resolution:

Overloaded or distant ISP resolvers. ISP DNS servers are shared infrastructure. During peak hours, query times spike from 20 ms to 200 ms or more.
Low TTL values. A TTL of 60 seconds means every cache expires every minute, forcing full recursive lookups. TTLs under 300 seconds are a common source of unnecessary latency.
CNAME chains. Each CNAME adds an extra lookup. A domain with three CNAME hops requires four total resolutions before returning an A record.
IPv6 fallback. When a system queries for AAAA records first and the authoritative server is slow to respond (or doesn't support IPv6), the client waits for a timeout before falling back to A records — adding 2–5 seconds.
VPN and split-tunnel DNS conflicts. Corporate VPNs often route DNS traffic through a tunnel to an internal resolver, adding 50–150 ms of round-trip latency that doesn't exist when the VPN is off.

Measure first — what "slow" actually means

Before changing anything, measure your current DNS performance. The dig command (Linux/macOS) and nslookup (Windows) are the standard diagnostic tools.

Measure with dig:

dig devhelm.io

The output you care about is at the bottom:

;; ANSWER SECTION:
devhelm.io.          300     IN      A       143.198.168.42

;; Query time: 24 msec
;; SERVER: 1.1.1.1#53(1.1.1.1) (UDP)
;; WHEN: Sun May 11 14:32:07 UTC 2026
;; MSG SIZE  rcvd: 56

The Query time line is what matters. Here is a reference table:

Query time	Rating	Action
< 15 ms	Excellent	No action needed
15–50 ms	Good	Acceptable for most workloads
50–100 ms	Poor	Switch resolver or investigate upstream
100+ ms	Critical	Immediate action required

Compare resolvers directly:

dig @1.1.1.1 devhelm.io | grep "Query time"
dig @8.8.8.8 devhelm.io | grep "Query time"
dig @9.9.9.9 devhelm.io | grep "Query time"

If your default resolver is 3–5x slower than Cloudflare (1.1.1.1) or Google (8.8.8.8), that is the first thing to fix.

Measure with nslookup on Windows:

nslookup devhelm.io
Server:  resolver1.isp.net
Address:  192.168.1.1

Non-authoritative answer:
Name:    devhelm.io
Address:  143.198.168.42

nslookup does not show query time directly. For timing on Windows, use PowerShell:

Measure-Command { Resolve-DnsName devhelm.io } | Select-Object TotalMilliseconds

Fix slow DNS lookup on your machine

These fixes address the most common causes of slow resolution, in order of impact.

1. Flush your local DNS cache

Stale or corrupted cache entries can cause lookups to hang or return wrong results. Flush first, then re-test.

macOS:

sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder

Linux (systemd-resolved):

sudo resolvectl flush-caches
resolvectl statistics | grep "Current Cache Size"

Windows:

ipconfig /flushdns

2. Switch to a faster public resolver

If your ISP resolver is slow, change to Cloudflare (1.1.1.1), Google (8.8.8.8), or Quad9 (9.9.9.9). These resolvers have global anycast networks that consistently resolve in under 15 ms from most locations.

Linux (/etc/resolv.conf or systemd-resolved):

sudo resolvectl dns eth0 1.1.1.1 1.0.0.1
sudo resolvectl dns eth0 # verify

macOS (System Settings > Network > DNS):

networksetup -setdnsservers Wi-Fi 1.1.1.1 1.0.0.1

3. Disable IPv6 DNS if you do not use it

If your network does not have working IPv6 connectivity, AAAA queries add timeout delays to every lookup. Test whether IPv6 is the problem:

dig AAAA devhelm.io @1.1.1.1 | grep "Query time"
dig A devhelm.io @1.1.1.1 | grep "Query time"

If the AAAA query is significantly slower or times out, consider disabling IPv6 resolution on your machine or configuring your resolver to deprioritize AAAA lookups.

4. Check your VPN's DNS configuration

VPNs commonly override DNS settings, routing queries through the tunnel. If DNS is slow only when connected to a VPN:

cat /etc/resolv.conf   # Linux: check which DNS server is active
scutil --dns | head -20 # macOS: check DNS configuration

If the resolver points to a VPN-provided address (e.g., 10.x.x.x), configure split-tunnel DNS so that only internal domains route through the VPN resolver.

How to fix DNS_PROBE_FINISHED_NXDOMAIN

DNS_PROBE_FINISHED_NXDOMAIN means the DNS resolver returned an NXDOMAIN response — the domain does not exist in DNS. Chrome, Edge, and Brave all surface this as an error page. The domain either genuinely does not exist, or something between your machine and the authoritative nameserver is blocking or misconfiguring the lookup.

Diagnosis, in order:

1. Verify the domain is correct. Typos account for most NXDOMAIN errors. Check for swapped letters, missing hyphens, and wrong TLDs (.com vs .io vs .dev).

2. Test from multiple resolvers. If your default resolver returns NXDOMAIN but a public resolver resolves the domain, your resolver has stale or filtered data:

dig example.com @1.1.1.1
dig example.com @8.8.8.8
dig example.com @$(cat /etc/resolv.conf | grep nameserver | head -1 | awk '{print $2}')

3. Check the authoritative nameserver directly. This confirms whether the domain's NS records are configured correctly at the registrar:

dig NS example.com @1.1.1.1
dig example.com @ns1.registrar.com

If the authoritative server itself returns NXDOMAIN, the domain's DNS zone is misconfigured or the domain has expired. Check with your registrar.

4. Flush DNS and restart the DNS client. A cached NXDOMAIN response (negative caching, per RFC 2308) can persist for the SOA minimum TTL, which defaults to hours on some zones.

5. Check your hosts file. A local override in /etc/hosts (Linux/macOS) or C:\\Windows\\System32\\drivers\\etc\\hosts (Windows) can shadow DNS entirely. Remove any stale entries for the domain.

6. Disable Chrome's secure DNS if it conflicts. Chrome aggressively prefetches DNS for links on a page. If prefetch queries go to a different resolver than your system default, you can get spurious NXDOMAIN errors. Navigate to chrome://settings/security and check the "Use secure DNS" setting — ensure it matches your intended resolver.

How to fix DNS server not responding

"DNS server not responding" means your machine sent a DNS query and received no reply at all — not even an error. This is different from NXDOMAIN (which is a valid response saying "this domain does not exist"). No response means the resolver itself is unreachable or unresponsive.

Systematic diagnosis:

Step 1: Confirm basic connectivity

Separate "network is down" from "DNS is down":

ping -c 3 1.1.1.1

If ping fails, the problem is your network connection, not DNS. Check cables, Wi-Fi, and router.

If ping succeeds, your network is fine but DNS is specifically broken. Continue.

Step 2: Test the DNS port directly

DNS uses UDP port 53 (and TCP 53 for large responses). Test whether your resolver is accepting connections:

dig @1.1.1.1 devhelm.io +tcp +timeout=5

If this works but normal queries fail, something is blocking UDP port 53 — commonly a firewall, router ACL, or ISP filter.

Step 3: Check your router

Home and office routers often run a local DNS forwarder. If the router's DNS process crashes or its upstream configuration is wrong, all devices on the network lose DNS.

Access your router admin panel (typically 192.168.1.1)
Check the configured upstream DNS servers
Try setting them to 1.1.1.1 and 8.8.8.8 as primary and secondary
Reboot the router

Step 4: Check for firewall or security software blocking DNS

Firewalls (especially on corporate networks), antivirus software, and parental control tools sometimes intercept or block DNS traffic. Temporarily disable them to isolate:

sudo iptables -L -n | grep 53

Step 5: Try DNS over HTTPS (DoH)

If your ISP is throttling or intercepting standard DNS (UDP/TCP port 53), DNS over HTTPS bypasses the interception by sending queries over HTTPS on port 443:

Firefox: Settings > Privacy & Security > Enable DNS over HTTPS
Chrome: Settings > Security > Use secure DNS > Select Cloudflare or Google
System-wide (Linux): Configure systemd-resolved with DNSOverTLS=yes

When the problem is upstream

Sometimes slow DNS is outside your control. Before blaming your resolver or network, check whether the problem is upstream:

Authoritative nameserver issues. The domain owner's nameserver may be slow or misconfigured. Test with dig +trace example.com to see exactly where in the resolution chain the delay occurs.
CDN misrouting. CDNs like Cloudflare and AWS CloudFront use DNS-based geographic routing. If your resolver's IP geolocation is wrong, you may be routed to a distant edge node. This is common with VPNs and small ISP resolvers.
Registrar glue record problems. If a domain's nameservers are under the same domain (e.g., ns1.example.com for example.com), the registrar must provide glue records — the A records for the nameservers themselves. Missing glue records create a circular dependency that manifests as timeouts.
Enterprise split-horizon DNS. In corporate environments, internal and external DNS zones overlap. A query for api.company.com might resolve to an internal IP on VPN and a public IP off VPN — or fail entirely if the split-horizon configuration has gaps.

Prevent DNS failures with monitoring

Everything you have done so far in this guide — flushing caches, switching resolvers, tracing NXDOMAIN responses, checking firewall rules — is reactive. You noticed a problem, diagnosed it, and fixed it. That reactive investigation is exactly the kind of work a runbook codifies so the next engineer doesn't repeat it from scratch. But the next DNS failure will not look like this one. An A record vanishes because someone fat-fingers a Terraform apply. A TTL gets dropped to 30 seconds during a migration and never gets reverted. Resolution times creep from 20 ms to 150 ms over three weeks because an upstream nameserver is quietly degrading. None of these announce themselves. They just erode your reliability until a user files a ticket or your on-call phone rings at 3 AM — and your MTTR climbs because the failure mode was unfamiliar.

A single "is DNS working?" check does not cover this. What you need is a layered set of assertions that catches the different ways DNS silently breaks.

Layer 1: Does it resolve at all?

The most fundamental check. A dns_resolves assertion confirms that your domain actually returns records — that the A record exists, that the AAAA record exists, that the response is not NXDOMAIN or SERVFAIL. If your A record disappears because of a zone file mistake or a registrar lapse, you find out in five minutes instead of five hours when customers start reporting a blank page.

Check both A and AAAA record types. Even if your application is IPv4-only, a broken AAAA record causes timeout-based fallback delays on clients that try IPv6 first — the exact problem covered in the IPv6 section above. Monitoring both means you catch issues on either path.

Layer 2: Does it resolve fast enough?

DNS that technically resolves but takes 200 ms adds 200 ms to every single page load, every API call, every webhook delivery. This latency is invisible in dashboards that only track HTTP response time because the DNS overhead happens before the connection even opens.

Two thresholds give you the coverage you need. A hard failure assertion (dns_response_time with a maxMs of 100) fires when resolution exceeds a critical ceiling — something is actively broken, whether that is an overloaded resolver, a network path change, or an authoritative server on another continent. A softer warning assertion (dns_response_time_warn with a warnMs of 50) fires at a lower threshold so you catch gradual degradation before it compounds into an outage. The warning gives you time to investigate during business hours. The hard failure pages your on-call immediately.

Layer 3: Are the TTLs healthy?

Low TTLs are a silent performance killer, and they show up constantly in the kinds of issues this guide covers. A TTL of 30 seconds means every visitor's browser, every edge server, and every recursive resolver on the planet discards the cached record every half minute and triggers a full recursive lookup. During a migration, it is common practice to temporarily lower TTLs to speed up propagation — and then forget to raise them back afterward.

A dns_ttl_low assertion with a minTtl of 300 catches exactly this. If someone — or an automated provisioning tool — drops your TTL below five minutes, you get a warning before the extra lookup load starts inflating resolution times across the board.

Layer 4: Check from multiple vantage points

DNS is not globally consistent. A record that resolves correctly from a probe in us-east might be stale, missing, or pointing to the wrong IP in ap-south because of propagation delays, regional resolver differences, or geo-DNS misconfigurations. If you only check from one region, you are testing your DNS health from one perspective and assuming the rest of the world agrees. It often does not.

Running checks from at least three regions — us-east, eu-west, and ap-south — ensures your monitoring reflects what your actual users experience rather than what a single datacenter sees.

Layer 5: Check against specific nameservers

By default, each probe region uses whatever recursive resolver is locally available. That is usually fine, but it means you can miss issues that are specific to a particular public resolver. Explicitly setting your nameservers to 1.1.1.1 and 8.8.8.8 — Cloudflare and Google, the two most widely used public resolvers — lets you test resolution from the same infrastructure your users are most likely hitting. If your domain resolves from Google but not Cloudflare (or vice versa), that points to a propagation issue or a resolver-specific caching problem that would otherwise be invisible until someone on the affected resolver reports it.

Putting it all together

Here is a complete DNS monitor configuration that implements all five layers:

{
  "name": "production-dns-health",
  "type": "DNS",
  "frequencySeconds": 300,
  "regions": ["us-east", "eu-west", "ap-south"],
  "config": {
    "hostname": "yourapp.com",
    "recordTypes": ["A", "AAAA"],
    "nameservers": ["1.1.1.1", "8.8.8.8"]
  },
  "assertions": [
    {"config": {"type": "dns_resolves"}},
    {"config": {"type": "dns_response_time", "maxMs": 100}},
    {"config": {"type": "dns_response_time_warn", "warnMs": 50}, "severity": "WARN"},
    {"config": {"type": "dns_ttl_low", "minTtl": 300}, "severity": "WARN"}
  ]
}

Every five minutes, from three continents, this monitor resolves yourapp.com for both A and AAAA records against Cloudflare's and Google's DNS. It fails hard if the domain does not resolve at all or if resolution takes longer than 100 ms. It warns if resolution exceeds 50 ms or if the TTL drops below 300 seconds.

The severity: "WARN" on the TTL and response time warning assertions is deliberate. These are degradation signals, not outage signals — they belong in a dashboard and a Slack channel, not in your PagerDuty rotation. The resolution check and the hard response time ceiling default to error severity, which is what triggers your incident workflow based on your severity levels. The distinction matters: you want to know about creeping latency during business hours, and you want to be woken up for a missing A record.

If your DNS infrastructure sits behind Cloudflare, you can also track their operational status through their public status feed — useful for distinguishing between your DNS issues and theirs.

You can create this monitor through the DevHelm dashboard, or from the terminal:

devhelm monitor create --type dns

Start monitoring free — DNS, HTTP, TCP, and ICMP checks from five continents, with the full CLI and API surface included.

Originally published on DevHelm.

DEV Community