DNS gets blamed last and breaks first: my symptom-to-root-cause playbook

#dns #networking #devops #sysadmin

Every incident I have chased that started with "the site is down" and ended in DNS had one thing in common: the error messages pointed everywhere except DNS. Timeouts, TLS warnings, a login screen that spins forever. You go poking at the app, the load balancer, the database, and an hour later it turns out a resolver was handing back a stale answer the whole time.

So I stopped trusting symptoms and started isolating layers. Here is the process I actually run, in order.

Three shapes of the same problem

Almost every DNS failure shows up as one of three things.

NXDOMAIN. The resolver asked the authoritative servers and got a definitive "this name does not exist." Sometimes that is true: an expired domain, a typo. More often the record exists and you are staring at a cache that has not caught up yet. If the name resolves at the authoritative source but your resolver says NXDOMAIN, you are not debugging, you are waiting.

Resolver not responding. Nothing resolves and browsing by name just stops. Usually the resolver itself is unreachable: an ISP outage, a firewall quietly eating port 53 (DNS needs both UDP and TCP), or the wrong resolver addresses in the config.

Slow resolution. Pages hang for a few seconds, then load fast. That is the classic signature of a dead primary resolver, where every lookup burns a full timeout before falling back to the secondary.

The one idea that makes this easy

Separate three questions and never let them blur together:

Can I reach the network at all?
Can I resolve names?
Is the answer I get back correct?

Each step answers exactly one of those.

Reach the network, no names involved:

ping 1.1.1.1

If that fails you have a connectivity problem, not a DNS problem. Stop blaming DNS.

Can you resolve at all:

dig google.com

Raw IP works but names do not? DNS is now confirmed as the failing layer.

Is the answer correct, and whose fault is a bad answer:

dig example.com            # your resolver
dig @1.1.1.1 example.com   # a known-good public one

This is the decisive split. If a public resolver answers correctly and yours does not, the fault is your resolver or the path to it. Point at 1.1.1.1 or 8.8.8.8 and keep working while you fix it. If the public resolver is also wrong, the problem lives at the authoritative side, in the zone itself.

When you suspect a broken delegation, one command settles it:

dig +trace example.com

That walks the chain from the root servers down to the authoritative nameservers and shows every referral. A registrar pointing at the wrong nameservers jumps straight out.

The cache is the liar

The most confusing DNS bugs, the "works on my machine, fails on the server" ones, are almost always stale caches. Flush yours before you theorize:

# macOS
sudo dscacheutil -flushcache; sudo killall -HUP mDNSResponder
# Linux (systemd-resolved)
sudo resolvectl flush-caches
# Windows
ipconfig /flushdns

And because your local cache can lie to you, it helps to get a second opinion from outside your own network. Full disclosure: I build a set of network tools called Trace Warrior, and the DNS lookup runs the query from a neutral vantage point and shows every record type at once, which is the fastest way to tell "my resolver is stale" from "the record is genuinely wrong."

The other half is not waiting for a human to notice. Most DNS incidents I have seen were a record that changed and nobody clocked it: a fat-fingered edit, or worse, a hijack. We ended up adding a DNS record monitor that watches for drift and alerts the moment an answer changes, because "should this record exist?" is a miserable question to be asking in the middle of an outage.

What actually prevents the next one

Lower TTLs before a planned change, raise them after. A TTL left at 86400 turns a five-minute cutover into a full day of split traffic.
Redundant nameservers on separate networks. Two nameservers on the same subnet is one nameserver wearing a disguise.
Reverse DNS, or your mail quietly stops being delivered.

The whole trick is refusing to guess. Reach, resolve, correct, in that order. Run the steps and DNS stops being mysterious and turns into a short walk to whichever layer is lying to you.

What is the DNS bug that took you longest to catch? Mine was a validating resolver throwing SERVFAIL on a broken DNSSEC signature while everyone else resolved fine. I would genuinely love to hear the one that got you.