Naz Quadri

Posted on Mar 31 • Originally published at nazquadri.dev

Your DNS is Lying to You

#linux #programming #systems #tutorial

Your DNS is Lying to You

What Actually Happens Between a URL and the First Byte

Reading time: ~13 minutes

You typed api.example.com into your browser — or curl'd it, or your service tried to connect to it — and something happened. Some bytes arrived. You moved on.

It is not a lookup table. It is a distributed, eventually consistent database with a 40-year-old trust model, deployed across millions of machines that have no obligation to agree with each other. When it goes wrong — and it does go wrong — the failure modes are some of the most maddening in all of networking, because the answer you get looks valid. It's just wrong.

There are four distinct roles. Most people know one of them.

The Bug That Made This Click

Picture this: a microservice can't connect to a dependency. Health checks pass. curl works fine from your laptop. The service throws connection errors that make no sense.

The service is running in a Docker container. Inside the container, curl api.internal.corp returns a different IP than dig api.internal.corp run from the same container.

Different. IP.

Same host. Same moment. Different tool. Different answer.

We'll learn exactly why that's possible before the end of this post.

The Cast of Characters

Before the mechanics, let's name the players. There are four distinct roles in the DNS resolution chain, and conflating them is the source of most confusion.

The stub resolver lives on your machine. It's not really a DNS server — it can't do much on its own. It's the code in libc (or your OS networking stack) that takes a hostname and says "I need an IP for this" by forwarding the question to someone who can actually answer it. On Linux, that's getaddrinfo(). On macOS it goes through mDNSResponder. Every DNS query your applications make starts here.

The recursive resolver (also called a "full-service resolver" or sometimes misleadingly the "DNS server") does the actual work. This is the server your stub resolver talks to. Its job is to walk the DNS tree from the root all the way down to a definitive answer. Your ISP runs one. Google runs one at 8.8.8.8. Cloudflare at 1.1.1.1. Your office probably has one too.

The authoritative nameserver actually owns the answer. If you bought example.com and set up your DNS records, you pointed your registrar at some authoritative nameservers. Those servers are the canonical source of truth for your zone. They don't do recursion — they just answer questions about records they own.

The root nameservers are where recursion starts when a recursive resolver has no cached answer. There are exactly 13 of them by IP — hundreds of physical machines behind those 13 addresses via anycast. Why 13? Because the original DNS protocol used 512-byte UDP packets, and 13 NS records was the maximum that fit 🤷‍♂️. They don't know where api.example.com is, but they know who handles .com.

What Actually Happens When You Type a URL

Let's trace it. You type https://api.example.com/v1/users and hit Enter.

Your browser extracts the hostname: api.example.com. It calls into the OS resolver. Before any network packet leaves your machine, the OS checks three things in order:

First, /etc/hosts. This is a flat text file that predates DNS by over a decade. It's checked before anything else, unconditionally. If api.example.com appears in /etc/hosts, the search is over — no network query happens at all. This is why adding entries to /etc/hosts works for local development, and it's also why corporate malware occasionally modified it to redirect banking sites. It's also why many devs are confused when their DNS changes don't seem to take effect: they have a stale hosts file entry they forgot about from six months ago.

Second, the local DNS cache. Your OS, and often a local daemon (systemd-resolved on modern Linux, mDNSResponder on macOS), keeps a cache of recent answers. If the cache has a fresh entry, done.

Third, and only if neither of those had an answer: a query goes out to the recursive resolver specified in /etc/resolv.conf.

# /etc/resolv.conf — the file that decides where your DNS queries go
nameserver 192.168.1.1      # your router, probably
nameserver 8.8.8.8          # fallback: Google
search corp.internal        # try appending this domain to short names

That nameserver line is the only thing most developers know about /etc/resolv.conf. The search directive is where it gets interesting — and where short hostnames like db can silently resolve to db.corp.internal, which is either convenient or baffling depending on the day.

That's why db works on your laptop but fails in CI: one has a search corp.internal entry and the other doesn't.

The Recursive Resolver Earns Its Name

Your query reaches the recursive resolver. Let's say it's a cold cache — never seen api.example.com before.

The resolver starts at the top.

It queries one of the 13 root nameserver IPs (hardcoded into all resolver software as the "root hints"). The root server doesn't know api.example.com. It responds with a referral: "I don't know, but .com is handled by these nameservers."

The resolver then queries a .com Top Level Domain (TLD) nameserver. The TLD server doesn't know api.example.com. It responds with a referral: "I don't know, but example.com is handled by these nameservers."

The resolver then queries an authoritative nameserver for example.com. This one knows. It returns an A record (IPv4) or AAAA record (IPv6) for api.example.com, along with a TTL — a "time to live" value in seconds.

The resolver caches the answer for TTL seconds and returns it to your stub resolver. Your stub resolver hands it to getaddrinfo(). Your browser gets an IP. The connection starts.

That whole chain — root → TLD → authoritative — happened in the background, probably in under 100ms. On a warm cache, it's a single hop and maybe 5ms.

Query trace for api.example.com:
  → root nameserver (hardcoded IPs)
    ← "ask .com TLD at 192.5.6.30"
  → .com TLD nameserver (192.5.6.30)
    ← "ask ns1.example.com at 93.184.216.10"
  → ns1.example.com (93.184.216.10)
    ← "api.example.com A 198.51.100.42  TTL 300"
  → your stub resolver
    ← 198.51.100.42

Three round trips. More if any of those delegations weren't cached. And critically: the recursive resolver that did all this work is running on someone else's machine, which you do not control, and which has its own cache that it shares with everyone else who uses it.

TTL Is Not a Suggestion, It's Also Not a Guarantee

TTL (Time to Live) is the record's expiry hint. If your A record has TTL 300, it means "cache this for 300 seconds, then check again."

Here's what TTL cannot do: it cannot tell resolvers that already have a cached answer to throw it away. When you update a DNS record, the old answer is still valid in every cache that holds it, until their individual TTLs expire. If your TTL was 24 hours (86400 seconds), some resolvers will be serving the old answer for up to 24 hours after your change.

This is why "just flush DNS" is not a real answer to a propagation problem. You can flush your local machine's cache. You cannot flush Google's cache. You cannot flush your ISP's cache. You cannot flush the cache of the recursive resolver your user's mobile carrier uses.

What you can do: lower your TTL well before a migration. If you know you're moving an IP next Tuesday, set your TTL to 60 seconds on Friday. Let the short TTL propagate. Do the migration. The blast radius of stale caches is 60 seconds instead of 24 hours.

What you cannot do: change the TTL and have it take effect immediately. The TTL change itself has to propagate, and it propagates at the old TTL.

That's right. The new TTL doesn't matter until the old TTL expires and resolvers re-fetch the record. Plan accordingly.

That's why your DNS change isn't working yet. The old answer is cached at some resolver with a TTL of 3600 and there are 47 minutes left. Verify by querying the authoritative server directly: dig @ns1.example.com api.example.com — that bypasses all caches and shows what the authoritative server has right now.

CNAME Chains and Why They're Weird

An A record maps a name to an IP. A CNAME record maps a name to another name — a canonical alias.

api.example.com  CNAME  loadbalancer.us-east-1.elb.amazonaws.com

When a resolver sees a CNAME, it has to resolve the target too. So api.example.com → look up loadbalancer.us-east-1.elb.amazonaws.com → that returns an A record. Two lookups, one query from your perspective.

CNAMEs are used everywhere — CDNs, load balancers, cloud services — because they let you point a name at another name that the provider controls. When AWS moves your load balancer, they update their A record; your CNAME keeps working.

The rule everyone forgets: you cannot have a CNAME at a zone apex. The zone apex is the bare domain itself — example.com with nothing in front of it. Why? Because CNAME has to be the only record for a name (it replaces the name entirely), but the zone apex needs SOA and NS records. You can't have a CNAME and also have NS records. The DNS spec doesn't allow it.

This is why CDN and DNS providers invented CNAME flattening (Cloudflare calls it CNAME at the root, Route53 calls it ALIAS records). When you point example.com at example.com.cdn.cloudflare.net, the provider does the CNAME lookup at query time and returns a flat A record to the client. From the outside, it looks like an A record. It's not. It's a CNAME that your DNS provider is silently expanding.

This matters when you're debugging. If dig example.com A returns an IP directly but you know you set up a CNAME at the root, the flattening is working. If it returns a CNAME, something's wrong with your provider config. These look identical to the application layer.

That's why you can't put a CNAME on example.com itself — CNAME semantics conflict with the SOA and NS records that every zone apex must have. Your DNS provider works around this with record flattening, which looks like an A record to the outside world.

Why `dig` and `curl` Give Different Answers

Back to my Docker debugging story. curl api.internal.corp returned a different IP than dig api.internal.corp. How?

Because they use different resolution paths.

dig is a DNS tool. It talks directly to a DNS resolver — by default, whatever is in /etc/resolv.conf, or you can specify one with @. It bypasses the OS resolver entirely, bypasses the local cache, and makes a raw DNS query.

curl uses getaddrinfo(). That function goes through the full OS name resolution stack, including /etc/nsswitch.conf — the OS's routing table for name resolution, a priority-ordered list of where to look. On a typical Linux machine it looks like:

# /etc/nsswitch.conf
hosts:          files dns myhostname

That files entry means /etc/hosts runs first — and curl reads it, while dig does not.

Inside Docker, it gets more interesting. Docker injects its own nameserver into the container's /etc/resolv.conf, pointing at Docker's internal resolver at 127.0.0.11. That resolver handles Docker network DNS (container names, service names). It may return different answers for internal names than an external DNS server would. dig run without arguments still reads /etc/resolv.conf — so inside the container, dig api.internal.corp was querying Docker's resolver, not the corporate DNS. And Docker's resolver didn't know about the internal service.

The rule: dig shows you what a DNS query returns. curl shows you what the application stack resolves. They are not always the same query against the same server.

When they disagree, the question is: which one matches what your application uses? Usually it's the curl path, because your application also calls getaddrinfo().

That's why dig and curl gave different answers inside that Docker container — Docker's internal resolver handled curl's path but not dig's direct query to the corporate DNS.

The Trust Problem Nobody Talks About

DNS was designed in 1983. The original protocol has no cryptographic authentication. A resolver asks a question; an authoritative server answers. There's nothing in the original design that proves the answer came from the real authoritative server.

This isn't a theoretical concern. DNS spoofing and cache poisoning are real attacks. An attacker who can intercept or forge DNS responses can redirect any hostname to any IP — transparently, with no visible error to the user.

The fix is DNSSEC — DNS Security Extensions. DNSSEC adds cryptographic signatures to DNS records. Authoritative servers sign their records with a private key; validators check those signatures against a public key published in the parent zone. The chain of trust runs from the root all the way down.

DNSSEC deployment is... fragmented. The root zone is signed. Many TLDs are signed. Many individual domains are not. And critically, many recursive resolvers don't validate DNSSEC even if the zone is signed — they just pass the signatures along. Validation has to happen somewhere for it to matter, and the chain has a lot of links.

This is why your browser sometimes shows a DNSSEC validation warning on a subdomain you're confident is yours — someone in the delegation chain has a misconfigured or expired signing key.

DoT (DNS over TLS) and DoH (DNS over HTTPS) are a different layer of protection. They encrypt the query in transit — so your ISP can't see what you're looking up, and a network attacker can't intercept the packet. But they don't solve the authoritative trust problem. You're still trusting the resolver at the other end.

The honest summary: DNS's trust model is "trust whoever answers." DNSSEC tries to fix that. Its deployment is patchy. DoT/DoH protect the wire, not the answer.

Practical Debugging Toolkit

When DNS is misbehaving, you want to query different layers explicitly rather than guessing:

# What does the authoritative server say right now?
dig @ns1.example.com api.example.com A

# What does your configured resolver return (including its cache)?
dig api.example.com A

# What would an uncached query to a specific public resolver return?
dig @8.8.8.8 api.example.com A

# Trace the full recursive delegation chain
dig +trace api.example.com A

# Show what getaddrinfo() would return (follows /etc/hosts, nsswitch.conf)
# getent is part of glibc-utils on Debian/Ubuntu; not available on macOS
getent hosts api.example.com

# Check your resolver configuration
cat /etc/resolv.conf
resolvectl status    # systemd-resolved environments

The +trace flag on dig is particularly useful when something is broken in the delegation chain itself — wrong glue records, expired DS records, missing NS entries. It shows you every hop.

When dig @authoritative returns the right answer but your application gets something different, the problem is between your application and that authoritative server. Work backwards: your OS cache, your container's resolver, your corporate split-horizon DNS, your VPN's DNS override.

The Thing Worth Holding Onto

I have a love-hate relationship with DNS. It's 40 years old, it was designed when the internet was a few hundred hosts, the trust model was "everyone on the network is trustworthy," and the failure modes of a globally distributed cache were... well not fully thought through.

And yet it works. Hundreds of billions of queries a day, run by competing organisations with no central coordinator, and it almost never falls over. When it does fail, the failures are the worst kind: subtle. An answer that looks valid but is stale. A resolution path that bypasses the record you just updated. A CNAME chain that doesn't behave the way you modelled it. You don't get an error. You get the wrong IP, served confidently. Oh dear maybe LLMs learned from DNS !!!

Every DNS debugging session I've ever had ended the same way: I wasn't querying the layer I thought I was querying. The answer is always cached somewhere you forgot to check. And you do this so rarely you basically have to re-teach yourself the DNS stack each time.

DEV Community

Your DNS is Lying to You

Your DNS is Lying to You

What Actually Happens Between a URL and the First Byte

The Bug That Made This Click

The Cast of Characters

What Actually Happens When You Type a URL

The Recursive Resolver Earns Its Name

TTL Is Not a Suggestion, It's Also Not a Guarantee

CNAME Chains and Why They're Weird

Why `dig` and `curl` Give Different Answers

The Trust Problem Nobody Talks About

Practical Debugging Toolkit

The Thing Worth Holding Onto

Further Reading

Top comments (0)

Your DNS is Lying to You

What Actually Happens Between a URL and the First Byte

The Bug That Made This Click

The Cast of Characters

What Actually Happens When You Type a URL

The Recursive Resolver Earns Its Name

TTL Is Not a Suggestion, It's Also Not a Guarantee

CNAME Chains and Why They're Weird

Why dig and curl Give Different Answers

The Trust Problem Nobody Talks About

Practical Debugging Toolkit

The Thing Worth Holding Onto

Further Reading

Why `dig` and `curl` Give Different Answers