<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: SAI RAM</title>
    <description>The latest articles on DEV Community by SAI RAM (@sai_ram_0000).</description>
    <link>https://dev.to/sai_ram_0000</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2169650%2F5f4ceeb8-5c63-4c17-85a8-52beb60125a5.jpeg</url>
      <title>DEV Community: SAI RAM</title>
      <link>https://dev.to/sai_ram_0000</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sai_ram_0000"/>
    <language>en</language>
    <item>
      <title>How DNS Actually Works: Resolution Hierarchy, Caching, and Production Failure Modes</title>
      <dc:creator>SAI RAM</dc:creator>
      <pubDate>Sat, 20 Jun 2026 10:47:27 +0000</pubDate>
      <link>https://dev.to/sai_ram_0000/how-dns-actually-works-resolution-hierarchy-caching-and-production-failure-modes-195h</link>
      <guid>https://dev.to/sai_ram_0000/how-dns-actually-works-resolution-hierarchy-caching-and-production-failure-modes-195h</guid>
      <description>&lt;h2&gt;
  
  
  DNS Is an Indirection Layer, Not a Lookup Table
&lt;/h2&gt;

&lt;p&gt;The "phonebook" metaphor everyone reaches for is actively misleading — and worse, it frames DNS as solved infrastructure when it's anything but. A phonebook is a static mapping. You look up a name, you get a number, done. DNS is something fundamentally different: a decoupling mechanism that separates stable human-readable identifiers from the volatile, ephemeral IP addresses underneath them. That distinction has enormous architectural consequences.&lt;/p&gt;

&lt;p&gt;When Google migrates a backend cluster from one datacenter to another, no client breaks. No user re-bookmarks anything. No API integration requires a config change. The domain &lt;code&gt;google.com&lt;/code&gt; stays fixed while the IP reality beneath it shifts completely — DNS absorbs the entire change. The name is the contract; the address is an implementation detail. Every major piece of internet infrastructure is built on top of this indirection, and understanding it shapes how you architect for resilience.&lt;/p&gt;

&lt;p&gt;CDN edge nodes, anycast load balancers, blue-green deployment targets, multi-cloud failover — none of these work without DNS as a transparent redirection primitive. When Cloudflare routes you to their nearest PoP, or AWS Route 53 returns a different A record based on your source region, they're exploiting this indirection to shape traffic without touching the client. The client never knows, and by design, never needs to.&lt;/p&gt;

&lt;p&gt;Here's the counterintuitive part: DNS isn't slow because it's distributed across thousands of servers worldwide. It's &lt;em&gt;fast&lt;/em&gt; because of that distribution — aggressive resolver caching at every layer combined with anycast routing means most queries never travel far. The latency you occasionally see in DNS is almost always a cache miss penalty, not a property of the system at rest.&lt;/p&gt;

&lt;p&gt;This reframes what TTL tuning actually is. It's not an ops detail — it's you setting the durability of the indirection contract. I've seen teams cut over cloud providers and get burned because they lowered TTLs &lt;em&gt;during&lt;/em&gt; the migration window rather than 24–48 hours before. By then, stale records are already distributed across resolvers worldwide and there's nothing you can do but wait out the original TTL. Drop it to 60 seconds two days before the cutover; let propagation happen before you flip the switch, not after.&lt;/p&gt;

&lt;p&gt;Negative caching — how long a resolver holds onto NXDOMAIN responses — operates by the same logic and bites engineers just as hard when a new record isn't resolving as expected. The mechanics of how resolvers navigate this hierarchy, from your OS cache outward to the root, reveal why those timing concerns aren't theoretical.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Resolution Hierarchy: 7 Steps from Keystroke to IP
&lt;/h2&gt;

&lt;p&gt;Here's something that surprises engineers the first time they think through it carefully: the overwhelming majority of DNS queries never leave the recursive resolver's cache. The root servers — the 13 logical anchors of the entire DNS hierarchy — handle a vanishingly small fraction of total query volume. Cloudflare's 1.1.1.1 resolver fields billions of queries daily, and nearly all of them are served from memory. The elaborate recursive machinery exists for cache misses, which in practice means cold starts and TTL expirations.&lt;/p&gt;

&lt;p&gt;Understanding why requires tracing the full resolution waterfall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cache-first hierarchy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Resolution is a layered fallback chain, and each layer is a cache hit opportunity that short-circuits everything below it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Keystroke → Browser cache (0ms)
          → OS stub resolver + /etc/hosts (~1ms)
          → Recursive resolver cache (~5ms)
          → Root nameserver (authoritative miss, ~20–100ms full round-trip)
          → TLD nameserver
          → Authoritative nameserver
          → Response propagates back up the chain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The browser maintains its own DNS cache with independent TTL tracking — Chrome exposes this at &lt;code&gt;chrome://net-internals/#dns&lt;/code&gt;. A cache hit here costs nothing measurable. Miss, and the query drops to the OS stub resolver, which checks &lt;code&gt;/etc/hosts&lt;/code&gt; before consulting the configured recursive resolver. The stub resolver is deliberately thin: it doesn't perform recursion itself, it delegates. All the computational work — iterative queries, referral chasing, DNSSEC validation — happens inside the recursive resolver. The three most common public resolvers are Cloudflare (&lt;code&gt;1.1.1.1&lt;/code&gt;), Google Public DNS (&lt;code&gt;8.8.8.8&lt;/code&gt;), and OpenDNS (&lt;code&gt;208.67.222.222&lt;/code&gt;) — each running globally distributed anycast infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;/etc/hosts and why it still matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/etc/hosts&lt;/code&gt; file is evaluated before any network query, which makes it a blunt but effective override mechanism. Local dev environments exploit this constantly — mapping &lt;code&gt;api.myapp.local&lt;/code&gt; to &lt;code&gt;127.0.0.1&lt;/code&gt; without touching DNS infrastructure. Container orchestration leans on the same principle: Kubernetes injects CoreDNS as the cluster's recursive resolver and configures each pod's &lt;code&gt;/etc/resolv.conf&lt;/code&gt; to point at it, enabling service discovery via names like &lt;code&gt;my-service.default.svc.cluster.local&lt;/code&gt; without external DNS round-trips. CoreDNS resolves these against its own in-memory service registry, never exiting the cluster. I've seen engineers chase mysterious resolution failures in k8s for an hour before realizing a manually edited &lt;code&gt;/etc/hosts&lt;/code&gt; on the node was intercepting queries before CoreDNS ever saw them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where cache misses actually go&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On a full recursive resolution, the resolver starts at a root server — not to get the final answer, but to get a referral. The root knows which nameservers are authoritative for &lt;code&gt;.com&lt;/code&gt;, &lt;code&gt;.io&lt;/code&gt;, &lt;code&gt;.dev&lt;/code&gt;, and so on. The resolver follows the referral to the TLD nameserver, which returns a referral to the domain's authoritative nameservers. A third query to the authoritative server finally yields the record. Each layer's response is cached with the TTL specified in that response, not a TTL the resolver invents.&lt;/p&gt;

&lt;p&gt;This is where each layer's distinct failure surface becomes operationally significant. A recursive resolver with a poisoned cache corrupts every downstream client. A TLD server with elevated latency inflates resolution time for every cold-start query to that TLD. An authoritative server that returns inconsistent TTLs across its nameserver fleet creates a thundering herd when the shortest-TTL version expires and triggers simultaneous re-resolution from thousands of resolvers.&lt;/p&gt;

&lt;p&gt;The TTL semantics at the authoritative layer have the most direct production impact — and that's what makes record types and their individual TTL behavior worth examining in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root Servers and TLD Servers: The Authoritative Spine
&lt;/h2&gt;

&lt;p&gt;Here's a misconception worth correcting early: there are not 13 root servers. There are 13 root server &lt;em&gt;names&lt;/em&gt; — &lt;code&gt;a.root-servers.net&lt;/code&gt; through &lt;code&gt;m.root-servers.net&lt;/code&gt; — backed by over 1,600 physical instances distributed globally via anycast routing. The "13" number is a direct artifact of original DNS design constraints: a UDP packet carrying root server data couldn't exceed 512 bytes, which capped the A record count at 13. Anycast sidesteps this entirely — your resolver queries &lt;code&gt;198.41.0.4&lt;/code&gt; (the &lt;code&gt;a&lt;/code&gt; root), and BGP routes that packet to whichever physical instance is topologically nearest, often within single-digit milliseconds.&lt;/p&gt;

&lt;p&gt;What root servers actually do is narrower than most engineers assume. They don't return final IPs. They don't know where &lt;code&gt;google.com&lt;/code&gt; lives. They inspect the rightmost label of a query — the TLD — and respond with NS records pointing to the authoritative TLD servers for that zone. A query for &lt;code&gt;api.stripe.com&lt;/code&gt; gets back a referral to Verisign's &lt;code&gt;.com&lt;/code&gt; nameservers, nothing more. Root servers are directory pointers, not answer sources.&lt;/p&gt;

&lt;p&gt;The TLD layer is where scale becomes genuinely impressive. Verisign operates the &lt;code&gt;.com&lt;/code&gt; and &lt;code&gt;.net&lt;/code&gt; TLDs — &lt;code&gt;.com&lt;/code&gt; alone carries approximately 170 million registered domains and fields billions of queries per day across a globally distributed infrastructure. The TLD nameservers hold NS records for every registered domain under that zone: Verisign doesn't know Stripe's IPs, but it knows which nameservers are authoritative for &lt;code&gt;stripe.com&lt;/code&gt;. That delegation — root to TLD to authoritative — is what makes DNS a distributed system rather than a centralized database. No single server holds the full namespace; authority is partitioned recursively by zone boundary.&lt;/p&gt;

&lt;p&gt;This delegation chain also explains something that frustrates engineers during deployments: new domain registrations can take up to 48 hours to resolve correctly. TLD zone files aren't updated in real time. Registrars batch-submit zone file updates to the TLD registry, and those updates propagate according to scheduled cycles rather than immediately. I've seen engineers provision infrastructure, register a fresh domain, and then spend an afternoon confused about why their resolver returns NXDOMAIN — the TLD hasn't published the NS delegation yet. The domain exists in the registrar's database, but that's a separate system from the live zone file Verisign is serving.&lt;/p&gt;

&lt;p&gt;The authoritative nameserver sitting at the bottom of this chain is where actual resource records live — A, AAAA, CNAME, MX, and the rest — and its behavior under load has its own set of sharp edges worth understanding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authoritative Nameservers and DNS Record Types
&lt;/h2&gt;

&lt;p&gt;Here's something the resolution hierarchy glosses over: every recursive resolver, no matter how sophisticated its caching strategy, is ultimately fetching data from a nameserver that has no idea about any other zone. The authoritative nameserver owns exactly one zone, holds the canonical records for it, and signals that ownership by setting the AA (Authoritative Answer) bit in its responses. When you see AA=1, you're at the end of the delegation chain — that answer didn't come from cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Record Taxonomy That Actually Matters in Production
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;A and AAAA&lt;/strong&gt; are the terminal answers — IPv4 and IPv6 addresses respectively. Everything else in DNS either routes you toward them or carries out-of-band metadata. A &lt;strong&gt;CNAME&lt;/strong&gt; introduces an alias: &lt;code&gt;www.example.com CNAME example-prod.cdn.net&lt;/code&gt; tells resolvers to restart the lookup with the new name. The constraint that bites people constantly is that a CNAME cannot coexist with other records at the same node. At the zone apex — &lt;code&gt;example.com&lt;/code&gt; itself, not &lt;code&gt;www&lt;/code&gt; — you're required to have NS and SOA records. A CNAME there is illegal per RFC 1912, which creates a real problem when you want to point your root domain directly at a CDN hostname.&lt;/p&gt;

&lt;p&gt;The canonical workaround is vendor-specific. Cloudflare's &lt;strong&gt;CNAME Flattening&lt;/strong&gt; resolves the CNAME chain internally and returns the final A record as if it were authoritative for the apex. AWS Route 53's &lt;strong&gt;ALIAS record&lt;/strong&gt; does the same — it's a Route 53 abstraction, not a real DNS record type, that lets you map &lt;code&gt;example.com&lt;/code&gt; directly to an ALB or CloudFront distribution. Both approaches solve the RFC violation by doing the indirection server-side before the response leaves the nameserver. I've seen this misconfiguration burn teams who migrate to a CDN, correctly update &lt;code&gt;www&lt;/code&gt;, then wonder why the naked domain returns SERVFAIL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MX records&lt;/strong&gt; encode mail routing priority — lower number means higher preference. &lt;strong&gt;NS records&lt;/strong&gt; delegate zone authority to specific nameservers. &lt;strong&gt;SOA&lt;/strong&gt; (Start of Authority) is the zone's metadata header: primary nameserver, admin contact (encoded as an email with the &lt;code&gt;@&lt;/code&gt; replaced by &lt;code&gt;.&lt;/code&gt;), serial number, and the refresh/retry/expire/minimum TTL intervals that govern secondary nameserver behavior. The serial number is the synchronization primitive — secondaries compare their local serial against the primary's SOA serial, and a higher primary serial triggers a zone transfer (AXFR or IXFR). Forget to increment the serial after edits on a primary, and secondaries will silently serve stale data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TXT records&lt;/strong&gt; carry arbitrary text, but in practice they're load-bearing infrastructure. SPF records (&lt;code&gt;v=spf1 include:...&lt;/code&gt;) tell receiving MTAs which IPs are authorized to send mail for your domain. DKIM records publish the public key that verifies message signatures. When your mail infrastructure changes — new ESP, additional sending IP range, rotated DKIM keys — and the DNS records don't follow, deliverability degrades silently. Messages don't bounce; they land in spam or get dropped, and the feedback loop is slow enough that the DNS drift often goes unnoticed for days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multiple A records&lt;/strong&gt; for the same name give you round-robin DNS: resolvers distribute queries across the address set in rotation. Twitter/X has historically published several A records for its primary hostnames with short TTLs as a first-layer distribution mechanism before traffic even reaches a load balancer. The critical caveat: clients cache and pin to one address for the TTL duration, so the balancing is statistical at best. With a 60-second TTL you get reasonable spread at scale; with a 300-second TTL and sticky clients, one host can absorb a disproportionate share.&lt;/p&gt;

&lt;p&gt;Understanding what lives inside a zone — and how authoritative servers signal ownership of that data — sets up the more operationally interesting question of how that data propagates, ages, and goes wrong across the caching layer between you and the authoritative source.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caching, TTLs, and the Propagation Delay Trap
&lt;/h2&gt;

&lt;p&gt;Here's the counterintuitive part: you don't control when the internet "sees" your DNS change. You only control how long resolvers are &lt;em&gt;allowed&lt;/em&gt; to cache the previous answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTL is a lease duration, not a cache expiry signal.&lt;/strong&gt; Every DNS record carries a TTL value — set by the zone owner — that tells resolvers how many seconds they may serve that answer from cache before re-querying. A &lt;code&gt;3600&lt;/code&gt; on an A record means any resolver that fetched it can serve the cached IP for up to an hour without touching your authoritative nameserver again. Once that window closes, the resolver queries upstream and gets whatever answer exists at that moment.&lt;/p&gt;

&lt;p&gt;This is the mechanism behind what the industry misleadingly calls "DNS propagation." There is no central push. No broadcast. No synchronization event. "Propagation" is just the gradual expiration of cached copies scattered across tens of thousands of recursive resolvers worldwide, each on its own independent TTL countdown. When engineers say "waiting for DNS to propagate," they mean waiting for the old TTL to drain everywhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The migration trap.&lt;/strong&gt; I've watched this burn engineers repeatedly during blue-green cutovers: the zone record sits at &lt;code&gt;TTL 3600&lt;/code&gt;. Migration window opens, engineer drops TTL to &lt;code&gt;60&lt;/code&gt; and updates the A record simultaneously. Half the internet is already caching the old IP — with a full hour left on their local TTL clock. That TTL change is invisible to them; they already have the answer. The new 60-second TTL only applies to resolvers fetching &lt;em&gt;after&lt;/em&gt; the change. The fix is straightforward but requires discipline: lower your TTL to 60–300 seconds &lt;strong&gt;24–48 hours before&lt;/strong&gt; the cutover. Let the short TTL propagate at the original TTL's pace. Then do the cutover. Then restore the long TTL post-migration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Negative caching compounds this.&lt;/strong&gt; NXDOMAIN responses are also cached, with TTL governed by the &lt;code&gt;minimum&lt;/code&gt; field in your zone's SOA record. Delete a record, then immediately recreate it? Resolvers that caught the deletion can serve &lt;code&gt;NXDOMAIN&lt;/code&gt; for the full negative cache duration — often 30 minutes to an hour — regardless of the new record's existence. I've seen a developer delete and recreate a CNAME during debugging and spend 30 minutes convinced their zone was broken, when resolvers were simply serving a stale negative cache entry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The harder problem: resolver-side TTL clamping.&lt;/strong&gt; Even a perfectly timed cutover with correctly lowered TTLs can behave unexpectedly because some ISP resolvers impose a minimum cache floor — ignoring TTLs below 60–300 seconds and caching longer than specified. AWS Route 53 health-check failover nominally fires within one TTL interval at &lt;code&gt;60s&lt;/code&gt;, but in practice ISP clamping can extend client impact by several minutes. Fast-failover designs that depend on sub-minute TTLs need to account for this ceiling you don't control.&lt;/p&gt;

&lt;p&gt;Understanding the caching layer is prerequisite to reasoning about the infrastructure sitting on top of it — particularly the anycast and GeoDNS architectures that use TTLs as a traffic-steering lever.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anycast, GeoDNS, and the Infrastructure That Makes DNS Fast
&lt;/h2&gt;

&lt;p&gt;Here's something worth internalizing: when you query &lt;code&gt;1.1.1.1&lt;/code&gt;, you're not talking to a single server. You're talking to whichever of Cloudflare's 300+ points of presence BGP has decided is topologically closest to you at that moment. The IP is the same everywhere. The server handling your query is not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anycast routing&lt;/strong&gt; works by announcing the same IP prefix from multiple autonomous systems simultaneously. BGP's path selection — preferring shorter AS paths and lower IGP cost — naturally routes each query to the nearest PoP. No client-side configuration, no explicit load balancing tier, no DNS round-robin. The network fabric itself is the load balancer. Cloudflare's anycast deployment achieves a median global query latency under 14ms precisely because most queries never travel more than a few hundred miles. Failover is implicit: if a PoP goes dark, BGP withdraws its prefix announcement and traffic reroutes automatically within seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GeoDNS&lt;/strong&gt; operates at a higher layer. Rather than routing packets to the nearest infrastructure, it returns different &lt;em&gt;answers&lt;/em&gt; based on where the query originates. Same domain name, different IP pools depending on region. Netflix does this at scale: &lt;code&gt;open.netflix.com&lt;/code&gt; resolves to US edge clusters for US users and EU edge clusters for European users — not because the domain is different, but because the authoritative nameserver inspects the source and tailors the response. This enables both latency optimization and regulatory compliance (data residency requirements often mandate that European user traffic stays within EU-hosted infrastructure).&lt;/p&gt;

&lt;p&gt;CDNs have made GeoDNS central to their architecture. When you configure a CDN in front of your origin, the CDN's authoritative nameserver becomes responsible for resolving users to the nearest edge node. The CDN isn't just a cache — DNS is literally the first routing decision in the request path. Getting that decision wrong adds latency that no amount of edge caching can recover.&lt;/p&gt;

&lt;p&gt;The sharp edge here is the &lt;strong&gt;resolver IP vs. client IP problem&lt;/strong&gt;. An authoritative GeoDNS server doesn't see the end user's IP — it sees the recursive resolver's IP. A user in São Paulo hitting Google Public DNS (&lt;code&gt;8.8.8.8&lt;/code&gt;) might get routed to a US east coast cluster because Google's resolver appears to originate from a US data center. &lt;strong&gt;EDNS Client Subnet (ECS)&lt;/strong&gt; addresses this by embedding a truncated client subnet prefix (typically &lt;code&gt;/24&lt;/code&gt; for IPv4) in the query, giving the authoritative server enough geographic signal to route accurately. The trade-off is cache fragmentation: Google Public DNS now caches responses keyed on subnet, not just on query name, which meaningfully reduces resolver-side hit rates for popular CDN-backed domains.&lt;/p&gt;

&lt;p&gt;I've found ECS is often invisible until a GeoDNS misconfiguration produces inexplicably wrong-region responses — at which point understanding the resolver/client IP distinction becomes urgent.&lt;/p&gt;

&lt;p&gt;The caching and geographic routing decisions discussed here ripple directly into how DNS failures manifest in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Failure Modes and Operational Edge Cases
&lt;/h2&gt;

&lt;p&gt;Here's the uncomfortable truth: DNS failure modes are disproportionately severe relative to their apparent complexity. A single misconfigured record, an expired signature, or a lapsed domain registration can silently erase your entire service from the internet. Engineers who treat DNS as "solved infrastructure" get surprised by this repeatedly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNS-based failover has a TTL floor.&lt;/strong&gt; If your A record has a 300-second TTL and your primary datacenter goes down at T+0, resolvers with cached responses will keep sending traffic there until T+300 — minimum. In practice, with resolver implementations that don't strictly honor TTL expiry, it's longer. Design your failover SLAs around this reality: a 5-minute TTL means a 5-minute minimum failover latency. I've seen teams set TTLs to 30 seconds pre-migration, then forget to restore them — suddenly they're hammering authoritative servers with 10x normal query volume under load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Split-horizon DNS is operationally treacherous.&lt;/strong&gt; Serving different answers to internal vs. external clients — typically via separate views in BIND or Route 53 private zones — breaks silently when misconfigured. A service that resolves correctly from the corporate VPN might NXDOMAIN from a CI runner with a different resolver path. The failure doesn't announce itself; requests just route somewhere unexpected or fail entirely. I've found this most often surfaces when engineers rotate VPN infrastructure or add new subnets without updating the view ACLs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNSSEC failures are catastrophic, not graceful.&lt;/strong&gt; When DNSSEC is configured correctly, it's invisible. When it breaks — expired RRSIG records, failed key rollovers, broken DS record chains — validating resolvers return hard SERVFAIL for the entire zone. The domain appears to vanish. A classic failure: a zone administrator activates a new Key Signing Key (KSK) but forgets to publish the updated DS record at the parent zone first. Validating resolvers immediately start failing the delegation chain. The fix requires parent zone cooperation and propagation time you don't have during an incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DNS amplification is a protocol-level attack surface.&lt;/strong&gt; Attackers spoof a victim's IP, send small queries (typically ANY or DNSKEY requests) to open resolvers, and those resolvers send large responses — up to 50x amplification factor — to the spoofed address. Mitigation requires BCP38 egress filtering upstream and disabling open recursion on your authoritative infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Registrar-level failures sit above everything else in the stack.&lt;/strong&gt; The GoDaddy DNS outage in 2012 took down authoritative nameservers for millions of domains via a botched internal router update — nameserver reliability irrelevant when the NS delegation itself is unreachable. Worse, if your domain registration lapses, no amount of authoritative server redundancy helps. The June 2021 Fastly outage is instructive from the other direction: DNS itself worked fine, but CDN infrastructure dependent on it collapsed — a reminder that DNS sits at the base of every reliability assumption above it.&lt;/p&gt;

&lt;p&gt;Understanding these failure modes changes how you architect around DNS, particularly when it comes to how authoritative infrastructure handles load at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  DNS in Modern Infrastructure: Service Discovery and Kubernetes
&lt;/h2&gt;

&lt;p&gt;Here's what makes DNS elegant as a service discovery mechanism: every language runtime already knows how to use it, no sidecar required.&lt;/p&gt;

&lt;p&gt;Kubernetes CoreDNS resolves names like &lt;code&gt;my-service.my-namespace.svc.cluster.local&lt;/code&gt; to the cluster-internal virtual IP of a Service. For headless services (&lt;code&gt;ClusterIP: None&lt;/code&gt;), that behavior inverts — DNS returns &lt;em&gt;all&lt;/em&gt; backing pod IPs directly, which is exactly what StatefulSets need so clients can address &lt;code&gt;postgres-0.postgres.default.svc.cluster.local&lt;/code&gt; as a stable identity across rescheduling.&lt;/p&gt;

&lt;p&gt;The operational trap is TTL interaction with connection pooling. CoreDNS typically returns TTLs of 5–30 seconds, but HTTP/2 and gRPC clients hold persistent connections. When a pod restarts and gets a new IP, pooled connections targeting the stale IP continue routing to a dead endpoint until the connection errors out — the DNS TTL becomes irrelevant because the client never re-resolved.&lt;/p&gt;

&lt;p&gt;Consul compounds this in a different direction: services register with health checks running every 10s, and DNS responses reflect current health state — but a 30s TTL means an unhealthy endpoint stays cached across clients for up to 30s after its first failed check. I've seen this gap silently inflate error rates during deployments when engineers assume health-check-integrated DNS provides instant failover.&lt;/p&gt;

&lt;p&gt;The fundamental trade-off DNS-based discovery makes is freshness for ubiquity — understanding that trade-off is what separates its correct use from its misuse in latency-sensitive or high-churn environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways for Engineers Designing with DNS
&lt;/h2&gt;

&lt;p&gt;Pre-lower TTLs 24–48 hours before any migration. Once you flip the record, you've surrendered control to cached copies at the original TTL — there's no mechanism to invalidate them.&lt;/p&gt;

&lt;p&gt;DNS failover speed is bounded by TTL, full stop. If you need sub-minute recovery, layer in application-level health checks and connection retries. Shorter TTLs alone increase resolver load without closing the gap meaningfully.&lt;/p&gt;

&lt;p&gt;CNAME-at-apex is an RFC violation. Use ALIAS records (Route 53) or CNAME Flattening (Cloudflare) when pointing a root domain to a CDN or load balancer hostname.&lt;/p&gt;

&lt;p&gt;GeoDNS accuracy isn't guaranteed — ECS support varies across resolvers. Test from diverse vantage points before trusting your latency models.&lt;/p&gt;

&lt;p&gt;Treat DNS as infrastructure. Version-control zone files, manage records via Terraform or provider APIs, and audit regularly for dangling CNAMEs. A CNAME pointing to a deprovisioned S3 bucket or Heroku app is a live subdomain takeover vector — tools like &lt;code&gt;aquatone&lt;/code&gt; and &lt;code&gt;dnsrecon&lt;/code&gt; make automated auditing straightforward. Zone changes belong in PRs, not console sessions.&lt;/p&gt;

</description>
      <category>infrastructure</category>
      <category>networking</category>
      <category>sre</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How I Traced One Browser Request from Keystroke to Rendered Page</title>
      <dc:creator>SAI RAM</dc:creator>
      <pubDate>Sat, 20 Jun 2026 10:44:50 +0000</pubDate>
      <link>https://dev.to/sai_ram_0000/how-i-traced-one-browser-request-from-keystroke-to-rendered-page-1fpp</link>
      <guid>https://dev.to/sai_ram_0000/how-i-traced-one-browser-request-from-keystroke-to-rendered-page-1fpp</guid>
      <description>&lt;h2&gt;
  
  
  I Just Wanted to Know Why &lt;a href="http://www.google.com" rel="noopener noreferrer"&gt;www.google.com&lt;/a&gt; Loads So Fast
&lt;/h2&gt;

&lt;p&gt;I was sitting at my desk one evening, typed &lt;code&gt;www.google.com&lt;/code&gt;, and the page was fully loaded before I could finish thinking the thought. Under 200 milliseconds. I remember pausing and genuinely wondering — &lt;em&gt;how?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not "how" in a hand-wavy sense. Actually how. My fingers pressed keys on a keyboard. Somehow, a fully rendered Google homepage appeared on my screen, pulling content from servers that could be thousands of miles away, in less time than it takes to blink. What just happened in that gap?&lt;/p&gt;

&lt;p&gt;I started pulling on the thread. Turns out that ~200ms is not one thing — it's a stack of layers, each one solving a different problem, each one adding its own slice of latency. There's a layer that translates &lt;code&gt;www.google.com&lt;/code&gt; into a number your computer can actually route to. A layer that opens a reliable channel across a chaotic network. A layer that encrypts everything so nobody between you and Google can read it. And finally the layer that actually asks for the page and receives it back.&lt;/p&gt;

&lt;p&gt;What surprised me most wasn't the complexity — it was how logical it all is once you trace it step by step. Every layer exists because someone hit a wall and had to solve a specific problem. Understanding those problems makes the whole stack click into place in a way that no amount of memorising acronyms ever does.&lt;/p&gt;

&lt;p&gt;So let me walk you through exactly what happens, layer by layer, in the order it actually occurs — from the moment you press Enter to the moment the page appears.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 1: DNS Resolution — Finding Google's Address Before Anything Else
&lt;/h2&gt;

&lt;p&gt;Before a single TCP packet leaves my machine, the browser needs to translate &lt;code&gt;www.google.com&lt;/code&gt; into an IP address. I used to think of this as a simple "phonebook lookup." It's not — and understanding why changed how I think about infrastructure migrations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cache hierarchy most developers underestimate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Resolution doesn't start with a DNS server. It starts with the browser's own in-memory cache, then falls through to the OS cache (after checking &lt;code&gt;/etc/hosts&lt;/code&gt;), and only then hits a recursive resolver — typically your ISP's or something like &lt;code&gt;8.8.8.8&lt;/code&gt;. The overwhelming majority of queries die right there at the recursive resolver's cache and never travel further. The full recursive walk is the exception, not the rule.&lt;/p&gt;

&lt;p&gt;When it &lt;em&gt;is&lt;/em&gt; a full cache miss, here's what actually happens for &lt;code&gt;www.google.com&lt;/code&gt;: the recursive resolver asks a root server, which responds with a referral to Verisign's &lt;code&gt;.com&lt;/code&gt; TLD nameservers. The resolver then queries those, gets referred to &lt;code&gt;ns1.google.com&lt;/code&gt;. Finally, it asks Google's authoritative nameserver and receives &lt;code&gt;142.250.183.100&lt;/code&gt;. Four round trips — but from my client's perspective it looks like one, because the recursive resolver does all the legwork. That's the design: offload the heavy lifting to infrastructure that can cache aggressively at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "13 root servers" thing is a misconception&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are 13 &lt;em&gt;logical&lt;/em&gt; root server names (&lt;code&gt;a.root-servers.net&lt;/code&gt; through &lt;code&gt;m.root-servers.net&lt;/code&gt;), but they're backed by over 1,600 physical instances distributed globally via anycast. The 13 number isn't a scalability ceiling — it's an artifact of fitting all root server addresses into a single 512-byte UDP packet, the original DNS message size limit. Anycast routing means your query hits the geographically nearest instance, not some single overloaded machine in a basement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TTL is a dial, not a setting you configure once&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;TTL is DNS's cache invalidation mechanism, and it's the most operationally interesting part. Set it too high (say, &lt;code&gt;86400&lt;/code&gt; seconds) and a botched server migration will leave users hitting a dead IP for &lt;em&gt;days&lt;/em&gt;. Set it too low and you're hammering resolvers with queries unnecessarily, adding latency on every cache miss. I've been bitten by both ends of this.&lt;/p&gt;

&lt;p&gt;The pattern I've found most useful in practice: before a planned migration, drop TTL to &lt;code&gt;60&lt;/code&gt; seconds roughly 48 hours ahead — enough time for the old high TTL to expire everywhere. Execute the migration. Then raise TTL back to &lt;code&gt;3600&lt;/code&gt; once the new records are confirmed stable. TTL becomes a dial you tune based on how much agility you need versus how much cache efficiency you want.&lt;/p&gt;

&lt;p&gt;DNS is fundamentally a &lt;em&gt;decoupling layer&lt;/em&gt;: it separates stable, human-readable names from volatile infrastructure IPs. That's exactly why a CDN can route the same hostname to a server in Frankfurt for me and a server in Singapore for someone else — all without touching the client.&lt;/p&gt;

&lt;p&gt;That geographic routing trick depends entirely on what happens &lt;em&gt;after&lt;/em&gt; DNS hands back an address. Which is where TCP enters the picture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 2: TCP Handshake — One Round Trip Before a Single Byte of Real Data
&lt;/h2&gt;

&lt;p&gt;With an IP address in hand, my browser immediately tries to open a TCP connection — and this is where I first started internalizing &lt;em&gt;latency as a physical constraint&lt;/em&gt;, not just a number in a monitoring dashboard.&lt;/p&gt;

&lt;p&gt;The three-way handshake is elegantly simple and completely unavoidable: client sends SYN, server replies SYN-ACK, client confirms with ACK. Only after that ACK lands can the browser send its first HTTP request. The reason the server can't skip straight to receiving data is that it needs to prove bidirectional reachability first — TCP's entire reliability model depends on both sides confirming they can both send &lt;em&gt;and&lt;/em&gt; receive before the connection is considered open.&lt;/p&gt;

&lt;p&gt;That handshake costs exactly one RTT. And RTT is just geography wearing a disguise.&lt;/p&gt;

&lt;p&gt;I made this concrete by running a quick measurement from different VPS locations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-w&lt;/span&gt; &lt;span class="s1"&gt;'%{time_connect}\n'&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; /dev/null &lt;span class="nt"&gt;-s&lt;/span&gt; https://www.google.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From a Frankfurt server: ~15ms. From a Mumbai server hitting a London origin: ~150ms. That 150ms is gone before a single byte of application data moves. This is precisely why CDN edge nodes exist — not just to cache content, but to physically shorten the handshake path. When you're in Mumbai hitting a CDN PoP that's also in Mumbai, that 150ms collapses to ~5ms.&lt;/p&gt;

&lt;p&gt;But TCP's costs don't stop at connection setup. The protocol also guarantees ordered delivery, retransmission of lost packets, and congestion control — and those guarantees create a subtle trap called &lt;strong&gt;head-of-line blocking&lt;/strong&gt;. In HTTP/1.1 over TCP, if packet #4 in a sequence is dropped, packets #5 through #50 sit waiting in the receive buffer even if they arrived intact. Every request on the connection stalls. HTTP/2 multiplexing helped at the application layer, but the underlying TCP stream still blocks. That single frustration is essentially the design motivation behind HTTP/3: by moving to QUIC over UDP, each stream becomes independently reliable, so one lost packet no longer freezes the world.&lt;/p&gt;

&lt;p&gt;The handshake is just the beginning of TCP's hidden tax. Once TLS enters the picture, the bill gets larger.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 3: TLS Handshake — Encryption Isn't Free, But TLS 1.3 Made It Cheaper
&lt;/h2&gt;

&lt;p&gt;With the TCP connection established, the browser immediately kicks off a TLS handshake — and this is where I spent the most time squinting at Wireshark traces trying to understand &lt;em&gt;why&lt;/em&gt; things worked the way they did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TLS 1.2 cost you two round trips&lt;/strong&gt; before a single byte of encrypted application data could flow. The client said hello, the server replied with its certificate and cipher preferences, the client responded with key material, and only then did encryption begin. At 50ms RTT — not unusual for cross-continental traffic — that's 100ms of pure ceremony before the browser can even ask for the HTML.&lt;/p&gt;

&lt;p&gt;TLS 1.3 collapsed this to one round trip by making the client &lt;em&gt;guess&lt;/em&gt; upfront. The Client Hello now includes a &lt;code&gt;key_share&lt;/code&gt; extension — the client assumes the server will negotiate X25519 (the most common elliptic-curve Diffie-Hellman group) and proactively sends its half of the key exchange alongside the hello. If the guess is right, the server can respond with its own key share, its certificate, and a Finished message all in one flight. Encryption starts immediately after.&lt;/p&gt;

&lt;p&gt;If the guess is wrong — say the server only supports P-256 — you get a &lt;code&gt;HelloRetryRequest&lt;/code&gt; and you're back to two round trips. This is why server operators advertise their supported groups clearly and why X25519 became the de facto default.&lt;/p&gt;

&lt;p&gt;The certificate itself does double duty. It carries the server's public key for the key exchange, and it proves identity by chaining up to a root CA your OS already trusts. In Chrome DevTools' Security tab, you can trace this chain concretely: &lt;code&gt;*.google.com&lt;/code&gt; is signed by Google Trust Services, which is signed by a root CA pre-embedded in your OS trust store. Break either link — expired cert, mismatched hostname, untrusted root — and the browser hard-stops. There's no "just this once" for TLS failures.&lt;/p&gt;

&lt;p&gt;One thing I found genuinely surprising: after TLS completes, your ISP can still see that you connected to &lt;code&gt;142.250.183.100&lt;/code&gt;. The IP header is plaintext, and the &lt;code&gt;server_name&lt;/code&gt; extension in the Client Hello — SNI — announces &lt;code&gt;www.google.com&lt;/code&gt; before encryption begins. The &lt;em&gt;content&lt;/em&gt; of your request is hidden; the &lt;em&gt;destination&lt;/em&gt; is not.&lt;/p&gt;

&lt;p&gt;For returning visitors, &lt;strong&gt;session tickets&lt;/strong&gt; let the client skip the full handshake entirely. The server issues an encrypted ticket at the end of a session; on the next connection the client presents it, and encryption resumes in the first flight — a meaningful win for repeat pageloads.&lt;/p&gt;

&lt;p&gt;With the secure channel finally open, the browser has one more thing left to do before Google's servers can respond with HTML.&lt;/p&gt;

&lt;h2&gt;
  
  
  Part 4: HTTP Request and Response — Finally Asking for the Page
&lt;/h2&gt;

&lt;p&gt;With TLS established, the browser finally sends what we've been building toward: an HTTP GET request for &lt;code&gt;/&lt;/code&gt;. But even here, the protocol choices matter more than I initially appreciated.&lt;/p&gt;

&lt;p&gt;Modern browsers negotiate HTTP/2 during the TLS handshake (via ALPN). That single detail eliminates a hack that defined HTTP/1.1 performance for years — browsers opening up to six parallel TCP connections per origin just to fetch multiple resources simultaneously. HTTP/2 multiplexes all requests over &lt;em&gt;one&lt;/em&gt; connection as independent streams. No queue blocking, no connection overhead per asset.&lt;/p&gt;

&lt;p&gt;The efficiency compounds with &lt;strong&gt;HPACK header compression&lt;/strong&gt;. On a page firing 80+ sub-requests, headers like &lt;code&gt;User-Agent&lt;/code&gt;, &lt;code&gt;Accept-Language&lt;/code&gt;, and &lt;code&gt;Cookie&lt;/code&gt; repeat identically. HPACK encodes them as small integer indices against a shared table. What was 800 bytes of repeated header data becomes a handful of integers — genuinely measurable when you're counting round trips.&lt;/p&gt;

&lt;p&gt;The Chrome DevTools Network tab makes this concrete. Filter by type, enable the connection column, and watch the waterfall. HTML arrives first, then CSS and JS appear as overlapping bars on the &lt;em&gt;same&lt;/em&gt; connection row — that's the multiplexing made visible. Images follow in parallel streams. It looks nothing like HTTP/1.1's staggered, connection-per-resource pattern.&lt;/p&gt;

&lt;p&gt;The response headers are equally deliberate. Static assets like &lt;code&gt;main.a3f92c.js&lt;/code&gt; — a content-addressed filename with a hash baked in — arrive with &lt;code&gt;Cache-Control: max-age=31536000, immutable&lt;/code&gt;. The browser won't touch the network for that file for a year. The hash changing is the invalidation mechanism. HTML, by contrast, typically gets &lt;code&gt;no-store&lt;/code&gt; or a short TTL, ensuring the latest asset URLs always reach the client.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;Accept-Encoding: br, gzip&lt;/code&gt; request header is the browser advertising Brotli support; the server then chooses &lt;code&gt;br&lt;/code&gt;, which typically compresses text assets 15–25% better than gzip. That saving is real bandwidth the rendering engine now has to work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Taught Me: Every Layer Is a Trade-Off Frozen in Time
&lt;/h2&gt;

&lt;p&gt;Running through this exercise, the thing that hit me hardest was the latency math. Add it up for a user 150ms away: DNS lookup (20–120ms on a cold cache), TCP handshake (150ms), TLS 1.3 handshake (150ms), HTTP request/response (150ms minimum). You're sitting at &lt;strong&gt;450–600ms before the first byte of HTML even arrives&lt;/strong&gt; — and that's before parsing, subresource fetching, or rendering a single pixel. On a fast connection. Every millisecond in that budget has a name and an owner.&lt;/p&gt;

&lt;p&gt;What reframed my thinking was realizing each layer is a solution to a real problem that existed at a specific moment in history. TCP solved packet loss on unreliable ARPANET-era networks. TLS solved plaintext eavesdropping as the web went commercial. HTTP/2 solved HTTP/1.1's serial request problem. Now HTTP/3 over QUIC is solving what TCP itself got wrong — specifically, head-of-line blocking. A single dropped packet in TCP stalls every stream on the connection. QUIC handles packet loss per-stream in user space, so one lost packet on your stylesheet doesn't freeze your JavaScript download. This matters most on lossy mobile networks where packet loss is routine, not exceptional.&lt;/p&gt;

&lt;p&gt;Once you see the stack as a latency budget, performance optimization becomes a targeting problem. CDN edges attack RTT directly by moving the server closer. Preconnect hints attack the handshake cost — &lt;code&gt;&amp;lt;link rel="preconnect" href="https://fonts.googleapis.com"&amp;gt;&lt;/code&gt; triggers DNS + TCP + TLS for a third-party origin &lt;em&gt;before&lt;/em&gt; the browser even parses the stylesheet that requests it, turning sequential handshakes into parallel ones. Caching attacks repeat-visit cost entirely.&lt;/p&gt;

&lt;p&gt;Every optimization maps to exactly one layer. And knowing which layer you're in tells you what tools you actually have.&lt;/p&gt;

</description>
      <category>networking</category>
      <category>dns</category>
      <category>tls</category>
      <category>http</category>
    </item>
  </channel>
</rss>
