DNS: The SRE's Most Underrated Skill

#sre #devops #dns #networking

I've seen more outages caused by DNS than by code. And it's always the same story: the team shipped, something broke, and three hours into debugging someone said, 'wait, is it DNS?'

It's always DNS.

Why DNS bites SREs specifically

DNS is invisible until it breaks. It caches at every layer (OS, resolver, app, CDN). TTLs are rarely what you expect. And it's usually owned by 'the networking team' who are actually just one guy who left the company in 2022.

The debugging mindset

When something weird happens, especially 'works from my laptop, broken in prod,' check DNS before you check code.

dig +short the hostname from the affected host
Check the TTL: dig HOSTNAME. Short TTL (60s)? Probably fine. Long TTL (86400)? You have a problem during rollout.
Is the resolver returning stale records? Try dig @8.8.8.8 HOSTNAME to bypass local cache.

The 3 DNS setups I've seen break

1. Split-horizon DNS with cached results. Internal resolver returns one IP, external returns another. Your service caches the wrong one. Mysterious connection failures ensue.

2. Short TTL during migration, long TTL in resolver. You set the TTL to 60s for a cutover. Your downstream service's resolver has its own cache that respects the record's initial TTL, which was 86400. Your cutover doesn't propagate for a day.

3. DNS-based health checks with slow propagation. You remove a bad host from DNS. Clients keep hitting it because of cache. Outage continues for the length of the TTL.