I've seen more outages caused by DNS than by code. And it's always the same story: the team shipped, something broke, and three hours into debugging someone said, 'wait, is it DNS?'
It's always DNS.
Why DNS bites SREs specifically
DNS is invisible until it breaks. It caches at every layer (OS, resolver, app, CDN). TTLs are rarely what you expect. And it's usually owned by 'the networking team' who are actually just one guy who left the company in 2022.
The debugging mindset
When something weird happens, especially 'works from my laptop, broken in prod,' check DNS before you check code.
-
dig +shortthe hostname from the affected host - Check the TTL:
dig HOSTNAME. Short TTL (60s)? Probably fine. Long TTL (86400)? You have a problem during rollout. - Is the resolver returning stale records? Try
dig @8.8.8.8 HOSTNAMEto bypass local cache.
The 3 DNS setups I've seen break
1. Split-horizon DNS with cached results. Internal resolver returns one IP, external returns another. Your service caches the wrong one. Mysterious connection failures ensue.
2. Short TTL during migration, long TTL in resolver. You set the TTL to 60s for a cutover. Your downstream service's resolver has its own cache that respects the record's initial TTL, which was 86400. Your cutover doesn't propagate for a day.
3. DNS-based health checks with slow propagation. You remove a bad host from DNS. Clients keep hitting it because of cache. Outage continues for the length of the TTL.
The rule
Lower your TTL before you need to. Not during the outage. A long TTL on production records is a loaded gun.
DNS deserves respect. Learn it. Love it. Debug it first.
Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com
Top comments (0)