Why Your DNS Failover Didn't Actually Fail Over

#devops #infrastructure #sre #networking

The failover was declared at 02:14. The runbook was followed. DNS records updated. Health checks passing on secondary. The on-call engineer closed the incident bridge call at 02:31 with a single line in the ticket: failover complete. At 02:32, a monitoring alert fired. Traffic was still hitting the dead primary.

The DNS record had changed in seconds. The traffic moved 18 minutes later. Only one of those numbers mattered.

This is dns failover testing failure in its most common form — not a misconfiguration, not a vendor bug, not a missed step. Every layer in the stack behaved exactly as designed. The system still failed operationally.

What the Runbook Said Was Done

The runbook covered the right things. TTL had been pre-reduced to 60 seconds during the maintenance window two weeks prior. The health check interval on the secondary was 30 seconds. The DNS record update propagated to the authoritative nameservers within 90 seconds of execution. By every documented metric, the failover was complete.

The team was not wrong to close the bridge. They were wrong about what "complete" meant.

DNS failover is treated as a discrete event — you change the record, propagation happens, traffic moves. The mental model is a switch: off, then on. The operational reality is a drainage problem. Traffic does not move when the record changes. Traffic moves when every active path that was routing to the old record has expired its cached state and re-resolved. Those are different events, separated by an amount of time that no runbook entry captures.

The Declaration Gap is the period between when the failover is declared complete and when traffic has actually moved. In this case, it was 18 minutes. In environments with more caching layers, it can be significantly longer.

The Four Layers That Each Did Their Job

This is the part worth understanding precisely. The failure was not caused by any single layer behaving incorrectly. It was caused by four layers each behaving correctly — and nobody having modelled what that looked like in combination.

01 — DNS TTL — TTL had been pre-reduced to 60 seconds. Resolvers that re-queried after expiry got the new record immediately. TTL did its job. The problem is that TTL is a floor, not a ceiling. Resolvers are not required to honour TTL exactly, and under load many cache longer than the declared value. The 60-second TTL reduced the blast radius. It did not eliminate it.

02 — Health Check Lag — The health check confirmed the secondary was healthy before the failover was declared. That check passed. What it did not model was the transition period — the window between the primary being declared dead and all traffic paths having drained away from it. Health checks measure endpoint state. They do not measure traffic distribution state.

03 — CDN Origin Cache — The CDN had its own origin resolution cache with a TTL independent of the DNS TTL. When the DNS record changed, the CDN did not immediately re-resolve the origin. It served from its cached origin record for the remainder of its own TTL window. Traffic transiting the CDN continued to reach the old origin until the CDN's internal cache expired — a separately timed event nobody had factored into the RTO calculation.

04 — Client-Side Resolver Persistence — Enterprise clients behind corporate recursive resolvers, browsers with internal DNS caches, and mobile devices with persistent resolver state all maintained their own cached records independently of what the authoritative nameserver was serving. Every one of these clients honoured its own caching logic correctly. The system still failed operationally.

What DNS Failover Testing Actually Needs to Measure

Most dns failover testing validates the wrong thing. A test that confirms the DNS record updated and the health check passed has validated the protection plane. It has not validated the recovery plane — whether traffic actually moved, when it moved, and what the distribution looked like during the transition window.

Diagnostic: "How do you know traffic moved — and how long after declaration did you check?"

A test that surfaces the Declaration Gap needs to measure traffic distribution, not DNS state. It needs to run active traffic against the production path — including CDN-transited requests and enterprise resolver-cached clients. It needs to timestamp when the DNS record change executes and when traffic distribution on the secondary crosses a defined threshold. The gap between those two timestamps is the real RTO contribution from DNS failover.

Pre-reducing TTL before a planned failover is necessary but not sufficient. The CDN cache TTL needs its own pre-reduction step. Ignoring this makes the CDN the binding constraint on traffic movement regardless of how aggressively the DNS TTL was tuned.

Monitoring during the failover window needs to watch traffic distribution at the application layer, not DNS propagation at the nameserver layer.

The Transferable Principle

DNS failover is not complete when the record changes. It is complete when traffic distribution changes.

That distinction rewrites the RTO model for any architecture that depends on DNS-based failover. The RTO contribution from a DNS failover event is not the TTL value. It is the time required for all active traffic paths to drain their cached state and re-resolve. These drainage events happen independently, on different timers, with no coordination signal between them.

Testing needs to validate this explicitly — not as a one-time exercise, but on the same cadence as the RTO it is supposed to guarantee. An architecture with a 15-minute RTO commitment that has never measured its Declaration Gap does not have a 15-minute RTO. It has a 15-minute aspiration and an unknown operational reality.

The record changed in seconds. The traffic moved 18 minutes later. Only one of those numbers mattered.

Originally published at rack2cloud.com