An 8-minute outage from a dead NLB and a JVM that cached DNS forever

#devops #infrastructure #kubernetes #sre

TL;DR: We drained a Network Load Balancer during a planned migration, and one internal service kept hammering the dead IPs for 8 minutes. The cause wasn't the failover. It was a JVM caching the DNS record forever. The fix was a 30-second TTL and a health-check tweak, not a smarter system.

Last month I ran a routine migration on one of our internal control-plane services at Buildkite. Move it behind a new NLB, drain the old one, done before lunch. We've got a small platform team, four of us, and this was meant to be a non-event.

It was not a non-event.

The bit where everything looked fine

I shifted the Route 53 record to the new NLB, watched the new targets go healthy, and started draining the old load balancer. Traffic on the new path climbed. Traffic on the old path... did not drop. One service, a Java scheduler that fans out build metadata, kept slamming the old NLB's IP addresses like nothing had changed.

The old targets were already deregistering. So we had requests landing on backends mid-shutdown, timing out at 10 seconds each, retrying, timing out again. Error rate on that service went from basically zero to 60%. For 8 minutes.

The annoying part? Every other service flipped over within about 60 seconds. Our Go services, our Node workers, all fine. Only the JVM one was stuck.

Why one runtime behaved differently

Here's the thing nobody on the team had clocked. The default DNS caching behaviour is wildly different depending on what's making the request.

| Runtime | Default DNS cache TTL | What actually happened |
|---|---|
| JVM (security manager off) | 30s | Usually fine |
| JVM (security manager on) | Forever (networkaddress.cache.ttl=-1) | Cached the dead NLB IPs until restart |
| Go net resolver | Honours record TTL | Re-resolved in ~60s |
| Node 18 | Honours record TTL | Re-resolved in ~60s |
| Python requests | No caching at the lib layer | Re-resolved per connection pool refresh |

NLBs hand you IP addresses behind a DNS name, and those IPs change when targets move. The contract is: you re-resolve and follow the record. The JVM, with a security manager active, sets networkaddress.cache.ttl to -1, which means cache the first answer for the life of the process. So our scheduler resolved the name once at boot, three weeks ago, and never looked again.

The DNS record had a 60-second TTL. Didn't matter. The JVM never asked DNS a second time.

The fix

Two lines in the JVM security config, pushed through our base image:

# $JAVA_HOME/conf/security/java.security
networkaddress.cache.ttl=30
networkaddress.cache.negative.ttl=5

Thirty seconds of caching, five seconds for negative lookups so a transient NXDOMAIN doesn't get pinned either. We bake this into the base image now so every JVM service inherits it. No app code change.

Then the second half of the problem: even with re-resolution, the old NLB was draining too slowly. Default deregistration delay is 300 seconds. During a planned cutover that's an eternity of half-dead targets accepting connections. We dropped it for this service:

aws elbv2 modify-target-group-attributes \
  --target-group-arn "$TG_ARN" \
  --attributes Key=deregistration_delay.timeout_seconds,Value=30

Short TTL plus short drain meant the next test cutover finished in under 90 seconds across every runtime. No error spike.

Where the LLM bit sneaks in

One reason this stung is that the same scheduler kicks off build steps that call an LLM for flaky-test classification. Those calls go out through an AI gateway, Bifrost in our case, which already does its own provider failover and re-resolution on the upstream side. So that path stayed healthy the whole time. Which made the outage extra confusing at first, because the part everyone assumed was fragile (the model calls) was solid, and the boring internal HTTP call was the thing on fire. Good reminder that the exotic dependency isn't always the one that bites you.

Trade-offs and limitations

A short DNS TTL isn't free, and I won't pretend it is.

More DNS queries. Dropping JVM cache to 30s means roughly 120x more lookups per hour per host than the old "resolve once" behaviour. For us that's noise against a Route 53 resolver, but if you're doing thousands of resolutions a second you'll want to check your resolver isn't the new bottleneck.

30s is still 30s. This shortens the failover window, it doesn't remove it. For a true hot failover you need connection-level health checks and active draining, not just DNS. We treat DNS as the coarse knob and deregistration delay as the fine one.

Negative caching tradeoff. A 5-second negative TTL means a brief DNS hiccup gets retried fast, but it also means a genuinely down name gets queried more aggressively. Fine for internal services, worth a second look if you're resolving something rate-limited.

It's per-runtime. There's no single switch. The JVM fix does nothing for a Go binary, and the Go default did nothing wrong here. You have to know each runtime's behaviour, which is exactly the gap that caused this.

What I'd tell past me

Test the failover, not the failover system. We had a lovely runbook for swapping NLBs and zero tests for what the clients actually did when we swapped. "Never had a DNS issue" just meant we'd never drained a load balancer with a JVM watching.

Next game day, we're killing an NLB on purpose and watching every runtime re-resolve. If something caches forever, I'd rather find out at 2pm on a Tuesday than during a real migration.