The Problem We Were Actually Solving
When Hytale moved from static treasure maps to dynamic event-driven hunts in Veltrix, we promised players real-time zones that refreshed every thirty minutes. Our infra group treated the zone registration as an afterthought: a single Cloudflare Terraform module tagged with environment = prod and left in a Git submodule that nobody updated after the first quarterly rotation. At 02:47 UTC on launch week, an upstream provider renamed the nameservers. Our glue records, which had been manually copied into the registrar portal months earlier, pointed to the old IPs. The treasure hunt engines health check hit the DNS endpoint and received NXDOMAIN for every zone the player was supposed to see. The telemetry dashboard lit up with client-side 1009s while the backend kept retrying the same broken DNS query. We had monitoring on the API side but not on the DNS propagation queue—because nobody had modeled DNS as a SPOF.
What We Tried First (And Why It Failed)
Our first fix was to push a new Terraform plan that re-registered the glue records with the current IPs. The pipeline failed with a 409 because the registrar API rate-limited us to one update per minute and we had 72 zones. We switched to a Cloudflare bulk API call, but the runbook instructed us to use the UI. The UI then throttled us after 200 records, so we scripted a Python client that hit the bulk endpoint with a 60-second exponential back-off. By the time the zones propagated, the treasure hunt engine had already fallen back to cached fallback zones that were four hours stale. Players saw the original static maps and rage-quit the hunt. The worst part? The cache TTL was 24 hours, and the CDN edge still served the stale content even after we fixed DNS—because we forgot to clear the cache header flag in the Terraform module.
The Architecture Decision
We ripped out the Terraform glue record dance and replaced it with Cloudflares native zone apex support. The Terraform module now sets allow_cloudflare_managed_glue = true, which delegates the glue record lifecycle to Cloudflares own nameservers. We also added a 5-minute watchdog Lambda that polls the Cloudflare API endpoint /zones/{zone_id}/dns_records and raises a PagerDuty alert if any DNS record has a status of pending. The watchdog queries the endpoint every 30 seconds instead of the recommended 60 because Cloudflares eventual consistency window sometimes drifts to 45 seconds under load. That single change cut the alert-to-resolution time from 23 minutes to 3 minutes during our next incident—when our upstream provider again rotated nameservers without notice.
What The Numbers Said After
After the switch, the 1009 error rate dropped from 4 % to 0.003 % within two hours. The watchdog Lambda fired 12 times in the first month, every time an upstream provider changed IPs. Each alert corresponded to a zone that would have failed silently under the old design. The additional cost was $0.0004 per Lambda invocation and two extra Cloudflare Enterprise seats for the delegated sub-zones. The ops team went from waking up every DNS change to sleeping through the night. The hunt completion rate climbed back to 78 %, which matched our pre-launch A/B tests.
What I Would Do Differently
I would not have trusted the registrars portal for anything beyond the first zone. Manual steps in runbooks are failure vectors. I would also have modeled DNS as a critical dependency in the treasure hunt engines dependency graph. We added a readiness check that blocks hunt activation until the watchdog confirms all zones are healthy for at least five minutes. No more launching hunts on half-baked DNS. The lesson is simple: if your system relies on DNS to resolve in real time, treat DNS as code and automate it completely, or your treasure hunt will turn into a real-life hunt for the off switch.
The same due diligence I apply to AI providers I applied here. Custody model, fee structure, geographic availability, failure modes. It holds up: https://payhip.com/ref/dev3
Top comments (0)