The Search Engine We Built to Diagnose Our Own Outages

#webdev #programming #architecture #systems

The problem we were actually solving was not fun.

It was 2:17 AM on a Sunday. The Veltrix cluster in soft-DNS-fail mode was flapping like a caught marlin, and the pager was screaming. Our Hytale service had 3,400 players online, each one holding a connection that looked green in the frontend but was silently racking up 500ms tail latency. The Prometheus heatmap showed a perfect gradient from green to red as traffic spiked from 200 rps to 2,100 rps. We had horizontal pods, pod anti-affinity, cluster autoscaler, all the Kubernetes sugar wed copied from the GKE SRE workbook. And yet, connections were still bouncing off the Ingress Nginx controller because the back-pressure signal from the pods wasnt reaching the NGINX config fast enough.

The Grafana alert screamed HytaleIngressLatency95p > 450ms for 5m. The tool we reached for first was the Hytale operators emergency playbook: kubectl describe ingress hytale-ingress, kubectl get pods -o wide, kubectl top pods. Each command took 4–6 seconds to return because the API server was saturated. We had planned for pod sprawl, but we hadnt planned for API sprawl. After 22 minutes of finger-on-keyboard bingo we finally spotted the real culprit: the NGINX controller was recompiling its Lua balancer table every time the cluster autoscaler spun up a new node, and the LuaJIT cache miss hit a pathological worst-case path during the recompile. The nginx-lua-prometheus metrics showed balancer_compile_duration at 1.8s on every pod restart.

The architecture decision was brutal but necessary.

We ripped out the NGINX ingress controller and replaced it with Traefik Enterprise 2.10. The tradeoff was clear: Traefik used a streaming configuration endpoint (/-/stream) that delivered the live balancer config to every pod in under 100ms, and the memory footprint jumped from 180 MiB to 320 MiB per replica. That was the cost of moving from a static Lua cache to a streaming hot-reload model. We put Traefik behind an AWS Network Load Balancer to keep the external traffic path clean, and we set up a dedicated Prometheus sidecar called traefik-sidecar-scraper that scraped only the /metrics endpoint every 1 second, avoiding the API server entirely. The latency 95p dropped from 450ms to 98ms within two minutes of the rollout.

What the numbers said after.

We ran a 7-day dark canary on the Traefik ingress before cutting 100% of traffic. During the canary, the traefik_sidecar_scraper errors_total metric stayed flat at 3.2 × 10⁻⁵ (one error every 8 hours), while the kube-apiserver_request_total for that namespace dropped from 12,800 rps to 4,200 rps. The player-facing latency 95p settled at 48ms on the Traefik path versus 92ms on the legacy NGINX path for the same traffic pattern. The memory overhead we accepted was 140 MiB per Traefik pod, but we gained 18 GiB of headroom on the API server, which meant the kubectl commands wed been cursing at now returned in 200–400 ms instead of 4–6 seconds.

What I would do differently.

The one thing Id change is the balancer algorithm. Traefik 2.10 ships with round-robin by default, but we needed consistent-hash with source IP for the Hytale sticky sessions. Tuning the traefik.routing.service.weight label to force consistent-hash added 270ms to the initial connection setup because the table had to be rebuilt on every restart. Next time Ill bake the algorithm into the Traefik Helm chart upfront and run a load test with 5,000 concurrent WebSocket connections before the canary. Ill also log the balancer_compile_duration metric every 10 seconds instead of every 60 seconds; the 60-second window hid the 1.8 s spike during autoscaling events, and that spike is the exact moment the pager fired.

DEV Community

The Search Engine We Built to Diagnose Our Own Outages

Top comments (0)