DEV Community

Cover image for The Ghost in the Veltrix: Why Our Treasure Hunt Engine Was Sending Operators Down the Wrong Rabbit Hole
Lillian Dube
Lillian Dube

Posted on

The Ghost in the Veltrix: Why Our Treasure Hunt Engine Was Sending Operators Down the Wrong Rabbit Hole

In November 2023 we ran our first global Hytale servers on Google Kubernetes Engine using Veltrix 3.2 as our configuration orchestrator. The Treasure Hunt Engine—a service that fans spawn to claim event loot—started crashing every time search volume exceeded 12 k RPM. Grafana showed a steady climb of 503 errors on /hunt/claim until the autoscaler maxed out at 32 G1 CPU cores and still couldnt keep up. Operators kept filing tickets that boiled down to one sentence: We click the map, nothing happens. We never saw the actual error because the ingress controller was swallowing it and returning a generic Too many requests.

What we tried first (and why it failed)

Our first move was to crank up the nginx-ingress-controller replicas from 3 to 12 and switch the load-balancer tier from GKE Standard to Premium. The 503 rate dropped to 8 k RPM, but now the p99 latency on claims spiked from 80 ms to 420 ms. The culprit was a recursive call in the hunt service: every claim required a round trip to the player-profile service to validate tier eligibility, and that service was on a shared Postgres 15.4 cluster with 3 k TPS of unrelated traffic. The error stack in Jaeger was literally tracing_id=7f3a1c8… server=profile-db pool_timeout. We tried adding connection pooling with PgBouncer, but the hunt service was using raw libpq and refused to reuse connections—no matter how many times we told it.

The Architecture Decision

We ripped the validation out of the synchronous path and made the hunt engine publish an event called HuntTierCheckRequired to a dedicated Kafka topic player-events-tier. The hunt service would respond to the client with a 202 Accepted immediately, then the loot-claim worker would listen to that topic and, if the tier passed, publish HuntLootReady. The worker ran in the same pod but on a separate goroutine with a 60-second TTL so we didnt leak memory if the tier service hung. We moved the player-profile service to an SSD-backed CloudSQL instance and gave it 32 GB RAM—cost went up by $180/month but failures dropped to zero. The ingress tier settled on 6 replicas with HorizontalPodAutoscaler watching both CPU and custom metric hunt_engine_claim_latency_bucket{le="1"}.

What the numbers said after

After 14 days we had:

  • Error rate on /hunt/claim < 0.03 %
  • P95 latency 120 ms, P99 245 ms
  • CPU per hunt pod at 45 % with 1.2 requests/second/core—well below the 70 % safety line we drew after the last overload in March 2023
  • Kafka consumer lag on player-events-tier stayed below 200 ms even during the Hytune 2024 weekend when search volume peaked at 45 k RPM

What I would do differently

I should have quarantined the tier check the day we first saw the 503 wall. We wasted two weeks convincing ourselves the bottleneck was ingress instead of the round-trip validation. If I could rewind, Id have put the validation worker in a separate namespace with its own HPA on consumer lag and a budget of $50/month extra. That single partition of Kafka would have cost less than the time we burned debugging nginx timeouts. Also, we never instrumented the raw libpq connection pool metrics, so we only caught the reuse bug when a junior engineer—against all tribal knowledge—ran SHOW pool_status and it spat back too many connections. Always expose the counters.

Top comments (0)