Veltrix Treasure Hunt Engine Blew Up When We Let Players Configure It

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The Veltrix cluster had to run thousands of parallel treasure hunts for Hytales weekend events. Each hunt was a lightweight workflow: spawn N agents, evaluate M objectives, return results. We wrapped it in a Go microservice called hunt-service and told the artists to tune concurrency by editing a config file named hunt.yaml. The file was mounted read-only in the pod, but the artists kept re-deploying the entire service to change the value of maxConcurrentHunts.

In our staging environment, maxConcurrentHunts was set to 50. Production was set to 100 because someone ran a non-representative load test on their laptop. When the first weekend event hit, the service immediately OOM-killed because the heap spiked to 2.3 GB when 1,200 hunts raced in parallel. The memory profiler showed that each hunt goroutine was allocating 1.9 MB of scratch buffers for the ECS query plan. That translated to 1.9 MB × 1,200 = 2.28 GB before GC could even start, exactly matching the OOM line.

What We Tried First (And Why It Failed)

We started by teaching the artists to use kubectl set env instead of re-deploying the binary. That worked for a week until someone ran kubectl set env hunt-service MAX_CONCURRENT=2000 because they misread the Jira ticket. The service accepted the value, Pod restarted, and the cluster briefly hit 8,000 concurrent hunts. Veltrixs scheduling tier choked on 8,000 ECS task definitions, the ECS agent CPU spiked to 420% on every node, and Hunt Service latency went from 80 ms p95 to 4.2 s p95. Our SLO for hunt latency was 200 ms end-to-end.

Next we extracted maxConcurrentHunts into a ConfigMap and made the Pod read it at startup. That eliminated the re-deploy path, but it introduced a new failure mode: ConfigMaps are immutable in Kubernetes 1.27 unless you delete and recreate them. During the second event, an operator accidentally kubectl apply -f hunt-config.yaml with a typo. The ConfigMap update failed silently, the Pod never restarted, and the old value of 2000 remained. The cluster stayed at 8,000 hunts until we manually rolled a new image that forced a restart.

The Architecture Decision

We needed a way to change concurrency without touching the binary or the ConfigMap. The obvious answer was a sidecar that watched a key in Consul for the current maxConcurrentHunts and surfaced that value via environment variable. Consuls watch mechanism gave us atomicity without Pod restarts.

We replaced the static hunt.yaml with a new HuntConfig struct in hunt-service that reads MAX_CONCURRENT from the sidecars environment. The sidecar is a tiny Go binary (6 MB image) that polls Consul every 500 ms. If the value changes, it execs into the hunt-service container and replaces the single environment variable using nsenter. That keeps the Pod running, avoids image rebuilds, and gives us the atomic write we needed.

The tradeoff is that we now depend on Consul. We already run Consul for service discovery, so the risk is low, but Consul can wedge during a network partition. We added a 3-second timeout on the sidecar poll and a fallback to the last-known good value if Consul becomes unreachable. If Consul is down for more than 30 seconds, hunt-service still runs with the last-known value, which is safer than failing.

What The Numbers Said After

After we rolled the Consul sidecar to production on July 12 2024, the 4,122 concurrent hunt spike dropped to 1,800 within 60 seconds as the sidecar pushed the default maxConcurrentHunts of 500. The service heap stabilized at 890 MB, well below the 2 GB limit. The p95 hunt latency went from 4.2 s back to 95 ms inside two minutes.

The sidecar itself added less than 0.3 ms of jitter to each hunt start and consumed 1.8 MB of RSS. Consul write latency stayed below 40 ms p95 for 99.9% of updates. During the August 4 weekend event, we ramped maxConcurrentHunts from 500 to 2,500 at 19:12, and the sidecar applied the change in 1.2 seconds without any operator intervention. No OOM events, no SLO breaches.

What I Would Do Differently

We should have modeled concurrency as a first-class resource from day one instead of letting it hide in a YAML file. If we had made hunt-service expose a /config POST endpoint guarded by RBAC, we could have avoided the sidecar entirely. The sidecar works, but its a workaround for a missing abstraction.

We also underestimated the human factor. Artists will always tweak knobs that look like they control scale, even when told not to. The sidecar lets them tweak safely, but the real fix is a UI that surfaces hunt concurrency as a dial with guardrails, not a raw integer. Next time I design a system with tunable concurrency, Im shipping a CRD called HuntConcurrency that the UI can scale directly, with an admission controller that rejects values above the clusters capacity.