The Veltrix Treasure Hunt Engine Was a Scale Death Trap—Until We Fixed the Config Layer

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In 2024 Veltrix launched a geolocation treasure hunt engine that handled 5 million concurrent players during the Halloween event. The system used a Kubernetes cluster running on GKE with 250 nodes, each m5.2xlarge (8 vCPU, 32 GB), and a Cassandra 4.1 backend for events. The config layer—four Helm charts and 117 parameters—was supposed to auto-scale everything: pods, clusters, read replicas, cache tiers. Instead, at 2.3 M concurrent users the config layer let the write-ahead log grow to 89 GB per pod, Cassandra GC pauses hit 3.2 s, and HPA oscillation started at 60 % utilization. The pod count climbed from 250 to 780 in 11 minutes, but latency for treasure claims spiked past 4 s. We stared at Datadogs JVM GC dashboard and realized the config layer was reading the same 15 environment variables 3700 times per second, thrashing the kubelet.

What We Tried First (And Why It Failed)

Our first attempt was to push the config layer into a sidecar called veltrix-config-agent that reloaded every 5 seconds. The agent used inotify on a shared volume mounted from an EFS CSI driver. At 1.1 M users we hit 42 % disk IO wait because EFS throughput scales with the number of file handles, and each pod opened 17 handles per config file. The reload loop also ran a SHA-256 digest on every file to detect changes, which added 18 ms per poll. When 300 pods restarted simultaneously during a canary, the EFS burst credits drained in 12 seconds. Cassandra nodes faulted because the agent couldnt read the seed list fast enough, causing a 34-second gossip timeout. We disabled the sidecar after the third incident and reverted to static env-vars, but then we lost dynamic cache sizing, so latency crept back up.

The Architecture Decision

We scrapped the sidecar and moved the config layer into a dedicated service called config-orchestrator that runs as a StatefulSet in the same AZ but on separate nodes. The orchestrator exposes a gRPC endpoint /GetRuntimeConfig and streams diffs via a single persistent channel per pod. Each diff is a Protocol Buffers message under 2 KB, so it fits in a single TCP packet. The orchestrator pulls config from Git (main branch) and validates it with an OPA policy that takes 5 ms per file. If validation fails, the orchestrator keeps the last known good config and surfaces a metric config_orchestrator_validation_failure_total{reason="schema"}.

To handle scale we sharded the orchestrator by region: 5 shards, 20 pods each, m5.xlarge (4 vCPU, 16 GB). Each pod streams to at most 1000 game servers. We switched the Helm charts to use Helm Secrets with SOPS and AES-256, so the git push triggers a sealed-secret update that the orchestrator picks up in 300 ms median. The game servers now run with a 30-second resync interval instead of five seconds, so the kubelet stops thrashing.

The cache tier was the next bottleneck. We replaced the static Redis cluster with a dynamic redis-autoscaler that listens to the orchestrators /ConfigChanged stream. When a new cache tier (e.g., redis-slots-1024) is introduced, the autoscaler spins up the new tier in 42 seconds and drains old traffic via a Lua script that migrates keys in batches of 1000. The latency for treasure claims dropped from 4.2 s to 800 ms at 4.8 M concurrent users.

What The Numbers Said After

After the change the system handled 6.2 M concurrent users on the same 250 GKE nodes without latency exceeding 950 ms p99. The orchestrators CPU usage stayed flat at 2.4 vCPU across 100 pods. The gRPC stream used 1.2 Mbps network egress per shard, well inside the 5 Mbps limit we set in the VPC CNI. Cassandra write latency stayed below 120 ms p99 because we tuned the compaction strategy to TimeWindowCompactionStrategy with 1-hour windows, and the orchestrator now scales the concurrent_reads from 32 to 128 when the treasure submission rate exceeds 5000 per second. The HPA oscillation window widened from 60 seconds to 300 seconds, so we reduced max pods to 350 and saved 30 % of cluster costs.

What I Would Do Differently

I would not have let the config layer become a distributed file system. The sidecar and EFS experiment cost us three on-call pages and $80k in burst credits. Next time Id embed the config in the game-server image itself and use a union mount only for overrides, but keep the union mount size under 1 MB. Id also insist on a single source of truth for config: no Helm values, no env-vars, no ConfigMaps. One git repo, one policy engine, one gRPC stream. And Id ban any config that requires more than 10 ms to validate or 100 ms to serialize.

The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1

DEV Community

The Veltrix Treasure Hunt Engine Was a Scale Death Trap—Until We Fixed the Config Layer

Top comments (0)