DEV Community

Cover image for Why Our Treasure Hunt Engine Crashed at 2,000 Concurrent Players and How We Fixed It
Lisa Zulu
Lisa Zulu

Posted on

Why Our Treasure Hunt Engine Crashed at 2,000 Concurrent Players and How We Fixed It

The Problem We Were Actually Solving

The real problem wasnt the hunt logic—it was the Veltrix configuration layer, a thin YAML/JSON DSL that was supposed to let non-engineers tune game behavior without touching code. In practice, it had become a distributed systems nightmare. Every time we changed spawn rates, loot tables, or event timers, the operator console would recompile the entire ruleset and push it to 500 edge servers. The network between the config layer and the services was UDP-based, and we had no idempotency. At 2,000 concurrent players, one dropped packet would cascade into 47 duplicate event firings, causing inventory duplication, teleportation bugs, and angry Discord threads.

What We Tried First (And Why It Failed)

We started with a classic hot-reload strategy using a Git-backed ConfigMap in Kubernetes. The CI pipeline would merge a PR, build a Docker image, and restart pods automatically. It worked fine in staging, but in production the restart storms triggered our autoscaler to hit max pods within 30 seconds, and the new pods would immediately crash because the dynamic config loader assumed atomic writes. Half the time, the YAML parser would throw a MalformedSequence error on line 124, and the operator console would show a red banner that said Failed to reconcile ruleset instead of rolling back gracefully.

Then we tried a sidecar pattern with a gRPC config server. The idea was to decouple the game services from the rules engine. The server would stream deltas over a persistent connection, and each service would apply changes locally. The first week looked promising— latency dropped to 150ms at 5,000 players. But on day three, during a scheduled database maintenance, the gRPC keepalive timer fired, the connection reset, and the cache invalidation logic kicked in. Without the ability to roll back to a known-good state, every game instance replayed the last 15 minutes of spawn events, creating duplicate loot boxes that players could open simultaneously. The support tickets piled up with screen recordings showing 20 identical dragons spawning in the same cave.

The Architecture Decision

We abandoned the operator-facing DSL entirely. Instead, we built a declarative rules compiler that emits Rust code at build time. The rules are written in a restricted HCL dialect, validated by a pre-commit hook, and compiled into a static binary that runs in a sidecar next to each game instance. The sidecar talks to a gossip-based consensus layer (etcd) for cluster-wide state, but each instance only consumes its own compiled ruleset. The key tradeoff was developer velocity versus operational safety: once the ruleset is compiled, it becomes immutable. If we need to change spawn rates, we bump the version in the Helm chart, trigger a blue-green deployment, and the old version stays live until the new one stabilizes. The rollback path is now a single kubectl patch command that reverts the chart version.

We also replaced UDP with a gossip protocol over TCP for config propagation. The gossip layer uses vector clocks to detect and reject duplicate events, and each server maintains a Merkle tree of its current state, allowing instant diffs during recovery. The latency cost is about 40ms per delta, but we gained deterministic replay and zero data loss.

What The Numbers Said After

After the migration to compiled rulesets and gossip-based propagation, the system handled 12,000 concurrent players with a 95th-percentile latency of 420ms. The operator console no longer crashes on invalid YAML, and the rollback time dropped from 5 minutes to 30 seconds. The error rate for duplicate events fell from 18% to 0.02%. The biggest surprise was the performance: the Rust-compiled ruleset reduced CPU usage by 38% because the dynamic parser and reflection layers were gone.

What I Would Do Differently

I would never give YAML to an operator again. The next time we need runtime flexibility, well use a small DSL that compiles to WebAssembly instead of Rust. The Wasm module can be updated atomically and sandboxed, giving us the safety of immutability with the flexibility of hot reload. Well also instrument the gossip layer with OpenTelemetry traces so we can catch divergence before it hits production. And well stop pretending UDP is acceptable for anything beyond simple heartbeats.

Top comments (0)