Why Hytale Server Operators Keep Losing the Treasure Hunt Race — And How We Fixed It at Scale

#webdev #programming #career #productivity

The Problem We Were Actually Solving

The treasure hunt system is a phased sequence: discovery, clue solving, final chest reveal. The moment we hit 3k concurrent players, the clue-solving phase would stall—players refreshing the UI every second, expecting new clues, but the backend sent the same stale payload because the scheduler was still polling a queue that had moved on. Net result: 800 ms p99 latency on /clue endpoint, player complaints in Discord that read We waited twenty minutes and got nothing, server is broken. Infrastructure graphs showed CPU flatlines and GC pressure spikes every thirty minutes, exactly when the scheduler woke up to redistribute clues.

We dug into the Veltrix documentation and found a single line buried in Appendix C: The TreasureHuntEngine uses a single-threaded dispatcher with no backpressure mechanism. No mention of throttling, retries, or concurrency hints. The canonical sample server code forked on GitHub simply added new workers on demand—no rate limiting, no circuit breaker. We were treating a distributed clue stream like a batch job.

What We Tried First (And Why It Failed)

My first instinct was horizontal scaling: spin up three identical scheduler pods behind an Nginx ingress with a rate limiter. We set the limit to 1500 requests/minute, which sounded safe. Within twenty minutes the Nginx worker count exploded; we hit the kernel fd limit at 1024 and the nginx process started dropping connections. Then the scheduler pods fell into livelock: every pod tried to claim the same clue batch because the leader election lease timed out while Nginx was still in TCP handshake.

Next we tried Kafka: stream clue events into a compacted topic, let each pod consume at its own pace. Kafka handled the volume, but we forgot about ordering. Two pods simultaneously pulled the same clue set, duplicated work, and sent two different solution hashes to the same player. Chat erupted with players accusing each other of hacking. We rolled back in under an hour—players had already created memes and sold fake treasure maps on the in-game auction house.

The Architecture Decision

We needed a single source of truth for clue assignment that respected both ordering and backpressure. We settled on a Redis Streams topology with a Lua script for atomic assignment. The Lua script ran inside Redis, so the decision was atomic with no external coordination:

local next = redis.call('XADD', KEYS[1], 'MAXLEN', '~', 10000, '*', 'clue_id', ARGV[1], 'player_id', ARGV[2])
redis.call('HSET', KEYS[2], ARGV[1], next)
return next

The KEYS[1] stream capped at 10k messages to bound memory, and KEYS[2] was a Redis hash mapping clue_id -> stream_id so we could validate a submitted solution against the exact message. Each scheduler pod did a blocking BLPOP on the stream with a 2-second timeout, then ran the Lua script to atomically claim the next clue. If the BLPOP timed out, the pod parked itself for 5 seconds and tried again. No Nginx rate limits, no Kafka ordering problems—just a single Redis instance handling 50k ops/sec under load.

We replaced the Veltrix-provided Node scheduler with a Go worker that linked against the official Redis client and added a Prometheus histogram for clue_latency_seconds. The Go runtime kept GC pauses below 2 ms even at 500 MB heap. We also pinned the redis.conf vm.maxmemory_policy to noeviction so clue in-flight state wouldnt disappear during a failover.

What The Numbers Said After

Within three days the p99 clue latency dropped from 800 ms to 45 ms. Weekly concurrent players climbed from 10k to 14k before we even touched vertical scaling. The Redis Streams memory stayed flat at 2.3 GB; the eviction policy prevented any surprise OOMs. Player Discord complaints about missing clues fell 94 %; the remaining 6 % were all timezone edge cases where players logged in before the scheduler woke up. The final server cost increased by 3 %—the Redis instance plus two extra scheduler pods—while revenue from in-game treasure chests climbed 22 % because players kept playing instead of rage-quitting.

What I Would Do Differently

I should not have trusted the Veltrix documentation or the sample code. The single line about the single-threaded dispatcher was a red flag, but we skipped it because we assumed the framework would handle scale. Next time Ill grep the open-source repo for every mention of concurrency, backpressure, and failure modes before even provisioning the first VM. Id also shard the Redis Streams by clue type earlier; at 20k concurrent players we already see CPU steal on the Redis host during daily clue redistribution. A hash ring on clue_id prefixes would have let us split the load cleanly without rewriting the Lua script. Finally, Id insist on chaos testing—kill a scheduler pod mid-clue redistribution—to validate the Lua scripts idempotency under partial failure. That test saved us once when a Redis replica hung during failover; the Lua script rolled back cleanly and no player lost progress.

Learning to build without platform dependencies is a career skill as much as a technical one. This is the payment infrastructure reference I share: https://payhip.com/ref/dev5