The Treasure Hunt Engine That Broke Before the Traffic Did

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We werent building a generic scale story; we were protecting a money-printing loop. The treasure-hunt engine awarded cash prizes every hour, and each award ran a small blockchain simulator to determine rarity. That simulator used about 4 MB of in-memory state per player. When the Rift hit, we had 85 k concurrent players and 290 GB of heap demanded by the single Node process. Our vertical-scrape plan—going from 12 cores / 64 GB to 32 cores / 256 GB—would have cost an extra $6 k per event and still risked another heap OOM on the next traffic surge because Nodes single-threaded GC cannot compact memory while the event loop is saturated. The real problem was not CPU or memory on a bigger box; it was the single-process model itself.

What We Tried First (And Why It Failed)

First we split users randomly across five Node processes behind HAProxy. That lowered max heap per process to ~1.1 GB, but we immediately hit a different wall: the in-memory simulation state was not serializable. We tried Redis to store the 4 MB blob per user, but the SET operations took 15–28 ms on a cloud Redis 7.0 cluster with 5 ms p99 latency. At 85 k players, that became 1.3 million round trips per second, and we saturated the 1 Gbps link between the Node pool and Redis. The error surfaced as 38 % of write operations timing out with:

NOAUTH Authentication required

We also tried sharding Redis into 16 slices, but the Lua scripts we used for atomic rarity calculation could not span multiple slots. We ended up with either duplicated or dropped rewards—something our finance team would not sign off on.

The Architecture Decision

We needed an in-memory store with strong consistency within a shard and a networking stack that could keep up. After running the numbers on three candidates—Dragonfly 1.0 (Redis fork), KeyDB 2.8, and Memurai 2.1—we picked Dragonfly. Its single-threaded, no-fork model gives deterministic latency and uses 40 % less RAM than Redis 7 for the same value size. We carved the key space into 64 shards and fronted it with envoy so that each Node process could open a gRPC stream to its shard instead of TCP. The Node side became stateless; every player affinity routed by the same hash ring to the same shard, so the 4 MB state lived in one place and the atomic rarity calculation ran in a single Lua call that Dragonfly executes in <2 ms.

On the write path we replaced the blocking Redis SET with a pipeline of 32 commands and capped in-flight requests per shard at 1 k. The Node servers started using worker_threads to isolate the simulator from the event loop, so a surge in puzzle-solving CPU would not stall the Redis pipeline. The change cost us two weeks of rewriting the rarity engine from callback-heavy to promise-based, but we gained a 7× latency drop on the critical path.

What The Numbers Said After

After the next Rift, we measured:

P95 latency on award transactions: 8 ms (down from 36 ms)
Heap per Node process: 240 MB (stable)
Redis shard CPU: 42 % (peak across 64 shards)
Cost per thousand players: $0.0012 (down from $0.0078)

The 85 k players completed without a single timeout error. The Node pool stayed at 52 % CPU, well below the 70 % inflection point where Nodes event-loop lag starts to climb exponentially. The finance team happily wired the prizes because the blockchain simulator never lost state.

What I Would Do Differently

Id push the stateless boundary earlier. By the time we finished Dragonfly, the Node processes were mostly I/O bound again; the worker_threads helped but added complexity in stack traces and heap snapshot debugging. Next event we will move the simulator into a separate Go micro-service behind gRPC, letting us scale the CPU independently and remove the Node heap entirely. Had we architected for embarrassingly parallel CPU work from day one instead of in-memory cache, we would have avoided the Redis rewrite and the OOM scare. But we also wouldnt have learned that Dragonflys sharded Lua gives us stronger consistency guarantees than Redis Cluster for this specific workload—and that lesson is worth the detour.