The Problem We Were Actually Solving
Our server had 1,200 concurrent players at peak, all trying to uncover the same 472 treasure locations spawned across three biomes. The engine defaulted to a central Redis cluster for location validation. At 1,200 ops/sec, Redis latency spiked to 450ms on writes and 380ms on reads. Redis wasnt the bottleneck—it was the serialization format. We were using JSON to represent each treasure location, which ballooned to 1.8KB per record. The Redis master couldnt keep up, and the cluster shed 287 requests per second during peak load, causing clients to timeout and retry, creating even more traffic. Players saw treasure chests flickering in and out of existence, and Jira tickets piled up with titles like Shiny Object Bug and Chest Teleportation Incident.
We werent solving fun. We were solving distributed consensus under write amplification.
What We Tried First (And Why It Failed)
Our first fix was to switch from Redis JSON to Protocol Buffers with gzip compression. The payload dropped from 1.8KB to 320 bytes. Latency improved—Redis stabilized at 120ms for writes and 90ms for reads. Success? Not quite.
The second failure emerged when we tested region-based load balancing. We sharded treasure locations by biome: Forest, Desert, Nether. But the engine still used global spawn timers. Every biome update required a broadcast to all shards to maintain cooldown consistency. Broadcast traffic exploded, and under 2k players, the cluster couldnt synchronize the spawn timers fast enough. Players in the Nether biome saw Forest chests spawn two minutes early, ruining the hunt rhythm. We tried Kafka to stream spawn events, but the event sourcing model introduced 15–20 seconds of lag between shards, making the hunt feel asynchronous and unfair.
The real problem wasnt serialization or sharding—it was timing synchronization across shards without a distributed lock manager.
The Architecture Decision
We abandoned shards and Redis entirely. Instead, we built a two-tier system:
Tier 1: Local In-Memory State with CRDTs
Each game node hosts a local treasure state as a Conflict-Free Replicated Data Type (CRDT). Locations are stored as sets with tombstones for despawns. No network calls during gameplay—just local reads and writes. The CRDT handles eventual consistency seamlessly.Tier 2: Sparse Sync with Raft Consensus for Spawn Timers
Spawn timers are managed by a five-node Raft cluster. Every 30 seconds, nodes elect a leader to decide which biome should spawn next. The leader broadcasts the decision to all game nodes. The broadcast is sparse—only the biome ID and timer reset—no full state dump. Each game node uses the timer to trigger local respawns via the CRDT.
We chose Raft over Redis for consensus because Redis Sentinel doesnt provide linearizable writes under high contention. Raft gives us total order broadcasts with 5–12ms latency at 5k players.
The CRDT we used is a custom G-Counter for spawn counts and a Lexi-DAG for location metadata. We rejected Dynamo-style gossip protocols because they introduce probabilistic inconsistencies, and players hate inconsistencies in treasure outcomes.
What The Numbers Said After
After the change:
Player-reported chest flickering dropped from 287 per hour to zero.
Median Redis latency (for leader election logs) stabilized at 8ms.
Game node CPU usage fell from 68% to 34% under peak load.
Player engagement (time on server) increased by 41% in the first week.
The community noticed the difference. Bug reports shifted from instability to feature requests: Can we add boss-triggered treasure hunts? What about raid chests?
The real victory wasnt speed—it was eliminating the perception of cheating. When chests despawned fairly and respawned predictably, players trusted the system. Trust drove retention.
What I Would Do Differently
I would not have started with Redis. Redis is a cache, not a consensus system. Using it for global state coordination was a category error.
I would implement the CRDT layer first, even before sizing the cluster. Without local state, youre just moving latency around.
I would avoid Kafka for real-time synchronization. Kafka is great for analytics, not for distributed timers. Use Raft or etcd instead.
Finally, I would measure fairness, not just throughput. Track the variance in chest spawn times across biomes. If one biome spawns 20% faster than another, players will notice and exploit it. Fairness is the real KPI of a treasure hunt engine.
Top comments (0)