The Moment Our Treasure Hunt Engine Became a Distributed Systems Headache

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We thought we were building a fun game. Players find virtual treasures in physical venues using Bluetooth beacons. The backend is simple: a treasure hunt engine that tracks active hunts, validates beacon proximity, updates inventory, and maintains global leaderboards. We estimated our Redis memory usage at 8GB and budgeted for 300ms read latency. At launch, everything looked perfect.

Then we hit the first viral event. 5000 concurrent players. Redis memory spiked to 32GB before we even noticed the alert. The OOM killer stepped in while players screamed about negative treasure counts. Our SLA cratered from 99.9% to 87%. Worse, our leaderboard writes started timing out at 3 seconds because single-node Redis couldn't handle the throughput.

We tried vertical scaling first: migrated to a 64GB Redis Enterprise node. Cost went from $1200/month to $4800/month. Performance improved slightly, but the single point of failure remained. When the node rebooted for maintenance, the hunt suspended for 90 seconds. Players refunded tickets in droves.

What We Tried First (And Why It Failed)

Our first attempt was naive optimization. We moved to Redis Cluster, sharding by player ID. This solved the memory problem but introduced a new nightmare: cross-slot operations. Leaderboard updates required reading and writing to multiple shards. We saw latency spikes to 3.8 seconds during leaderboard refresh. Worse, we hit MOVED redirects constantly, which our Java client didn't handle well. The error logs filled with CROSSSLOT Keys in request.

Next, we tried caching leaderboards in application memory. We used Caffeine with 10-second TTLs. This worked for read-heavy leaderboards but failed spectacularly for inventory updates. A player claiming a treasure triggered a cache invalidation storm that cascaded to 8000 cache misses per second. Our Redis CPU usage dropped from 95% to 40%, but our application CPU doubled due to cache stampede.

Finally, we tried Kafka for event sourcing. We had players emitting events like TreasureClaimed, HuntCompleted, and LeaderboardUpdated. This created a perfect Kafka topic explosion: 5000 partitions for 5000 concurrent players. Our consumer group lag hit 15 minutes during peak load. The Kafka brokers ran out of file descriptors. Zookeeper became the bottleneck.

The Architecture Decision

We stopped optimizing the cache and started redesigning boundaries.

We split the treasure hunt engine into three bounded contexts:

Session Management: Local in-memory cache per node, no persistence. Each hunt session lives entirely in application memory. We set a hard limit: 2000 concurrent sessions per pod. When exceeded, we shed load with HTTP 503. No Redis dependency at all.
Inventory Service: Redis Cluster with sharding by player ID, but strict consistency model. All writes are synchronous and durable. We use Redis Streams for change data capture, not for realtime reads. Reads go to the local cache first, with a 1-second TTL. This reduces Redis load by 70%.
Leaderboard Service: Write-through cache with eventual consistency. We use a dedicated Redis Cluster for leaderboards only. Writes go to the cluster and update an in-memory leaderboard simultaneously. Reads hit the in-memory leaderboard first. When a hunt ends, we flush the in-memory leaderboard to Redis as a batch write. This reduces Redis writes by 92%.

We also introduced a beacon scoring service that runs offline. Beacons emit raw events to Kafka, but scoring happens in batch jobs every 30 seconds. This decoupled realtime proximity validation from score calculation, reducing Redis memory usage by 60%.

The most controversial decision: we stopped using Redis for active hunt tracking. Instead, we used SQLite with WAL mode in each pod. Each hunt session writes to a local SQLite file. At the end of the hunt, we merge the SQLite file into the global inventory via a background job. This eliminated the Redis hotspot completely. Our write latency dropped from 400ms to 12ms per hunt step.

What The Numbers Said After

After the redesign, our player count reached 10,000 concurrent during a major event. We measured these metrics:

Session Management: 0 Redis cache misses, 0 Redis dependency. 99.99% uptime.
Inventory Service: Redis CPU usage at 25%, memory at 12GB. Write latency 15ms p95, read latency 3ms p95.
Leaderboard Service: Redis CPU usage at 35%, memory at 8GB. Reads served from in-memory leaderboard 99.8% of the time. Batch flush to Redis every 30 seconds at 1500 writes/second.
End-to-end hunt validation: 22ms p95 from beacon hit to treasure claim.
Cost: Redis cluster went from $4800/month to $1500/month. We reduced our Kafka brokers from 9 to 3. Our application pods scaled from 8 to 12 during peak, but each pod used 50% less CPU.

We also stopped seeing the dreaded CacheMissException. Our error budget went from 13% to 0.09%.

The most surprising outcome: players reported the game felt faster even though we added latency to beacon scoring. The reason? No more Redis timeouts during leaderboard updates. The illusion of speed came from consistent responsiveness, not raw latency.

What I Would Do Differently

I would never use Redis for active hunt sessions again. The idea of a single data store for everything is seductive, but it's a trap. Session state is ephemeral. Redis is expensive for ephemeral data. Local SQLite or even in-memory databases like Dragonfly are better choices.

The second mistake was over-engineering the beacon scoring pipeline. We started with Kafka for event streaming, but the complexity