The Day Treasure Hunt Broke My Caches—And How We Fixed It

#systems #webdev #programming #architecture

The Problem We Were Actually Solving

The treasure hunt engine used a single Redis sorted set key per map instance: hytale:treasure:global:top. With 200 concurrent maps and 40 k concurrent players, each map push'opération' (ZADD hytale:treasure:global:top ) triggered an implicit DEL when the key grew past Rediss active-expire threshold. The eviction logs showed 1.2 M key deletions per minute, which translated to 15 k QPS on the DEL path alone. That churn saturated the Redis clusters CPU on the replica threads, lifted p99 from 18 ms to 412 ms, and caused upstream game servers to backpressure their event queues. Load-shedding kicked in at 18 k QPS on the DELETES, which meant 20 % of treasure completions were dropped. The logs didnt say why; they just printed Too many active connections.

What We Tried First (And Why It Failed)

Paging the Redis cluster from 3 to 6 shards was the first move. We used redis-shard 2.4.1. The shard count doubled, but the hot key was still on shard 0. We hit the per-shard connection limit inside 45 minutes and the cluster entered a loop of fencing and resharding. Metrics: p99 latency 311 ms, evictions 1.4 M per minute, and connection count north of 28 k per shard. The game team added local LRU caches in the lobby microservice, but the cache invalidation used a pub-sub channel that Redis itself couldnt deliver under load—so stale chests propagated for up to 90 seconds. The player reports read: My treasure vanished. My rank changed while I blinked.

Next we tried a Lua script to snapshot and trim the sorted set with a capped size of 10 k entries. The script ran for 7 ms on the master but still generated a 12 k-byte RESP response that had to be streamed to every replica. The replication backlog grew to 2 GB and caused a fork bomb in the persistence layer. Redis 7.0.12 rolled back the Lua script, leaving us with partial leaderboards and corrupted AOF files. The AOF rewrite process consumed 40 GB of disk space overnight and wedged the entire cluster until the next maintenance window.

The Architecture Decision

We finally ripped the global sorted set apart at the service boundary. Instead of one key per map, we created per-player leaderboards: hytale:treasure:v2::top. The treasure engine now emitted events—TreasureFound, TreasureClaimed, TreasureExpired—to an Apache Kafka cluster (version 3.6.1, 6 brokers, 200 partitions). The leaderboard service consumed the stream, applied a 24-hour retention and a tombstone for stale treasures, and wrote the leaderboard into ClickHouse (version 23.8.6, 3 replicas, 32 vCPU each) using an ORDER BY tuple. ClickHouse handled the sort, and because we partitioned by map_id and toDateTime(event_time), the 80 k QPS ingested at 1.2 GB/minute with a 500 ms flush window.

The tradeoffs were explicit: we moved from sub-millisecond Redis updates to eventually consistent ClickHouse writes with 2-3 s lag. Player-facing leaderboards were updated via a GraphQL API that cached the last 5 minutes in a local Caffeine store. The cache was invalidated by a Kafka compacted topic __consumer_offsets so stale data couldnt survive the window. We also lost the realtime push for global events, but we gained a stable cluster: p99 latency on the leaderboard service dropped to 72 ms, and the Redis cluster stabilized at 6 k QPS with 18 % memory overhead—well below its 30 % soft-limit.

What The Numbers Said After

Week 1 post-launch showed 99.94 % availability on the treasure service, down from 97.6 % the week before. The ClickHouse merge tree took 4.2 TB on disk and served 8 k QPS on the leaderboard endpoint. The Kafka lag peaked at 1.2 s during the daily reset window, but never blocked the game loop. Redis memory stayed flat at 14 GB per shard, and the eviction rate dropped to 12 k per minute—mostly session keys that expired naturally.

The cost spike was real: ClickHouse cluster cost $2.8 k/month, Kafka $1.6 k/month, and the extra Redis shard was decommissioned. Total infra delta: +$4.6 k/month. Revenue from the event was $230 k, so the ROI was clear. The game designers were unhappy we dropped the realtime flare, but players stopped complaining about vanished chests, which was the actual problem.

What I Would Do Differently

I wouldnt have tried to shoehorn the global sorted set into Redis in the first place. The data domain screams analytical workload; Redis is a caching layer, not a data warehouse. We should have modeled the leaderboard as a stream from day one and used Kafka + ClickHouse as the first-class citizen. The lesson is simple: pick your consistency boundary early. If youre doing anything that approximates a global leaderboard with high write volume, dont let it live in a single Redis key or youll pay the churn tax in production, not in staging.