A Week in the Life of a Treasure Hunt Engine that Almost Went Off the Rails

#systems #webdev #programming #architecture

The Problem We Were Actually Solving

Our player base exploded from 400k to 1.2M during Black-Friday week while the Treasure Hunt event gave 100k concurrent players a 30-second window to solve 5 puzzles. Rewards were dynamic: gold coins, exclusive skins, or nothing. The business wanted sub-second latency on /hunt so the UI felt instant, but AWS cost ceiling was $0.06 per player. We chose Redis with a bloom filter because the treasure engine doesnt need consistency for coins—only existence. What we didnt model was the write amplification from bloom regeneration and the session churn ratio of 2.1 per minute during the event. That churn meant TTL ≤ 30 s to keep memory bounded, but then the bloom false-positive rate spiked from 1 % to 12 % because the filter recycled every second. At 12 % false positives we were hitting Aurora with 80k point lookups per second—each query costing 30 ms on a db.t3.2xlarge. SLA burned, PagerDuty pages triggered, and finance sent a Slack alert titled Budget vs Reality: +37 %.

What We Tried First (And Why It Failed)

We started with Redis 7.2 clustered mode, three shards, replication factor 2, and a global session prefix. The bloom filter was a module named RedisBloom v2.4.4 with two parameters: capacity 10M and error rate 0.01. We set memory limit to 12 GiB per shard and assumed TTL at 5 minutes because 5 minutes felt long enough for a hunt session. The first load test with 20k concurrent players showed 150 ms p99 but we didnt simulate session churn. On the third day of internal QA the bloom hit 92 % memory usage in 45 minutes and OOM-killed the cluster. We switched to LFU eviction, then to a tiered setup with a 1 GiB hotset and a 10 GiB overflow. Hotset eviction still nuked the bloom filter and trigger regeneration on every request, cycle repeats. On the live day we saw the false positives spike and Aurora melt. We rolled back to a flat Redis hash (string type, value size 2.0 KiB) and set TTL 30 s. The p99 latency crashed back to 220 ms, CPU on instances dropped to 18 %, and cost per player stayed inside budget. Lesson learned: bloom filters and high-churn sessions are a toxic mix when your SLA is measured in milliseconds.

The Architecture Decision

We rewrote the session store to a single Redis hash per player, keyed as session:{uid}:{event_id}. No bloom filter. Instead we relied on the hash primitives O(1) lookup and a 30-second sliding TTL. We chose Redis stack v7.2 (ElastiCache) with 3 shards, replication factor 2, and 16 MiB cluster bus bandwidth. Memory ceiling set to 48 GiB total across shards—well below the previous 60 GiB limit. We disabled persistence for the session store because losing a hunt session on restart was acceptable; the reward is idempotent and cached in DynamoDB with 5-minute TTL anyway. For the treasure data itself we kept Aurora PostgreSQL but moved the treasure list to an append-only DynamoDB table so the hot path only reads the current treasure row via primary key. The trigger to refresh treasure row is an EventBridge pipe every 30 seconds that updates a single Aurora row. We kept Redis for session caching but removed the bloom filter entirely. The new write path is 1 write to Redis hash (expiration extended) and 1 conditional write to Aurora only if the player hasnt collected this treasure before. We measured 1 ms average for Redis write, 30 ms average for Aurora conditional write, and p99 latency stayed at 220 ms even at 100k concurrent players.

What The Numbers Said After

After the rollout the metrics looked like this:

Prometheus scrape every 15 s:

Redis sessions: 9.8M keys, 38 GiB used, evictions 0, hit rate 99.8 %.
Aurora treasure table: 800k point reads/s during peak, p95 18 ms, p99 32 ms.
/hunt endpoint: p99 220 ms, p95 140 ms, error rate 0.03 %.
Cost: $5.8k for Redis ElastiCache, $3.1k for Aurora, $1.2k for DynamoDB—total $10.1k for 1.2M players, $0.0084 per player.

Cost alerts never fired. The on-call rotation recorded zero pages for latency during the event. The bloom filter is now only used offline for generating treasure drop probabilities; it never touches the critical path. The bloom false positive rate is 0.4 % because we run it once per hour on a nightly job against a static seed list.

What I Would Do Differently

I would not have trusted the RedisBloom module for high-churn sessions without measuring regeneration cost under real churn. In hindsight we should have modeled session churn as a Poisson process with λ = 2.1. That would