The Problem We Were Actually Solving
In November 2024 we launched a treasure-hunt feature for the Veltrix platform that let users solve real-time puzzles for micro-rewards. By December we saw 500 concurrent players; by Valentines Day we were at 10k. The feature was built on a simple model: every second wed generate a new treasure map, shard the valid tiles across 24 Redis nodes, and fan out updates to 300k WebSocket connections using RabbitMQ. The first outage hit on February 13 at 3:47 PM when the Redis cluster started throwing ERR max number of clients reached and we had to emergency reboot half the shards. The incident lasted 22 minutes, cost us $18k in missed rewards, and revealed wed optimized for the wrong thing.
What We Tried First (And Why It Failed)
Our first cut used a single PostgreSQL 14 database with a triggers-and-tables pattern that looked elegant in the diagram: a hunts table, a tiles table with a GIN index on geohash, and a materialized view for live leaderboards. We chose this because we believed reads would dominate; our load tests showed 85% GET traffic. Within two weeks we were seeing P99 read latency of 450ms on the leaderboard view and the write path began emitting ERROR: deadlock detected when two users tried to claim the same tile at the same millisecond. The transaction retries climbed to 27% and we watched the replication lag hit 2.3 seconds on the follower. Adding read replicas helped the reads but made the deadlocks worse because the same row could be locked on different nodes. We knew we had to move off the monolith, but moving to Redis too early introduced a new class of failure we hadnt modeled.
The Architecture Decision
We decided to treat the treasure hunt as an independent service with strict boundaries: a Map Generator that emitted immutable treasure maps every second, a Tile Claim Service that used Redis Streams for exactly-once claims, and a Broadcast Layer that pushed deltas to WebSockets without ever reading the state back. The key tradeoff was shifting from strong consistency to eventual consistency for the public leaderboard. We accepted that a users claimed tile might appear up to 200ms later in the global view, but we guaranteed that no two users could ever claim the same tile. We chose Redis Streams over Kafka because the volume was 2M events/sec during peak hunts and Kafkas 200ms producer latency wasnt acceptable for our real-time claim loop. We also moved from a single Redis cluster to a fleet of 48 Redis 7.0 shards using Hash slots so we could scale writes without hotspots. The claim path became a Lua script that runs in 0.7ms on average and returns a CAS-style error when the tile is gone.
What The Numbers Said After
After the migration we saw the following:
- P99 claim latency dropped from 450ms to 8ms.
- Redis memory usage per shard stabilized at 3.2GB with a 6-hour TTL on all keys; we set maxmemory-policy to allkeys-lru to prevent swapping.
- The leaderboard lag stayed under 200ms 99.6% of the time; the outliers were caused by WebSocket fan-out bursts where our Go channels blocked for 5ms under 10k conn bursts.
- We spent $2.1k/month on Redis Cloud Pro versus the $800 wed projected, mostly because we kept the shard count higher than necessary during the Valentine surge.
- The outage window on February 14 shrank to 4 minutes when we had to scale the Broadcast Layer from 6 to 14 pods; the autoscaler triggered at 85% CPU on the new pods and we had to manually bump the limit.
What I Would Do Differently
I would never again let the real-time claim logic touch the public leaderboard directly. We still get complaints when a players tile doesnt appear in the top-100 list for 150ms, but we cant spend engineering hours squeezing that to 50ms without breaking the budget. A better pattern would be to isolate the public view into a separate read-optimized projection that rebuilds every 500ms from the event log, keeping the claim path pure and cheap. We also over-provisioned the Redis fleet; we could have started with 24 shards and autoscaled based on stream lag rather than CPU. The Lua script we wrote for CAS claims is fragile—any syntax error kills the entire shard until we restart—so next time Id push the logic into a sidecar that can be rolled without touching Redis. Finally, we should have modeled the WebSocket fan-out as a separate cost center from day one; the Broadcast Layer became the new bottleneck once the claim path quit melting.
Top comments (0)