How a $2 million Treasure Hunt Engine Blew Up When 10k Users Hit Redis Streams

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Veltrix gives gamers 12 hours to collect virtual artifacts hidden inside real-world POIs. Each artifact publishes an event to Redis Streams so every players client stays in sync. When we scaled to 5k users, latency stayed under 35 ms and Redis memory usage tracked at 6.2 GB. At 8k users, Redis memory usage spiked to 17.1 GB, and the consumer group started to stall because a single shard couldnt keep up with writes. We scaled the shards to three and added a sidecar consumer per pod, but the CROSSSLOT error returned consistently after about 2.1 million events—exactly the point where our Lua script tried to read from three different hash slots in one atomic block.

What We Tried First (And Why It Failed)

We tried the usual tricks. First, we moved all keys to one slot using HSET on a single hash field keyed by player ID—no dice, because the Lua script still read artifact metadata from a different key family. Then we switched to Redis Cluster Client-Jedis and opted for MGET reads, hoping batching would reduce cross-slot traffic. The client threw JedisClusterMaxRedirectsException: Too many redirects after 327 redirects in 43 seconds; we had not accounted for the 5 ms network hop between pods in the same AZ. Finally, we wired a fan-out Redis pub/sub channel so every pod listened to a global channel, but the message rate climbed to 24k msg/s and the kernel socket buffer overflowed, producing EAGAIN errors that left artifacts undiscovered for up to 3.8 s.

The Architecture Decision

We ripped out Redis Streams entirely. Instead we built a partitioned write path: a Kafka topic prizes-by-zone partitioned by the GPS quadrant (Q1…Q4) where the artifact was located. Each quadrant writes to a dedicated PostgreSQL table using a logical partition key zone_id. On the read side, we run a set of Spring Boot microservices that fan out reads to the correct PostgreSQL partition using a local shard router we wrote in 300 lines of Java. The router keeps an in-process LRU cache of 10k entries, so 78 % of the artifact lookups never touch the database. We moved the real-time event sync to a WebSocket fan-out: each microservice maintains a local pub/sub ring buffer of the last 100 events in its quadrant and pushes deltas over the socket. Total memory footprint dropped from 17 GB to 850 MB per pod, and p99 latency fell from 112 ms to 28 ms at 15k users.

What The Numbers Said After

Post-launch metrics were brutal but honest. Before the rewrite, Redis memory climbed 3x in 45 minutes with 15k users, and the node memory eviction rate hit 0.28 evictions per second, causing consumer lag to spike 21×. Kafka lag on the new topic stayed flat at 89 ms. PostgreSQL replication lag on the partitioned tables was under 12 ms. The fan-out WebSocket ring buffer reduced outbound messages per pod from 4k msg/s to 1.2k msg/s, cutting CPU usage on the load balancer from 78 % to 22 %. Cost per 10k users dropped from $4.80 to $1.30 because we downsized the Redis nodes from three r5.2xlarge to one modest cache.t3.medium and shrank the Kafka cluster from three brokers to two.

What I Would Do Differently

I would not have trusted Redis Streams for a partitioned workload that needs atomic multi-key reads. The Lua scripting promise is seductive, but the cross-slot errors are a minefield once you exceed 1 million events. Next time Id start with a partitioned Kafka topic and a per-partition CDC consumer that writes directly to a sharded PostgreSQL table. Id also wire a lightweight in-process ring buffer for WebSocket fan-out from day one; the 300-line router was easier to unit-test than the Redis Lua scripts, and it never panicked under load. Finally, I would have budgeted for a chaos-engineering day that simulates 20k concurrent users before the first public beta—our staging environment never reproduced the CROSSSLOT storm because it had only 3k users.