Treasure Hunts Ran Fine at 10k Players—Then We Hit 150k and the Redis Cluster Blew Up

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

By week four the daily active players hit 150 000 and we started seeing Redis cluster resets every time a big streamer announced a drop. The error was simple:

MOVED 12345 10.2.3.4:6381

Clients would retry, the Lua script would re-run, and loot could be duplicated or missed entirely. Worse, 10 % of chests would stay in the found state even after server restart because the slot migration had torn the atomicity we thought Lua gave us. The business metric that mattered was chest-opening fairness: we promised no two players could open the same chest. That promise was now broken and we had four hours to make it right before the next scheduled global event.

What We Tried First (And Why It Failed)

We bolted on Redis Streams as an event bus. Every chest open published a message; a worker service updated a Postgres table with FOR UPDATE SKIP LOCKED to enforce uniqueness. Median latency climbed to 130 ms. The worker kept crashing under 200 000 events/sec because the lock acquisition itself became the bottleneck. The error that killed us was:

ERROR: canceling statement due to statement timeout

At 500 ms we had already breached the SLA. We tried increasing the timeout, but then we introduced another class of bugs: players got stuck for three seconds while the lock waited, and streamers viewers assumed the game had frozen. We rolled back.

Next we split the domain: Redis handled only the spatial part (player position, chest ID range) and Postgres handled only the loot draw. We moved the uniqueness guarantee to INSERT … ON CONFLICT DO NOTHING inside a single statement. Still, the P99 latency stayed north of 200 ms and we were burning 400 RPS on Postgres just to draw loot tables.

The Architecture Decision

We ripped out Redis as state store altogether. Instead we built a sharded PostgreSQL 14 cluster with logical replication to read replicas for search. The new engine is a single SQL transaction:

BEGIN;
SELECT id FROM chests
WHERE zone = $1 AND found = false
ORDER BY random() LIMIT 1
FOR UPDATE SKIP LOCKED;
UPDATE chests SET found_by = $2, found_at = now()
WHERE id = $3 AND found = false;
COMMIT;

We chose PostgreSQL because it is the only system that can give us row-level atomicity, random selection, and strong consistency inside one round-trip from the client. We shard chests by zone on a 32-bit hash of the zone id; the shard key lives in Postgres, not Redis.

To keep the game loop feeling real-time we pushed location updates to a WebSocket fan-out service written in Go that talks directly to the shard leader. The treasure-hunt endpoints themselves run in Node.js behind an nginx L7 load balancer; each request hits Postgres, not Redis.

Cost went up: we doubled our cloud spend, but the guarantee held. At 400 000 concurrent players the median latency is 18 ms, P99 42 ms, and the zero-dupe promise measured by synthetic chests is 100 % over every 15-minute window.

What The Numbers Said After

We installed Prometheus scrapers on the PostgreSQL metrics exporter. The key indicator is:

pg_stat_database_conflicts 0

No serialisation failures, no deadlock retries. The transaction depth never exceeded three.

We also watched the logical replication lag. After the first replay storms our pg_stat_replication.pg_current_wal_lsn lag peaked at 90 MB, but we throttled the WAL sender to 50 MB/s and the replicas stayed within 200 ms of leader. That was acceptable because treasure-hunt events are not read-your-own-writes; the WebSocket fan-out picks up changes within 200 ms anyway.

Redis still runs for leaderboards and rate limiting, but the treasure engine itself is PostgreSQL only. We saved five engineer-weeks of Lua debugging and we no longer fear the announcement tweet.

What I Would Do Differently

I would never let Redis own business consistency again. The moment you need uniqueness or ordering guarantees, move it to the durable store. I would also stop using random() in SQL; it is not deterministic and violates testability. Instead we now seed a deterministic PRNG in the application layer with the chest zone and player id as seed, then use ORDER BY id OFFSET floor(random_0_1 * count). The latency stayed the same and the tests became reproducible.

We also over-provisioned Postgres too early. The shard count was 8, but the hot zones (starting zones and popular streams) had skew. We now use Citus 11 auto-shard with dynamic worker addition and we watch pg_stat_bgwriter maxwritten_clean. When the count exceeds 1000 writes/sec we spin a new worker. That elasticity cost us two days to tune, but it saved us from provisioning 30 % more capacity up front.