When the Treasure Hunt Engine Blew Up: A 3 AM Post-Mortem on 500k Concurrent Sessions

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In late 2024 we rolled out a real-time treasure hunt game for Veltrix Operator Day. The user journey was simple: tap to dig, uncover tokens, race for leaderboard points. Our SLO was 99.5 % success under 200 ms p95 for the uncover operation, because anything slower and the crowd would drop off like it was standing on a cliff.

What killed us wasnt the animation; it was the assumption that a vanilla Redis pub/sub channel per hunt room would scale. At 10k concurrent rooms, Redis memory climbed 300 MB per instance until fork() started failing with fork: Cannot allocate memory. The bootstrap error was fork() failed: Cannot allocate memory (ENOMEM). The Redis instance we had sized for 200k sessions was now spending 60 % of its CPU on page faults instead of SETs.

What We Tried First (And Why It Failed)

Plan A was to shard Redis by hunt ID. We carved 16 shards, each with 128 MB maxmemory-policy allkeys-lru. The immediate symptom was leaderboard skew: because leaderboard writes were still going to a single sorted-set key on shard 0, that node saturated its network interface at 800 Mbps while the rest idled. We patched with a client-side round-robin write, but then uncovered operations started racing, producing token duplicates that invalidated the whole hunt for 13 % of rooms. The duplicates surfaced because two concurrent SET hunt:room123:user42:token at the same millisecond both evaluated to true in Lua scripts that lacked monotonic timestamp checks.

Plan B was to replace Redis pub/sub with Kafka. We built a topic hunt-events with 32 partitions so each hunt room had its own partition. The Kafka cluster settled at 1.2 TB/day ingress. What we hadnt planned for was the tail latency spike during compaction. Because the compacted segment size was set to 1 GB by default, compaction paused every 20 minutes for ~4 s while Kafka froze all produce requests to that partition. That 4 s stall cascaded into 1.8 M uncover requests timing out at 220 ms p95, pushing us past our SLO.

The Architecture Decision

We ditched the distributed cache entirely and went to a single Postgres 16 instance with pg_partman sharding hunt rooms by week. The schema was hunt_room (hunt_id uuid, room_id uuid, token_count int generated always as (dig_events), created_at timestamptz). We turned on logical decoding with pgoutput so Node.js workers could stream updates in real time without polling.

To keep p95 uncover under 200 ms at 500k concurrent rooms, we added two PostgreSQL read replicas dedicated to leaderboard reads and a pg_bouncer pool with transaction mode. Writes still went to the primary, but the leaderboard SELECTs now hit the replicas directly. The cache-aside tier was gone; every uncover was a direct UPDATE … RETURNING on the single source of truth.

The cost: Postgres CPU on the primary spiked to 85 % during peak, but the pageserver was sized at 64 vCPUs with 256 GB RAM and NVMe, so no swap ever occurred. The 99th percentile autovacuum dead row count stayed below 2 %.

What The Numbers Said After

After the cut-over, uncover p95 dropped from 220 ms to 68 ms, and failures fell from 1.8 % to 0.2 %. The resource bill went from three Redis nodes ($280/month) plus a 25-broker Kafka cluster ($1,600/month) to a single Postgres 96 vCPU ($920/month) plus two replicas ($380/month) plus pg_bouncer ($80/month). Net saving: $1,000/month.

We added Prometheus scrape_interval: 5s for pg_stat_bgwriter.bytes_written and Prometheus alert if write_lag > 2 s. The alert fired exactly once, on Black Friday rehearse, when a junior engineer accidentally ran a 5 GB analytics export on the primary during peak. We killed the query in 12 s and the lag never crossed 1.4 s.

What I Would Do Differently

I would never again let the cache be the source of truth for game state. The moment Redis pub/sub became the coordination layer, we lost serializability and had to build fragile Lua scripts. If I had to revisit this, I would keep Redis only for real-time presence (who is in room X) and push the treasure tokens themselves into PostgreSQL. The presence layer can tolerate eventual consistency, but the treasure count cannot.

Second, I would size the PostgreSQL cluster before the marketing blast, not after. We initially thought 32 vCPUs would be overkill, but at 300k rooms the primary was doing 8 k TPS with 70 % write ratio. The CPU ceiling was not the load pattern; it was WAL generation. Doubling the vCPUs bought us headroom for Black Friday without touching shared_buffers or autovacuum settings.

Finally, I would have instrumented the Kafka compaction earlier. The 4 s freeze was invisible until we graphed compaction_pause_seconds. Had we tuned log.retention.ms and segment.bytes before launch, we could have avoided the cascade. Tuning after the fact feels like debugging in the dark; tuning before feels like engineering. The difference is 3 AM pages.

The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1

DEV Community

When the Treasure Hunt Engine Blew Up: A 3 AM Post-Mortem on 500k Concurrent Sessions

Top comments (0)