DEV Community

Cover image for The Day We Let 15,000 Concurrent Hunters Crash Our Treasure Hunt Engine (And How We Fixed It)
Lillian Dube
Lillian Dube

Posted on

The Day We Let 15,000 Concurrent Hunters Crash Our Treasure Hunt Engine (And How We Fixed It)

The Problem We Were Actually Solving

In April 2024 we rolled out a real-time treasure hunt feature on Veltrix called Pulse Quests. The idea was simple: over a 60-minute window spread across 300 global servers, 50,000 users would simultaneously race to locate a hidden virtual artifact inside our web app. We estimated 15,000 concurrent hunters at peak. Our backend stack was Node.js + MongoDB + Redis. Nothing fancy.

By minute 8 we were seeing 100 % CPU on the Node tier and 95th-percentile Redis latency of 1.2 seconds. Users were refreshing, the event scoreboard API flipped between stale and blank, and we got our first S-2 alert within Slack. Worse, the traffic spike wasnt a gradual ramp—it was a binary on/off switch when the countdown timer hit zero.

What We Tried First (And Why It Failed)

First attempt: vertical scaling. We pushed the Node pods from 0.5 vCPU each to 2 vCPU and doubled the Mongo primary instance. The latency dropped from 1.2 s to 700 ms, but we ate 1,800 extra RDS credits for the month and still crashed when the next hunt started with 18 k hunters. CPU steal percentage on the Mongo primary crept past 42 %, and replication lag between primary and two secondaries stretched to 4.3 seconds. A single disk write on the primary now blocked every write operation for 800 ms—exactly the window where 3 k hunters would collide on the same artifact.

Next we tried MongoDB change streams with a Node fan-out. We naively piped every artifact click through a single change stream to 15,000 WebSocket connections held open by Socket.IO. Within 12 minutes the Node tier ran out of file descriptors (lsof reported 65,536 open sockets) and the kernel killed the process with ENFILE. Our error budget was now -52 minutes of cumulative latency SLO burn.

The Architecture Decision

We ripped out the global singleton bottleneck and built a sharded hot path:

  1. We carved the hunt map into 96 static geohash buckets. On event start the Node tier published a single Redis pub/sub channel called quest:start. Each warehouse worker (Go 1.21) subscribed to exactly one bucket via Redis Streams—bucket 42 heard only geohashes starting with dp—so a bucket never handled more than 400 concurrent hunters at peak load.

  2. We moved all mutable state (claimed artifacts, score deltas) into a dedicated write path: Go worker → Redis Streams → Lua script that increments a Lua-side counter and emits a diff to subscribers. The Lua script ran atomically inside Redis 7.2, so we avoided the Mongo write bottleneck entirely.

  3. We fronted the UI with a lightweight edge cache (Cloudflare Workers KV) that only stored the latest bucket diff. Workers KV gave us 5 ms get latency for 97 % of requests, and the delta payload was ≤2 kB.

  4. We kept MongoDB as the eventual consistency sink for analytics and replay, but only after the hunt window closed. The primary write volume went from 3 k writes/second to 12 writes/second.

The biggest tradeoff: we accepted eventual consistency for hunt scores during the race. Users saw a 300 ms delay between claiming an artifact and the leaderboard update. We measured that as 0.3 % of hunters refreshing, which we considered acceptable for a game mechanic where real-time bragging mattered less than the end-of-hunt bragging.

What The Numbers Said After

Post-Pulse-Quests-2 we ran three more hunts with the same scale. CPU on the Node tier stayed below 45 %, Redis Streams 95th-percentile latency at 18 ms, and file descriptor usage stabilized at 3 k per pod. Our infrastructure bill actually dropped 18 % because we right-sized the Mongo cluster and no longer needed emergency vertical scaling.

We instrumented three new metrics:

  • hunt:hotpath:latency measured by Workers KV miss-rate and Go worker lag. Ideal target ≤20 ms; we hit 14 ms on average.
  • redis:streams:backlog tracked the per-bucket stream length. We capped at 5 k messages; in hunt #4 the max backlog peaked at 4.2 k—well below our safety buffer.
  • artifact:collisions reported how many users tried to claim the same artifact within the same 100 ms window. At hunt #2 we saw 12 per artifact; by hunt #4 it dropped to 3 because we sharded aggressively and Lua incremented atomically.

What I Would Do Differently

I would not default to WebSockets for a pure pub/sub race. Socket.IO introduced 30 % extra overhead versus raw Redis pub/sub and cost us an entire node restart cycle. If we had started with raw pub/sub and edge Workers KV caching, we could have avoided the Node-tier file descriptor explosion entirely.

I would also have added a circuit breaker on the Lua script call path. On hunt #3 we had a 1 ms Lua script latency spike that caused one bucket worker to stall; the upstream Redis pub/sub queue grew to 8 k messages before the circuit tripped. A 50 ms timeout on Lua + automatic worker restart would have capped the backlog at 2 k.

Finally, we should have capped the per-bucket subscriber fan-out at 500. When we accidentally pushed a bucket with <100 users it looked fine, but when a viral artifact landed inside a bucket with 1.5 k hunters the Workers KV miss-rate spiked to 18 % and we served stale snapshots for 400 ms. A simple subscriber cap would have forced a shard split without another deploy.

Top comments (0)