Treasure Hunt Engine: The Day We Realized the Event Bus Was Our Constraint

#webdev #programming #rust #performance

The Problem We Were Actually Solving

We werent just chasing p99 latency; we were solving a fundamental mismatch between the event model and the treasure hunt logic. Each treasure hunt round emits thousands of micro-events: player joins, item picks, time updates, leaderboard recalculations, and realtime notifications. The Node.js event loop was choking under the backpressure. The BullMQ worker was blocked on Redis pubsub, not because of network latency, but because Node.jss single-threaded event loop couldnt keep up with the rate of incoming events. The Redis server itself was fine—CPU at 12%, memory at 68%, no evictions. The bottleneck wasnt the queue or the data store. It was the runtime.

I added a debug trace using 0x and saw 78% of CPU time was spent in uv__io_poll, the epoll/select wrapper. The Node.js process was spending more time waiting for events than processing them. And because BullMQ uses Redis streams, every publish and consume was a network roundtrip. The 250 microsecond RTT from us-east-1 to the Redis cluster was adding up when we were publishing 47,000 events per second. The p99 latency followed the square root of the number of concurrent players. At 5,000 players, it was 80ms. At 10,000 players, 2.3 seconds. The system wasnt scaling linearly. It was falling off a cliff.

What We Tried First (And Why It Failed)

We tried horizontal scaling BullMQ workers. We spun up 8 workers behind an SQS queue. The SQS throughput was fine—50,000 events/sec sustained—but BullMQs Redis backpressure became a distributed locking nightmare. Workers fought over the same Redis key ranges, and the Redis pubsub fanout created a thundering herd on the Node.js event loop. We saw lock contention in XREADGROUP with 200ms timeouts. We tried sharding the Redis streams into 16 shards. The shard imbalance was brutal—some shards got 3x the load. We tried upgrading Node.js to 20. Same behavior. We even tried using ioredis for connection pooling, but the fundamental mismatch remained: Node.js was a stream processor pretending to be an event-driven runtime.

Then we tried denoising the events—filtering out duplicate player actions, compressing payloads, batching events. That helped reduce volume by 38%, but the p99 latency still rose with player count. The issue wasnt event volume. It was the runtimes inability to handle the concurrency model we needed.

The Architecture Decision

We had to accept that Node.js was the constraint. Not Redis. Not BullMQ. The runtime itself. We spun up a prototype in Go. Using go-redis with a streaming consumer group, we hit 320,000 events/sec on the same c5.4xlarge instance with under 100ms p99 latency. The memory allocation profile from pprof showed 1.2 MiB per second GC pressure—nothing compared to Node.jss 47 MiB/sec. But Go wasnt the only option. We also tested Rust.

We built a minimal Rust prototype using Tokio, Redis streams via redis-rs, and a hand-rolled event router. The first version used std::thread for concurrency, but that led to thread starvation under load. We switched to Tokio with 8 worker tasks and a single Redis connection with multiplexing. The memory footprint was 8.7 MiB RSS at idle, peaking at 42 MiB under 100,000 events/sec. The Tokio runtimes work-stealing scheduler meant no idle threads, no wasted CPU waiting for events. The p95 event latency was 18ms, p99 47ms. We ran a load test for 12 hours with 20,000 concurrent players, 2.1 million events per minute. Zero GC pauses, zero memory leaks, zero crashes.

The architecture decision wasnt just language. It was concurrency model. We moved from a centralized event bus (Redis streams) to a partitioned event log with local in-memory buffering. Each shard had its own Redis stream, and a Rust worker consumed it with async I/O, processed events in order, and pushed updates to a local pubsub channel for the game server. The game server itself remained in Node.js, but now it listened to local events via Unix sockets, cutting RTT from 250 microseconds to 4 microseconds. The entire stack became pipeline-based: Redis → Rust → Node.js → frontend.

What The Numbers Said After

We deployed the Rust worker to production on a c5.large (2 vCPUs, 4 GiB RAM). The worker handled 60,000 events/sec with 4ms p99 latency. The Node.js game server CPU dropped from 85% to 18%. Memory usage fell from 1.4 GiB to 320 MiB. The Redis cluster CPU dropped from 78% to 22%, and we reduced shards from 16 to 4. The treasure hunt p99 latency dropped from 2.3 seconds to 62ms. Player complaints about lag disappeared. We added a new feature: realtime leaderboard recalculations every 200ms. The Rust worker handled it with zero additional latency.

But the real win was observability. Tokios tracing crate gave us per-event latency histograms at 100,000 events/sec. We could see where each event spent time: 31% in Redis XREAD, 28% in event parsing, 14%