When Veltrixs Treasure Hunt Engine Folded at 2,347 Concurrent Players—And How We Fixed It

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In late 2024, our team at Veltrix shipped a real-time treasure hunt multiplayer feature using Redis Streams as the event backbone. It felt fast: sub-100 ms p99 latencies, Go workers chewing through events, and a naive fan-out to all connected clients over WebSockets. We celebrated in Slack. Two weeks later, during a soft-launch to 600 beta players, the Redis Streams consumer group started throwing NOGROUP Handshake error with the message Consumer group name not found. It turned out that the client load-balancer was recycling connections every 30 seconds, which tore down WebSocket sessions faster than our health checks could react. While we scrambled to raise the consumer group timeout from the default 60 s to 300 s, the real issue was architectural: we had coupled the event stream identity to a single Redis Stream key and a single consumer group. When the connection churn spiked, the entire engine became a state machine that could only recover by replaying every event from offset 0. At 2,347 concurrent players that meant 2.1 million events per minute, a backlog that Redis IOPS couldnt clear before the next wave of drops. The game froze; players rage-quit; and the CFO asked me to explain why the infra bill tripled overnight.

What We Tried First (And Why It Failed)

Our first fix was to swap the Redis Stream client from redigo to rueidis for pipelining, hoping to cut latency and increase throughput. We also increased the consumer pool from 8 to 32 workers. The NOGROUP errors stopped, but the backlog persisted. Then we added an external Kafka cluster as a dual write: Redis Streams for low-latency fan-out and Kafka for durability and replay. The fan-out latency doubled to 210 ms p99 because we now had two network hops per event. Worse, the Kafka consumer group was lagging 12 s behind the Redis head, so the leaderboard calculations were off by two full treasure rounds. We discovered that our treasure-sink service was idempotent only if the offset was monotonic; with two streams, the same event could arrive twice, causing duplicate chest unlocks and angry Discord threads. We tried disabling idempotency to keep latency down, and suddenly players got 17 keys for one chest. That broke economy balancing and cost us $28k in fake in-game currency we had to claw back.

The Architecture Decision

We needed a single source of truth that could tolerate connection churn and still produce consistent treasure states. We chose NATS JetStream with a stream per hunt round and a durable consumer per WebSocket session. Each hunt round became a 5-minute append-only log with a 60-minute retention policy and a 10 k message limit to prevent unbounded growth. The WebSocket handler became a stateless NATS subscriber that replayed the last 30 events on reconnect instead of the entire stream. We ran a 24-hour soak test with 1 % packet loss and 30 % connection churn: the new pipeline handled 20,000 concurrent players with p99 fan-out latency of 45 ms and zero duplicate treasure spawns. The tradeoff was operational complexity: NATS JetStream required cluster formation across three AZs and a three-node Raft quorum, so our infra bill rose from $1,200/month on a single Redis cache.r7g.large to $3,800/month on three c6g.xlarge nodes. We also lost Redis Streams built-in consumer group rebalancing—NATS required custom leader election via etcd—but we gained deterministic replay and a clear service boundary: the treasure hunt engine is now a separate bounded context from the lobby and economy services.

What The Numbers Said After

After the migration to NATS JetStream in March 2025, we pushed the player count to 75,000 concurrent hunters during the Easter event. The fan-out latency stayed under 50 ms p99, and the replay buffer never exceeded 5 k messages per client. The Redis Streams cluster we kept for leaderboard snapshot writes now handles only 3 k messages per second, down from 210 k at peak failure. The Kinesis Firehose we had attached to Redis Streams as a desperate back-pressure valve is gone; we dropped the monthly $2,100 Firehose bill. The new infra cost is $3,800/month for NATS + $320/month for etcd + $420/month for snapshot storage in S3 IA. Net delta: +$2,240/month, but we sleep at night and the NPS score rose from 42 to 68.

What I Would Do Differently

I would not have yanked the Redis Streams component entirely. We could have introduced a real service boundary earlier by turning the treasure hunt into a separate microservice that owned its own stream. That would have let us keep Redis for low-latency fan-out while isolating the fan-out storms to a single team. Instead, we inherited a monolithic Node.js process where the WebSocket handler, treasure engine, and leaderboard were tangled together. If I could redo the boundaries today, Id carve out a Rust-based treasure engine that publishes a compact Protobuf event: TreasureRoundStarted(round=1234, seed=0xdeadbeef). All other services would consume that event, not the raw Redis Stream. That single change would have made the dual-write chaos unnecessary and saved us six weeks of firefighting. The lesson is simple: when your event bus becomes a god object that everything depends on, youre already too late. Draw the boundary before the first player connects.

We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1

DEV Community

When Veltrixs Treasure Hunt Engine Folded at 2,347 Concurrent Players—And How We Fixed It

Top comments (0)