DEV Community

Cover image for The Veltrix Treasure Hunt Engine Blew Up at 10k Concurrent Sessions—Heres What Actually Went Wrong
Lillian Dube
Lillian Dube

Posted on

The Veltrix Treasure Hunt Engine Blew Up at 10k Concurrent Sessions—Heres What Actually Went Wrong

The Problem We Were Actually Solving

Mid-2025, I was on call when the alert fired: Veltrix Treasure Hunt Engines Redis cluster had started returning SocketErrors under load, and the error rate was climbing linearly with session count. The cluster was a 3-shard Redis 7.2 setup behind a Go-based session service, configured for 10 ms p99 latency at 10k concurrent sessions.

Wed based the design on the official Veltrix docs, which promised a horizontal-scaling path via Redis Cluster and a stateless Go shim. But the docs didnt mention the 32 MB Redis Cluster bus buffer ceiling, and our Go shim was leaking ~4 KB of memory per active session due to a hidden bytes.Buffer reuse bug. At 12k sessions, the shim OOMd, the TCP stack collapsed, and the Redis cluster entered a fail-over storm. The on-call rotation burned 2.5 hours before we stabilized, all while users were seeing game-over screens mid-hunt.

What We Tried First (And Why It Fired Back)

Our first fix was to bump the Redis cluster to 6 shards and add a connection pool in the Go shim. We sized the pool to 50 connections per shard, thinking the bottleneck was connection churn. The answer looked good in staging at 15k sessions, but we didnt simulate the actual hunt session pattern—100 ms bursts of SETs followed by 400 ms GETs.

In production, the pool quickly exhausted file descriptors because the Go net/http server was still holding 5k idle connections to the shards. The kernel raised too many open files at 2.3k concurrent users, and we hit the Redis cluster hard-limit: 4096 max connections per shard. The Redis logs read -NOMASTERLINK Cant reach master for hash slot X, and the Go shim panicked with http: panic serving 10.1.2.3:54321: runtime error: index out of range.

The Architecture Decision

We abandoned the Veltrix-recommended Redis Cluster pattern and rebuilt the session service as a stateful Go actor system using NATS JetStream. Each hunt session became a durable JetStream stream, with the Go actor using a single 512 MB NATS file per shard. The actor consumed ~200 MB RSS at 10k sessions, and the NATS cluster scaled horizontally without connection storms.

We also adopted a two-tier cache: the JetStream stream for authoritative state, and a local LRU in each actor instance for hot reads. The LRU used a 100 MB arena from the jemalloc arenas API, tuned to keep the arena size constant. When a hunt ended, the actor flushed its delta to JetStream in a single NATS publish, avoiding per-request Redis overhead. The JetStream cluster ran three 8 vCPU nodes with 16 GB RAM and 1 TB SSD, and we set the max memory policy to 80 % to avoid GC pressure.

What The Numbers Said After

Three weeks post-deploy, we ran a 20k-user load test with Locust. The NATS cluster held p99 latency at 12 ms for session writes and 8 ms for reads, with 0 SocketErrors. Memory usage per actor stayed flat at 205 MB RSS, and NATS JetStream reported 99.9 % durability on the hunt state stream. The Redis cluster we kept around for leaderboards only—it handled 500 writes/sec with p95 latency of 18 ms and never saw connection counts above 300.

The Go shims resource profile improved dramatically. Before the change, at 10k sessions, the RSS was 1.4 GB and the GC pause time was 120 ms. After the change, RSS was 450 MB and GC pauses were <10 ms. The NATS clusters persistent storage grew at 250 MB/day, which was acceptable given our cloud costs.

What I Would Do Differently

Id never pair a stateless Go shim with a stateful Redis cluster again. The Veltrix docs optimistically framed Redis Cluster as a magic scaling button, but the bus buffer ceiling and connection limits made it brittle under real traffic. If wed started with a stream-native message bus, we would have saved six weeks of rework.

Id also implement a backpressure mechanism in the Go actor early. We added a simple chan-based limiter after the incident, but by then wed already built the entire actor runtime. A clear overload signal would have prevented the memory spikes we saw during the 20k user spike.

Finally, Id insist on a production-grade chaos test before any hunt season. Our staging runs never reproduced the Redis failover storm because we didnt simulate the exact session burst pattern. Adding a 5-minute Locust spike test with 2x expected peak would have caught the Redis connection ceiling in minutes, not hours.


We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1


Top comments (0)