Veltrix 2026: Why Our Treasure Hunt Engine Kept Losing Players at 10k Concurrent Moves

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The promise was simple: every move a player makes drops a virtual coin into a randomly chosen shard of a 64-shard Redis Cluster. At 10,000 concurrent hunters we measured 4.2 million writes per minute. Our SLO for leaderboard reads was p99 ≤ 200 ms. We missed it by 1,900 ms every Tuesday when the shard balancer in redis-cli --cluster rebalance ran the scan-keys operation across all 64 nodes.

The real problem wasnt scaling writes; it was the silent coupling between move ingestion and the leaderboard refresh pipeline. When the balancer swept the cluster, every Redis node blocked for 800–1,200 ms while it scanned 2.3 GB of zset data. During that window the NodeJS workers queued 700,000 zadd commands. The NodeJS event loop froze because the hiredis library used a single thread per connection pool. The backlog grew to 4.2 GB of heap and the GC ran every 2.8 s instead of its target 7 s. Players saw their avatars freeze for 1.8 s, which we measured via synthetic tests as the exact moment they abandoned the hunt.

What We Tried First (And Why It Fail)

Our first attempt was vertical scaling. We bumped the Redis instance size from cache.r6g.16xlarge (64 vCPU, 488 GB RAM) to cache.r6g.32xlarge. The move-ingest pipeline was rewritten in Go to remove NodeJS, but the zset scan in the leaderboard worker still caused the balancer to crawl. The weekly rebalance still took 11 minutes, and the redis-cli cluster rebalance command still emitted:

[ERR] Node XX.XXX.XXX.XXX:6379 is not reachable

The issue wasnt CPU or memory; it was the fact that redis-cli --cluster rebalance triggers a full SCAN on every node. The Redis Cluster specification calls this cluster topology maintenance and it is inherently blocking. We had optimized the wrong layer.

We tried a Lua script to batch the leaderboard updates, but Lua in Redis has a hard 5 ms script timeout. At 800 writes per millisecond the script hit the timeout, returned script killed, and the NodeJS layer retried with exponential backoff for 5 minutes. Players saw error 503 on their top-100 queries.

The Architecture Decision

We ripped out Redis Cluster entirely and switched to a write-behind architecture with Apache Kafka as the durable event bus and Apache Flink as the stateful compute layer. The decision was controversial because the Redis Cluster documentation claims linear scalability, but we proved it only scales writes if you never run topology maintenance during peak hours.

Our new ingestion pipeline:

PlayerMove events → Kafka topic player-moves-v1, 32 partitions, replication factor 3, retention 72 h.
Flink job with RocksDB state backend, keyed on player-id, windowed at 1-second tumbling windows for the top-100 leaderboard.
Flink sink writes to a PostgreSQL table with a BRIN index on (ts, player_id, shard_id) that compresses 3.8 million rows per second to ~400 MB/day.

The RocksDB instance runs on NVMe SSD with 50 GB cache. We disabled OS-level transparent huge pages to avoid unexpected 2 ms stalls. Flink checkpoint interval is 60 s, aligning with Kafka consumer lag SLA of ≤10 s. The PostgreSQL instance is a r6i.8xlarge with 32 vCPU, 256 GB RAM, and GP3 disks at 3,000 IOPS. We chose PostgreSQL over ClickHouse because our product managers wanted geospatial queries later, and we knew ClickHouse would require 2× the storage at this scale.

The tradeoff: we accepted 150 ms of additional latency for leaderboard reads because Flink needs to materialize the window. Before the change our p99 was 180 ms; after it became 330 ms. We measured the drop in player session length: 12 % fewer players reaching the treasure within 5 minutes. The business accepted the trade because 330 ms is still under our 500 ms product SLA and the system now survives rebalances without ECONNRESET.

What The Numbers Said After

After two weeks of traffic at 28k concurrent players:

Kafka end-to-end latency: 8 ms average, 14 ms p99.
Flink checkpoint duration: 4.2 s, with 0 failures over 14,400 checkpoints.
PostgreSQL TPS on leaderboard reads: 14,200 reads/s, 4,300 writes/s.
Redis Clusters move ingestion path is now used only for leaderboard snapshots refreshed every 5 minutes, reducing its load by 98 %.
Player drop-off at 5 minutes dropped from 42 % to 30 % after the change.

The biggest surprise was the cost curve. The new pipeline cost $0.008 per 1,000 player-move events (Kafka + Flink + PostgreSQL) versus $0.003 under the old Redis-only design. However, we were paying $28k/month in on-call burnout because every Tuesday we had to wake someone up to delete the Redis Cluster node that went OOM. The new system runs 95 % of the time without any human intervention.

What I Would Do Differently

If I could reset the clock, I would have built the write-behind pipeline in May 2024 instead of trying to stretch Redis Cluster past its design limits