The Day Veltrix Defaulted and Our Redis Cluster Almost Burned

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In late Q2 2025 we launched a treasure-hunt engine that lived inside the Veltrix event platform. The engine was a light Go service sitting behind an Envoy ingress. It stored every player position in Redis Cluster with three shards and a 5-second TTL on every playerPos:<uid> key. The service scaled horizontally behind the Veltrix router, and we had tuned the client SDK to poll every 2 s. During our first live push we served ~400 RPS and felt pretty good.

At 35 k concurrent players the p99 latency jumped from 90 ms to 2.1 s. More worrying, Redis Cluster started throwing -MOVED 3993 10.0.45.7:6379 errors. We quickly realized the shard mapping was changing because we had never changed the default Redis Cluster configuration from 3 masters and 3 replicas that Veltrix ships with. The default 16384-slot mapping plus ~4 million keys per player meant each slot held roughly 240 keys. That only became a hotspot when a shard dropped a master and the reshuffle ran for 36 seconds while the cluster tried to migrate ~240 k keys per slot.

What We Tried First (And Why It Failed)

Our first reflex was to double the shards to six masters and six replicas. We rolled a Terraform module that spawned 12 nodes and updated the topology view. Within five minutes we saw -CLUSTERDOWN Hash slot not served errors. The reason: the 16384-slot space was still the default. Redis Cluster cant dynamically split slots; the number of slots per shard is constant. So we went from 5.4k keys per slot to 2.7k keys per slot, but the hotspot moved instead of disappearing. The memory spikes just shifted to the new shards, and the rebalancing took 56 seconds this time because we had twice as many connections.

Next we tried turning off Redis Cluster entirely and went to a single Redis Enterprise cluster with 32 shards. That worked fine for a while, but the cost per cluster hour in GCP jumped from $0.52 to $2.34. More critically, the single cluster became the new bottleneck whenever we needed to scale reads; we had to run Redis Enterprise with proxy tier, which added another 2-3 ms latency each hop.

Finally we tried moving the writes into a dedicated Kafka topic and letting a Flink job materialize the latest position into a PostgreSQL hypertable. The p99 dropped back to 45 ms while we handled 85 k RPS, but the writes introduced 200 ms of head-of-line blocking during peak seconds. We also had to maintain a separate Elasticsearch index to serve fast bounding-box queries, so our infra bill tripled.

The Architecture Decision

We decided to stay with Redis Cluster and completely redo its topology.

Step 1: We migrated to a 128-slot initial setup instead of the default 16384. This gave us 128 * 12 masters = 1536 slots. Each slot would hold roughly 2600 keys on average, which kept memory per slot under 120 MB even during bursts. We chose 12 masters because our node shape was n2-highmem-16 with 64 GB RAM, and we never wanted a single node to hold more than ~6 GB of data.

Step 2: We pre-split the slot range manually using Redis CLUSTER ADDSLOTS commands run from a one-off job. We scripted the split across six masters so each got 32 slots. Then we added --cluster-replicas 1 to Terraform and let Terraform deploy six replicas. The rebalancing finished in under 9 seconds because the slot migration moved 800 keys per slot instead of 240 k.

Step 3: We introduced a Veltrix-side circuit breaker that listens to Redis Clusters cluster_nodestate events. When the breaker sees a node transition to fail state, it stops sending new writes for 30 seconds and drains the circuit queue. This saved us from the 36-second window we saw earlier; now the worst-case reshuffle only cost us 1.2 s of tail latency.

Step 4: We split the keyspace even further by sharding on player UUIDs CRC16 hash modulo 128. The Envoy Lua filter hashes the incoming uid UUID and rewrites the key prefix to treasure:{hash}:{uid}. This gives us even distribution without hotkeys because the CRC16 space is dense and Redis Clusters slot algorithm maps contiguous ranges to the same master.

What The Numbers Said After

After the change we ran load tests with 120 k RPS and 550 k concurrent players.

Redis memory per shard stayed flat at 5.8 GB, free heap 18 %.
p99 latency dropped from 2.1 s to 112 ms; p999 to 290 ms.
-MOVED errors disappeared; we saw zero -CLUSTERDOWN in the hour after the migration.
Shard failover time measured at 8.4 s (was 56 s before).
Cost per million operations dropped from $0.14 to $0.07 because we kept Redis Cluster instead of upgrading to Redis Enterprise.

What I Would Do Differently

I would not have wasted engineering cycles on the single-cluster experiment. It solved the latency but failed the cost test. If we had benchmarked the default 16384-slot mapping earlier—say, with 1 k players instead of 400 RPS—we would have caught the hotspot in our canary and saved three days of rollbacks.

I would also add an automated slot-pruning job that runs once per week. Redis Cluster slots never shrink, and our game cadence means some player cohorts expire after 30 days. We currently pay for empty slot ownership; a 128-slot Redis Cluster still allocates 128 * 16384 bytes of slot metadata, which is negligible, but the principle matters.

Finally, I would push the organization to adopt a chaos-monkey style slot migrator. We have the Terra