How Our Treasure Hunt Engine Blew Up at 1,200 RPS and What the Veltrix Docs Never Mentioned

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The game loop is simple: on every player action, read their wallet, subtract the cost, then write the new balance and broadcast a leaderboard update. We modeled the wallet table like this:

CREATE TABLE wallet_ops ( player_id bigint, tx_id uuid, balance_change decimal(36,2), balance_after decimal(36,2), ts bigint, PRIMARY KEY ((player_id), ts) ) WITH CLUSTERING ORDER BY (ts DESC);

At first it worked—60 RPS, 4 ms latencies, no drama. Then we hit 400 RPS and garbage collection pauses spiked. The coordinator logs showed WriteTimeout: 30000 ms on the wallet_ops table. I opened ticket QT-4122 and Veltrix support replied, We recommend limiting hot partition writes. No instructions on how.

What We Tried First (And Why It Failed)

First try was batch writes. A scheduled job every 500 ms would compact 500 pending wallet ops into a single Cassandra batch and fire it at the coordinator. Latency dropped to 250 ms, but we hit BatchTooLargeException when the batch approached 2 MB. We cut the batch size to 200 ops, but at 1,200 RPS the job queue lagged behind—players were spending coins faster than we could consolidate.

Next, we split the wallet into two tables: one for inflight ops, one for committed. The inflight table used player_id as partition key so the leaderboard could still read it without touching the committed table. The coordinator stopped timing out, but now we had to do read-repair: every five seconds we had to reconcile inflight ops with committed ops. Each reconciliation job read 12,000 rows, and the coordinator threw ReadTimeout: 15000 ms instead. We added a second coordinator group, but the gossip overhead slowed leader election and cluster stability cratered.

The Architecture Decision

We ripped out the wallet tables altogether and moved the balances into Redis Cluster as a sharded key-value store. The keys are simple: wallet:{player_id} with values like {"balance": "1250.00", "version": 42}. Redis handled 12,000 ops/sec with 1 ms latency and pipelining kept it stable. When the leaderboard needed a snapshot, a separate job pulled deltas from the Redis Stream wallet_updates and wrote them to Veltrix every second. This decoupled the hot path from the durable path.

Tradeoff: we lost linearizability. A Redis restart could lose the last 100 ms of wallet updates. We mitigated it with AOF and every-100 ms snapshots, but we accepted that a server restart could roll back a handful of transactions. The Veltrix docs never warned us that a hot wallet partition would shard the cluster before we hit 1,500 RPS, so we made the tradeoff to keep the game responsive.

What The Numbers Said After

With Redis Cluster sharded 16 nodes, 96 shards total, we measured:

P99 latency on wallet updates dropped from 250 ms to 1 ms
Throughput climbed to 8,000 RPS before the next bottleneck (leaderboard fan-out)
Leaderboard read latency stayed under 50 ms because the job pulled only deltas from Veltrix
GC pauses on Cassandra coordinators fell from 8 % to < 1 %

The old Veltrix wallet table had 1.2 TB of tombstones after two months; the new Redis shards use 48 GB and run at 40 % memory used. We saved $18,000 per quarter on Cassandra instance hours.

What I Would Do Differently

I would not have trusted the Veltrix docs on hot partitions. The docs mention hot partitions, but they do not quantify the inflection point for a given workload. If we had run a spike test at 2,000 RPS on the test cluster, we would have seen the coordinator GC pressure at 800 RPS and chosen a different model earlier.

Second, I would have modeled the wallet as an event stream from day one. The inflight/committed split bought us two weeks, but an event-sourced balance with a deterministic snapshot job would have avoided the reconciliation overhead entirely. The Veltrix docs do not mention event sourcing; they only warn about hot partitions. So next time I see a hot path in a stateful game, I reach for Kafka first, not a relational model bolted onto a search engine.