The Veltrix Treasure Hunt Engine: Why Our First Rewrite Cost Us 3.2 Million Requests Per Second

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The product goal was simple: every player who walks into a building on the map should see the same treasure list within 300 ms. We translated that into a consistency contract: strong consistency on the treasure list keyed by building-ID, but eventual consistency on the global leaderboard that ranks players by total coins collected. The problem was that the engine we inherited from the mobile team assumed eventual consistency everywhere. Their Redis Cluster v6.2.6 shards were sized for 80k ops/sec, and they used Lua scripts to merge deltas on the client. When the Royale drop pushed 1.2M concurrent connects at 00:00 UTC, the Lua scripts collided with Rediss single-threaded event loop. We saw 47k script-timeouts per minute and a P99 tail latency of 4.2 seconds on the treasure-list endpoints.

What We Tried First (And Why It Failed)

Our first rollout kept the Lua merges but moved the treasure lists to a Go service backed by a single PostgreSQL 14 cluster with pgbouncer 1.17.0 connection pooling. We reasoned that strong consistency on the treasure list would be easier to reason about than distributed CRDTs. The migration script ran at 20:00 UTC the night before the drop. Eight minutes in, the write-ahead log started to stall because the WAL receiver could not keep up with the 45k INSERTS/sec coming from the Lua scripts. The DBA on call increased max_wal_size to 4 GB, which only delayed the inevitable. At 21:42 UTC the leader elected to restart, and the cluster entered a 3-minute split-brain while pg_rewind fought to reconcile the standby nodes. When the service came back, the Lua scripts had already enqueued 1.9 million backlogged treasure events. The Go service fell over trying to replay them through logical decoding, and we hit an OOM at 32 GB RSS.

The Architecture Decision

We ripped out the Lua scripts and replaced the treasure list store with a partitioned RocksDB 8.7.0 tier that we called the Cellar. Each building-ID mapped to one sparse SST file that we updated via a write-behind log to a local WAL rotated every 100 ms. The Cellar sharded 64-way across NVMe volumes, giving us 320k ops/sec per node at <2 ms P95. We fronted the Cellar with a single envoy 1.26.0 proxy that implemented a consistent-hash policy on building-ID. Downstream, we kept PostgreSQL only for the global leaderboard; we added a TimescaleDB 2.12.0 hypertable partitioned by player-ID so that the 12 million active players stayed within ~300 GB of hot data. The Timescale instance ran on AWS RDS i3.8xlarge with 2 TB gp3 disks and a 30k IOPS burst credit.

The global winner-notification fanout was the first place we accepted eventual consistency. We switched from WebSockets to NATS 2.9.21 jetstream with a 5-minute deduplication window. Each player subscribed to exactly one jetstream subject: user.. That meant we could replay missed notifications without flooding the clients. The only strong-consistency requirement we kept was that a single write to the Cellar for a building had to appear to all players before the notification fanout completed. We achieved that by making the Cellar write synchronous in the HuntMaster, but the Timescale leaderboard writes were asynchronous and retried with exponential backoff.

We also introduced a local cache layer with Dragonfly 1.8.1 acting as a L1 shard for each envoy instance. The cache TTL was 50 ms, which was the same as the timeout we gave the envoy circuit breakers. We tuned the hop-by-hop retry budget to 3 attempts before failing the request to the client, which capped our tail latency at 220 ms P99 even when the Cellar was under 230k concurrent reads.

What The Numbers Said After

The Winter Royale drop went live at 00:00 UTC on 15 December 2025. In the first hour we ingested 2.9 billion treasure updates. Our scrape job on the HuntMaster showed a steady 3.2M requests/sec on the write path with no flapping. The Cellar nodes reported 78 k ops/sec per shard at 2.1 ms P95 latency. The PostgreSQL cluster on the leaderboard side handled 420k INSERTS/sec with a 160 ms P95 write latency and 1.2 seconds P99. NATS jetstream delivered 1.8 million winner notifications in the first 2 minutes without a single NACK. The client error rate stayed below 0.04 % across all regions.

The cost side was brutal: the Cellar nodes alone ran 64 r6i.2xlarge instances, each costing $1.092 per hour, or ~$1,500/day. The NATS jetstream cluster added another $840/day for 9 m5.2xlarge brokers with 5 TB gp3 storage each. We saved money by collapsing the Redis Cluster entirely and by moving the TimescaleDB to cheaper i3.2xlarge spot instances at $0.24/hour, reducing the leaderboard bill from $3.1k/day to $1.2k/day.