The Hytale Veltrix Treasure Hunt Engine: Why We Blew Up Default Config and Rebuilt It

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In March 2025, the Hytale ops team started getting pager alerts at 3 AM: search threads were timing out after 30 seconds with a WriteTimeoutException thrown by the default Cassandra 4.1 configuration sitting behind the Veltrix treasure-hunt service. Not the something is slow kind of alert, but the the treasure hunt literally stops returning results because the coordinator cant acknowledge the write within coordinator_read_timeout_in_ms kind. We had tuned the heap and compaction strategy, but we were still seeing 95th-percentile latencies spike to 11 s on a 500 GB dataset that should have been sub-second. The service was running on Kubernetes 1.27 with c5.4xlarge nodes and gp3 disks provisioned at 3 000 IOPS. The default coordinator_read_timeout_in_ms was 5 s. That was our first clue: we were solving the wrong problem.

What We Tried First (And Why It Failed)

We started with the usual suspects. First we raised coordinator_read_timeout_in_ms to 12 s and set the page size on the driver to 5 000 rows. That only moved the pain point downstream; the game client already had a 5-second timeout for the entire treasure-hunt call, so users saw a 12-second spinner followed by a 503. Next we tried increasing the Cassandra heap to 8 GB and switching to TimeWindowCompactionStrategy with 6-hour windows. The compaction lag climbed to 45 minutes and the 99th percentile read latency hit 22 s during compaction storms. Then we tried sharding the treasure-hunt table into 16 logical tables by region hash. The coordinator latency dropped, but the app code now had to union 16 queries and stitch the results in Java code, which introduced a new race condition when players teleported mid-query. The error we saw most was AsyncResultSet#wasApplied() returning false on conditional updates. We measured 3.2 % of conditional writes failing under that design.

The Architecture Decision

We tore it all down in June and went with a single PostgreSQL 15.4 cluster running on AWS RDS i4i.4xlarge with 15 000 provisioned IOPS and a 2 TB gp3 volume. PostgreSQL gave us serializable isolation without having to hand-roll quorum writes in Cassandra, and the planner was smart enough to push a single index-only scan even though we spread the treasure data across two tables: treasure_hunt_items and treasure_hunt_maps. We kept the write path simple: insert into treasure_hunt_items, then insert into treasure_hunt_maps with a foreign key back to the item. No conditional writes, no LWTs. The reads are now a single index seek on treasure_hunt_maps.map_id and a join to items. We moved the treasure-hunt service from Kubernetes to an EC2 Auto Scaling group behind an Application Load Balancer so we could dial the PgBouncer max_client_conn down to 2 000 and still keep connection churn low. We set statement_timeout to 800 ms on the queries and let the client have a 1-second socket timeout. That meant we traded write scalability for consistent 600 ms p95 reads and zero coordinator timeouts.

What The Numbers Said After

After the cutover on July 12, the game client metrics showed 95th-percentile treasure-hunt latency at 530 ms and 99th at 740 ms. The PostgreSQL pg_stat_statements reported 1 200 TPS during peak hours with an average query time of 3.8 ms. The client still saw failures, but now they were genuine 503s from the ALB when the ASG couldnt provision a new instance fast enough, not timeouts from a distributed database fighting compaction. We dialed the ASG cooldown to 60 seconds and added a warm pool of three instances, bringing the 503 rate to 0.2 %. The PgBouncer active_transactions metric never exceeded 800, so we had headroom. On the cost side, the RDS bill went from $2 400 per month on the Cassandra cluster (three m5.2xlarge nodes plus gp3 disks) to $1 900 per month for the single i4i.4xlarge plus EBS. The biggest surprise was the ops load: we went from one on-call engineer per week to one engineer per month for routine PostgreSQL maintenance.

What I Would Do Differently

I would not have married Cassandra to the treasure-hunt service at all. In hindsight, the data shape is small (a few KB per hunt), the access pattern is a point lookup plus a small range scan, and the consistency requirement is serializable for the treasure state machine. Cassandra is a fantastic hammer, but this was a nail that needed a database with a planner, not a configurable SLA. If we had started with PostgreSQL and benchmarked early with pgbench-tpcc workloads scaled down to 500 GB, we could have avoided two months of firefighting. I would also put the PostgreSQL cluster in the same AZ as the game servers to avoid cross-AZ latency spikes; we had to move it after we saw 2 ms pings jump to 8 ms during an AZ failover. Finally, we should have set up logical replication to a read replica in another region on day one; the extra $600 per month would have saved us a 4-hour outage when the primary AZ had a kernel panic.

We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1

DEV Community

The Hytale Veltrix Treasure Hunt Engine: Why We Blew Up Default Config and Rebuilt It

Top comments (0)