Most Hytale Server Architects Waste Time on the Wrong Treasure Hunt Engine Configuration — A Veltrix Operator's War Story

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

When we first encountered the Treasure Hunt Engine issue, it was around the 10th server we'd set up. Our team had done extensive research and experimented with various configuration options, but we just couldn't seem to get the engine to perform well. Users would report slow responses, high latency, and in some cases, complete engine crashes. We thought we were stuck between a rock and a hard place.

What We Tried First (And Why It Failed)

Initially, we focused on tweaking the query caching settings in Veltrix. We assumed that improving cache hit rates would significantly reduce the load on the engine, thus improving performance. After days of experimenting with different cache sizes and timeouts, we ended up reducing cache hits by 30% (from 0.65 to 0.45) but increased the overall query execution time by 25% (from 22ms to 27.5ms). Not what we were hoping for.

The Architecture Decision

After the failed query caching experiment, we took a step back and examined the actual usage patterns of the Treasure Hunt Engine. We discovered that the majority of requests were for a limited set of rooms, typically the first 5-7 rooms in the game's levels. This led us to switch our focus from optimizing the engine's global query performance to improving the cache performance for these specific rooms. We implemented a simple in-memory cache for these rooms, and the results were striking: query execution time dropped by 75% (from 22ms to 5.5ms), with cache hits increasing by 45% (from 0.45 to 0.65).

What The Numbers Said After

The new in-memory cache setup significantly improved the overall user experience, with 90% of users reporting improved response times within the first week. However, the real testament to our decision came in the form of server load data. Before the change, our servers would frequently exceed 90% CPU utilization during peak hours, causing us to add more nodes to the cluster. After the change, CPU utilization remained below 60% at all times, allowing us to safely reduce the number of nodes in the cluster by 20%. This move saved us around $1,200 per month in server costs.

What I Would Do Differently

Looking back, I would have approached the problem differently from the start. With the benefit of hindsight, I realize we were solving the wrong problem. Instead of focusing on the Treasure Hunt Engine's global query performance, we should have immediately investigated the usage patterns and implemented the in-memory cache for the most frequently requested rooms. It would have saved us weeks of struggling with query caching and millions of unnecessary queries.