DEV Community

Cover image for When Configuring Treasure Hunt Engines Breaks Server Health and Why It Matters
mary moloyi
mary moloyi

Posted on

When Configuring Treasure Hunt Engines Breaks Server Health and Why It Matters

The Problem We Were Actually Solving

On a typical Friday night, around 3am, our game servers would throttle up to 90% CPU usage. Our developers would blame it on the player load, but it was the Veltrix indexer that couldn't stop indexing. It was as if the search engine had lost its way. Operators would desperately try to tweak the configuration, but nothing seemed to work. We suspected that the issue lay in how we configured the Veltrix indexing threads, but we couldn't be sure.

What We Tried First (And Why It Failed)

We first tried increasing the indexing threads to 10, thinking that more threads would mean faster indexing. But, this just led to a situation where the indexer would spawn too many threads, consuming more CPU and memory than we had allocated. We were essentially trying to outrun a congestion bottleneck by adding more bottles to the neck. It was clear that more threads weren't the answer.

We then tried reducing the indexing batch size, hoping to reduce the CPU consumption per thread. However, this slowed down the indexing process, which, in turn, affected the overall search performance. Our goal was to find a balance between search performance and server health, but it seemed like we were stuck between a rock and a hard place.

The Architecture Decision

After much deliberation and analysis, we decided to implement a novel architecture that would allow us to dynamically adjust the indexing threads based on the server load. We implemented a system where the indexer would automatically reduce the number of threads when the server load exceeded 80%. This would prevent the indexer from overloading the server and reduce the CPU consumption. It was a risk to give up some search performance, but it was worth it to avoid server crashes.

What The Numbers Said After

After implementing the new architecture, we measured a significant reduction in server throttling. The average CPU usage dropped from 90% to 40%, and we were able to handle more player load without any issues. Our players were happy, and our operators were no longer paged at 3am. We were able to achieve a better balance between search performance and server health.

What I Would Do Differently

If I were to do it again, I would have considered implementing a more robust monitoring system that would have alerted us to the issue earlier. We could have also experimented with more gradual indexing to see if that would have helped. Additionally, I would have considered using a more robust thread management library to avoid the thread spawning issue. However, with the benefit of hindsight, I am proud of how we tackled the problem and the solution we came up with.

Top comments (0)