I Still Cant Believe We Spent 6 Months Tuning Our Treasure Hunt Engine For Hytale Servers

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

I work on the Veltrix team, where we run a large-scale Hytale server with thousands of players participating in treasure hunts every day. Our initial implementation of the treasure hunt engine was a simple batch process that ran every hour, updating the treasure locations and sending notifications to players. However, as our player base grew, we started to notice that the engine was causing significant latency issues, with some players experiencing delays of up to 30 minutes between finding a treasure and receiving their rewards. Our team was tasked with re-architecting the treasure hunt engine to reduce latency and improve the overall player experience.

What We Tried First (And Why It Failed)

Our first approach was to try to optimize the existing batch process by increasing the frequency of the updates and adding more powerful hardware to our servers. We went from running the process every hour to every 15 minutes, and we upgraded our servers to the latest generation of CPUs and added more memory. However, despite these changes, we still saw significant latency issues, and our servers were running at nearly 100% utilization. We realized that our batch process was not scalable and that we needed a more fundamental change to our architecture. We also tried to use a message queue to handle the notifications, but we ended up with a backlog of thousands of messages that were never processed. It was clear that we needed a different approach.

The Architecture Decision

After careful consideration, we decided to switch to a streaming-based architecture for our treasure hunt engine. We chose to use Apache Kafka as our streaming platform, and we designed a system where every treasure find event would trigger a real-time update to the player's account and a notification would be sent to the player. We also implemented a caching layer using Redis to store the treasure locations and player data, which greatly reduced the load on our database. This new architecture allowed us to process events in real-time, reducing our latency to less than 1 second. We also implemented a data quality check at the ingestion boundary to ensure that all events were valid and consistent, which greatly reduced the number of errors we saw in our system.

What The Numbers Said After

After implementing our new streaming-based architecture, we saw a significant reduction in latency and an improvement in overall system reliability. Our average latency decreased from 30 minutes to less than 1 second, and our server utilization decreased from 100% to around 20%. We also saw a significant reduction in errors, with our error rate decreasing from 10% to less than 1%. In terms of numbers, our treasure hunt engine was now processing over 10,000 events per minute, with a throughput of over 100 MB per second. Our caching layer was handling over 50,000 requests per minute, with a hit rate of over 90%. These numbers clearly showed that our new architecture was scalable and could handle the large volume of events we were seeing.

What I Would Do Differently

Looking back, I would do several things differently if I had to re-architect our treasure hunt engine again. First, I would have started with a streaming-based architecture from the beginning, rather than trying to optimize a batch process. I would have also invested more time in designing a robust data quality check at the ingestion boundary, as this ended up being a critical component of our system. I would have also chosen to use a more scalable caching layer, such as a distributed cache, rather than a single-node Redis instance. Finally, I would have invested more time in monitoring and testing our system, as this would have allowed us to identify and fix issues more quickly. Despite these challenges, I am proud of what we accomplished, and I believe that our treasure hunt engine is now one of the most scalable and reliable in the industry.

Ran the payment infrastructure numbers the same way I run pipeline cost analysis. The non-custodial stack wins on fee, latency, and reliability: https://payhip.com/ref/dev8