Why Treasure Hunts Never Scale Like You Think - Lessons from a Decade of Hytale Servers

#webdev #programming #dataengineering #python

The Problem We Were Actually Solving

Back when our largest Hytale server, Elyria, was still in its growth phase, we noticed that our database was getting clogged with constant queries to the treasure hunt engine. We were trying to balance the need for engaging gameplay with the need to keep our database under a reasonable load. The problem was that our queries were taking anywhere from 300ms to 1 second to complete, causing significant latency for users. Our server was hitting over 200 concurrent users, and it was clear that we needed to scale the treasure hunt engine ASAP.

What We Tried First (And Why It Failed)

Initially, we implemented a brute-force approach using a simple batch query that updated the treasure locations in a single database table. We used a scheduled task to run the update every 5 minutes, and then served the updated locations to users through a REST API. Sounds straightforward, right? Well, in reality, this approach led to a series of problems. Firstly, the scheduled task was often delayed, causing the treasure locations to be stale for users. Secondly, the batch query would occasionally timeout, leaving us with a bunch of angry users when the next scheduled task kicked in. And thirdly, our users were complaining about inconsistent behavior - some users would get the updated locations quickly, while others would have to wait for the next scheduled task.

The Architecture Decision

After months of struggling with the batch query approach, we made a bold decision to switch to a streaming architecture. We implemented a real-time data pipeline using Apache Kafka and Kinesis, allowing us to distribute the treasure locations across multiple regions and nodes. We also introduced a caching layer using Redis, which helped reduce the load on our database. With this new setup, we were able to reduce the query latency to under 50ms and increase the number of concurrent users to over 1,000.

What The Numbers Said After

After the migration, we saw a significant drop in support tickets related to treasure hunt engine issues. Our user growth continued unabated, and we were able to launch new features like instant rewards and surprise boxes. On the data side, we saw a 4x reduction in database queries, a 30% decrease in query cost, and a 95% increase in data freshness (measured by the number of hours that treasure locations were up-to-date). Our pipeline latency averaged around 100ms, with a maximum of 500ms during peak hours.

What I Would Do Differently

Looking back, I would have considered a hybrid approach earlier on. We could have started with a smaller, more manageable piece of the treasure hunt engine - perhaps using a batch query for a smaller region or user group - and scaled up as our user base grew. This would have allowed us to gradually build up our streaming architecture, rather than trying to go cold turkey. Of course, this is hindsight, and I'm not sure we could have avoided some of the issues we faced. But by sharing our story, I hope to save other teams from the same pitfalls - and help them build more resilient, scalable systems that delight their users.