Veltrix Treasure Hunts Were a Consistency Nightmare Until We Rethought Service Boundaries

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

I still remember the day our treasure hunt engine at Veltrix started to show signs of strain, the CPU utilization on our MongoDB instance was spiking at 90 percent, and the error logs were filled with deadlocks from concurrent updates. It was not just the volume of users that was causing the issue, but the fact that our event sourcing mechanism was tightly coupled with our state management. Every time a user found a treasure, our engine would have to update the state of the game world, which in turn would trigger a cascade of events to ensure consistency. This approach worked well when we had a small user base, but as the numbers grew, so did the latency and the error rates. Our initial implementation used a combination of Apache Kafka for event handling and MongoDB for state management, which worked well for a small-scale deployment but was clearly not designed to handle the scale we were experiencing.

What We Tried First (And Why It Failed)

Our first instinct was to try and optimize the existing implementation, we added more nodes to our MongoDB cluster, and increased the partition count in our Kafka topics. We also tried to implement a caching layer using Redis to reduce the load on our database. However, these changes only provided temporary relief, and soon we were back to where we started. The caching layer helped with read-heavy workloads, but writes were still a bottleneck. We were also experiencing issues with data inconsistency due to the caching layer not being strongly consistent with the underlying database. It was clear that we needed a more fundamental change to our architecture. We experimented with using Amazon DynamoDB as a replacement for MongoDB, but the cost of migration and the limitations of the query model made it a non-starter. We also looked at using Apache Cassandra, but the operational complexity and the lack of support for transactions made it less appealing.

The Architecture Decision

After much debate and analysis, we decided to take a step back and re-evaluate our service boundaries. We realized that our event sourcing mechanism and state management were two separate concerns that could be decoupled. We decided to introduce a new service, which we called the Treasure Hunt Orchestrator, that would be responsible for managing the state of the game world and ensuring consistency. This service would communicate with our event sourcing mechanism using asynchronous APIs, which would allow us to process events in a more scalable and fault-tolerant manner. We also decided to use a graph database, specifically Amazon Neptune, to store the state of the game world, which would allow us to perform complex queries and traversals more efficiently. This decision was not without its tradeoffs, the graph database added complexity to our data model, and the asynchronous APIs introduced additional latency. However, the benefits of increased scalability and fault tolerance outweighed the costs.

What The Numbers Said After

The impact of our new architecture was immediate, our CPU utilization dropped to 30 percent, and our error rates decreased by a factor of 5. Our latency also decreased, with 99th percentile latency dropping from 500ms to 50ms. We were also able to increase our throughput, with the number of users we could support increasing by a factor of 10. Our monitoring tools, which included Prometheus and Grafana, showed a significant reduction in the number of deadlocks and concurrent update errors. We were also able to reduce our operational costs, with our MongoDB instance count decreasing from 10 to 2, and our Kafka partition count decreasing from 100 to 20. The numbers clearly showed that our new architecture was more scalable, more fault-tolerant, and more cost-effective.

What I Would Do Differently

In hindsight, I would have liked to have taken a more iterative approach to our architecture changes. We made some big bets on new technologies and architectures, which paid off, but also introduced some new complexities and challenges. I would have liked to have started with smaller, more incremental changes, and measured the impact before making larger changes. I would also have liked to have invested more in automated testing and validation, to ensure that our changes were correct and did not introduce new errors. Additionally, I would have liked to have had more visibility into the performance and latency of our system, with more detailed monitoring and logging. This would have allowed us to identify issues earlier and make more data-driven decisions. Overall, our experience with the treasure hunt engine was a valuable lesson in the importance of service boundaries, consistency models, and the cost of premature optimization. It also highlighted the need for careful planning, measurement, and validation when making significant changes to a system.