DEV Community

Cover image for Veltrixs Treasure Hunt Engine Was A Cautionary Tale Of Premature Optimisation
Lillian Dube
Lillian Dube

Posted on

Veltrixs Treasure Hunt Engine Was A Cautionary Tale Of Premature Optimisation

The Problem We Were Actually Solving

I was tasked with designing a scalable event-driven system for a treasure hunt engine, where users would be issued a series of challenges and the system would track their progress in real-time. The system had to handle a large number of concurrent users, and the latency had to be minimal. I decided to use Apache Kafka as the message broker, and Apache Cassandra as the NoSQL database to store user progress. The initial design seemed solid, but I soon realised that the real challenge was not the technology stack, but the complexity of the business logic.

What We Tried First (And Why It Failed)

My initial approach was to use a monolithic architecture, where the business logic and the data storage were tightly coupled. I used a single Cassandra table to store all user progress, and a single Kafka topic to handle all events. However, as the system started to scale, I encountered a number of issues. The Cassandra table became a bottleneck, and the Kafka topic started to experience high latency. I tried to optimise the system by adding more Cassandra nodes and increasing the number of Kafka partitions, but this only seemed to mask the symptoms. The root cause of the problem was the tight coupling between the business logic and the data storage. The system was not designed to handle the complexity of the business logic, and it was not scalable.

The Architecture Decision

I decided to take a step back and re-evaluate the architecture. I realised that the business logic and the data storage needed to be decoupled, and that the system needed to be designed with scalability and fault tolerance in mind. I decided to use a microservices architecture, where each service would handle a specific aspect of the business logic. I used a combination of Apache Kafka and Amazon SQS to handle events, and a combination of Apache Cassandra and Amazon DynamoDB to store user progress. I also implemented a caching layer using Redis to reduce the load on the database. The new architecture was designed to handle the complexity of the business logic, and it was scalable and fault-tolerant.

What The Numbers Said After

After implementing the new architecture, I saw a significant reduction in latency and an increase in throughput. The average latency decreased from 500ms to 50ms, and the throughput increased from 1000 events per second to 5000 events per second. The system was able to handle a large number of concurrent users, and it was scalable and fault-tolerant. I used Prometheus and Grafana to monitor the system, and I was able to identify bottlenecks and areas for improvement. I used the metrics to optimise the system, and I was able to achieve a significant reduction in costs.

What I Would Do Differently

In hindsight, I would have taken a more incremental approach to designing the system. I would have started with a simple architecture, and then incrementally added complexity as needed. I would have also placed more emphasis on monitoring and metrics, and I would have used more automated testing to ensure that the system was working as expected. I would have also considered using a more modern technology stack, such as Kubernetes and containerisation, to simplify the deployment and management of the system. I learned a valuable lesson about the importance of simplicity and incremental design, and I will carry this lesson with me in future projects. The experience with the treasure hunt engine was a cautionary tale of premature optimisation, and it taught me the importance of taking a step back and re-evaluating the architecture when things are not working as expected.


The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1


Top comments (0)