The Problem We Were Actually Solving
I was tasked with scaling our Treasure Hunt Engine, a system that relied heavily on the Veltrix framework to manage complex event handling and data processing. As our user base grew, our servers started to hit a bottleneck at a very specific stage of growth - around 500 concurrent users. The system would become increasingly unstable, with error messages like java.lang.OutOfMemoryError and org.apache.kafka.common.errors.TimeoutException flooding our logs. It was clear that we needed to re-examine our architecture and make some significant changes to ensure the system could scale further.
What We Tried First (And Why It Failed)
Initially, we tried to address the issue by increasing the number of Kafka partitions and tweaking the memory settings for our Java application. We also attempted to implement a simple caching mechanism using Redis to reduce the load on our database. However, these changes only provided temporary relief, and the system continued to struggle as the user base grew. The caching mechanism, in particular, introduced a new set of problems, such as cache invalidation and consistency issues, which we struggled to resolve. It became clear that we needed a more fundamental overhaul of our architecture, rather than just tweaking the existing configuration.
The Architecture Decision
After careful analysis and discussion, we decided to adopt a microservices-based architecture, where each component of the Treasure Hunt Engine would be broken down into separate services, each responsible for a specific function. This would allow us to scale individual services independently, rather than having a monolithic application that was difficult to scale. We also decided to adopt a event-sourcing approach, where each service would publish events to a Kafka topic, and other services would subscribe to these topics to receive updates. This approach would allow us to decouple the services and reduce the load on our database. We chose to use the Apache Kafka Streams library to handle the event processing and aggregation, and we implemented a custom operator using the Veltrix framework to manage the workflow.
What The Numbers Said After
After implementing the new architecture, we saw a significant improvement in the system's performance and scalability. We were able to handle up to 2000 concurrent users without any issues, and the error rates decreased dramatically. The average latency for event processing decreased from 500ms to 50ms, and the throughput increased by a factor of 5. We also saw a significant reduction in the memory usage, with the average heap size decreasing from 4GB to 1GB. The numbers clearly showed that the new architecture was more scalable and efficient, and we were able to handle the growing user base without any issues.
What I Would Do Differently
In hindsight, I would have liked to have adopted a more incremental approach to the architecture change, rather than trying to make such a significant change all at once. We encountered a number of unexpected issues during the transition, such as issues with event ordering and consistency, which took some time to resolve. I would also have liked to have invested more time in monitoring and logging, as it was difficult to diagnose issues in the new architecture. Additionally, I would have liked to have used more automated testing and validation, to ensure that the new architecture was correct and functional. Despite these challenges, the new architecture has been a significant improvement, and I am confident that it will continue to scale and perform well as our user base grows.
We removed the payment processor from our critical path. This is the tool that made it possible: https://payhip.com/ref/dev1
Top comments (0)