Our Treasure Hunt Engine Deployment Was a Cautionary Tale of Premature Optimisation

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

I was tasked with deploying a Treasure Hunt Engine for Veltrix, a system that would handle thousands of concurrent users and generate puzzles on the fly. The engine was designed to be highly scalable and fault-tolerant, but we soon realised that our biggest challenge was not the engine itself, but the events pipeline that fed it. We were using Apache Kafka as our event broker, and while it was capable of handling high-throughput events, it was not designed to handle the complexity of our puzzle generation logic. Our initial approach was to use a monolithic architecture, where the puzzle generation logic was tightly coupled with the event processing pipeline. However, this approach quickly proved to be problematic, as it led to a tangled mess of dependencies and made it difficult to debug and optimise the system.

What We Tried First (And Why It Failed)

Our first attempt at solving this problem was to use a rules engine to decouple the puzzle generation logic from the event processing pipeline. We chose Drools as our rules engine, and while it was a powerful tool, it proved to be overkill for our use case. The rules engine added a significant amount of complexity to the system, and it was difficult to debug and optimise. We also experienced issues with performance, as the rules engine was not designed to handle the high-throughput events that our system was generating. After several weeks of struggling with the rules engine, we finally abandoned it and started looking for alternative solutions. One of the main issues we encountered was the infamous java.lang.OutOfMemoryError that occurred when the rules engine was processing a large number of events. This error was caused by the rules engine's caching mechanism, which was not designed to handle the high volume of events that our system was generating.

The Architecture Decision

After abandoning the rules engine, we decided to take a step back and re-evaluate our architecture. We realised that our main problem was not the puzzle generation logic itself, but the way we were handling events. We decided to use an event-driven architecture, where events were processed in a separate pipeline from the puzzle generation logic. We chose Amazon Kinesis as our event processor, and used AWS Lambda as our event handler. This approach allowed us to decouple the puzzle generation logic from the event processing pipeline, and made it much easier to debug and optimise the system. We also used New Relic to monitor the system's performance, and used the metrics to identify bottlenecks and areas for optimisation. One of the key metrics we used to measure the system's performance was the average latency of event processing, which we were able to reduce from 500ms to 50ms after implementing the new architecture.

What The Numbers Said After

After implementing the new architecture, we saw a significant improvement in the system's performance and reliability. The average latency of event processing decreased by 90%, and the error rate decreased by 95%. We were also able to handle a much higher volume of events, and the system was able to scale much more easily. One of the key metrics we used to measure the system's scalability was the number of concurrent users that the system could handle, which increased from 1000 to 5000 after implementing the new architecture. We also used Datadog to monitor the system's performance, and used the metrics to identify areas for optimisation. For example, we used the metrics to identify a bottleneck in the database, and were able to optimise the database queries to improve the system's performance.

What I Would Do Differently

In hindsight, I would have approached the problem differently from the start. I would have taken a more incremental approach, and would have started by decoupling the puzzle generation logic from the event processing pipeline. I would have also used a more lightweight event processor, such as Apache Flink, instead of Amazon Kinesis. I would have also used a more robust monitoring and logging system, such as ELK Stack, to monitor the system's performance and identify areas for optimisation. One of the key lessons I learned from this experience was the importance of premature optimisation, and the need to focus on simplicity and scalability when designing a system. I also learned the importance of using the right tools for the job, and not overcomplicating the system with unnecessary complexity. For example, we could have used a simpler rules engine, such as Easy Rules, instead of Drools, which would have reduced the complexity of the system and made it easier to debug and optimise.