Why We Lost Our Treasure Hunt Engine to an Unlikely Event-Driven Denial-of-Service Attack

#webdev #programming #security #appsec

The Problem We Were Actually Solving

On the surface, it seemed like we were just building a complex event-driven system to handle treasure hunt requests. However, we were actually solving a much deeper problem - creating a highly scalable and responsive matchmaking engine that could handle thousands of users simultaneously. We wanted to create an experience where users could seamlessly interact with the treasure hunt system, without noticing any delays or errors.

What We Tried First (And Why It Failed)

When we first started building the treasure hunt engine, we decided to go with a classic pub/sub architecture, leveraging Apache Kafka as our event bus. We set up a series of ZooKeeper instances to manage our Kafka clusters, and our application code would simply publish events to topics and subscribe to those events to process them. Sounds simple enough, right? But what we failed to consider was the exponential scaling costs of managing a large number of topics and ZooKeeper instances. As our traffic increased, our infrastructure costs skyrocketed, and our application started to slow down.

The Architecture Decision

After several failed attempts to refactor our system to handle the increased traffic, we realized that we needed to rethink our event-driven architecture from the ground up. We decided to switch to a distributed event store like Apache Cassandra, which would allow us to decouple our event producers from our event consumers. We also implemented a domain-driven design approach, focusing on modeling our business domain as a series of discrete events that could be easily composed and decomposed. This allowed us to create a more modular and scalable system that could handle our high traffic volumes.

What The Numbers Said After

After implementing our new event-driven architecture, we saw a significant reduction in our infrastructure costs - down by over 30% in fact. Our application responded to user requests in under 50ms, and our error rates plummeted to almost zero. The metrics were a clear testament to the effectiveness of our new architecture.

What I Would Do Differently

If I were to do this project all over again, I would focus more on designing our event-driven architecture with observability and monitoring in mind from the very start. I would invest in tools like Prometheus and Grafana to monitor our system's performance and latency, and create alerts to notify our team of any issues before they become major problems. I would also spend more time on testing and validation, ensuring that our code behaves correctly under duress. By doing so, we could have avoided the Denial-of-Service attack that took our system down in the first place.