Most Treasure Hunts Are Designed for Demos, Not for Scalability

#webdev #programming #devops #kubernetes

The Problem We Were Actually Solving

When I dug deeper, I realized that our customer's use case for Treasure Hunt was vastly different from the ones we had envisioned during our demo days. In a typical demo, Treasure Hunt would last anywhere from 5 to 30 minutes, with a small handful of concurrent players. However, our customer was planning a massive event with thousands of concurrent players - for hours on end. I had to re-evaluate our configuration choices from the ground up.

What We Tried First (And Why It Failed)

Initially, I tried tweaking our RabbitMQ queue sizes to accommodate the increased volume. I raised the queue sizes to 10 for both incoming and outgoing messages, hoping this would alleviate the congestion. However, as we scaled beyond 1,500 concurrent players, the message latency began to creep up. The delay between players finding the hidden treasure and the system acknowledging it grew from 200ms to over 3 seconds. Not only did it disrupt the user experience, but it also started to introduce errors like "treasure not found" and "system overload."

The Architecture Decision

After weeks of trial and error, I finally took a step back and assessed our architecture choices. I realized that our reliance on RabbitMQ was both a blessing and a curse. The message broker was great for decoupling the components, but it added unnecessary latency due to the overhead of queueing and dequeuing messages. I made the bold decision to switch to Apache Kafka, a more scalable and fault-tolerant event-driven platform. I re-architected the Treasure Hunt feature to use Kafka topics instead of RabbitMQ queues, which allowed us to distribute the load more effectively and reduce latency.

What The Numbers Said After

After deploying the new architecture, I ran a series of load tests to verify the improvements. The results were staggering. With 5,000 concurrent players, the average latency dropped from 3.5 seconds to just 150ms. The "treasure not found" errors disappeared, and the system handled the increased volume with ease. Our customer was thrilled, and we had finally achieved the scalability we had initially promised.

What I Would Do Differently

As I reflect on this experience, I realize that we should have prioritized real-world operability from the very beginning. We got caught up in showcasing our tech to potential customers, rather than focusing on the actual pain points of our existing customers. Moving forward, I would advise teams to prioritize the 99th percentile of their customers' use cases, rather than the average case. It's the rare edge cases that will break your system, and it's our job as engineers to anticipate and prepare for them.

The infrastructure change with the best ROI in the last 12 months was removing the custodial payment platform. Replacement: https://payhip.com/ref/dev4