DEV Community

Cover image for We Chose To Use A Clustering Approach For Our Treasure Hunt Engine, And It Almost Broke Us
Lisa Zulu
Lisa Zulu

Posted on

We Chose To Use A Clustering Approach For Our Treasure Hunt Engine, And It Almost Broke Us

The Problem We Were Actually Solving

We were hired to build a high-performance online treasure hunt engine for a popular game platform. The requirements were straightforward: handle millions of users, provide instant responses, and minimize lag. Our team thought we could achieve this with a simple scaling strategy: just add more servers. But we soon discovered that in the world of AI-powered systems, that's not how it works.

What We Tried First (And Why It Failed)

We started with a monolithic design, where all the systems – the AI model, data storage, and load balancers – were run on the same server cluster. But soon we encountered two main issues: massive memory bloat and unaffordable latency. Our system would get bogged down under intense loads, leading to timeouts and crashes. To make matters worse, the load balancers couldn't accurately distribute the traffic, resulting in uneven server loads and increased response times. This approach, which initially seemed like a good idea, quickly turned out to be a disaster.

The Architecture Decision

After some soul-searching, we decided to adopt a clustering approach. We split our system into multiple independent nodes, each handling a specific task. This allowed us to scale up individually, reducing the overhead of a monolithic setup. We set up a cluster of 10 nodes, each running a dedicated load balancer, AI model, and data storage. Our architecture comprised 3-5 nodes for each component, which was determined by load testing to ensure acceptable performance degradation.

The data storage nodes ran a specialized caching layer on top of a standard relational database, giving us a read-through write-through architecture for predictable performance in both reads and writes. We chose the open-source Redis system for our caching strategy due to its ability to handle massive read loads while still supporting a reasonable write-to-read ratio. We used InfluxDB for our data store and made use of their built-in caching feature to avoid unnecessary data retrieval from disk.

The benefits were immediate – our system became more responsive under heavy loads. The latency dropped from 2 seconds to 100ms, and the server crashes disappeared. Our users could now engage with the treasure hunt seamlessly. But our work wasn't over yet.

What The Numbers Said After

After deploying this new architecture, we monitored the system closely. We observed that while our latency had decreased to an acceptable level, it was still far from the 20ms threshold we had initially aimed for. This was mainly due to the overhead of the clustering setup and inter-node communication. Average request latency was 80ms which, although it passed our minimum threshold, left room for further optimizations. Furthermore, our system's throughput decreased by a factor of 5 when scaling up the cluster from 5 nodes to 10. What initially seemed like a scalable design choice, turned out to be a scaling limitation.

What I Would Do Differently

While our clustering approach saved us from disaster, it introduced new challenges that still affect the performance of our system. If I were to redo this project, I would focus on a more decentralized architecture from the start. For instance, I would consider migrating the AI model to an edge computing setup, reducing the latency associated with inter-node communication. Furthermore, to tackle the scaling bottleneck, we would explore more fine-grained load balancing strategies and implement a predictive model to anticipate and mitigate traffic spikes.

Our story highlights that the path to building reliable AI-powered systems is fraught with misconceptions and unspoken trade-offs. As engineers, we often start with seemingly simple ideas, only to discover their limitations in the real world. When it comes to high-availability systems, there's no room for trial and error. The best approach is to anticipate potential pitfalls and incorporate failure tolerance into the design from the start.

Top comments (0)