The Pitfalls of Treating a Treasure Hunt Engine Like a Social Feed

#webdev #javascript #programming #react

The Problem We Were Actually Solving

At first glance, our treasure hunt engine seemed like a typical event-driven system – users create events, which trigger reactions from other users. However, as the system grew, we realized it was more like a graph problem with multiple event chains converging at different points. The system needed to be able to process complex queries on user-generated content while ensuring that the UI stayed responsive under load. In other words, we were solving a problem that was less about event-stream processing and more about scaleable graph traversal.

What We Tried First (And Why It Failed)

We initially used a Node.js-based system with a MongoDB backend to store event metadata. This setup seemed like a no-brainer, but it quickly turned into a bottleneck. With the sheer volume of events and concurrent users, our MongoDB instance started throwing "connection closed by user error" exceptions, and performance degradation became a regular occurrence. The Node.js process also became memory-intensive, which forced us to scale vertically before we could scale horizontally. This was a double-edged sword: our infrastructure costs increased, but so did the system's availability.

The Architecture Decision

We decided to switch to a GraphQL API with Redis and Postgres as our primary data stores. In this new setup, we used a pub/sub model with Redis to broadcast events to a consumer queue and a Postgres database for storing persistent metadata. This allowed us to decouple event processing from the main application, reducing the load on our main Node.js application. We also implemented a graph database using ArangoDB to handle complex queries and relationships between users and events.

What The Numbers Said After

Once we implemented this new architecture, we saw a significant reduction in average response times and connection timeouts. Our Grafana dashboard showed that the overall system latency decreased by 70% after the upgrade. We also observed a 25% decrease in memory usage, which allowed us to reduce our infrastructure costs by scaling down our Node.js instances.

What I Would Do Differently

Looking back on the project, I would've done a few things differently from the start. First, I would've started with a more explicit partitioning of our data into separate shards to begin with instead of forcing the system to do it for us under load. Additionally, I would have prioritized the implementation of our pub/sub model earlier in the project rather than after the system had already scaled to the point where it was becoming unmanageable.