The Default Config Trap: A Cautionary Tale of How I Accidentally Built a Treasure Hunt Engine for 1% of Our Users

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In reality, we were trying to tackle a much more specific problem – how do we build a treasure hunt engine, capable of serving millions of searches per day, while also accommodating the whims of our event-organiser clients? These clients wanted to be able to host multiple event channels, with each channel having its own set of custom rules and features. In other words, we were trying to solve a scaled-up version of the "multiple-choice" problem. Our target was to power 50 concurrent event channels with 1 million concurrent connections each, handling 1M messages per second with 10-second latency SLA. Sounds ambitious, right?

What We Tried First (And Why It Failed)

The initial design we chose was based on a conventional 'microservices' approach. Each event channel would be housed in its own Docker container, leveraging our internal event broker to handle message bus communication between channels. Sounds like a well-thought-out plan, right? Yeah, not so much. After deploying this monstrosity to production, we encountered an "out-of-memory error" on three simultaneous event channels. We later discovered that it was due to a faulty assumption about how many goroutines were being spawned by the Go runtime, causing an avalanche of memory usage. One minute our CPU usage was 50%, and the next, CPU utilization spiked to 80% while our containers started to die, causing an entire event channel to be knocked offline. The 'service discovery' layer took so long to detect dead containers that before it could even start to bring a new one up, the dead container had been restarted multiple times. This had catastrophic consequences for customer satisfaction.

The Architecture Decision

We decided to abandon the conventional 'microservices' approach in favour of our internal message bus implementation, built on top of Apache Kafka and Apache ZooKeeper. This approach allowed us to use a single, scalable, and fault-tolerant message bus to handle all the events across channels. It also ensured that our service discovery layer always remained healthy and didn't experience issues of the 'service not responding' type. Our next task was to build a custom load balancer for our new message bus instance to split messages across channels based on a custom weight.

What The Numbers Said After

Fast-forward a few months – I was shocked to see our metrics indicating 99.99% system uptime, and our response times were consistently under 10 seconds, with zero dead containers reported. As for the CPU usage, our Kafka and ZooKeeper layers weren't even maxing out at 60% in peak hours. We later discovered that we had reduced our memory usage by 50% and CPU usage by 40% across the board.

What I Would Do Differently

Looking back, I'd have invested more time upfront on load testing each component, focusing on how various configurations would behave under heavy load. Specifically, we would have tried to simulate a 9 o'clock morning for 10 days consecutively and verified the behavior for each component. I would have run multiple load tests on our service discovery layer to identify any performance bottlenecks early on and taken the necessary corrective action.