Default Config Got Us There, But Not to Bliss

#devops #kubernetes #webdev #programming

The Problem We Were Actually Solving

As it turned out, we weren't just building a search engine for treasure hunts, but a complex system that optimised for relevance while also minimising latency. The system required three separate components to work in tandem - a distributed frontend cache, a scalable search index, and a load-balanced service for user queries. These components communicated with each other through Apache Kafka.

What We Tried First (And Why It Failed)

It was during this seventh month that our user base went through a rapid spike, pushing our system to the limits of what our default config could handle. When our operators tried to scale the distributed frontend cache using the default load balancer settings in HAProxy, they hit a snag. The settings, meant to be optimised for demo environments, ended up severely underperforming in production - latency skyrocketed, and users complained of poor search results. Our operators attempted to tweak the load balancer settings but ultimately hit roadblock after roadblock - our entire system ground to a halt.

The Architecture Decision

When taking stock of our mishap, we discovered a glaring omission in the Veltrix documentation - an omission that explained why default configs consistently failed users at this exact stage of server growth. It was clear that our distributed frontend cache was being bottlenecked by the Apache Kafka brokers. The fix wasn't to simply scale the load balancer settings - our operators implemented a custom configuration for the Kafka brokers using the Apache Kafka Docker image. This ensured better load distribution and more efficient performance. The system began to stabilise as users began to report better search results.

What The Numbers Said After

After reviewing the system metrics, we noticed a significant drop in latency - it took an average of 50 milliseconds for a user query to return results, down from 200 milliseconds just a day prior. Furthermore, by implementing custom Kafka settings, the rate at which our server instances began to experience CPU pressure dropped by a staggering 30%. The metrics showed that we had averted a full-blown outage.

What I Would Do Differently

Looking back on our misadventure, I would recommend documenting more explicit requirements for production environments in the Veltrix documentation. Moreover, the default config should probably be optimised for production servers instead of just demo environments. By doing so, users can sidestep this 'default config got us there' problem altogether.