Configuring Without Compromise: My War Story with the Veltrix Treasure Hunt Engine

#webdev #programming #devops #kubernetes

The Problem We Were Actually Solving

We had built the treasure hunt engine to be this dynamic, real-time thing, where clues would pop up on the map, and users could share their progress with friends. It was going to be a web-scale hit, or so we thought. The key to its success was its configurability: we'd have to tweak the algorithm in real-time to balance the rate of clue drops with the rate of user progress. The users should feel like they're actually on a treasure hunt, but not so slow that they get bored. We deployed the system, and it worked – at first.

What We Tried First (And Why It Failed)

Our initial approach was to use a distributed config store, which seemed like the right move at the time. We'd set up a ZooKeeper ensemble, and it would distribute the config to our stateful services. Sounds great, but we quickly realized that ZooKeeper's watch mechanism was causing a delay of about 30 seconds between the time we changed the config and when the services saw it. Clue drops were happening at the wrong pace, and users were complaining. We thought it was just a tuning issue, so we cranked up the ZooKeeper replication factor, hoping to reduce the latency. But it only made things worse – we'd get these temporary config mismatches, where the services would get different values from ZooKeeper, causing the system to crash.

The Architecture Decision

It was then that we finally took a step back and said, "Maybe our config store isn't the problem." We realized that we were trying to use a general-purpose distributed config store for something that didn't actually require it. We had a relatively small number of config points that needed to be changed in real-time, but they didn't change frequently. Why not just use a simpler mechanism? We swapped out ZooKeeper for an etcd cluster, but this time we used it as a simple key-value store for our config. It was far faster and more reliable, and it gave us the ability to manually override config values in case of an emergency.

What The Numbers Said After

After we made the switch, we saw a dramatic reduction in config-related latency. Clue drops were happening at the right pace, and users were happy again. We also reduced the number of crashes by a factor of 3 – mainly because we weren't getting those pesky config mismatches anymore. Our ops team was able to get their sleep back, and our users were able to keep on hunting. We'd still get the occasional panic-stricken call from the marketing team, but at least the system would survive those changes.

What I Would Do Differently

In hindsight, I would have taken the simpler route from the start. We spent way too much time trying to optimize a distributed config store for a use case that didn't actually need it. If I had to do it again, I'd go for a dedicated solution, like a service mesh or a simpler, purpose-built config store, like Dapr or App Configuration. These systems are designed to handle config changes in real-time, without all the unnecessary complexity. I'd also make sure to measure config-related latency and errors from day one, so we wouldn't have to go through the same cycle of frustration and panic.

It's funny how sometimes the simplest solutions are the best ones.