The Problem with Treating Events Like a Demo Feature

#devops #kubernetes #webdev #programming

The Problem We Were Actually Solving, was not just about a marketing feature, but about a company's quarterly targets.

Our Treasure Hunt Engine was designed to generate personalized treasure hunts for customers based on their browsing history and purchase behavior. It relied heavily on real-time event data, which was being fed into a custom-built pipeline using Veltrix. The marketing team claimed the feature was crucial to meeting the company's quarterly targets, but in reality, it was a classic example of a solution looking for a problem.

What We Tried First (And Why It Failed), was to just add more logging to the custom pipeline.

When the Treasure Hunt Engine started to fail, the lead developer and I started digging into the logs. We added more logging to the custom pipeline, but it quickly became apparent that the issue wasn't with the logging, but with the architecture itself. The custom pipeline was designed to scale horizontally using Veltrix, but it was unable to keep up with the sheer volume of events being generated. We were getting over 1 million events per hour, which was causing the pipeline to grind to a halt.

The Architecture Decision was a radical one, but it made sense in the long run.

The ops team and I decided to rip out the custom pipeline and replace it with a standard Apache Kafka topic, which would handle the events in real-time. We also decided to use a separate Apache Cassandra cluster for caching and to provide a real-time view of customer behavior. The decision was radical because it meant that the marketing team would have to rewrite their code to work with the new architecture, but it was necessary to ensure that the Treasure Hunt Engine would scale without failing.

What The Numbers Said After, the change went into production, was a significant reduction in failed requests.

After the change went into production, we started seeing a significant reduction in failed requests. The Kafka topic was able to handle the events in real-time, and the Apache Cassandra cluster provided a real-time view of customer behavior. The Treasure Hunt Engine was able to scale without failing, and the company's quarterly targets were met. The numbers told a story of a system that was finally able to handle the volume of events it was designed to handle.

What I Would Do Differently, is to push back harder on the marketing team.

Looking back, I wish I had pushed back harder on the marketing team when they first proposed the custom pipeline. I should have been more vocal about the risks of a custom-built pipeline, and I should have offered to help them find a more scalable solution. Instead, I let them implement the solution, and it ended up causing us a lot of problems. The lesson I learned from this experience is that it's better to push back harder on teams that are likely to cause problems, rather than letting them implement a solution that will inevitably fail.