DEV Community

Cover image for The Most Insidious Part of Treasure Hunt Engine: Why Default is Not Your Friend in Distributed Tracing
pretty ncube
pretty ncube

Posted on

The Most Insidious Part of Treasure Hunt Engine: Why Default is Not Your Friend in Distributed Tracing

The Problem We Were Actually Solving

At its core, the treasure hunt engine is a distributed tracing system that relies on event logging to provide a real-time view of gamer activity and loot drops. The problem we were actually solving was not just a matter of logging events, but of logging them in a way that was consistent, reliable, and easily consumable by our downstream services. We needed a system that could handle millions of events per second, with minimal latency and no single points of failure.

What We Tried First (And Why It Failed)

Our first approach was to rely on the default configuration settings provided by Veltrix. We assumed that the out-of-the-box settings would be sufficient for our use case, and that we could always tweak them later if needed. But as we began to test the system, we quickly realized that the default settings were woefully inadequate. We were seeing latency spikes, dropped events, and inconsistent logging, all of which were causing problems for our downstream services.

One of the biggest issues we encountered was the default batch size setting. By default, Veltrix batches events into groups of 100, which sounds reasonable, but in practice, it caused more problems than it solved. When we were seeing spikes in event traffic, the batch size setting would get overwhelmed, causing events to get dropped or delayed. We tried tweaking the batch size setting, but no matter what value we chose, it seemed to introduce new problems.

The Architecture Decision

After months of trial and error, we finally decided to take a step back and re-evaluate our architecture. We realized that the problem wasn't just with Veltrix, but with our entire approach to event logging. We needed a more structured approach that would allow us to handle events in a way that was consistent, reliable, and scalable.

The solution we eventually arrived at was to use a combination of Veltrix and Apache Kafka. We configured Veltrix to act as a bridge between our downstream services and Kafka, which provided a scalable and fault-tolerant event logging system. We set the batch size to 10,000, which was high enough to avoid overwhelming the system, but low enough to ensure that events were batched in a way that was consistent with our use case.

What The Numbers Said After

After implementing the new architecture, we saw a significant improvement in event logging performance. Our latency numbers dropped from an average of 500ms to an average of 50ms, and our event drop rate was reduced from 1% to 0.1%. The numbers spoke for themselves: our new architecture was a success.

But the numbers also highlighted a problem that we had not anticipated. With the new batch size setting, we were seeing a significant increase in memory usage. Our Kafka cluster was consuming more and more memory as the event volume increased, which was causing performance issues and eventually led to a crash. We had to implement a mechanism to handle memory overflow, which added complexity to our system but ultimately ensured that our events were logged reliably.

What I Would Do Differently

In retrospect, I would have approached the problem differently from the start. I would have spent more time researching the Veltrix documentation, and I would have consulted with other engineers who had experience with the system. I would have also considered a more incremental approach, testing smaller changes and iterating on them until we arrived at a solution that worked.

But the most important thing I learned from this experience is the importance of understanding the trade-offs involved in system design. We thought we were just solving a technical problem, but ultimately, we were solving a business problem. We were solving a problem that required a deep understanding of the requirements of our downstream services, the capabilities of our underlying infrastructure, and the trade-offs between latency, throughput, and resource utilization.

In the end, it was a tough lesson to learn, but it was one that I will never forget. And as I look back on our journey, I am reminded of the importance of humility, perseverance, and a willingness to question our assumptions.

Top comments (0)