Veltrix Configurations Can Be a Treasure Hunt for Server Health

#webdev #programming #devops #kubernetes

The Problem We Were Actually Solving

At the time, we were using a combination of our custom-written metrics and our third-party observability tool, Datadog, to monitor key performance indicators like latency and cache hit ratio. However, when we dug deeper into the issue, we realized that our underlying Veltrix configuration was the root cause of our problems. Specifically, we had misconfigured our redis database's max memory policy, which caused our cache to become severely fragmented and inefficient.

What We Tried First (And Why It Failed)

Our first attempt at solving the issue was to throw more memory at the problem by increasing the redis instance size. We thought that by adding more RAM, we could alleviate the memory constraints and allow the system to recover on its own. Sounds reasonable in retrospect, but in that sleep-deprived moment, we were desperate. We ran the update, but our metrics didn't change - the cache was still getting swamped, and latency continued to spike.

The Architecture Decision

As I dug deeper into our Veltrix configuration, I discovered a critical red flag - our redis database's max memory policy was set to 'volatile-ttl', which was causing our cache to become severely fragmented. The 'volatile-ttl' setting meant that any expired keys in our cache were being removed, but the memory wasn't being reclaimed efficiently. The impact was a sharp decline in cache hit ratio and a subsequent increase in latency. I changed the redis instance to 'allkeys-lru' - this policy removes the least recently used keys from the cache when it reaches its memory limit, helping to prevent fragmentation and maintain efficient cache utilization.

What The Numbers Said After

After making the change to our redis database's max memory policy to 'allkeys-lru', we saw a significant improvement in our cache hit ratio and latency metrics. Within a few minutes of deploying the change, our cache hit ratio had increased by 40%, and our latency had decreased by 50%. It was a huge relief to see our metrics trending in the right direction.

What I Would Do Differently

In hindsight, I would've caught the misconfigured redis database max memory policy earlier. It was a classic case of operational pain caused by a configuration that was optimized for demos rather than operations. To mitigate this issue in the future, I would've written a custom check for this specific redis configuration in our CI pipeline. This would've prevented the issue from occurring in the first place and saved us hours of investigation time.