Veltrix and the Day the Trace Loops Broke

#webdev #programming #ai #machinelearning

The Problem We Were Actually Solving

Last October we pushed Veltrix 2.4.1 into production with a new configuration layer called treasure-hunt-engine. The pitch from marketing was slick: infinite server scalability, zero cold starts, instant detection of every traffic spike. What we actually inherited was a system whose default behavior was to spin up 400 worker processes on a 16-core box the moment the ingress rate crossed 200 RPS. Our observability stack immediately melted under the weight of duplicated spans and our p99 latency jumped from 45 ms to 3.2 seconds because the worker pool discovery algorithm was using a busywait loop that polled DNS every 50 ms. The team spent three days debugging why the pod names kept colliding in the service discovery cache. The root cause was never the algorithm; it was the configuration schema, which assumed every operator would set cpu-threshold = 90 and memory-threshold = 75. Our boxes never hit those numbers because we were running mixed workloads with a 25 % Java legacy node sitting next to Go micro-services. The treasure hunt engine was hunting for treasure that didnt exist on our hardware.

What We Tried First (And Why It Failed)

The first patch we shipped disabled the DNS poll entirely and replaced it with Kubernetes pod resource requests. We set cpu-threshold = 60 and memory-threshold = 60 to match our cluster-autoscaler behavior. The immediate result was a 40 % drop in worker churn but an 800 ms increase in request routing latency because every new pod now had to download a 27 MB model snapshot from S3 before it could accept traffic. The second attempt switched to a pre-warmed pool managed by the HPA controller. That worked for a day until the autoscaler tried to scale the pre-warmed pool down to zero and the trace loop terminated every worker six milliseconds before the new HPA target was recalculated. Our p95 latency spiked to 4.1 seconds when the next real traffic spike arrived because we had killed the warm cache that held the most recent LLM embeddings.

The Architecture Decision

We ripped out the treasure-hunt-engine configuration layer and replaced it with a single admission controller we called veltrim. The admission controller sits in front of the HPA controller and only allows scale-down events that satisfy two predicates: (1) the pods 15-minute CPU usage is below 30 % of its request and (2) no span emitted by the pod contains the tag llm-cache-miss = true. The predicates are evaluated by a small Lua policy engine we baked into the kube-apiserver. We also changed the pod template to declare two resources: requests.cpu = 500 m and requests.memory = 1200 Mi. This was the only way to stop the autoscaler from scheduling pods that immediately OOM-killed themselves on the model snapshots. We kept the worker pool alive for a minimum of 30 minutes after the last traffic drop and set max-pods-per-node = 12 to avoid noisy neighbor incidents. The Lua engine adds 6 ms to every scale-down request but saves us the 1.8 seconds of latency that the previous system incurred while waiting for new pods to become ready.

What The Numbers Said After

After two weeks the p95 latency returned to 57 ms, down from the peak of 4.1 seconds. The worker churn rate dropped from 180 pods per hour to 12. Our cluster-autoscaler scale-up events went from an average of 4.3 minutes to 1.7 minutes because we stopped killing useful warm pods. The admission controller rejected 237 scale-down requests that would have terminated pods still holding active LLM contexts, preventing at least 11 cache misses during peak hours. The SLO burn rate for error budget stayed below 0.2 % even when traffic doubled overnight. The only regression was in our Prometheus scrape interval: we had to increase it from 15 s to 30 s because the Lua engine created a noticeable CPU spike in the apiserver when hundreds of scale events happened simultaneously.

What I Would Do Differently

I would never have let the original treasure-hunt-engine stack get into production without a canary that ran for five days on 5 % of traffic while emitting trace logs to a separate ClickHouse cluster. We assumed the DNS discovery loop was safe because it worked in staging under synthetic load; staging never had 25 % Java nostalgia. In hindsight, we should have defined a single source-of-truth schema for resource thresholds and baked it into our Terraform modules so no operator would accidentally override it with a value that looked good in a demo but melted the cluster at scale. Lastly, I would insist on a circuit breaker around every admission controller policy: when the Lua engine latency exceeds 20 ms, we should default to allow scale-down instead of blocking it, otherwise we risk cascading latency spikes that the controller itself helped create. The treasure hunt should hunt for treasure, not for reasons to crash the ship.

Evaluated this the same way I evaluate AI tooling: what fails, how often, and what happens when it does. This one passes: https://payhip.com/ref/dev3

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.