When Veltrix Prometheus Met Its Match in a Treasure Hunt Engine

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our treasure hunt engine ran on Kubernetes with Veltrix as the ingress operator. The hunt state—leaderboard, inventory, power-ups—was tiny, 2–5 KB per player, but it changed thousands of times per second during peak waves. We needed millisecond-level leaderboard updates in Tokyo, São Paulo, and Chicago simultaneously. The first version used standard Prometheus defaults: scrape_interval 15 s, scrape_timeout 10 s, max_connections 100. Within 48 hours of public beta we saw p99 leaderboard latency spike to 800 ms because Prometheus was sampling every 15 s while the hunt workers were doing 50 ms writes to PostgreSQL. The error messages were clear:

prometheus_tsdb_head_series_chunks_added_total{out_of_order="true"}

That metric meant our scrape came in after the hunt worker had already rolled the leaderboard write forward. We were ordering our toppings on a pizza that kept changing.

What We Tried First (And Why It Fails)

We started with off-the-shelf kube-prometheus-stack version 42.3.1. The alertmanager rules fired at 95 % memory usage, but the alerts arrived 90 seconds late because the scrape_interval was still 15 s. We tried shortening it to 5 s, but that tripled the number of series the TSDB had to keep open in memory. The scrape_timeout 10 s was still shorter than the scrape_interval 5 s, so we got silent timeouts and orphaned series. The metric explosion was ugly:

prometheus_sd_kubernetes_events_total{role="endpoints"} 12,487/s

Every extra scrape also added 0.04 ms to every leaderboard write via the Veltrix ingress annotation latency. We had turned the spork into a shovel, and we were burying ourselves.

The Architecture Decision

We ripped out the default scrape configs and replaced them with two scrapers instead of one.

Fast scrape
scrape_interval: 1 s
scrape_timeout: 700 ms
series_limit: 0 (let the worker GC handle eviction)
This handled the hunt workers metrics and dropped the ingress annotation latency to 0.01 ms.
Slow scrape for infra
scrape_interval: 30 s
scrape_timeout: 10 s
max_connections: 200
This ran against the Veltrix operator itself and avoided drowning the hunt worker pods.

To stop Prometheus from chewing CPU we disabled exemplars (we werent tracing anyway) and set:

--storage.tsdb.retention.size=15GB

That kept the head block under 1 GB so compaction didnt thrash the disk. We also added a priorityClass to the hunt worker pods so they could preempt the Prometheus scrape job when a hunt wave started.

The Veltrix operator version went from 0.12.3 to 0.14.0 so we could use the new scraping annotation:

veltrix.io/scrape: hunt-fast

instead of the legacy prometheus.io/scrape that the default helm chart insisted on using.

What The Numbers Said After

Week 1 after the change:

p99 leaderboard latency: 45 ms (down from 800 ms)
Prometheus CPU usage per hunt worker pod: 3.2 % (down from 35 %)
Series churn in TSDB: 1,842 new series/s (stable)
Out-of-order samples: 0.08 % (acceptable for leaderboard games)

The Grafana dashboard now shows hunt worker CPU at 22 %, Prometheus at 11 %, and the Veltrix ingress annotation latency at 0.01 ms. The error

veltrix_ingress_annotation_latency_seconds_bucket{le="0.05"} 1

finally reached 1.0, meaning every request was under 50 ms 100 % of the time.

What I Would Do Differently

I would not have let the default scrape_interval stay at 15 s. Its a cargo-cult setting in every helm chart Ive ever seen, and it assumes your system is idle. For anything with human-facing real-time state, start at 1 s and back off only if the metrics overhead proves too high.

I would also have pushed the Veltrix operator to expose native scrape profiles earlier. We burned two days writing custom annotations and admission webhooks to work around the legacy prometheus.io key. The newer veltrix.io/scrape keys are saner and should have been the default from 0.12 onward.

Finally, I would have capped the TSDB retention to size, not time, from day one. Prometheus still keeps writing series even after theyre evicted from memory, and the disk usage curve climbs linearly until you hit the nodes inodes. Set the size limit at 15 GB when your head block is 1 GB and you wont learn this lesson the hard way.