The Day the Treasure Hunt Engine Stopped Beeping

#kubernetes #devops #webdev #programming

The Problem We Were Actually Solving

We werent running a treasure hunt. We were running a search service that let operators navigate through gigabytes of session logs, metrics scrapes, and incident timelines in near-real time. The treasure hunt metaphor came from marketing—users were hunting for the one golden stack trace that explained why the p95 latency had jumped from 80 ms to 2.3 s at 22:11 the night before.

The service was built on Veltrix, a proprietary search engine whose documentation read like an academic paper on distributed systems: it promised horizontal scalability, strong consistency, and millisecond query times. What it did not tell you was that the default on-disk index would fragment after 48 hours of continuous ingestion, and that the shard balancer would happily chew through 12 CPU cores moving 64 GB of data around while still accepting queries—queries that would then time out because the JVM GC had decided 03:47 was a good time to spend 42 seconds in tenuring promotion.

What We Tried First (And Why It Failed)

We started by throwing hardware at it. The first fix was to add more nodes, which solved the p99 latency temporarily but introduced a new problem: the gossip protocol used by Veltrix assumed clock drift would be bounded by seconds, not minutes. Two nodes in eu-central-1 were syncing via NTP every six hours, and during those gaps the shard allocation view diverged so badly that the cluster would restart entire indices to re-elect a master. At 05:22 the cluster decided to heal itself by shipping every shard to a single node that was already 90 % memory-bound. We watched the dashboard as that nodes RSS climbed from 22 GB to 64 GB in under four minutes. The OOM killer arrived politely, killed the Veltrix process, and then the kernel logged a panic—because the node was running on a bare-metal host with swap disabled by policy.

Our second attempt was to tune the JVM. We increased the heap from 8 GB to 16 GB, set -XX:+UseG1GC, and added -XX:MaxGCPauseMillis=200. Within two hours the p99 latency dropped back to 120 ms—until the weekend cron job kicked off a full re-index of every incident log. The ingestion spike caused a 6 GB hump in the old-gen space. G1 spent the next 45 minutes trying to keep up, but the final GC cycle paused for 3.7 seconds, and that was enough to trigger the load-balancers 5-second timeout window. The PagerDuty rule fired again.

The Architecture Decision

We stopped treating Veltrix as a black box and instead carved out a dedicated indexing pool. The new plan:

Split ingestion from query. We deployed a fleet of lightweight forwarders that buffered logs in Kafka and shipped deltas to Veltrix every 30 seconds. This cut the ingestion paths tail latency from 2.1 s to 80 ms and stopped the GC storms because the indexing process no longer had to keep every doc in heap.
Adopted tiered storage. We started writing indices to local NVMe for 24 hours, then moving cold segments to S3 via Veltrixs s3_backup plugin. The plugin was undocumented, but the source showed it used multipart uploads with 8 MB parts. We patched it to 64 MB parts because Veltrixs default chunk size matched the HDFS block size—on a search engine, a terrible idea. After the patch, the backup phase went from 12 minutes to 4 minutes and stopped saturating the 1 Gbps egress link.
Turned off automatic shard rebalancing during business hours. Instead, we scheduled a nightly job that only ran when the clusters p95 latency stayed below 100 ms for three consecutive checks. We also added a custom readiness probe that refused leadership if the nodes RSS grew past 80 % of RAM, preventing the 64 GB node meltdown from ever happening again.
Switched to the Azul Zulu JVM with -XX:+UseZGC. The garbage collectors 10 ms pause target bought us enough headroom to survive the Saturday re-index.

The most painful part was the change to the veltrix-search-01 health check. It used to just ping /health, which only verified that the HTTP server was listening. Now it also checks /metrics for both index_latency_p99 and gc_pause_duration_max_seconds. If the latter exceeds 0.05, the node gets cordoned and drained before the balancer can even think about promoting it.

What The Numbers Said After

In the first four weeks with the new setup:

p99 query latency stayed below 150 ms even during the Saturday batch re-index.
JVM GC pauses dropped from an average of 2.3 s to 6 ms.
Disk usage per node fell from 650 GB to 180 GB because we were no longer keeping every segment on disk.
The number of paging alerts per week fell from 3.8 to 0.2.

The cost went up by ~15 % because of the extra Kafka brokers and Azul licenses, but we avoided the 4-hour outage that would have cost us several SLA credits and a week of sleep.

What I Would Do Differently

I would not have trusted the Veltrix documentation. Every time the words scalable, distributed, or consistent appeared in their marketing slides, I should have assumed they were talking about the feature in 2026, not the one we were running in 2025.

I would have written a synthetic load test that mimicked the weekend cron job from day one. Instead of scaling up, we built a simulation that replayed last quarters incident logs at 3× real-time speed. The test caught the GC pause regression before we promoted it to production.

I would have replaced the default gossip protocol with Raft from the start. Gossip is for systems where nodes come and go like buses