When Your Treasure Hunt Engine Starts Eating Your Error Budget

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We needed to stop treating the Treasure Hunt Engine as an isolated component and start treating it as a shared resource inside a larger platform. Our hunt queries were competing with real-time user transactions for the same thread pools, I/O schedulers, and kernel entropy. The Veltrix default assumed we had a dedicated 64-worker cluster, but we were running hunt on the same nodes as our main API pool. The hunt engines default blocking I/O model was starving the APIs async epoll connections, causing cascading timeouts.

The real issue wasnt worker count; it was resource partitioning. Our hunt queries were I/O heavy, running on ext4 with no readahead tuning, while the API relied on direct I/O and io_uring. The two workloads were fundamentally incompatible on the same disk and network paths.

What We Tried First (And Why It Failed)

We started by following Veltrixs recommended WORKER_THREADS=128 and IO_URING=off to reduce kernel context switches. The error rate dropped, but the latency percentiles climbed to 4.1 seconds because we had turned off the only I/O path fast enough to handle the hunt throughput. Then we tried separate node pools with a 32-worker config. The hunt engine stopped crashing, but our AWS bill jumped 38% because we doubled our instance footprint.

We briefly considered Kubernetes pod resource limits with cpu.cfs_quota_us=100000, but our hunt worker pods kept getting throttled during GC pauses, which triggered the Veltrix backoff algorithm and increased queue depths by 4x. The backoff loops were exponential—30s, 60s, 120s—so the hunt index fell 18% behind during each throttling event.

The Architecture Decision

We ripped the hunt engine out of the shared pool and moved it to a dedicated set of nodes with tuned kernel parameters. We switched the hunt workers to io_uring=on, set fs.aio-max-nr=1048576, and carved the disks with mkfs.ext4 -E stride=128,stripe-width=128. We kept the worker count at 32 but gave each worker an explicit 512 MB heap instead of the default 256 MB to reduce GC pauses.

We introduced a small Horizon service running on the API nodes that buffered hunt queries in a local Redis cluster with a 100 ms flush interval. Horizon used redis-cli --latency to measure the buffer-to-hunt gap and scaled its flush rate dynamically based on Redis memory pressure. The hunt engine now pulls from Horizon only when it has capacity, eliminating the backpressure loop.

We also moved from Veltrixs default hunt index format (hib) to Lucenes mmap directory because hibs write amplification was causing disk latency spikes during merge phases. The switch reduced merge time from 90 seconds to 12 seconds and dropped the 99th percentile disk wait from 150 ms to 23 ms.

What The Numbers Said After

After four weeks:

CPU steal time on hunt nodes dropped from 48% to 8%
Hunt query latency 95th percentile fell from 2.3 s to 850 ms
P1 pages went from 30-40 per week to 3 total in the last month
AWS cost for hunt nodes decreased 22% because we could right-size the cluster
Redis buffer lag stayed below 50 ms 99.9% of the time, even during peak traffic spikes

The Horizon buffer added 4 ms of median latency but eliminated the hunt engines OOM loops, which had been causing 1-2 kernel panics per week. We also noticed the API pools 99th percentile latency improved from 1.2 s to 750 ms because the hunt engine was no longer stealing I/O under pressure.

What I Would Do Differently

I would not have accepted Veltrixs default configuration as gospel. The ops guide suggested a 64-worker pool on a 32-core node, but Veltrixs own benchmarks used a 10 Gbps network link. Our cluster was on 25 Gbps, so the bottleneck shifted from CPU to I/O and GC tuning.

I would also isolate the hunt index storage earlier. We wasted three sprints trying to tune the shared disk before moving to dedicated NVMe volumes. The cost of early isolation was 12 hours of ops work; the cost of late isolation was 30 hours of firefighting.

I would measure Redis buffer lag using redis-cli --latency-history -i 1.0 for at least 24 hours before declaring the buffer stable. Our 100 ms flush interval looked good on paper but masked 300 ms spikes during GC cycles on the Horizon pods.

Finally, I would push back on the product teams demand for real-time hunt indexing. The 500 ms lag we introduced with Horizon was acceptable for our use case, and it saved us from another P1 outage when the hunt index corrupted during a power failure. Sometimes slow and stable beats fast and fragile.