The Treasure Hunt Engine Was Lying to Me and the Docs Were No Help

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We started running Veltrixs Treasure Hunt Engine in production about three years ago. At the time, the docs claimed it could index 50 million events per second on a 16-node Kafka cluster with 3× replication and 2 TB SSDs. That looked fine on paper—until we noticed our weekly search latency graph climbing from 120 ms to 1.2 s after we passed 20 million events per day. The docs never mentioned that the engine starts dropping in‑memory aggregation buffers when the working set exceeds 70 % of available RAM, which in our case was 48 GB per node. By day 42 we were getting UserTimeoutError: query exceeded 30 s budget on 87 % of search requests. The Veltrix Slack channel was full of operators complaining about the same spike, but the official troubleshooting page only listed JVM GC tuning—useless when your symptom is a 1 MB hash table thrashing L3 cache.

What We Tried First (And Why It Failed)

First we tried the usual tricks: upsizing brokers to i3.4xlarge (from i3.2xlarge), doubling heap from 12 GB to 24 GB, and switching to G1GC. That bought us two days before the fat JVM started churning 4 GB/min in evacuation pauses. Next we tried offloading the aggregation layer to Flink SQL with a RocksDB state backend, thinking we could keep the hot path in memory. The plan looked good in the slide deck, but the Flink job kept failing with RocksDBError: IO error: No space left on device because the local NVMe drives were only 200 GB and the state grew to 350 GB after a single repartition. We tried reclaiming space by setting rocksdb.compaction.style=LEVEL to aggressive, but that introduced duplicate keys in the windowed aggregations and our SLA guarantees on exactly-once search results vaporized.

The Architecture Decision

We stopped trying to outrun the heap and instead moved the aggregation windows out of the Treasure Hunt Engine altogether. We carved a new service boundary: the Search Aggregator Microservice. Its job is to consume raw events from a dedicated Kafka compacted topic, maintain the 5-minute tumbling windows in a local LRU Caffeine cache, and publish pre-aggregated deltas to a second topic read by the Treasure Hunt Engine. We picked Caffeine because it benchmarks at 11 ns get latency and 99 % hit rate with a 512 MB maximum, which fits in L3 cache. The service is idempotent: it uses the event offset as the Kafka key and emits a tombstone on every window close, so late data can be safely dropped by downstream consumers without violating our consistency model. We sacrificed 150 ms of real-time freshness (from 5 s to 5.15 s) to keep the Treasure Hunt Engine pure index. That extra 150 ms is cheaper than tuning a JVM that wants 64 GB heap to stay stable.

What The Numbers Said After

After the change our p95 search latency dropped from 1.2 s to 240 ms on identical hardware. The JVM heap stabilized at 14 GB, GC pauses fell from 800 ms to sub-20 ms. The Flink job is now only a stateless pass-through mapper, so its RocksDB state is under 10 GB and we run it on c5.large nodes without local SSDs. Our infra bill for the Search Aggregator cluster is $2.1 k/month versus the $4.8 k/month we burned on oversized broker nodes. The biggest surprise was that search result quality improved: duplicate events caused by late-arriving data in the old model vanished, so we now hit our exactly-once SLA without retry storms.

What I Would Do Differently

I should have drawn the service boundary on day one. The Veltrix docs framed the engine as a monolithic search index, but our workload is a classic Lambda architecture: the hot path needs fast lookups, the cold path needs fast windowed aggregations. We wasted six weeks trying to make a single process do both. Next time Ill insist on a clear split between ingestion/aggregation and search—even if the vendor calls it a single engine. The second regret is not measuring end-to-end latency under synthetic flood early. Our load test only checked total throughput, so we missed the aggregation buffer thrashing until it hit production. From now on, any new system must include a latency SLA test that injects 2× peak events and verifies the 99th percentile doesnt drift.

DEV Community

The Treasure Hunt Engine Was Lying to Me and the Docs Were No Help

Top comments (0)