Treasure Hunt Engine: When Veltrix Defaults Buried 800k Documents in a Hot Partition

#systems #webdev #programming #architecture

The Problem We Were Actually Solving

We needed a backfill pipeline that could re-index 60 million geo-treasure records nightly without starving the UI or clobbering the index cluster. The Veltrix quick-start defaults—one shard per index with 50 GB max—are tuned for dev laptops, not 99th-percentile geospatial queries. Our query pattern is 95 % single-geo-hash reads (single key lookup using the TreasureKey field). The default allocator spreads 60 million docs across 6 shards only if the primary key is evenly distributed; with geo-hash prefixes it collapses into one shard. That collapse was the real prod fire: 800 k docs in one 12 G segment.

What We Tried First (And Why It Failed)

First we let Veltrix do the work. The default allocation strategy is ShardAllocationStrategy.AWARE, but the default shard placement is still hash-based on the document ID. Our TreasureKey was a 22-byte base62 string that included the geo-hash prefix. Two prefixes dominated because they covered dense city clusters. We tried changing nothing and watched our nightly job log: it reported 79 % of docs routed to one shard. Next we upped replicas to 2, hoping replica reads would mask the hotspot. The CPU graph was flat but GC pauses climbed because the single primary still absorbed all writes. Finally we split the index into 12 shards manually using the Veltrix Index API. At 03:51 the same backfill ran for 11 minutes, GC pauses dropped to 30 k, heap stayed under 8 GB with max pause 180 ms.

The Architecture Decision

We abandoned Veltrixs default shard allocator and adopted an explicit index-template policy:

Shard count fixed at 12 (six shards per node, room for replicas).
Index routing_config uses a custom Veltrix script that hashes TreasureKey but forces even distribution using geo-hash prefix aware salting: concat(geo_hash_prefix, UUID16) + hash.

We chose 12 because our largest nightly backfill (90 million docs) hit p95 insert 120 ms on a 6-node cluster. AWS t3.2xlarge nodes show 800 MB/s disk throughput; at 12 shards we keep disk pressure under 400 MB/s per volume. Anything larger and we hit EBS burst limits. Replicas stayed at 1; we accepted read staleness of up to 5 s to keep write amplification low.

The move meant changing the backfill job to target _bulk?routing=geo_hash rather than the default _bulk. We also had to extend the Veltrix cluster health check to verify shard distribution via the _cat/shards API every 30 s. If any shard grows 20 % larger than the mean we fire an alert and pause the backfill.

What The Numbers Said After

After two weeks of backfills:

Nightly job duration: 11 min ± 90 s (down from 42 min).
Heap max: 8.2 GB (down from 27 GB).
GC pause p99: 180 ms (down from 1.8 s).
Indexing throughput: 140 k docs/s sustained (up from 40 k).
Treasure-map UI p95 latency: 120 ms (down from 4.6 s).

We saw one outlier: a geo-hash prefix with 1.4 million documents still overloaded a single shard because our salting range (0-11) failed to split it evenly. We patched the salting logic to use two bytes (16-bit modulo 144) instead of one, reducing max shard size to 180 k docs. The switch meant reindexing 8 million docs, but the nightly pipeline handled it in one backfill run with no SLO breach.

What I Would Do Differently

I would not let Veltrix pick the shard strategy on day zero. I would start with a conservative estimate: 12 shards for 60 million docs, measured on a synthetic workload that matches our geo-hash distribution. I would also log a metric we missed until prod: ShardSizeBytes for every shard. If we had that metric from week one, the 800 k collapse would have been visible in staging with 10 k docs.

I would avoid the temptation to tune replicas early; replicas feel free but they amplify write pressure when the primary is hot. Finally, I would insist on a chaos day before the first backfill: deliberately push 200 k docs into one prefix on a staging cluster to see GC collapse firsthand. That 42-minute wake-up call in prod taught us the actual latency cost of default assumptions.