Veltrix Configuration Nightmare: The Day We Realized Hytale Search Was Lying to Our Players

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

We were tuning Veltrixs dynamic search service to handle 50,000 concurrent Hytale players who expected instant discovery of servers, maps, and mods. Our SLA demanded p99 latency under 120 ms globally, yet the Korean player cohort—about 12 % of traffic—consistently timed out after 300 ms and returned empty payloads. The indexer logs showed no errors; the load balancer showed no 5xx. The problem wasnt availability; it was language drift. Our vector embeddings were trained on English Hytale documentation and ignored non-Latin scripts. Players who typed Hytale in Hangul, Cyrillic, or Arabic got polite but hollow responses that the demo never prepared us for.

What We Tried First (And Why It Failed)

We began with the default Vespa configuration shipped by the Veltrix operator guide. Vespas RankProfile used cosine similarity over a single multilingual encoder called multilingual-e5-large. On paper the cosine score looked fine—0.86 for Hytale queries—until we dug into per-language recall. Korean queries had a 34 % drop in recall versus English. Our first fix was to enable the built-in Vespa language fallback, but that added 40 ms of extra round-trip because the fallback triggered a second query to a secondary index. The p99 latency jumped from 110 ms to 155 ms, breaching the SLA for everyone.

We then tried splitting the corpus by script using Vespas document attributes: one field for Latin, one for CJK, one for Cyrillic. The recall for Korean queries jumped to 91 %, but the ingestion pipeline suddenly ballooned from 2.1 GB to 4.3 GB per build because we were duplicating every document into three script-specific buckets. Our CI runners on GitHub Actions started timing out after 20 minutes, and the nightly build that once took 8 minutes now took 32 minutes. The operational cost tripled overnight, and the infra team sent me a single emoji: 🔥.

The Architecture Decision

We settled on a hybrid encoder pipeline. We kept the original multilingual-e5-large for English and low-resource languages, but for Korean, Japanese, and Chinese we switched to language-specific encoders: klue-roberta-large for Korean, bert-base-japanese-v3 for Japanese, and bert-base-chinese for Simplified Chinese. To avoid the ingestion bloat, we stored the embeddings in a separate columnar store (Apache Doris) and served them through a lightweight gRPC shim inside the Vespa searcher chain. The gRPC shim added only 8 ms of latency on cold caches and kept the hot Vespa index at 2.4 GB, down from 4.3 GB.

The trickiest part was cache invalidation. We decided to shard by language code at the Vespa node level, so a Korean query always lands on a node that owns the Korean encoder. We used consistent hashing with murmur3 to avoid hotspots. If a languages traffic spiked, we could reroute by changing one line in the cluster config without rebuilding the corpus.

What The Numbers Said After

Traffic split by language after the change:

English p99 latency: 102 ms (down from 110 ms)
Korean p99 latency: 115 ms (up 5 ms but still under 120 ms)
Recall for Korean queries: 94 % (up from 66 %)
Ingestion build time: 10 minutes (down from 32 minutes)
Monthly infra cost: $1,800 (down from $5,200)

The most telling metric was the empty-response rate. Before the fix it was 2.8 % globally, driven entirely by non-English queries. After the fix the empty-response rate dropped to 0.1 % and stayed flat across languages. The demo slides still show multilingual-e5-large as the hero encoder, but in production we quietly route 60 % of traffic through language-specific models. The demo never mentions the shim, the hash ring, or the night we fought the ingestion pipeline. It just shows a glowing Hytale server list.

What I Would Do Differently

I would have measured language-specific recall from day one instead of trusting the demos aggregate numbers. We instrumented recall too late, only after the Korean players started complaining on Discord. Second, I would have capped the multilingual encoders vector dimension sooner. The original Vespa schema produced 1,024-dim vectors; reducing to 768 dim and switching to float16 cut memory by 30 % without measurable recall loss.

Finally, I would have budgeted for the gRPC shim earlier. Our initial plan treated the shim as a temporary workaround; by week four it was the most reliable component in the stack. The lesson is that demos optimize for theatrical multilingual performance, but production survives on boring, language-specific pragmatism.