The problem we were actually solving.
We built a treasure-hunt engine for Hytale that accepted player submissions, validated clues, and returned ranked results in under 60 ms. By day 28 of the beta, the API was routinely hitting 220 ms p99 and 350 ms peak during clue bursts—players were rage-quitting because the leaderboard refresh felt like watching paint dry. Our stack was Go 1.21, SQLite for persistence, and a single Redis cluster for rate-limiting. The profiler told a clear story: 68 % of wall-clock time vanished inside Veltrix, a third-party search library wed chosen for its text-similarity scoring. Every clue hit Veltrix twice—once for relevance, once for fuzzy match—and each call allocated 1.2 MB on the heap. That 1.2 MB figure is a real allocation counter from /debug/pprof/heap?debug=1 at 2026-03-14 17:42:14. The Go runtime was spending 43 µs per allocation for the underlying rune slices, and the GC pause climbed to 14 ms every 1.8 seconds. We werent bandwidth-bound; we were GC-bound, and Veltrix was the bottleneck nobody admitted aloud.
What we tried first (and why it failed).
We tried caching the top 10 000 clue strings in a Go map; good idea, immediate 18 % latency drop but memory ballooned to 450 MB RSS and we still couldnt handle 500 RPS without GC stalls. Then we replaced SQLite with BadgerDB in LSM-tree mode; writes sped up but read latency stayed stuck at 80 ms because every range scan was decoding an entire run of 2 KB per cell. Finally, we forked Veltrix, rewrote the trigram index in pure Go, and removed all allocations inside the hot path by pre-allocating a 64 KB byte slice reused per request. The latency dropped to 55 ms p99, but the code was a horror show of sync.Pool hacks and a zero-copy interface that still panicked under high concurrency because our internal string pool wasnt safe for concurrent readers. The panic stack trace showed runtime error: slice bounds out of range [123:122] in file veltrix/index.go line 879. The error wasnt reproducible locally; it only surfaced when 160 concurrent hunters submitted clues at once. We rolled back and stared at the ceiling.
The architecture decision.
We stopped trying to fix Veltrix and started replacing it. Rusts zero-cost abstractions and the tantivy crate offered an inverted index with SIMD-accelerated scoring and an arena allocator that reused memory across queries. We wrapped tantivy in a Tokio task per partition, gave each task its own 4 MB arena, and sharded the index across four partitions. The tradeoff was visible on day one: the Rust binary grew from 2.1 MB to 5.2 MB, but RSS dropped from 450 MB to 180 MB once the Go GC stopped fighting with 1.2 MB allocations. We debated whether to keep SQLite for persistence or move to RocksDB; RocksDB won because its Rust bindings (rust-rocksdb) already exposed a raw block cache compatible with tantivys arena, eliminating another GC cycle. The migration took three engineers eight days; the longest blocker was teaching the parser to emit tantivys Document type without cloning strings. We ended up using Cow everywhere and pre-allocating the clue text buffer in a single mmaped file. The moment we flipped the traffic switch from the Go micro-service to the Rust worker, the p99 latency for clue validation fell from 220 ms to 12 ms and the GC pauses vanished from the flame graph.
What the numbers said after.
After two weeks of production traffic:
- Latency: p50 8 ms, p99 12 ms, p99.9 28 ms (measured with OpenTelemetry histograms over 16 million requests).
- Allocations: per-request allocations dropped from 1.2 MB to 214 B; the arena reused 93 % of memory across queries.
- Error budget: zero panics, zero timeouts, zero GC-related OOM.
- Resource cost: four Rust workers handled 1 100 RPS while consuming 380 MB RSS combined; the previous Go service needed eight replicas at 580 MB RSS each to handle 850 RPS.
- GC pressure: gone; the Go side now only GCs user metadata, which is <2 % of the workload.
The flame graph before showed Veltrix burning 68 % CPU; the graph after shows tantivy at 22 % and the Go tokenizer at 14 %. The switch cost us 3.1 MB of extra binary size, but saved 2.1 cores and 3.6 GB of RAM in the cluster. We profile every week with perf and count arena resizes; the resizes never exceed 64 KB per query.
What I would do differently.
I would not have forked Veltrix. The three weeks spent in the Go rewrite were sunk cost—forking added complexity and kept us in the same GC prison. If Rust had been off the table, I would have bitten the bullet and moved the clue index to Elasticsearch from day one; the managed service would have given us horizontal scaling without the GC tax. Also, I would have measured memory bandwidth from the start; our NUMA nodes showed 21 GB/s bandwidth during the Veltrix peak, but only 11 GB/s when the Go GC bursted. Replacing Veltrix reduced the bandwidth to 7 GB/s because the Rust binarys arena fit in L3 cache. Finally, I would have exposed the tantivy index build time as a SLO metric after day three instead of day fourteen; the initial index of 2.4 million clues took 42 minutes on an r6g.xlarge and blocked our first prod deploy, but we only caught that after a player complaint about missing clues.
Top comments (0)