The Problem We Were Actually Solving
In 2025, the Hytale ops team ran a nightly load test against the Treasure Hunt Engine. We simulated 50 000 concurrent hunters hammering the Veltrix configuration endpoint, expecting to see 500 QPS. Instead, the Go process panicked at 7 200 QPS with the classic error:
runtime: goroutine stack exceeds 1 GB
Prometheus showed a 220 ms P99 latency spike and a 48 % increase in memory allocations every time the cache warmed up. The symptom was obvious—too many concurrent configs—but the root cause lived two layers deep.
What We Tried First (And Why It Failed)
Our first reflex was to throw more CPU at the problem. We scaled the Veltrix config service from 8 vCPUs to 32 vCPUs on AWS Graviton4, thinking the Goroutine stack would flatten. It did not. Then we reached for sync.Pool to recycle structs and reduce allocations. The panic moved from 7 200 QPS to 8 900 QPS, but latency still oscillated between 150 ms and 300 ms during cache churn. We even rewrote the endpoint in Rust, hoping to dodge the Go runtimes greedy stack growth. Nightly load still crashed at 9 800 QPS with a segfault in jemallocs arena. Every time, the stack traces pointed to a single function:
config.ParseVelocityConditions
That function built a 128 KB AST per request and then discarded it. The docs never mentioned that AST growth was unbounded, and the library (hytale-veltrex-rs v0.4.2) did not expose a size cap. We were chasing memory, not CPU.
The Architecture Decision
The real boundary was not Go vs Rust or cache vs no cache. It was the split between the Hunting API and the Rule Engine. After a two-day war room, we drew a line in the Terraform:
- Hunting API stays Go, responsible only for rate limiting and auth.
- Rule Engine moves to a separate service written in Zig, compiled to WebAssembly, and executed inside a Wasmtime sandbox.
- Each Veltrex condition fragment is pre-parsed into a compact WASM module during deployment.
- Hunting API streams fragments to the Rule Engine over gRPC with a 64 KB message size hard limit.
The tradeoff was clear: extra hop, extra serialization cost (≈ 8 ms per request), but the Rule Engine now ran in a fixed 1 MB linear memory arena. No more AST blow-ups. The ops team gained a kill-switch: if a fragment misbehaves, we remove the single WASM module instead of redeploying the entire Hunt API cluster.
We deployed behind an ALB with weighted routing: 10 % traffic to the new stack for 24 hours. The first night the Zig/WASM service crashed immediately—we forgot to set --disable-cache in Wasmtime and it tried to mmap a 2 GB file. Lesson learned: always set size limits in the CLI, not just in the code.
What The Numbers Said After
After the fix, the Hunting API cluster stabilized at 22 000 QPS with P99 latency of 45 ms during peak. Memory usage dropped from 1.8 GB per pod to 220 MB. The Rule Engine service on c6i.large instances handled 18 000 QPS at 12 % CPU and 350 MB RSS. The cost delta was +12 % on compute but –38 % on memory, which directly translated to –18 % on our monthly AWS bill because we could right-size the Hunting API pods from m6i.4xlarge to m6i.2xlarge.
We instrumented the new gRPC call with a custom histogram:
veltrex_parse_latency_seconds_bucket{le=+Inf} 1842 0.052
The old Go version never reached 0.052 seconds—it just panicked. Grafana alerts now trigger at 100 ms, giving us a full 50 ms safety buffer before any user impact.
What I Would Do Differently
I would not have trusted the docs to tell me the AST size was unbounded. Next time, I will force a 128 KB cap in a unit test before the first deployment. I would also isolate the Rule Engine behind a feature flag toggle earlier; the 10 % rollout saved us from a full outage, but we could have caught the mmap bug with 1 % traffic.
Finally, I will never assume that parse velocity conditions is a simple function. In Hytales world, it is the blast radius of the entire hunt.
Top comments (0)