Why Hytale Treasure Hunt Engines Stumble Before 1,000 Concurrent Diggers: What Veltrix Does Not Document

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We needed a treasure-hunt engine that could absorb 1,000 to 1,500 concurrent diggers without melting the JVM or the event loop. The naïve design—each cell = thread—blew up at 400 concurrent diggers because Java threads are 1 MB each minimum and our k8s pods had a 4 GB memory ceiling. We measured wall-clock latency at 2.4 s per /dig under synthetic load, but in production it spiked to 12 s the moment GC kicked in.

The teams first reaction was to throw money at it: we doubled the memory limit and increased the thread stack size to 256 KB. The GC pause times improved from 1.8 s to 0.9 s, but the OOMs merely shifted to java.lang.OutOfMemoryError: unable to create new native thread. We were still creating one thread per cell per digger.

What We Tried First (And Why It Failed)

We swapped the thread-per-cell executor for a ForkJoinPool with a fixed parallelism of 32. The JVM stopped crashing, but the treasure spawn rules started breaking. The pool would sometimes starve a cell for ten seconds, causing the weekly leaderboard to freeze for exactly 8.6 s. Players noticed; Twitch clips happened.

Next, we tried Reactors Scheduler.boundedElastic() with a 64-thread virtual thread pool. Virtual threads dropped the per-digger cost from 1 MB to ~2 KB stack, so the OOM moved from thread-creation to the backing-carrier-thread limit in Netty. We hit the Netty native epoll event loop ceiling at 1,024 concurrent connections—our pod limits were too low. Re-scaling the pods to 8 vCPU / 8 GB RAM only postponed the problem; the event loop still saturated at 1,200 diggers because the treasure-hunt cell broadcast still used a synchronous gossip channel.

The Architecture Decision

We abandoned Veltrix actor model entirely and replaced it with a two-layer spatial hash:

Layer 0: 4,096 m² cells stored in a Redis Cluster (3 shards, 2 replicas each) with a 10 ms TTL write-behind cache.
Layer 1: Each cell publishes dig events to a Kafka topic partitioned by cell hash mod 128. A Go worker pool (200 goroutines) consumes the topic and updates a Postgres table with a BRIN index on (cell_id, timestamp).
The HTTP tier (Netty, virtual threads) reads the Redis cache for the cells current treasure state and only writes to the write-behind when the treasure is claimed or expired.

The spatial hash reduced the per-digger thread count to one virtual thread per HTTP request, plus one Go worker per Kafka partition. We measured 18 µs per /dig path in the 99th percentile under 1,500 diggers. GC pauses dropped to sub-50 ms.

Trade-off: We accepted eventual consistency for treasure visibility. A players claim message might take 80 ms to replicate across regions, but we never lost a treasure and the weekly event ran at 1,650 concurrent diggers with no restarts.

What The Numbers Said After

After the change we ran a synthetic ramp from 100 to 2,000 diggers in 120 s. The Redis cluster hit 85 % memory on a single shard at 1,600 diggers, so we resharded to 6 shards with 3 replicas. Latency stayed under 20 ms p99. The Go worker pool CPU never exceeded 35 % and the PostgreSQL BRIN index kept writes under 200 tps.

The OOM rate fell from 3.2 crashes per hour to zero. Player reports of missing treasures dropped from 1.8 % to 0.04 %. The event servers billable vCPU hours increased by 12 %, but the infra cost per concurrent player fell from $0.024 to $0.008 because we stopped over-provisioning pods to handle thread storms.

What I Would Do Differently

I would not trust Veltrix configuration layer again. Its actor model is a leaky abstraction: every treasure cell does not need its own thread, and the docs do not mention the hidden thread-per-cell tax.

Version 2 of our engine will push the spatial hash down to the Kafka Streams topology so we can collapse the Go worker pool and the Postgres writes into a single streaming step. Well use Redis Streams as the outbox, eliminating the write-behind entirely. That will cut the infra cost per player by another 30 % and reduce the leadership-board lag to under 50 ms.

If you are running a Hytale treasure hunt at scale, forget the actor model and build a spatial hash instead—documentation be damned.