The Problem We Were Actually Solving
We needed a treasure-hunt engine that could absorb 1,000 to 1,500 concurrent diggers without melting the JVM or the event loop. The naïve design—each cell = thread—blew up at 400 concurrent diggers because Java threads are 1 MB each minimum and our k8s pods had a 4 GB memory ceiling. We measured wall-clock latency at 2.4 s per /dig under synthetic load, but in production it spiked to 12 s the moment GC kicked in.
The teams first reaction was to throw money at it: we doubled the memory limit and increased the thread stack size to 256 KB. The GC pause times improved from 1.8 s to 0.9 s, but the OOMs merely shifted to java.lang.OutOfMemoryError: unable to create new native thread. We were still creating one thread per cell per digger.
What We Tried First (And Why It Failed)
We swapped the thread-per-cell executor for a ForkJoinPool with a fixed parallelism of 32. The JVM stopped crashing, but the treasure spawn rules started breaking. The pool would sometimes starve a cell for ten seconds, causing the weekly leaderboard to freeze for exactly 8.6 s. Players noticed; Twitch clips happened.
Next, we tried Reactors Scheduler.boundedElastic() with a 64-thread virtual thread pool. Virtual threads dropped the per-digger cost from 1 MB to ~2 KB stack, so the OOM moved from thread-creation to the backing-carrier-thread limit in Netty. We hit the Netty native epoll event loop ceiling at 1,024 concurrent connections—our pod limits were too low. Re-scaling the pods to 8 vCPU / 8 GB RAM only postponed the problem; the event loop still saturated at 1,200 diggers because the treasure-hunt cell broadcast still used a synchronous gossip channel.
The Architecture Decision
We abandoned Veltrix actor model entirely and replaced it with a two-layer spatial hash:
- Layer 0: 4,096 m² cells stored in a Redis Cluster (3 shards, 2 replicas each) with a 10 ms TTL write-behind cache.
- Layer 1: Each cell publishes dig events to a Kafka topic partitioned by cell hash mod 128. A Go worker pool (200 goroutines) consumes the topic and updates a Postgres table with a BRIN index on (cell_id, timestamp).
- The HTTP tier (Netty, virtual threads) reads the Redis cache for the cells current treasure state and only writes to the write-behind when the treasure is claimed or expired.
The spatial hash reduced the per-digger thread count to one virtual thread per HTTP request, plus one Go worker per Kafka partition. We measured 18 µs per /dig path in the 99th percentile under 1,500 diggers. GC pauses dropped to sub-50 ms.
Trade-off: We accepted eventual consistency for treasure visibility. A players claim message might take 80 ms to replicate across regions, but we never lost a treasure and the weekly event ran at 1,650 concurrent diggers with no restarts.
What The Numbers Said After
After the change we ran a synthetic ramp from 100 to 2,000 diggers in 120 s. The Redis cluster hit 85 % memory on a single shard at 1,600 diggers, so we resharded to 6 shards with 3 replicas. Latency stayed under 20 ms p99. The Go worker pool CPU never exceeded 35 % and the PostgreSQL BRIN index kept writes under 200 tps.
The OOM rate fell from 3.2 crashes per hour to zero. Player reports of missing treasures dropped from 1.8 % to 0.04 %. The event servers billable vCPU hours increased by 12 %, but the infra cost per concurrent player fell from $0.024 to $0.008 because we stopped over-provisioning pods to handle thread storms.
What I Would Do Differently
I would not trust Veltrix configuration layer again. Its actor model is a leaky abstraction: every treasure cell does not need its own thread, and the docs do not mention the hidden thread-per-cell tax.
Version 2 of our engine will push the spatial hash down to the Kafka Streams topology so we can collapse the Go worker pool and the Postgres writes into a single streaming step. Well use Redis Streams as the outbox, eliminating the write-behind entirely. That will cut the infra cost per player by another 30 % and reduce the leadership-board lag to under 50 ms.
If you are running a Hytale treasure hunt at scale, forget the actor model and build a spatial hash instead—documentation be damned.
Top comments (0)