DEV Community

Cover image for The Moment Our Hytale Treasure Hunt Engine Blew Past 40k Errors per Day
Lisa Zulu
Lisa Zulu

Posted on

The Moment Our Hytale Treasure Hunt Engine Blew Past 40k Errors per Day

The problem we were actually solving

In 2025, when Hytales Veltrix engine went live, our in-game treasure hunt system was supposed to drop players into procedurally generated tunnels, mark dig sites with floating runes, and reward them with aetherium shards—all within 8 seconds of spawning. The gameplay loop looked good in the demo: players solved a riddle, the engine streamed the next tunnel chunk, and the camera followed a spline that curved around the newly generated geometry. What the demos never showed was the silent 40K errors per day that appeared only when the A* pathfinder tried to re-route around a newly spawned ore pocket that had materialized inside the same spline control point. The spline cache, updated by a 200 ms background coroutine, assumed static geometry; the pathfinder assumed dynamic geometry. The mismatch cost us 14% of our concurrent sessions before we even noticed the metric in New Relic.

What we tried first (and why it failed)

First, we wrapped the spline in an asynchronous observer that rebuilt the curve whenever the terrain chunk streamer signaled a change. The coroutine fired every 200 ms and returned a new Catmull-Rom curve to the renderer. Latency looked acceptable: 95th percentile curve rebuild time was 98 ms, and the frame budget held at 16.6 ms. Then we ran 500 bots in the staging world for 12 hours. The first explosion happened at bot 247: a race condition where two chunks streamed simultaneously and the spline observer received two transform updates in the same frame. The Catmull-Rom knots became non-monotonic, the renderer crashed with a NaN vertex buffer, and the error log filled up with InvalidOperationException in Unitys Burst jobs. Worse, the observer kept firing because the chunk loader still thought the curve was dirty, spinning the CPU at 98% for 800 ms and triggering a cascading GC that dropped frame rate to 5 FPS.

The architecture decision

We needed a way to freeze the spline geometry at the exact moment the pathfinder asked for a route, but still let the renderer show dynamic ore pockets. The solution was a dual-layer spline: a static reference spline baked at world load time, and a lightweight dynamic offset layer stored per chunk. When the terrain chunk arrived, the chunk loader appended a local offset vector to the spline knot instead of rewriting the global curve. The pathfinder read the static reference spline, applied the offsets on the fly, and returned a route within 45 ms. Meanwhile, the renderer consumed the original static spline plus the offset deltas, so tearing artifacts vanished. We moved the offset storage to a 64 KB ring buffer allocated per 256 m chunk, which kept memory per player under 256 KB even with 16 concurrent hunts. The state machine that owned the ring buffer ran in a separate job system thread with a 1 ms time slice, preventing the 800 ms stall we saw before.

What the numbers said after

After the rollout we watched the error rate drop from 40K/day to 0 by week three. The 95th percentile pathfinding latency went from 28 ms to 45 ms, which still fit inside our 8-second cadence because the hunt timer started after the first AI decision, not after terrain chunk arrival. The profiler showed the new job spent 37 µs on each knot offset versus 2.1 ms in the old monolithic coroutine. Memory usage per session fell from 1.3 MB to 280 KB, partly because we stopped serializing the entire Catmull-Rom knot array on every update. Most surprising was the player feedback: completion time variance shrank from ±3.4 s to ±0.9 s, because the static reference spline kept the spline length constant and the offset layer only nudged geometry in place. The change also removed the Unity Burst job exception; we patched the old code path and saw a 22% reduction in GC alloc bytes per frame, which mattered more than the raw error count.

What I would do differently

I would not trust a coroutine to handle geometry mutations again, no matter how small the delta. Next time well reserve a dedicated virtual geometry threading context from engine boot and feed both the pathfinder and the renderer from that single source of truth. We should also bake the spline length into the world manifest so new clients can pre-allocate the offset buffers before the first chunk arrives—we lost two days debugging missing knot IDs in the mobile build because the client tried to guess the curve length. Finally, we need a deterministic replay path for the treasure hunt state so QA can reproduce the NaN vertices in a 120-second test instead of waiting for a 12-hour bot run. The replay would replay the RNG seed, the chunk load order, and the exact frame when the offset vector was committed. Without it, well keep playing whack-a-mole with the same race condition every time we tweak a material shader.

Top comments (0)