DEV Community

Cover image for When the Treasure Hunt Engine Eats Itself: My First Production Outage That Taught Me the True Cost of Defaults
pretty ncube
pretty ncube

Posted on

When the Treasure Hunt Engine Eats Itself: My First Production Outage That Taught Me the True Cost of Defaults

The Problem We Were Actually Solving

The treasure-hunt engine is a state machine that advances an epoch every 10s, recomputing leaderboards and validating 100k+ player claims in a single Lua coroutine. One night the Lua heap counter—yes, we were still using debug.getregistry()—jumped from 64MB to 412MB inside 90 minutes. By 03:47 the kernel started swapping, the GC froze for 1.8s, and the epoch stall propagated to every player session.

The on-call rotation reset the process at 03:49, but I could see the same pattern replaying: epoch duration rising linearly with heap size. The SLA required 95th percentile claim validation under 50ms; we were testing at 23ms in staging with synthetic loads, so the regression felt personal.

What We Tried First (And Why It Failed)

Our first fix was to bump LUA_GCSTEP from 200 to 2000. The theory was that larger steps would let Lua finish collections faster. What actually happened: the major GC cycle took 600ms and paused all player sessions because Lua coroutines arent preemptible. The p99 latency graph developed a comb pattern—good epochs at 30ms, bad epochs at 1.5s.

Next we tried running two Lua states in a sharded cluster. The cross-shard RPC latency added 18ms baseline, and the new Lua states still accumulated memory until they OOMed. The CPU flame graph showed 37% of cycles in luaV_execute, still fighting the interpreter.

The Architecture Decision

We stopped trying to tune the Lua interpreter and wrote a new epoch engine in Rust. Instead of one monolithic Lua coroutine, the Rust version splits the 100k claims into 16 independent segments that parallelize over a Tokio work-stealing scheduler. Each segment uses hashbrowns raw_entry API so we can validate 60k claims/s per core without allocations.

The critical tradeoff: we lost the ability to hot-patch game logic at runtime. Our deployment now requires a binary rollout and a safety check in CI that runs the Rust engine against the exact Lua bytecode we retired. That check caught a bug in the claim expiry logic where we were double-counting a timestamp overflow—something luacheck would never see because the overflow wrapped silently in Luas number type.

What The Numbers Said After

Heres the before and after from the production run the week after cutover. All numbers are 5-minute rolling medians measured on c6g.4xlarge (16 vCPU Graviton2):

Metric Lua Defaults Rust Engine
Heap Growth / hour 127 MB 0 MB
Epoch Duration 23 ms 11 ms
P99 Validation 420 ms 28 ms
RSS After 7 days 1.2 GB 89 MB
GC Pauses > 100ms 47 / hour 0 / hour

The new endpoints also exposed a latent Redis hotspot: the Lua version had been using EVALSHA with a 512-byte script that serialized the entire claim set, so every call touched ~20k keys and caused a 3ms tail latency. The Rust version switched to HSCAN batches of 1000 keys and cut that tail to 1.1ms.

What I Would Do Differently

I would not have assumed Luas defaults were wrong. The LuaJIT defaults are excellent for short-lived scripts, but for a 24/7 service maintaining 100k+ dynamic states, the interpreters GC and scheduler are the wrong abstractions.

I would also have measured memory growth from day one instead of trusting the staging suite. The staging cluster only ran 10k synthetic claims and a 64MB heap was fine; prod was 10x bigger and 100x longer-lived. A single Grafana panel with lua_gc_total_bytes would have saved the outage.

Finally, I would have resisted the temptation to preserve Lua compatibility longer. Every week we kept the dual stack added complexity: two build pipelines, two dependency trees, and two places to deploy. Once the Rust engine passed the bytecode compatibility test, we should have killed the Lua path immediately.


Same principle as removing a memcpy from a hot path: remove the intermediary from the payment path. This is how: https://payhip.com/ref/dev2


Top comments (0)