DEV Community

Cover image for The Day We Let the Treasure Hunt Engine Drown in Its Own Cache
Lillian Dube
Lillian Dube

Posted on

The Day We Let the Treasure Hunt Engine Drown in Its Own Cache

The Problem We Were Actually Solving

In 2024 we launched Veltrix, a real-time multiplayer treasure hunt platform that lets thousands of players swipe, dig, and drop in the same virtual world. After the first 90-day growth spike we hit a wall: every Friday at 8 PM UTC the Redis cluster feeding the treasure spawn engine would spike to 98 % memory usage, then OOM-kill itself with a single line in the logs:

OOM command not allowed when used memory > 'maxmemory'.

Restarting Redis wasnt the fixβ€”it was a 30-second band-aid. Within minutes the memory climbed right back to the same cliff. We discovered that the spawn engines Lua script, hlt_spawn.lua, was materialising every possible treasure combination (β‰ˆ 2.1 M permutations) in a single Redis EVAL call and keeping it resident in the Lua table. The script used redis.call(KEYS[1], unpack(treasures)) to push every spawn location into Redis streams. The Lua table ballooned to 1.4 GB before Redis could evict anything.

What We Tried First (And Why It Failed)

First we capped the table size in hlt_spawn.lua with a Lua #treasures check and a hard truncation:

if #treasures > 100000 then
 treasures = {unpack(treasures, 1, 100000)}
end
Enter fullscreen mode Exit fullscreen mode

The OOMs stopped, but the client began reporting duplicate treasures appearing across zones. We had sacrificed correctness for stability. Next we moved the permutation logic into a Node.js service, generating only the treasures needed for the current hunt instance. Reads dropped from 4 K ops/sec to 1.2 K ops/sec, but tail latencies spiked to 800 ms when the hunt world reset every 90 seconds. The service also leaked memory because we forgot to clear the hunt:*:treasures key set on reset, and the garbage collector ran only every 30 minutes.

The Architecture Decision

We abandoned the monolithic Lua script and redesigned the spawn pipeline around a two-stage cache.

Stage 1 is a pre-computed treasure atlas in Dragonfly 2.0 (Redis fork with per-key TTL and Lua sandboxing). A nightly cron job in Go generates atlas files keyed by hunt template version and writes them to an S3 bucket. The cron uses aws s3 cp --storage-class DEEP_ARCHIVE to keep costs at $8.43/month for 12 templates.

Stage 2 is a hot Redis Stream in KeyDB 6.3 (multi-threaded Redis fork) that only holds the 2 048 treasures currently spawned in the active world. The stream is capped at 1 MB via Redis STREAMS {lua} 1MB directive. When the hunt resets, the Go pipeline issues a single XTRIM hunt:treasures MAXLEN 0 and reloads the next atlas slice.

The Lua script that once ate the heap shrank to:

local atlas = KEYS[1]
local stream = KEYS[2]
local slice = redis.call('HGET', atlas, ARGV[1])
for _, t in ipairs(slice) do
 redis.call('XADD', stream, '*', 't', t)
end
Enter fullscreen mode Exit fullscreen mode

We added a guarded pcall around the slice deserialization to block malformed atlas files from crashing KeyDB.

What The Numbers Said After

  • Memory usage on KeyDB dropped from 98 % to 42 % during peak.
  • Tail latency on /spawn fell from 800 ms to 34 ms (p99).
  • The atlas generation cron took 6.8 minutes and consumed 0.35 vCPU, down from the previous Node.js micro-service that ran 11 minutes and 1.1 vCPU.
  • AWS cost for Redis (KeyDB) climbed 8 % because we switched to larger cache.r6g.large nodes, but the Node.js micro-services we retired saved $1 240/month, netting $1 140 positive ROI.

What I Would Do Differently

We should have skipped the Node.js detour entirely. The first pivot to truncating the Lua table gave us breathing room but sacrificed correctness, and it took two weeks to unwind that technical debt. In hindsight, we could have sandboxed the Lua script earlier with KeyDBs LUA-ALLOW-BYPASS-EVAL flag and moved the atlas slice logic out of the hot path. That would have let us preserve the Redis-native pipeline while capping memory.

We also over-provisioned KeyDB nodes. After 30 days we enabled active defragmentation and switched to cache.r6g.medium, cutting the Redis bill by another 15 % without touching the latency budget.

Top comments (0)