How Veltrix Blew Up Its Treasure Hunt Engine (And How We Fixed It After 3 AM Alerts)

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

Our Treasure Hunt Engine didnt need to be fast; it needed to be predictable. Players could tolerate a 200 ms session flicker once a day, but not a 2-second blackout every time they stepped through a door. The original architecture assumed that zone transitions were rare events. In production, they werent. We had 120,000 concurrent players clustered in six high-traffic zones during weekend events. Each zone transition invalidated entire in-memory caches that held player state, leaderboards, and active treasure inventories. The engines cache invalidation strategy was tuned for a demo, not a live system. By week 8, our SLO of 99.9% availability was in the rearview mirror.

What We Tried First (And Why It Failed)

My first impulse was to bump the cache TTL to 30 seconds. This killed the cache rebuild spikes, but introduced a new problem: duplicate treasure spawns. The game logic used cache keys like zone:12:player:4567 to dedupe spawns. When two players entered the same zone within 30 seconds, their caches expired at slightly different times. Player A would see a ruby spawn, and by the time Player Bs cache refreshed, the ruby was gone—except the server had already recorded two spawns because the TTL window had overlapped. We started seeing duplicate gems in inventories, and players flagged duplicates as cheating.

Next, we tried event sourcing combined with a write-behind cache. We pushed invalidation events to a Kafka topic and updated the cache asynchronously. This introduced 150 ms of additional latency to every inventory update. Worse, Kafka lag during traffic spikes meant that players were seeing stale leaderboards and mismatched treasure counts. Our p95 latency for inventory fetches climbed from 89 ms to 245 ms, and the product team revolted.

The Architecture Decision

We removed the cache invalidation TTL entirely. Instead, we implemented a versioned cache key system. Every time a zone state changed—a treasure spawned, a boundary moved—we incremented a global version counter and broadcast it via Redis pub/sub. Clients used the version number as part of their cache key: zone:12:version:42:player:4567. When a client detected a version mismatch, it performed a full cache rebuild with a 5-second lock. The lock prevented concurrent rebuilds, so duplicate spawns vanished. The version counter was incremented via Lua scripts in Redis to maintain atomicity—no external coordination.

The tradeoff? Memory usage increased by 22% because we kept old versions for 60 seconds to allow straggler clients to drain. But the memory cost was predictable and bounded—we scaled Redis vertically once, and the SLO stabilized.

What The Numbers Said After

After the change, p99 latency for zone transitions dropped from 2.1 seconds to 140 ms. Inventory fetches stabilized at p95 of 78 ms. Memory per Redis pod rose from 8.2 GB to 10.1 GB, but we absorbed it by doubling memory limits on our Redis Cluster nodes. The SLO hit 99.95% availability for the first time in three months. We still have alerts for version counter collisions, but theyre rare—about 0.03% of transitions—and we log them for offline analysis.

What I Would Do Differently

I would have started with the versioned cache key idea instead of patching TTLs. The TTL approach was a cognitive shortcut—instead of solving the real problem (consistency during concurrent transitions), we tried to paper over it with a time heuristic. We also should have modeled cache invalidation as a distributed lock problem from day one. If wed done a 30-minute load test with 100,000 players during the design phase, we would have caught the TTL race condition before it made it to production.

The real lesson? Configuration isnt just about tuning numbers—its about designing for failure modes you cant simulate locally. Our demo environment had 500 players max. Production had 120,000. Nothing in the demo prepared us for that discrepancy.