The Day the Veltrix Configs Blew Up My Treasure Hunt Engine

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

Wed shipped the Treasure Hunt Engine six months earlier as a real-time scavenger hunt overlay for Hytale. Players raced through generated biomes, solving puzzles to unlock treasure chests. Under the hood, chest generation was orchestrated by a separate service we called Veltrix. Veltrix took a region tag—us-west-3, eu-central-1, ap-northeast-1—hashed the biome coordinates, and returned a deterministic chest layout. The engine itself didnt know geography; it asked Veltrix for the layout, then streamed the results to clients via WebRTC.

The problem wasnt speed; it was correctness. When the region tag was wrong or the Veltrix endpoint unavailable, the hunt-service threw a 500, logged the error, and returned empty chests. Players saw repeating fields and swore the game was broken. Worse, the frontend had no fallback: it just showed a spinner. So players tweeted screenshots of empty fields with titles like Hytale Treasure Hunt is down again.

What We Tried First (And Why It Failed)

Week one we naively added retries. The hunt-service fired three exponential backoff attempts to Veltrix before returning empty. But Veltrix latency itself was unreliable. The service runs on Fly.io with a single Postgres cluster shared across regions. Our west-coast region tag pointed to eu-central-1 sometimes because of DNS flakiness in Flys internal mesh.

Then we tried a circuit breaker using Polly in .NET. We set failureThreshold 3, samplingDuration 10s, halfOpenAfter 30s. That worked fine in staging—where Flys DNS never flaked—but in prod the circuit stayed open six minutes after the spike because Flys health checks lagged behind actual connectivity. Meanwhile players had already moved on to other games.

We even tried calling Veltrix through Envoy sidecars, hoping to absorb the flakes. But the sidecar added 28 ms of proxy overhead, which broke our SLA of under 200 ms p95 for chest discovery. Players with 60 fps monitors noticed the extra frame delay and blamed the engine.

The Architecture Decision

We ripped out the sidecars and Polly and wrote a custom topology resolver. The hunt-service now ships with a baked-in region map: us-west-3 → Veltrix endpoint in San Francisco, eu-central-1 → Frankfurt, ap-northeast-1 → Tokyo. The map is compiled into the binary at build time; no runtime DNS quirks.

When a player spawns, the hunt-service hashes the coordinate pair, picks the closest region tag, and calls Veltrix directly over plain HTTP/1.1 with a 150 ms timeout. If the call fails, the hunt-service returns a cached layout from the last successful request for that biome. The cache key is biomeId + regionTag; we use an in-memory LRU of 1000 layouts with 5-minute TTL. Players see stale chests for ten seconds instead of a broken screen, which is better than nothing.

We also added an endpoint /selfcheck that the CI pipeline runs after every deploy. It curls each region tag with a 100 ms timeout and asserts the chest layout isnt empty. If the check fails, the pipeline rolls back the container. That single test caught a misrouted Fly volume in staging last month and saved us a fire drill.

What The Numbers Said After

The direct region map dropped p95 latency from 218 ms to 94 ms. The 503 rate fell from 1.4 % to 0.02 % across the last quarter. Incident pages dropped 78 % year-over-year. Players in Tokyo stopped tweeting about empty chests.

Cache hit rate stabilized at 43 %—enough to keep the service available during Veltrix brownouts. The in-memory cache uses 32 MB of RAM on a 512 MB container, so were fine.

One metric I regret not exposing: the number of times Veltrix returns different layouts for the same biome and region. We assumed determinism, but region drift still happens twice a week. Well add a non-determinism counter next sprint.

What I Would Do Differently

I should have exposed the region map as a config file instead of baking it into the binary. Then we could hot-patch it during incidents without rebuilding. The current build pipeline takes four minutes to push a new image, and thats four minutes of angry players.

I also would not trust Fly.ios routing again. Weve kept the Envoy sidecars idea on ice. Well deploy them only if Veltrix adds regional redundancy; otherwise the sidecar overhead is still a liability.

Last, Id add a client-side fallback: if the hunt-service returns empty chests, the frontend should display a random preset layout and log the incident. That way players arent staring at a blank screen while we scramble to roll back. The frontend team fought me on it—extra code, extra state—but after the last outage they agreed its cheaper than another angry tweet.