The Treasure Hunt Engine Blew Up My Cluster Three Times Before I Learned This

#webdev #programming #ai #machinelearning

The Problem We Were Actually Solving

The demo showed one user, one session, one route. The real system had to serve thousands of players, each with a personal treasure map that evolved based on click patterns, time windows, and—because someone in marketing insisted—emoji reactions to their score. Our first stack was a Next.js 13 app with Vercel edge functions for the hunt logic, a managed Redis for leaderboards, and a separate Postgres for player profiles. Latency goal: <300 ms p95 end-to-end.

What we discovered on day three was that the Vercel free tier silently throttled edge functions after 100 ms CPU time. That meant any route that looked up a players treasure state in Redis would hit the ceiling if the payload was larger than 1 KB. The delightful demo had never exceeded 300 bytes.

What We Tried First (And Why It Failed)

We rewrote the treasure state endpoint to return only a UUID and then hydrated the client with a second call to /api/state/{uuid}. This cut the edge function time to 45 ms, but introduced a race: two concurrent clicks could claim the same treasure if the second client fetched stale state from Postgres. Our Redis leaderboard write failed with CAS mismatch 42 % of the time because we were letting the client serialise the score delta.

Then we tried a Lua script in Redis to atomically compare-and-swap the treasure state. The script worked great in local Docker, but when we pushed it to the managed Redis cluster, we discovered the Lua engine ran on a single thread shared across all tenants. Any long-running script blocked every other players leaderboard update. During peak load, Redis p99 latency jumped to 1.8 seconds and the CEOs demo froze for 42 seconds.

At the same time, the Vercel team migrated the edge runtime from Node 18 to Node 20, silently dropping support for the undocumented global fetch() cache we had been using as a poor mans CDN. Every treasure map re-rendered hit the origin, and our CloudFront bill tripled overnight.

The Architecture Decision

We ripped out Vercel edge functions entirely. Instead, we moved the entire treasure logic into a Go microservice running on Fly.io. The service exposes a single HTTP POST /hunt endpoint that accepts a compressed protobuf payload containing player UUID, click coordinates, and a client-provided timestamp. We sharded the Redis cluster into six primary nodes, each with two read replicas, to parallelise the Lua scripts. A Lua script now runs within 3 ms because we limited it to a single key compare-and-swap instead of scanning the whole leaderboard.

For state consistency, we introduced a lightweight outbox in Postgres: every state change writes a row in hunt_events and a sidecar process publishes that row to a NATS jetstream. The Go service consumes its own events to update an in-memory LRU cache so the next request sees the latest state without hitting Postgres. Cache invalidation is based on a monotonic clock (players last treasure index) rather than wall time.

We configured Redis to use ~30 % of the instance memory for Lua scripts to avoid GC pauses. We set Lua script timeouts to 1 ms, and if a script exceeds it, we fall back to a fast path that rejects the click immediately rather than risking a cascade.

What The Numbers Said After

In the first chaos test we threw 20 000 simulated players at the new service, each clicking every 500 ms. End-to-end p95 latency stayed at 112 ms. Redis p99 remained under 12 ms. The Lua CAS mismatch rate dropped to 0.03 %. The NATS jetstream added 1.4 ms of end-to-end latency but eliminated stale leaderboard reads across shards.

The CloudFront bill fell back to baseline because the Go service caches treasure maps at the edge via Flys global anycast network. The Postgres write load stayed under 300 TPS even though we were processing 40 000 state updates per minute.

After two weeks of production traffic, we uncovered one new failure mode: clock skew between client devices and the Go service caused some players to see negative score deltas. We fixed it by switching to a hybrid logical clock that combines server timestamp with player sequence number. The fix added 0.07 ms average latency but eliminated the visual glitch that marketing called unacceptable.

What I Would Do Differently

I would never let the demo dictate scale assumptions. We should have run a spike with a single player and a synthetic load of 10 000 virtual players before choosing any part of the stack. The edge function envelope was a black box; moving the logic to a stateful service gave us visibility but cost us two weeks.

I wish we had adopted protobuf from day one instead of JSON. The savings in bandwidth and parse time were obvious only after we hit the Vercel throttling.

Finally, I would have built a local chaos rig that kills random Redis shards and Fly regions while we watched Grafana. Our post-mortem after the Lua GC incident took six hours because we didnt have a repeatable failure scenario. Next time we burn a cluster, well already have the ashes on disk.