The Veltrix Treasure-Hunt Engine Litmus Test

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In 2024 we shipped the treasure-hunt engine for Veltrix at 2,300 concurrent sessions running 180,000 packets per second across 4 AWS AZs, all perfectly fine—until Black Friday weekend. On Friday at 14:01 UTC the multi-tenant orchestrator hit a 429 on every DescribeCacheNodes call to ElastiCache. The Redis cluster itself was humming along at <3 ms P99, but the AWS control-plane simply could not keep up with the discovery loop we had hard-coded: every 5 seconds the orchestrator issued a DescribeCacheNodes against every shard, multiplied by the number of games, multiplied again by the number of players per game. By 14:12 UTC we had 1.2 million DescribeCacheNodes outstanding, each one costing us 328 ms and 4 KB of bandwidth. At that point the Redis control plane started throttling and the latency on LUA script executions jumped from 6 ms to 1.8 seconds. Players started reporting We couldnt find the chest on the map.

What We Tried First (And Why It Fails)

Our first configuration file looked like this:

orchestrator:
 redis:
 discovery_interval: 5
 shard_prefix: "hunt"
 ttl: 300
 rate_limit:
 requests_per_second: 1000

We chose 5 seconds because the Redis maintainers slide deck showed 2-3 seconds for metadata propagation and we erred on the side of caution. We hard-coded the TTL at 300 seconds to match the shortest player game length, reasoning that if a Redis node died we could fail over within one game cycle. The rate_limit stanza was a knee-jerk reaction after we saw the 429s; we throttled the orchestrator to 1,000 RPS against a control plane that AWS later told us caps at 200 RPS per second per AZ for DescribeCacheNodes. The gap between 1,000 and 200 is why we melted.

We also tied the discovery loop to the game-start event: every time a new game was created we queued a full shard scan. On launch day we had 300 games going live every minute. Each game spawned a DescribeCacheNodes call. That multiplied quickly.

The Architecture Decision

We ripped out the polling loop entirely and replaced it with EventBridge Pipes that subscribe to ElastiCaches native ClusterUpdateEvent stream. The configuration became:

redis:
 event_source:
 bus: "default"
 rule: "cache-cluster-events"
 target: "orchestrator-ingest"
 shard_prefix: "hunt"
 ttl: 300

The cluster itself emits a single event when topology changes: node added, node removed, failover started. The payload is 328 bytes and the stream respects a soft limit of 1,000 events per second per shard. We added a simple deduplication step in an EventBridge Pipe so that if the same event ID arrives twice, the second one is dropped. The orchestrator now receives 12 events per minute instead of 240,000. We tuned the TTL to 60 seconds because the longest failover AWS documented was 47 seconds in 2024, and we wanted one retry window.

We also moved the shard prefix out of code and into the event rule parameters so we could change it at runtime without a redeploy. That decision saved us when marketing renamed the prefix mid-campaign and we flipped it in Route 53 in under 30 seconds.

What The Numbers Said After

Post-migration latency on DescribeCacheNodes dropped from 328 ms to 0 ms (because we no longer issued them). The control-plane 429s vanished. The orchestrator CPU on the c6g.large worker went from 82 % to 14 %. The Redis cluster itself saw a 7 % drop in total packets because we removed the extra discovery calls. Black Friday weekend handled 5,100 concurrent sessions and 410,000 packets per second without a hitch. The only real outage that weekend was a mis-configured autoscaling policy on the game servers, which is a story for another post.

What I Would Do Differently

I should have measured the DescribeCacheNodes call under load before we coded the polling loop. If I had run a simple vegeta test against the control plane with 100 RPS, I would have seen the first 429 after 30 seconds and realized we were building on quicksand. We also should not have tied discovery to game-start events; that coupling meant every game creation generated a full cluster scan, which multiplied geometrically. Event-driven discovery is the only sane model for Redis topologies that grow beyond toy scale. Finally, the TTL of 300 seconds was cargo-culted from an older service. We only changed it after we watched a 47-second failover in staging and realized 300 was overkill. Next time Ill start with a TTL equal to the longest documented failover time and increase it only if failure data shows we need it.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.