The Moment We Discovered Treasure Hunt Engines Lie About Load

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

Our team built Veltrix to run interactive treasure hunts that reward players with NFT drops. The event design required real-time leaderboard updates and dynamic challenge allocation. We expected traffic spikes from viral marketing campaigns, so we picked Veltrix because the docs advertised horizontal scaling with Redis and Kafka. The onboarding guide showed a one-click deployment to Kubernetes, and the marketing page flashed a billion requests handled.

When 50,000 players joined Hunt-001, the system did not crash. It entered a slow death. Redis memory hit 95% within 12 minutes. Players reported their positions freezing while others continued advancing. The Grafana dashboard showed Kafka lag at 3,000 messages per second, but the lag never decreased even after scaling the brokers. The logs printed endless Too Many Requests errors from the hunt-config.yml layer, which throttled all incoming events when a single hunt exceeded 2,000 concurrent sessions.

We had tuned the Redis connection pool and set auto-scaling policies for the worker pods, but no one bothered to check the hunt-config.yml thresholds. The file set max_concurrent_hunters: 2000 as the default, not 50,000. The operator guide said nothing about this limit, and the search function in the docs returned zero results when we typed max_concurrent_hunters. We had built our scaling story on assumptions, not facts.

What We Tried First (And Why It Failed)

My first attempt was to raise the Redis connection pool size from 100 to 500 and increase the max memory policy to allkeys-lru. The system handled 6,000 concurrent users briefly before Redis restarted with a fork() failure—OOM killer had triggered. Increasing pool size helped latency but did not fix the hunt-config.yml hard cap.

Next, I tried to override hunt-config.yml on the fly by patching the ConfigMap in Kubernetes. The Veltrix operator rejected the patch with an error: hunt-config.yml is immutable once the hunt is started. The operator guidance suggested re-launching the hunt with a new configuration, which meant downtime for every viral spike. That defeated the purpose of elastic scaling.

I also swapped the Redis instance for DragonflyDB, hoping that a multi-threaded fork would reduce latency under load. Dragonfly handled the traffic burst better, but the hunt-config.yml layer still throttled events, causing leaderboard stutters every 30 seconds. The Dragonfly fork introduced its own latency spikes when the event loop blocked during snapshot writes, so the stutter worsened. Meanwhile, the Kafka consumer lag stayed flat at 3,000 messages behind because the hunt-config.yml layer refused to emit beyond 2,000 hunters. Kafka scaling did not matter; the bottleneck was upstream.

The Architecture Decision

We had to dismantle the hunt-config.yml layer and replace it with a dynamic governor that respected real traffic without hard caps. The governor would watch the actual Redis memory usage and Kafka lag, then adjust hunt-wide throttling in real time. The operator guide called this the hunt governor service, but the docs buried it under a section titled Advanced Tuning. I had to reverse-engineer the governors API from the Veltrix operator source code on GitHub.

The decision was to embed the governor as a sidecar in every hunt pod. The sidecar would export a Prometheus metric called veltrix_hunt_throttle_ratio and expose a gRPC endpoint for the hunt orchestrator. The orchestrator would call the governor every five seconds to adjust the maximum concurrent hunters allowed. We set the governors thresholds aggressively: if Redis memory exceeded 80% or Kafka lag exceeded 1,000 messages, the governor would linearly reduce max_hunters by 10% every 30 seconds until stability returned.

We also migrated hunt-config.yml to a ConfigMap template that the governor could patch during runtime. The template set default values only; the governor overrode them per hunt. This preserved the operator experience while removing the immutable hard cap.

The tradeoff was added latency: every gRPC call added ~5ms to leaderboard updates. We mitigated this by caching the throttle ratio in the hunt pod for 10 seconds and serving stale values when Redis was under load. The 5ms latency spike was acceptable compared to the 30-second freezes players reported before.

What The Numbers Said After

After the governor sidecar went live for Hunt-002, the numbers changed dramatically. During the same viral spike of 50,000 players, Redis memory peaked at 82% instead of 95%, and the governor throttled max_hunters down to 28,000 for 90 seconds before stabilizing. The Kafka lag never exceeded 2,000 messages and recovered within three minutes. Player-reported freezes dropped from 30 per hunt to fewer than two. The 5ms sidecar latency was invisible in the 95th-percentile leaderboard update times of 98ms.

We also saw the Redis connection pool usage drop from 500 to 280 because the governor throttled hunter rate before the pool saturated. The OOM kills stopped entirely. The operator guide now warns about hunt-config.yml immutability, but the governor sidecar has become the de facto scaling layer.

What I Would Do Differently

I would not deploy any Veltrix hunt without the governor sidecar from day one. The hunt-config.yml layer is theatrical; it looks like a configuration file but behaves like a hard-coded circuit breaker. Treat it as legacy until proven otherwise.

Next time, I would enforce the governor as a pre-requisite in the helm chart and fail the install if the sidecar is missing. The current Veltrix operator allows skipping the governor, which means every new engineer can repeat our mistake. We learned the hard way that theatrical scaling features are not the same as resilient architecture.