When Hindsight Was 20/20: How One Wrong Parameter Derailed Our Treasure Hunt Engine for Six Hours

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

The Treasure Hunt Engine at Veltrix is not a game; it is a real-time auction system disguised as a mobile scavenger hunt. Each user runs a background thread that polls /treasure/{id} every 1.2 seconds to check if they have captured a virtual coin. The catch is that the /treasure endpoint is not stateless: it uses a Redis Lua script to atomically increment a counter and return the updated balance only if the timestamp is within a 5-minute window. Our KPI was simple—p99 latency < 1 s and error rate < 0.5 %—but the moment we pushed to 10 k concurrent users, the p99 exploded to 4.2 s and the error rate hit 2.8 %, all while CPU on the Python workers stayed at 35 %. We knew the Lua script was efficient (median execution time 1.4 ms), so the bottleneck had to be elsewhere. The run-book told us to set simulated_user_count=10000 for load tests, but it never told us that the load generator itself was part of the problem.

What We Tried First (And Why It Failed)

We first blamed Redis. We upgraded from 7.0.12 to 7.2.3, recompiled with -O3, and tuned maxmemory-policy allkeys-lru. The p99 barely moved. Then we blamed the Python async workers, so we doubled the uvicorn workers from 4 to 8 and set --limit-concurrency 1000. The CPU climbed to 78 %, but the latency stayed flat at 4.1 s. Next we tried increasing Redis connection pool size from 100 to 500; the error rate dropped to 1.1 %, but p99 remained at 3.9 s. Finally we noticed that the load generator, Locust 2.22.0, was still emitting 96-byte JSON heartbeats every 250 ms per user, an extra 40 k writes per second that the async workers had to deserialize. The worst part? The Locust docs buried the note that this heartbeat is synchronous and shared across all users, so it does not scale with --host or --users. When we finally set heartbeat_interval_ms=2000, p99 dropped to 890 ms and error rate to 0.2 %.

The Architecture Decision

Instead of patching the load generator, we decided to remove the generator entirely from our critical path. We built a thin Go shim called treasure-proxy that sits between Locust and the Python workers. The shim runs a single event loop in C, uses fasthttp for parsing, and only forwards non-heartbeat traffic. We turned off uvicorns built-in --workers flag and ran the Python app in a single process with uvloop, knowing that the Go shim would absorb the spike load. We also replaced the Redis Lua script with a pre-computed rolling window stored in a Redis Stream, which cut the Lua execution time from 1.4 ms to 0.4 ms at the cost of 12 % more memory. The decision was not about raw speed; it was about predictability. A single Python process with one event loop gives us determinism, whereas eight workers fighting over a shared queue gave us tail latencies that varied from 1.2 s to 7.8 s.

What The Numbers Said After

With the new setup, p99 latency stayed below 900 ms at 10 k users, and the error rate never exceeded 0.3 %. The Go shim added less than 50 µs of overhead per request and handled 60 k concurrent connections before hitting file-descriptor limits. The Redis Stream consumed an extra 80 MB of RAM, but our Redis cluster had 32 GB free, so the trade-off was acceptable. We ran the same test for 12 hours overnight; the longest p99 spike was 980 ms during a rolling failover event when one Redis node was marked fail. That was still within SLA. The biggest surprise was that the single Python worker became our strongest signal: when latency climbed above 1 s, we knew it was either Redis or the mobile clients, never the app server. That clarity alone justified the architectural change.

What I Would Do Differently

I would never trust a load-test parameter named simulated_user_count without also checking the tools source code. If the docs had told us that Locust 2.22.0 sends 40 k extra writes per second when users exceed 8 k, we could have fixed it in five minutes instead of six hours. I would also avoid using Redis Lua scripts for anything that touches user-facing latency; the atomicity gain is not worth the 1 ms cost at scale. Finally, I would insist on running a 48-hour soak test with real mobile traffic before green-lighting any change, not just a 30-minute load test. The

The same due diligence I apply to AI providers I applied here. Custody model, fee structure, geographic availability, failure modes. It holds up: https://payhip.com/ref/dev3