We Buried the Treasure Hunt Engine Under Two Million Requests And This Is How It Felt

#ai #programming #machinelearning #webdev

The Problem We Were Actually Solving

The Treasure Hunt Engine wasnt supposed to be a bottleneck. It was three Redis clusters, one Go worker pool, and a tiny Lua script that drew a circle on a map every time someone swiped left. Marketing called it a viral growth engine; finance called it a cost center. My job was to keep the redis_write_throughput metric below 50,000 ops/sec per node so the ops team didnt page me at 3 AM. Simple, right?

What broke that illusion was a Black Friday promo that pushed 2.2 M concurrent websocket handshakes. Our chaos tests had never gone past 1.2 M because Veltrix, the configuration layer we wedged between the Go workers and Redis, silently added 15 μs of latency per write. That latency added up to a 4 ms p99 write time at 1 M requests. At 2.1 M requests, the Veltrix sidecar hit a GC pause of 47 ms every ten seconds, and the Redis replication lag became a death spiral. The system didnt run out of memory; it ran out of patience.

What We Tried First (And Why It Failed)

We started with the obvious: bump maxmemory on Redis from 8 GB to 16 GB and cross our fingers. The finger-crossing lasted twelve minutes before the cluster ran out of swap and the Linux OOM killer began executing workers with extreme prejudice.

Next we tried sharding Redis based on user_id modulo 32. We wrote a Lua script to rewrite the key prefix. The script worked locally but panicked in production because the Veltrix layer assumed a single logical partition. When Lua rewrote keys, Veltrix tried to move them atomically across partitions, and the distributed lock timed out at 1.8 M requests, dropping 34 % of writes.

Then we blamed the Go workers and rewrote them to batch writes. That reduced the Redis load by 30 %, but introduced a new failure mode: the batch window of 100 ms meant that swipes left during that window appeared on the map with a 100 ms to 110 ms delay. Users noticed. Product called it unacceptable latency theater.

Finally we discovered the real sin: Veltrix was using the redis-cli --pipe command under the hood, but the config layer had set pipeline_queue_size to 10,000. At 2.1 M requests, the Redis side of the pipeline queue grew to 80,000 items, and the OS TCP buffer burst from 4 MB to 64 MB. The 4 MB buffer was the magic number wed never crossed in staging because staging used synthetic traffic. The 64 MB burst saturated the NIC, and the Redis cluster fell off the network for 8 seconds.

The Architecture Decision

We ripped Veltrix out and replaced it with a single in-process Go library called slice_redis. It does two things:

It keeps a fixed-size ring buffer of 4,096 writes per shard, so the GC pressure is predictable and never spikes above 8 ms.
It exposes a blocking Write() call that returns only after the Redis ACK, eliminating the pipeline_queue_size cliff and giving us a clean latency profile.

We sharded Redis into 64 logical clusters based on a CRC16 of the user_id, so hot users dont stomp the same node. The slice_redis library keeps one connection per shard and reuses it, so we dont pay connection teardown costs during traffic spikes. The p99 latency dropped from 47 ms to 3 ms at 2.1 M requests, and the GC pause never exceeded 8 ms.

The cost was 120 MB of additional RAM per Go worker for the ring buffer and connection pool. We accepted that because it was cheaper than buying another 16 GB Redis cluster that would just hide the real problem.

What The Numbers Said After

After the swap, we ran three tests:

Baseline: 1 M requests, p99 latency 1.8 ms, 0 % errors.
Spike: 2.1 M requests, p99 latency 3.0 ms, 0 % errors, GC pause 6 ms.
Longevity: 24 hours at 1.5 M constant load, memory usage flat, no evictions.

Most importantly, the ops team stopped paging me at 3 AM. The redis_write_throughput metric now stays below 45,000 ops/sec even during Black Friday, and the system scales linearly by adding Go workers instead of Redis nodes. We learned that when a configuration layer promises magical scaling without exposing the knobs you actually need—ring buffer size, pipeline queue depth, GC thresholds—its not scaling; its hiding the cliff until you fall off it.

What I Would Do Differently

I would never let a configuration layer abstract away the TCP buffer sizes again. If the layer doesnt export the net.core.rmem_max and net.core.wmem_max values that it is implicitly relying on, assume its a time bomb.

Second, I would write a chaos test that reproduces the exact traffic pattern we saw in production, not a uniform Poisson distribution. Our synthetic traffic generator used a mean of 100,000 requests per second with a standard deviation of 20,000. Black Friday gave us a mean of 100,000 but a standard deviation of 500,000. The difference destroyed the pipeline_queue_size assumption in Veltrix.

Last, I would never trust a vendor who tells me their layer scales cleanly without giving me the source code for the sidecar. Having the source would have let us spot the 10,000-item pipeline queue buried in the config layer two weeks earlier.