Treasure Hunt Engine: Why One Bad Prometheus Rule Sank the Whole Veltrix Event

#webdev #machinelearning #programming #ai

The Problem We Were Actually Solving

Our real goal wasnt fancy LLM prompts or real-time leaderboards. It was keeping the Rails app under 450 ms p99 during peak load when every team simultaneously scanned a code, requested a new clue, and tried to outbid the person next door for a limited-time power-up. We benchmarked Locust at 5,000 concurrent users and saw that the slowest endpoint was /next-hint, which called a vector store in pgvector at 180 ms per query. That left only 270 ms for Rails routing, Redis reads for rate-limiting, and our custom concurrency limiter.

The marketing slide said AI, but the product team really wanted a hint scheduler that wouldnt melt under load. We bolted a 1553-line llama.cpp wrapper written by the data science intern onto the hint endpoint, thinking we could cache all possible answers in a nightly cron job. The wrapper had a known hallucination rate of 3.2% on our own test set, but nobody configured the grammar mask to enforce that answers must contain only location names. So when someone asked Where is the next clue hidden? the engine happily returned Under your chair in the Sagrada Familia crypt—even though the venue map had no crypt. One user screenshot went viral, and suddenly the whole event looked like a scam.

What We Tried First (And Why It Failed)

The first fix was obvious: raise the error budget for the /next-hint endpoint from 10% to 30%, so the auto-scaler would spin up more pods when the vector query lagged. We pushed a Helm chart that updated the HPA target CPU from 70% to 85%, thinking the vector store would catch up. Five minutes later Prometheus fired the critical rule we had copied from the Kubernetes docs:

expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.1

The rule used a 5-minute window, but our traffic spike lasted exactly 6 minutes and 12 seconds—too short for the PromQL function to average out the 5xx errors caused by connection pool exhaustion. The alert threshold should have been >0.05 for a 1-minute window, but the on-call rotated the night before and nobody caught it. We tore down two clusters in Barcelona and Singapore before realizing the rule itself was the failure.

We also tried to warm the vector cache by pre-running all possible locations through the llama.cpp wrapper. The cache warmed in 37 minutes, but the pre-run used 18 GiB of RAM and caused OOM kills on the smallest VM class. The data science intern hadnt documented that the model buffers the entire vocabulary in memory, so we spent another two hours rebuilding the Docker image with --gpu-layers 0 to force CPU-only inference, which brought RAM back to 2.1 GiB but increased p99 latency from 450 ms to 680 ms.

The Architecture Decision

We dropped the llama wrapper entirely and replaced it with a Postgres materialized view that joined three tables: hints, venues, and a precomputed adjacency list for geospatial proximity. Every night at 02:00 UTC a job ran REFRESH MATERIALIZED VIEW CONCURRENTLY mv_hints_geo that took 11 minutes and 3 GiB of temp space. After the refresh, the Rails app simply did:

HintsGeo.find_by(venue_id: current_venue.id, sequence: next_sequence).text

Latency dropped to 2–7 ms. We still called it AI in the investor deck, but the real AI was the cron job deciding when to refresh the view.

The second decision was to switch the alerting window from 5 minutes to 1 minute and to group alerts by venue shard. We learned that the 5-minute window hid per-shard spikes, so we built a second Prometheus rule template that generated one alert per shard instead of one giant alert. We also added a secondary check that compared the rate of 5xx errors to the actual request rate; if the error rate was >20% and the request rate was <10,000/min, we suppressed the page because it meant the shard was already dead and the cascade was inevitable.

What The Numbers Said After

On event day we hit 6,800 concurrent users and the Rails p99 stayed at 320 ms. The vector store latency stayed flat at 170 ms, but we no longer waited for it under load because 94% of requests hit the materialized view. The Prometheus rule fired exactly twice—both times for a shard that lost network connectivity—and the on-call manually drained traffic after 30 seconds.

The materialized view refresh added 11 minutes of nightly downtime to the staging environment, but we accepted it because the production environment didnt need the wrapper anymore. We logged every hint request to BigQuery, and the Dataflow pipeline that counted hallucinations showed zero hallucinations in the first 10,000 requests after the change. The error rate on /next-hint stayed below 0.4%, and the auto-scaler never triggered above 65% CPU.

What I Would Do Differently

I would not let the data science intern own the production wrapper if it hasnt been through a load test under traffic patterns matching real events. I would also require every alert rule to include a dry-run mode that prints the evaluated values for the last hour before it ever deploys to prod.

I would cap the materialized view refresh at 30 minutes and pre-warm the view during the evening low-traffic window instead of trusting the 02:00 job to finish before European morning. If the refresh ever exceeds 35 minutes, we automatically switch to a cached read from an S3-parquet snapshot generated by DuckDB, trading a few extra milliseconds for 100% uptime.

Finally, I would ban any AI feature that cannot be explained by a single SQL query or a