The Problem We Were Actually Solving
We discovered that the hunt micro-service used a pure vector search—OpenSearch 2.9 with the ANN plugin—indexing every clue as a 768-dimensional embedding. The embeddings came from a fine-tuned all-MiniLM-L6-v2 model fed by a Kafka topic that lagged by 60 seconds during peak load. When 15 000 users simultaneously submitted the same clue string, the vector index returned ten near-duplicate hits. Our scoring function then averaged their coordinates, pushing the centroid of the cluster into a parking lot two blocks away. Marketing called it creative misdirection. Operations called it an outage hotline generator.
The real user pain was not latency—it was accuracy. A single false treasure location meant 300 support tickets within ten minutes. Our SLA for coordinate correctness was ±5 meters. We missed it by 800 meters.
What We Tried First (And Why It Failed)
First we tuned the HNSW parameters: ef_search from 100 to 500, M from 16 to 32. The recall improved from 0.72 to 0.81, but the tail latency jumped from 120 ms to 410 ms—too slow for the event overlay on the mobile app. We tried quantizing the vectors to int8 to halve memory, but the same-venue recall dipped to 0.64 and the coordinate drift increased because the quantization error warped the nearest-neighbor geometry.
Then we switched to BM25 on the raw clue strings. The doctored BM25F implementation returned results in 28 ms, but the semantic gap meant users searching for franglais phrases like tuck fridg got French-language clues while the actual clue was in English about a fridge in the lobby. The failure rate spiked to 24 percent during the first live event.
The Architecture Decision
We ripped out the vector layer entirely and built a two-tier lookup.
Tier one is a deterministic hash map: SHA-256 of the normalized clue string → venue zone ID (5 m × 5 m grid cell). The map is sharded across 256 Redis Cluster nodes. Writes go through a sidecar that pre-computes the hash and stores the zone centroid. Latency is 3 ms p99.
Tier two, only when the clue is ambiguous or missing, falls back to the vector index. We reduced the vector space to 128 dimensions using PCA on the fine-tuned embeddings and switched the backend to Milvus 2.3 with IVF_FLAT and nprobe set to 20. The quantization is fp16, trading 15 percent recall for 40 percent memory reduction and consistent 45 ms p99.
We also introduced a static fallback: a hard-coded CSV of the top 2 000 known clues maintained by event operations. If both tiers miss, we serve the CSV centroid and log the miss rate for labeling. The fallback covers 78 percent of clue collisions, bringing the overall miss rate down to below 1 percent.
What The Numbers Said After
In the first major event—an indoor mall with 22 000 concurrent users—the deterministic tier handled 98.4 percent of requests at 3 ms p99. The vector tier was invoked 1.6 percent of the time and added an average of 22 ms latency. Coordinate accuracy stayed within ±4 meters for 99.7 percent of submissions. Support tickets for wrong locations dropped from 300 to 8 in the first hour.
The observability dashboard now tracks four key metrics:
- clue-hash-hit-rate: target > 95 %
- vector-fallback-rate: target < 2 %
- coordinate-accuracy-meters: 95th percentile < 5
- event-end-coordinate-drift: < 10 meters across 99 % of users
We also set up a nightly job that recomputes the PCA vectors from the latest fine-tuned model and updates the Redis map in a rolling restart. The window is 30 minutes, small enough to keep the clue set fresh for the next campaign.
What I Would Do Differently
I would not let the fine-tuning pipeline run on a single GPU without a circuit breaker. Twice during load tests the GPU OOM killed the embeddings worker for 4 minutes, and the vector tier silently degraded to random sampling. Today we gate new model versions behind a canary that compares coordinate drift before promoting to production. If the drift increases by more than 10 percent, the rollback script triggers automatically.
Second, I would have insisted on a dual-write path from day one. The original system wrote to the vector index only, so the BM25 fallback was an afterthought. Now we write every clue to both tiers synchronously. The cost is an extra 1 ms per write and a 15 percent increase in Redis memory, but it eliminates the surprise gap where a new clue is missing from one tier during peak traffic.
Finally, I would treat the static CSV fallback as first-class, not technical debt. It is the simplest form of retrieval we have, yet it carries the least hallucination risk. Marketing can change the clues daily, but the centroids for the static zones never drift because they are measured with a laser rangefinder and locked at build time.
The same due diligence I apply to AI providers I applied here. Custody model, fee structure, geographic availability, failure modes. It holds up: https://payhip.com/ref/dev3
Top comments (0)