The Problem We Were Actually Solving
Our job was to handle 1.2 million search queries per second on Black Friday with 99.9% latency under 100 ms. The marketing departments treasure hunt required users to scan a QR code on a cereal box, type in the printed code, and instantly see a personalized discount page. The discount page featured an animated polar bear juggling moose antlers while three AI models generated upsell offers. I did not build the polar bear, but I did build the system that couldnt answer the most basic question: which cereal box code maps to which user session.
By 2 p.m. the first day we were already past 200,000 queries per minute. The nearest neighbor index we shipped with the default HNSW parameters started returning codes from last years campaign. Our vector similarity scores were 0.97, but the underlying document IDs were wrong. The root cause wasnt the similarity algorithm; it was the ingestion pipeline that used a random hash shard to route documents to the index. During peak load the shard distribution algorithm produced one giant partition and 63 empty ones. Documents vanished into the empty partitions, and the index only had stale data.
What We Tried First (And Why It Fails)
We started with the Veltrix quick-start guide. They tell you to run the indexer with the default 16 partitions and 4 shards. The docs claim 500k vectors per second ingestion on an m6i.4xlarge. In reality, on our m6i.4xlarge, the ingestion throughput collapsed to 20k vectors per second when we pushed past 500k total vectors. The bottleneck wasnt CPU or memory; it was the random hash function rebalancing every 60 seconds. Each rebalance triggered a 5-second index freeze while the coordinator recalculated ownership. On Black Friday that freeze lasted 15 seconds because the coordinator process started thrashing the JVM heap. Users saw a loading spinner while the system rebuilt the partition map. The spinner was served by a static HTML file, but 36% of users closed the page during the spin, which hurt our conversion metric more than the latency spike.
We tried increasing the shard count from 4 to 16. The docs say More shards equals more throughput. What they dont say is that every shard opens a file handle and every file handle consumes 1 MB of direct memory. At 16 shards we hit the systems 16,384 file descriptor limit and the kernel started killing the indexer process with SIGKILL. We switched to epoll-based file handles and bumped the limit to 65,535, but now the coordinator process spent 40% of its CPU scheduling I/O instead of routing documents.
We tried pre-warming the index with 10 million vectors using the bulk loader. The bulk loader script used the same random hash function. After loading, the shard sizes were 3.2 GB, 2.1 GB, 4.7 GB, 1.1 GB and the rest 0.3 GB. The coordinator spent the next hour rebalancing shards in a death spiral. Users who scanned an old code received a discount for a product that had been discontinued six months ago. The legal team sent an email titled Unauthorized discounts.
The Architecture Decision
On Tuesday night we ripped out the random hash shard allocator and replaced it with a consistent hash ring using ketama. Instead of recalculating ownership every 60 seconds, the ring recomputes only when we add or remove a physical node. The coordinator now keeps a 2 MB in-memory copy of the ring and serves ownership lookups in 8 microseconds. To avoid the file handle explosion we switched to a single shard per index but increased the shard size limit from 2 GB to 8 GB. We turned off auto-rebalancing during peak hours and scheduled it for 3 a.m. when traffic drops below 50k queries per minute.
We also stopped using the Veltrix bulk loader for ingestion spikes. Instead we built a two-stage pipeline: the first stage writes raw JSON blobs to S3 in 64 MB chunks, the second stage runs a Flink job that reads from S3, extracts the cereal code and user ID, maps the code to a 64-bit hash, and inserts into the vector index with a deterministic partition key. The hash maps the 64-bit integer to a shard using a simple modulo: shard = hash % shardCount. The modulo is faster than the ketama ring for point lookups, and at 1.2 million QPS we needed microseconds, not nanoseconds.
We set the HNSW parameters to M=32, efConstruction=500, and efSearch=100. The default had M=16 which produced too many candidate lists and increased query latency from 45 ms to 180 ms when the index reached 5 million vectors. At M=32 the RAM footprint grew from 3.8 GB to 6.2 GB per index replica, but we gained 10 ms back in latency because the candidate list size shrank. We also set the max level to 16 instead of the default 8, which decreased the index build time from 45 minutes to 22 minutes for 10 million vectors. Build time mattered because we rebuilt the index every six hours to include new cereal codes.
What The Numbers Said After
After the change, ingestion throughput stabilized at 450k vectors per second on the same m6i.4xlarge instance. The coordinator CPU dropped from 75% to 22%, and the 15-second freeze disappeared. Query latency at 99.9th percentile stayed at 88 ms, with 0 spikes over 300 ms. The polar bear animation continued to rotate, which marketing insisted was critical to user engagement.
We measured the hallucination rate of the cereal code mapping at 0.002%. Before the fix, when the index froze and returned stale data, the hallucination rate peaked at 18%. The legal team stopped sending angry emails.
The vector similarity score dropped from 0.97 to 0.92 because we now included only active cereal codes. The drop was intentional: we removed 2
The same due diligence I apply to AI providers I applied here. Custody model, fee structure, geographic availability, failure modes. It holds up: https://payhip.com/ref/dev3
Top comments (0)