Why Our Vector Search Broke at 2M Queries/Day—And What Fixed It

My Testing Ground

Last year, I built a job-matching prototype handling 10K queries daily. But when usage exploded to 2 million daily interactions, latency spiked to 500ms, and timeouts crippled user experience. Like Jobright’s team, I discovered keyword-based systems collapse under three real-world demands:

Dynamic data: 400K daily job postings changes (inserts/deletes)
Hybrid queries: Combining semantic vectors (job descriptions) with structured filters (location, salary, visa status)
Concurrency: 50+ simultaneous searches during traffic spikes

Here’s how I benchmarked solutions—and what actually worked.

1. Why Traditional Databases Fail

I first tried extending PostgreSQL with pgvector. For 10K vectors, response was stable at 50ms. At 1M vectors, latency looked like this:

SELECT * FROM jobs  
ORDER BY embedding <=> '[0.2, 0.7, ...]'  
WHERE location = 'San Francisco' AND visa_sponsor = true  
LIMIT 10;

Results at 5M vectors:

Latency: 220ms (P95)
Writes blocked reads during data ingestion
Filtered searches timed out 12% of the time

Failure Analysis:

B-tree indexes optimize for structured filters but degrade during vector similarity searches. Concurrent writes exacerbate locking.

2. Vector DB Showdown: My Hands-On Tests

I evaluated four architectures using a 10M-vector job dataset (768-dim embeddings). Workload: 1000 QPS with 30% writes.

System	Avg. Latency	Filter Accuracy	Ops Overhead
FAISS (GPU)	38ms	None¹	Rebuild index hourly
Pinecone	82ms	89%	Managed
Milvus Open-Source	45ms	92%	Kubernetes tuning
Zilliz Cloud	49ms	98%	Zero administration

¹ FAISS couldn’t combine vector search with filters.

Key Failures Observed:

FAISS: Crashed during bulk deletes. Required daily full-index rebuilds.
Pinecone: 120ms+ latency for Asian users (US-only endpoints).
Milvus: Spent 3 hours/week tuning Kubernetes pods for memory spikes.

python  # Hybrid search snippet I used  
results = collection.search(  
    data=[query_vector],  
    limit=10,  
    expr="visa_sponsor == true and location == 'CA'",  
    consistency_level="Session"  
)

3. Consistency Levels: When to Use Which

Most teams overlook consistency—until users see stale job posts. I tested three modes:

Level	Use Case	Risk
Strong	Critical writes (e.g., job removal)	30% slower queries
Session	User-facing searches	Stale data if same session not used
Bounded	Analytics/trends	5-sec stale data possible

Real Bug I Caused:

Using Bounded consistency for job matching caused a deleted role to appear for 4 seconds—triggering user complaints. Fixed by switching to Session.

4. Deployment Tradeoffs: What No One Tells You

I deployed two architectures:

A. Monolithic Cluster

Pros: Single endpoint
Cons: Query contention. Scaling reset connections.

B. Tiered Sharding (Jobright’s Approach)

Separate clusters for:

Core job matching
Referral discovery (graph + vectors)
Company culture search Result: 50ms latency at 2K QPS, zero resource contention.

Data Ingestion Tip:

Using bulk-insert with 10K vectors/batch reduced write latency by 65% vs. real-time streaming.

5. Why "Zero Ops" Matters More Than Benchmarks

After 6 months with Zilliz Cloud:

Zero infrastructure alerts
12+ feature deployments (e.g., real-time salary filters)
Cost: $0.0003/query at 2M queries/day

Compare this to my Milvus open-source setup:

Weekly ops tasks: Index tuning, node rebalancing, version upgrades
3.4 hrs/week engineer overhead → $50K/year hidden cost

My Toolkit Today:

Embedding models: all-MiniLM-L6-v2 for job descriptions (~85% accuracy)
Vector DB: Managed service for core product (Zilliz/Pinecone)
Self-hosted: Only for non-critical workloads (e.g., internal analytics)

Next Experiment:

Testing reranking models (e.g., BAAI/bge-reranker-large) atop vector results to boost match precision. Will share results in a follow-up.

Lesson Learned:

Infrastructure isn’t just about scale. It’s what lets you ship features while sleeping through the night.