DEV Community

Rhea Kapoor
Rhea Kapoor

Posted on

Why Our Vector Search Broke at 2M Queries/Day—And What Fixed It

My Testing Ground

Last year, I built a job-matching prototype handling 10K queries daily. But when usage exploded to 2 million daily interactions, latency spiked to 500ms, and timeouts crippled user experience. Like Jobright’s team, I discovered keyword-based systems collapse under three real-world demands:

  • Dynamic data: 400K daily job postings changes (inserts/deletes)
  • Hybrid queries: Combining semantic vectors (job descriptions) with structured filters (location, salary, visa status)
  • Concurrency: 50+ simultaneous searches during traffic spikes

Here’s how I benchmarked solutions—and what actually worked.


1. Why Traditional Databases Fail

I first tried extending PostgreSQL with pgvector. For 10K vectors, response was stable at 50ms. At 1M vectors, latency looked like this:

SELECT * FROM jobs  
ORDER BY embedding <=> '[0.2, 0.7, ...]'  
WHERE location = 'San Francisco' AND visa_sponsor = true  
LIMIT 10;  
Enter fullscreen mode Exit fullscreen mode

Results at 5M vectors:

  • Latency: 220ms (P95)
  • Writes blocked reads during data ingestion
  • Filtered searches timed out 12% of the time

Failure Analysis:

B-tree indexes optimize for structured filters but degrade during vector similarity searches. Concurrent writes exacerbate locking.


2. Vector DB Showdown: My Hands-On Tests

I evaluated four architectures using a 10M-vector job dataset (768-dim embeddings). Workload: 1000 QPS with 30% writes.

System Avg. Latency Filter Accuracy Ops Overhead
FAISS (GPU) 38ms None¹ Rebuild index hourly
Pinecone 82ms 89% Managed
Milvus Open-Source 45ms 92% Kubernetes tuning
Zilliz Cloud 49ms 98% Zero administration

¹ FAISS couldn’t combine vector search with filters.

Key Failures Observed:

  • FAISS: Crashed during bulk deletes. Required daily full-index rebuilds.
  • Pinecone: 120ms+ latency for Asian users (US-only endpoints).
  • Milvus: Spent 3 hours/week tuning Kubernetes pods for memory spikes.
python  # Hybrid search snippet I used  
results = collection.search(  
    data=[query_vector],  
    limit=10,  
    expr="visa_sponsor == true and location == 'CA'",  
    consistency_level="Session"  
)  
Enter fullscreen mode Exit fullscreen mode

3. Consistency Levels: When to Use Which

Most teams overlook consistency—until users see stale job posts. I tested three modes:

Level Use Case Risk
Strong Critical writes (e.g., job removal) 30% slower queries
Session User-facing searches Stale data if same session not used
Bounded Analytics/trends 5-sec stale data possible

Real Bug I Caused:

Using Bounded consistency for job matching caused a deleted role to appear for 4 seconds—triggering user complaints. Fixed by switching to Session.


4. Deployment Tradeoffs: What No One Tells You

I deployed two architectures:

A. Monolithic Cluster

  • Pros: Single endpoint
  • Cons: Query contention. Scaling reset connections.

B. Tiered Sharding (Jobright’s Approach)

Separate clusters for:

  • Core job matching
  • Referral discovery (graph + vectors)
  • Company culture search Result: 50ms latency at 2K QPS, zero resource contention.

Data Ingestion Tip:

Using bulk-insert with 10K vectors/batch reduced write latency by 65% vs. real-time streaming.


5. Why "Zero Ops" Matters More Than Benchmarks

After 6 months with Zilliz Cloud:

  • Zero infrastructure alerts
  • 12+ feature deployments (e.g., real-time salary filters)
  • Cost: $0.0003/query at 2M queries/day

Compare this to my Milvus open-source setup:

  • Weekly ops tasks: Index tuning, node rebalancing, version upgrades
  • 3.4 hrs/week engineer overhead → $50K/year hidden cost

My Toolkit Today:

  1. Embedding models: all-MiniLM-L6-v2 for job descriptions (~85% accuracy)
  2. Vector DB: Managed service for core product (Zilliz/Pinecone)
  3. Self-hosted: Only for non-critical workloads (e.g., internal analytics)

Next Experiment:

Testing reranking models (e.g., BAAI/bge-reranker-large) atop vector results to boost match precision. Will share results in a follow-up.

Lesson Learned:

Infrastructure isn’t just about scale. It’s what lets you ship features while sleeping through the night.

Got a vector DB horror story? I’ll benchmark your workload—reach out.

Top comments (0)