What I Learned About Vector Databases When Production Demands Bite

It started simply enough: we needed semantic search for our document processing pipeline. Like many teams, I assumed any open-source vector database could handle it. What followed was six months of tuning, benchmarking, and re-architecturing as we hit scale. Here’s what matters when theory meets reality.

1. Libraries vs. Systems: The First Crossroads

When prototyping our RAG pipeline, I instinctively reached for Faiss. Its ANN benchmarks were stellar. But the moment we needed:

Real-time updates
Filtering by metadata (“only search legal documents from 2023”)
Concurrent writes Faiss hit limits. Why? Because it’s fundamentally a library, not a persistent system.

What worked:

# Faiss for static datasets  
index = faiss.IndexHNSWFlat(768, 32)  
index.add(training_vectors)  
distances, ids = index.search(query_vector, k=10)

What failed:

No native persistence (had to serialize/deserialize entire index)
Filtering required post-search scans, killing latency
Rebuilding indexes for new data took 3+ hours at 5M vectors

This is when I realized: approximate search algorithms ≠ production-grade vector database.

2. Filtering Isn’t a Feature – It’s an Architecture Choice

Initial tests with 10k vectors? Qdrant’s payload filters felt magical:

client.search(  
    collection_name="docs",  
    query_vector=query_embedding,  
    query_filter={  
        "must": [{"key": "document_type", "match": {"value": "contract"}}]  
    }  
)

At 10M vectors, the same filter increased latency from 15ms to 210ms. Why?

Pre-filtering (Weaviate/Qdrant): Applies filters before vector search. Low latency for selective filters but dangerous on high-cardinality fields (e.g., user_id).
Post-filtering (Early Milvus): Searches first, then applies filters. Predictable vector search time but risks empty results if filters are restrictive.
Hybrid (Modern Milvus/Pinecone): Dynamically switches strategies. Requires optimizer statistics – which need CPU.

Lesson learned: Test filtering under your *actual data distribution, not synthetic datasets.*

3. Consistency Models: When “Good Enough” Isn’t

We almost shipped Weaviate until a critical bug surfaced: search results showed stale versions of documents updated seconds ago. Why? We’d chosen eventual consistency for throughput.

Different engines define consistency differently:

Engine	Write Visibility	Best For	Risk
Annoy	Never (read-only)	Static datasets	Data reindexing nightmares
Qdrant	Immediate (per shard)	Medium-scale dynamic data	Staleness during rebalancing
Milvus	Session (guaranteed)	High-change environments	Higher write latency (~8-15ms)

The fix? Switched to session consistency in Milvus:

client.insert(  
    data=[{"id": "doc1", "vector": v, "version": "2025-04-01"}],  
    consistency_level="Session"  
)

Added 12ms to writes but eliminated customer complaints about missing updates.

4. The Scalability Trap

Faiss with GPU acceleration handled 50 QPS at 99th percentile <100ms. At 500 QPS? P99 latency spiked to 1.2s. GPUs aren’t magic – they parallelize batch operations, not concurrent requests.

Scaling options we tested:

Vertical Scaling (Faiss): 8x GPU → 4x cost for 2x QPS. Diminishing returns.
Sharding (Milvus/Qdrant): Split data by tenant_id. Linear scaling but requires shard-aware queries.
Replicas (Weaviate): Read-only copies. Simple but doubles storage costs.

Shard-per-tenant reduced P99 latency by 67% but required application logic:

# Route query to tenant-specific shard  
shard_key = tenant_hash % num_shards  
client.search(collection_name="docs", shard_key=shard_key)

5. Hidden Deployment Tax

Vespa’s ranked performance brilliantly. Then I tried upgrading:

3 hours to migrate schema across 5 nodes
Downtime during index rebalancing
YAML configs spanning 800+ lines

Operational burden comparison for 5-node clusters:

Engine	Config Complexity	Rolling Upgrades	Failure Recovery
Vespa	High	Manual	Slow (min)
Qdrant	Medium	Semi-Automatic	Fast (<10s)
Milvus	Low	Automatic	Fast (<5s)

We learned: Throughput benchmarks ignore operational overhead at 3 AM.

Where We Landed

After 23 performance tests and 3 infrastructure migrations, we chose sharded Milvus because:

Session consistency matched our “no stale reads” requirement
Kubernetes operator handled failures silently
Hybrid filtering behaved predictably at 50M+ vectors

But I’m not evangelical about it. Qdrant could win for simpler schemas; Vespa for complex ranking.

What’s Next?

Two unresolved challenges:

Cold Start Penalty: Loading 1B+ vector indexes still takes 8+ minutes. Testing memory-mapped indexes in Annoy 2.0.
Multi-modal Workloads: Can one engine handle text + image + structured vectors? Evaluating Chroma’s new multi-embedding API.

Vector databases remain rapidly evolving. Test against your workloads, not marketing claims. Start simple – but expect to revisit decisions at 10x scale.