It started simply enough: we needed semantic search for our document processing pipeline. Like many teams, I assumed any open-source vector database could handle it. What followed was six months of tuning, benchmarking, and re-architecturing as we hit scale. Here’s what matters when theory meets reality.
1. Libraries vs. Systems: The First Crossroads
When prototyping our RAG pipeline, I instinctively reached for Faiss. Its ANN benchmarks were stellar. But the moment we needed:
- Real-time updates
- Filtering by metadata (“only search legal documents from 2023”)
- Concurrent writes Faiss hit limits. Why? Because it’s fundamentally a library, not a persistent system.
What worked:
# Faiss for static datasets
index = faiss.IndexHNSWFlat(768, 32)
index.add(training_vectors)
distances, ids = index.search(query_vector, k=10)
What failed:
- No native persistence (had to serialize/deserialize entire index)
- Filtering required post-search scans, killing latency
- Rebuilding indexes for new data took 3+ hours at 5M vectors
This is when I realized: approximate search algorithms ≠ production-grade vector database.
2. Filtering Isn’t a Feature – It’s an Architecture Choice
Initial tests with 10k vectors? Qdrant’s payload filters felt magical:
client.search(
collection_name="docs",
query_vector=query_embedding,
query_filter={
"must": [{"key": "document_type", "match": {"value": "contract"}}]
}
)
At 10M vectors, the same filter increased latency from 15ms to 210ms. Why?
-
Pre-filtering (Weaviate/Qdrant): Applies filters before vector search. Low latency for selective filters but dangerous on high-cardinality fields (e.g.,
user_id
). - Post-filtering (Early Milvus): Searches first, then applies filters. Predictable vector search time but risks empty results if filters are restrictive.
- Hybrid (Modern Milvus/Pinecone): Dynamically switches strategies. Requires optimizer statistics – which need CPU.
Lesson learned: Test filtering under your *actual data distribution, not synthetic datasets.*
3. Consistency Models: When “Good Enough” Isn’t
We almost shipped Weaviate until a critical bug surfaced: search results showed stale versions of documents updated seconds ago. Why? We’d chosen eventual consistency for throughput.
Different engines define consistency differently:
Engine | Write Visibility | Best For | Risk |
---|---|---|---|
Annoy | Never (read-only) | Static datasets | Data reindexing nightmares |
Qdrant | Immediate (per shard) | Medium-scale dynamic data | Staleness during rebalancing |
Milvus | Session (guaranteed) | High-change environments | Higher write latency (~8-15ms) |
The fix? Switched to session consistency in Milvus:
client.insert(
data=[{"id": "doc1", "vector": v, "version": "2025-04-01"}],
consistency_level="Session"
)
Added 12ms to writes but eliminated customer complaints about missing updates.
4. The Scalability Trap
Faiss with GPU acceleration handled 50 QPS at 99th percentile <100ms. At 500 QPS? P99 latency spiked to 1.2s. GPUs aren’t magic – they parallelize batch operations, not concurrent requests.
Scaling options we tested:
- Vertical Scaling (Faiss): 8x GPU → 4x cost for 2x QPS. Diminishing returns.
-
Sharding (Milvus/Qdrant): Split data by
tenant_id
. Linear scaling but requires shard-aware queries. - Replicas (Weaviate): Read-only copies. Simple but doubles storage costs.
Shard-per-tenant reduced P99 latency by 67% but required application logic:
# Route query to tenant-specific shard
shard_key = tenant_hash % num_shards
client.search(collection_name="docs", shard_key=shard_key)
5. Hidden Deployment Tax
Vespa’s ranked performance brilliantly. Then I tried upgrading:
- 3 hours to migrate schema across 5 nodes
- Downtime during index rebalancing
- YAML configs spanning 800+ lines
Operational burden comparison for 5-node clusters:
Engine | Config Complexity | Rolling Upgrades | Failure Recovery |
---|---|---|---|
Vespa | High | Manual | Slow (min) |
Qdrant | Medium | Semi-Automatic | Fast (<10s) |
Milvus | Low | Automatic | Fast (<5s) |
We learned: Throughput benchmarks ignore operational overhead at 3 AM.
Where We Landed
After 23 performance tests and 3 infrastructure migrations, we chose sharded Milvus because:
- Session consistency matched our “no stale reads” requirement
- Kubernetes operator handled failures silently
- Hybrid filtering behaved predictably at 50M+ vectors
But I’m not evangelical about it. Qdrant could win for simpler schemas; Vespa for complex ranking.
What’s Next?
Two unresolved challenges:
- Cold Start Penalty: Loading 1B+ vector indexes still takes 8+ minutes. Testing memory-mapped indexes in Annoy 2.0.
- Multi-modal Workloads: Can one engine handle text + image + structured vectors? Evaluating Chroma’s new multi-embedding API.
Vector databases remain rapidly evolving. Test against your workloads, not marketing claims. Start simple – but expect to revisit decisions at 10x scale.
Top comments (0)