DEV Community

Marcus Feldman
Marcus Feldman

Posted on

What I Learned About Vector Databases When Production Demands Bite

It started simply enough: we needed semantic search for our document processing pipeline. Like many teams, I assumed any open-source vector database could handle it. What followed was six months of tuning, benchmarking, and re-architecturing as we hit scale. Here’s what matters when theory meets reality.

1. Libraries vs. Systems: The First Crossroads

When prototyping our RAG pipeline, I instinctively reached for Faiss. Its ANN benchmarks were stellar. But the moment we needed:

  • Real-time updates
  • Filtering by metadata (“only search legal documents from 2023”)
  • Concurrent writes Faiss hit limits. Why? Because it’s fundamentally a library, not a persistent system.

What worked:

# Faiss for static datasets  
index = faiss.IndexHNSWFlat(768, 32)  
index.add(training_vectors)  
distances, ids = index.search(query_vector, k=10)  
Enter fullscreen mode Exit fullscreen mode

What failed:

  • No native persistence (had to serialize/deserialize entire index)
  • Filtering required post-search scans, killing latency
  • Rebuilding indexes for new data took 3+ hours at 5M vectors

This is when I realized: approximate search algorithms ≠ production-grade vector database.

2. Filtering Isn’t a Feature – It’s an Architecture Choice

Initial tests with 10k vectors? Qdrant’s payload filters felt magical:

client.search(  
    collection_name="docs",  
    query_vector=query_embedding,  
    query_filter={  
        "must": [{"key": "document_type", "match": {"value": "contract"}}]  
    }  
)  
Enter fullscreen mode Exit fullscreen mode

At 10M vectors, the same filter increased latency from 15ms to 210ms. Why?

  • Pre-filtering (Weaviate/Qdrant): Applies filters before vector search. Low latency for selective filters but dangerous on high-cardinality fields (e.g., user_id).
  • Post-filtering (Early Milvus): Searches first, then applies filters. Predictable vector search time but risks empty results if filters are restrictive.
  • Hybrid (Modern Milvus/Pinecone): Dynamically switches strategies. Requires optimizer statistics – which need CPU.

Lesson learned: Test filtering under your *actual data distribution, not synthetic datasets.*

3. Consistency Models: When “Good Enough” Isn’t

We almost shipped Weaviate until a critical bug surfaced: search results showed stale versions of documents updated seconds ago. Why? We’d chosen eventual consistency for throughput.

Different engines define consistency differently:

Engine Write Visibility Best For Risk
Annoy Never (read-only) Static datasets Data reindexing nightmares
Qdrant Immediate (per shard) Medium-scale dynamic data Staleness during rebalancing
Milvus Session (guaranteed) High-change environments Higher write latency (~8-15ms)

The fix? Switched to session consistency in Milvus:

client.insert(  
    data=[{"id": "doc1", "vector": v, "version": "2025-04-01"}],  
    consistency_level="Session"  
)  
Enter fullscreen mode Exit fullscreen mode

Added 12ms to writes but eliminated customer complaints about missing updates.

4. The Scalability Trap

Faiss with GPU acceleration handled 50 QPS at 99th percentile <100ms. At 500 QPS? P99 latency spiked to 1.2s. GPUs aren’t magic – they parallelize batch operations, not concurrent requests.

Scaling options we tested:

  • Vertical Scaling (Faiss): 8x GPU → 4x cost for 2x QPS. Diminishing returns.
  • Sharding (Milvus/Qdrant): Split data by tenant_id. Linear scaling but requires shard-aware queries.
  • Replicas (Weaviate): Read-only copies. Simple but doubles storage costs.

Shard-per-tenant reduced P99 latency by 67% but required application logic:

# Route query to tenant-specific shard  
shard_key = tenant_hash % num_shards  
client.search(collection_name="docs", shard_key=shard_key)  
Enter fullscreen mode Exit fullscreen mode

5. Hidden Deployment Tax

Vespa’s ranked performance brilliantly. Then I tried upgrading:

  • 3 hours to migrate schema across 5 nodes
  • Downtime during index rebalancing
  • YAML configs spanning 800+ lines

Operational burden comparison for 5-node clusters:

Engine Config Complexity Rolling Upgrades Failure Recovery
Vespa High Manual Slow (min)
Qdrant Medium Semi-Automatic Fast (<10s)
Milvus Low Automatic Fast (<5s)

We learned: Throughput benchmarks ignore operational overhead at 3 AM.

Where We Landed

After 23 performance tests and 3 infrastructure migrations, we chose sharded Milvus because:

  • Session consistency matched our “no stale reads” requirement
  • Kubernetes operator handled failures silently
  • Hybrid filtering behaved predictably at 50M+ vectors

But I’m not evangelical about it. Qdrant could win for simpler schemas; Vespa for complex ranking.

What’s Next?

Two unresolved challenges:

  1. Cold Start Penalty: Loading 1B+ vector indexes still takes 8+ minutes. Testing memory-mapped indexes in Annoy 2.0.
  2. Multi-modal Workloads: Can one engine handle text + image + structured vectors? Evaluating Chroma’s new multi-embedding API.

Vector databases remain rapidly evolving. Test against your workloads, not marketing claims. Start simple – but expect to revisit decisions at 10x scale.

Top comments (0)