Why Vector Search Isn’t Just Hype

Here’s my hands-on review of open-source vector search engines, distilled from building RAG systems and semantic search prototypes. I’ll cut through the hype and focus on operational realities.

Why Vector Search Isn’t Just Hype

In my last project—a legal document retrieval system—keyword searches failed to link "breach of fiduciary duty" with "trustee negligence." Traditional databases can’t map semantic relationships. Vector embeddings solved this by encoding meaning into 768-dimensional vectors. But choosing the wrong engine tanks performance. Here’s what I learned.

1. Performance Under Load: Beyond Marketing Claims

I benchmarked three engines using the LAION-5B dataset (10M subset, 512-dim vectors):

Engine	Latency (10k QPS)	Recall@10	Index Build Time
Faiss (IVF)	12ms	0.87	22 min
Qdrant	19ms	0.92	41 min
Annoy	8ms	0.78	1.2 hrs

Faiss CPU vs. GPU Pitfall

Enabling GPU acceleration reduced Faiss latency to 4ms—but only with CUDA 11.3. Newer CUDA versions caused kernel crashes. Lesson: Infrastructure constraints dictate choices.

2. Filtering Tradeoffs: When "Hybrid Search" Gets Messy

Qdrant’s filtered search seemed ideal for my e-commerce prototype:

results = qdrant_client.search(
    query_vector=[0.2, -0.1, ...],
    query_filter=Filter(
        must=[
            FieldCondition(key="price", range=Range(gte=100)),
            FieldCondition(key="category", match=Value(text="Electronics"))
        ]
    )
)

Problem: Filtering on high-cardinality fields like user_id increased latency 6x at 50M vectors. Weaviate’s graph filters fared better but required schema restructuring.

3. Consistency Nightmares in RAG Systems

During a ChatGPT-like RAG implementation:

Milvus/Zilliz Cloud offered strong consistency: New document embeddings appeared in searches instantly.
Pgvector with PostgreSQL used eventual consistency. I once saw a 17-second lag during peak writes, causing outdated responses.

Rule of Thumb: Use strong consistency for transactional systems; accept eventual consistency for batch analytics.

4. The Deployment Tax

Deploying Weaviate on Kubernetes seemed straightforward until persistent volume claims choked at 5 TB. Compare resource footprints:

Engine	Memory (1B Vectors)	Cold Start Time
Faiss	512 GB	0
Milvus Lite	64 GB	2.1 sec
Vespa	96 GB	8.5 sec

Vespa’s Hidden Cost: 30% slower ingestion during rolling updates—unacceptable for real-time agents.

5. Error Handling: Where Frameworks Bleed

When testing Annoy’s Python bindings:

try:
    index.build(50)  # 50 trees
except Exception as e:
    # Actual error: "Tree limit exceeded for mmap mode"
    logger.error(f"Build failed: {str(e)}")

Diagnosing failure required tracing C++ core dumps. Milvus and Qdrant provided clearer gRPC status codes (e.g., RESOURCE_EXHAUSTED).

What I’d Use Today

After 300+ hours of testing:

RAG with real-time updates:Milvus/Zilliz Cloud. Consistency won.
Edge deployments: LanceDB. Embedded Python libraries simplified offline use.
Prototyping: Pgvector. SQL joins beat glue code.
Avoid: Annoy for dynamic datasets. Rebuilding indexes weekly wasted cycles.