Andrew Kennon

Posted on Jul 3

Evaluating Vector Databases in Real Systems

Over the past couple years, I’ve integrated vector databases into several autonomous driving and LLM-based perception workflows — think multimodal RAG pipelines, scene retrieval across lidar+camera streams, or sensor signature matching. These workloads aren’t your average chatbot demos; they demand high recall, stable latency, and the ability to filter by time, location, or sensor type.

So I’ve been watching the vector DB landscape pretty closely. Below is a benchmark-grounded evaluation of what’s out there, with some firsthand perspective on what actually works when you’re pushing real data through these systems.

How I Evaluate (And Why It Matters)

I care about five things when selecting a vector DB for production:

Query performance — Low latency matters, especially when you’re feeding LLMs in real time.
Recall — In robotics, wrong retrievals mean wrong plans. I need to trust the top-k.
Insert + index time — If I’m syncing sensor data every few seconds, I don’t want indexing to be the bottleneck.
Scalability — Millions of vectors from real-world scenes pile up fast.
Functionality — Filtering, hybrid queries, stability under load — all must-haves.

I use results from ANN-Benchmark and VectorDBBench, with embeddings ranging from 960 to 1536 dimensions — roughly what you’d expect from vision transformers or OpenAI’s text models.

Zilliz Cloud / Milvus

What I’ve Seen

This one consistently performs well on both recall and throughput, especially with disk-based indexing and hybrid search. I’ve used Milvus in a lidar-tagged object retrieval task (think “find similar scenes where the ego vehicle was overtaken”) — and the filtering capabilities really helped.

Strengths

High QPS and recall in real-world tests (VectorDBBench)
Supports hybrid queries (e.g., timestamp > t && similarity > x)
Scales well for large workloads
Milvus (open-source) for flexibility, Zilliz (cloud) for lower ops

Limitations

Self-hosted setup is non-trivial (you’re running Pulsar and etcd)
Zilliz Cloud simplifies deployment but limits deep tuning (e.g., IVF params)

Weaviate

What I’ve Seen

I tested Weaviate in a cross-modal search prototype — matching dashcam clips to driving logs using metadata filters. It handled hybrid queries well, though indexing took longer than I expected during rapid ingestion.

Strengths

Good recall/QPS balance in most benchmarks
First-class support for metadata filtering and multimodal modules
Friendly APIs (GraphQL, REST), quick to start

Limitations

Index construction slower on large or dynamic datasets
Memory usage can spike during concurrent reads

Pinecone

What I’ve Seen

Pinecone feels like a SaaS-native solution. I tried it in a project where fast prototyping mattered more than custom indexing. It worked — but I hit walls when I wanted to tune performance.

Strengths

Easy to deploy, zero ops
Solid QPS under moderate load
Strong ecosystem integration (LangChain, OpenAI)

Limitations

Indexing parameters not exposed — you get what you get
Recall underperforms slightly on larger or high-dimensional datasets
Cost becomes a concern once you scale past toy projects

Qdrant

What I’ve Seen

This one surprised me. I ran a batch object similarity task on CPU (no GPU) and it held up pretty well. Still rough around the edges on some features, but promising for edge use or CPU-bound systems.

Strengths

Efficient insert/search on commodity hardware (Rust backend)
REST/gRPC APIs, filtering supported
Lightweight and open-source

Limitations

Limited support for hybrid search with complex schemas
Ecosystem not as mature as Milvus or Weaviate (yet)

FAISS

What I’ve Seen

I still use FAISS when I want to test indexing strategies in isolation. But I wouldn’t use it in a full production loop — no filtering, no persistence, no service layer.

Strengths

Excellent for raw ANN algorithm comparisons
GPU support, fast brute-force testing
Customizable index combinations

Limitations

Not a database: no filters, no auth, no scaling
Can’t be used standalone in production workflows

Chroma

What I’ve Seen

Nice for LangChain demos. I once used it in a hackathon to build a doc-based LLM assistant. Fast to start, but hit scaling limits quickly.

Strengths

Minimal setup, beginner-friendly
Works well for prototypes or one-off use

Limitations

Weak on indexing and recall with larger datasets
Missing core DB features like hybrid filters and distributed execution

Summary Table

Vector DB	Recall	QPS	Indexing	Hybrid Search	Deployment
Milvus / Zilliz	High	High	Fast	Supported	OSS + Cloud
Weaviate	High	Medium	Medium	Supported	OSS + Cloud
Pinecone	Medium	Medium	Fast	Limited	Cloud only
Qdrant	Medium	Medium	Fast	Partial	OSS + Cloud
FAISS	High	Varies	Fast	Not supported	Local only
Chroma	Low	Low	Simple	Limited	Local / Prototypes

Final Notes

For robotics, perception, or any RAG-heavy pipeline, my current picks look like this:

Milvus/Zilliz if I need indexing performance + hybrid filtering at scale
Weaviate if schema flexibility and metadata filtering are key
Qdrant if I’m deploying on the edge or working CPU-only
Pinecone if I want managed infrastructure and don’t mind tradeoffs

That said, nothing beats testing with your own embeddings and real query patterns. Benchmarks help, but your workload always tells the truth.

Let me know if you want a breakdown by use case — like fraud detection vs. vision search vs. conversational RAG. I’ve tested across a few and the performance shifts depending on the shape of your vectors and latency needs.

Top comments (1)

Philip J • Jul 3

I just stumbled upon this. It's interesting. We're looking for a scalable (billion-scale) VDB solution at our startup. How can I contact you to learn more about your experience?