DEV Community

Andrew Kennon
Andrew Kennon

Posted on

Evaluating Vector Databases in Real Systems

Over the past couple years, I’ve integrated vector databases into several autonomous driving and LLM-based perception workflows — think multimodal RAG pipelines, scene retrieval across lidar+camera streams, or sensor signature matching. These workloads aren’t your average chatbot demos; they demand high recall, stable latency, and the ability to filter by time, location, or sensor type.

So I’ve been watching the vector DB landscape pretty closely. Below is a benchmark-grounded evaluation of what’s out there, with some firsthand perspective on what actually works when you’re pushing real data through these systems.

How I Evaluate (And Why It Matters)

I care about five things when selecting a vector DB for production:

  1. Query performance — Low latency matters, especially when you’re feeding LLMs in real time.
  2. Recall — In robotics, wrong retrievals mean wrong plans. I need to trust the top-k.
  3. Insert + index time — If I’m syncing sensor data every few seconds, I don’t want indexing to be the bottleneck.
  4. Scalability — Millions of vectors from real-world scenes pile up fast.
  5. Functionality — Filtering, hybrid queries, stability under load — all must-haves.

I use results from ANN-Benchmark and VectorDBBench, with embeddings ranging from 960 to 1536 dimensions — roughly what you’d expect from vision transformers or OpenAI’s text models.


Zilliz Cloud / Milvus

What I’ve Seen

This one consistently performs well on both recall and throughput, especially with disk-based indexing and hybrid search. I’ve used Milvus in a lidar-tagged object retrieval task (think “find similar scenes where the ego vehicle was overtaken”) — and the filtering capabilities really helped.

Strengths

  • High QPS and recall in real-world tests (VectorDBBench)
  • Supports hybrid queries (e.g., timestamp > t && similarity > x)
  • Scales well for large workloads
  • Milvus (open-source) for flexibility, Zilliz (cloud) for lower ops

Limitations

  • Self-hosted setup is non-trivial (you’re running Pulsar and etcd)
  • Zilliz Cloud simplifies deployment but limits deep tuning (e.g., IVF params)

Weaviate

What I’ve Seen

I tested Weaviate in a cross-modal search prototype — matching dashcam clips to driving logs using metadata filters. It handled hybrid queries well, though indexing took longer than I expected during rapid ingestion.

Strengths

  • Good recall/QPS balance in most benchmarks
  • First-class support for metadata filtering and multimodal modules
  • Friendly APIs (GraphQL, REST), quick to start

Limitations

  • Index construction slower on large or dynamic datasets
  • Memory usage can spike during concurrent reads

Pinecone

What I’ve Seen

Pinecone feels like a SaaS-native solution. I tried it in a project where fast prototyping mattered more than custom indexing. It worked — but I hit walls when I wanted to tune performance.

Strengths

  • Easy to deploy, zero ops
  • Solid QPS under moderate load
  • Strong ecosystem integration (LangChain, OpenAI)

Limitations

  • Indexing parameters not exposed — you get what you get
  • Recall underperforms slightly on larger or high-dimensional datasets
  • Cost becomes a concern once you scale past toy projects

Qdrant

What I’ve Seen

This one surprised me. I ran a batch object similarity task on CPU (no GPU) and it held up pretty well. Still rough around the edges on some features, but promising for edge use or CPU-bound systems.

Strengths

  • Efficient insert/search on commodity hardware (Rust backend)
  • REST/gRPC APIs, filtering supported
  • Lightweight and open-source

Limitations

  • Limited support for hybrid search with complex schemas
  • Ecosystem not as mature as Milvus or Weaviate (yet)

FAISS

What I’ve Seen

I still use FAISS when I want to test indexing strategies in isolation. But I wouldn’t use it in a full production loop — no filtering, no persistence, no service layer.

Strengths

  • Excellent for raw ANN algorithm comparisons
  • GPU support, fast brute-force testing
  • Customizable index combinations

Limitations

  • Not a database: no filters, no auth, no scaling
  • Can’t be used standalone in production workflows

Chroma

What I’ve Seen

Nice for LangChain demos. I once used it in a hackathon to build a doc-based LLM assistant. Fast to start, but hit scaling limits quickly.

Strengths

  • Minimal setup, beginner-friendly
  • Works well for prototypes or one-off use

Limitations

  • Weak on indexing and recall with larger datasets
  • Missing core DB features like hybrid filters and distributed execution

Summary Table

Vector DB Recall QPS Indexing Hybrid Search Deployment
Milvus / Zilliz High High Fast Supported OSS + Cloud
Weaviate High Medium Medium Supported OSS + Cloud
Pinecone Medium Medium Fast Limited Cloud only
Qdrant Medium Medium Fast Partial OSS + Cloud
FAISS High Varies Fast Not supported Local only
Chroma Low Low Simple Limited Local / Prototypes

Final Notes

For robotics, perception, or any RAG-heavy pipeline, my current picks look like this:

  • Milvus/Zilliz if I need indexing performance + hybrid filtering at scale
  • Weaviate if schema flexibility and metadata filtering are key
  • Qdrant if I’m deploying on the edge or working CPU-only
  • Pinecone if I want managed infrastructure and don’t mind tradeoffs

That said, nothing beats testing with your own embeddings and real query patterns. Benchmarks help, but your workload always tells the truth.

Let me know if you want a breakdown by use case — like fraud detection vs. vision search vs. conversational RAG. I’ve tested across a few and the performance shifts depending on the shape of your vectors and latency needs.

Top comments (1)

Collapse
 
philip_j profile image
Philip J

I just stumbled upon this. It's interesting. We're looking for a scalable (billion-scale) VDB solution at our startup. How can I contact you to learn more about your experience?