DEV Community

Rhea Kapoor
Rhea Kapoor

Posted on

Benchmark Realities: How Vector Databases Actually Perform in Production

I’ve lost count of how many times I’ve seen engineering teams choose a vector database based on impressive benchmark numbers, only to watch it stumble when handling real-time queries against live data streams.

Last month’s experience was typical: a prototype using Elasticsearch achieved sub-20ms latency during isolated testing but degraded to 800ms P99 latency when filtering against dynamically updated product inventory.

That disconnect between lab results and production behavior isn’t just frustrating – it derails projects.


The Testing Illusion

Most vector database benchmarks suffer from three critical flaws that render their results misleading:

1. Static Datasets

Benchmarks commonly use outdated datasets like SIFT-1M (128D) or GloVe (50–300D).
Real-world embeddings from models like OpenAI’s text-embedding-3-large reach up to 3072 dimensions.

Testing with undersized vectors is like benchmarking a truck’s fuel efficiency by coasting downhill.

2. Oversimplified Workloads

Many tests measure query performance only after ingesting all data and building indexes offline.

Production systems don’t have that luxury.

When testing Pinecone last quarter, I observed a 40% QPS drop during active ingestion of a 5M vector dataset.

3. Misleading Metrics

Peak QPS and average latency hide critical failures.

Databases with great average latency often show >1s P99 spikes during concurrent filtering operations.


Designing a Production-Valid Benchmark

To address these gaps, I built a test harness simulating real-world conditions.

Key Components

📚 Modern Datasets

Corpus Embedding Model Dimensions Size
Wikipedia Cohere V2 768 1M/10M
BioASQ Cohere V3 1024 1M/10M
MSMarco V2 udever-bloom-1b1 1536 138M

🕒 Tail Latency Focus

Measure P95/P99 latency, not just averages.

In a 10M vector dataset test, one system showed 85ms average latency but 420ms P99 – unacceptable for user-facing workloads.

🔁 Sustained Throughput Testing

Gradually increase concurrency and observe:

  • serial_latency_p99: Baseline, no contention
  • conc_latency_p99: Under load
  • max_qps: Sustainable throughput

(Insert Figure: QPS and Latency of Milvus at Varying Concurrency Levels)

At 20+ concurrent queries, nominal QPS stayed flat, but latency surged due to CPU saturation.


Critical Real-World Scenarios

1. Filtered Queries

Combining vector search with metadata filters, like “top 5 sci-fi books released after 2020,” impacts performance dramatically.

Filter Selectivity Impact

  • 50% filtered → Low overhead
  • 99.9% filtered → Can improve speed 10x, or crash the system

(Insert Figure: QPS and Recall Across Filter Selectivity Levels)

OpenSearch’s recall dropped erratically above 95% selectivity, complicating capacity planning.


2. Streaming Data

Testing search-while-inserting reveals architectural bottlenecks.

# Pseudocode
insert_rate = 500 rows/sec
producers = 5

while data_remaining:
    producers.insert(100_rows_each_per_sec)
    if data_ingested % 10 == 0:
        run_queries(concurrency=32)
Enter fullscreen mode Exit fullscreen mode

(Insert Figure: Pinecone vs. Elasticsearch in Streaming Test)

Pinecone started strong, but Elasticsearch overtook it after 3 hours of indexing – an eternity for real-time workloads.


3. Resource Contention

On a 16-core cloud instance with 32 concurrent queries:

  • System X → OOM at 5M vectors
  • System Y → Disk I/O saturation → +300% P99 latency

Practical Deployment Insights

✅ Consistency Levels

  • STRONG: Required for transactional systems (e.g., fraud detection)
  • BOUNDED: Fine for feed ranking
  • EVENTUAL: Risked 8% missing vectors in streaming tests

⚙️ Indexing Tradeoffs

Index Type P99 Latency Rebuild Time (10M) Notes
HNSW 15ms 45 min Fast queries, slow updates
IVF_SQ8 80ms 5 min (incremental) Slower queries, faster updates

📈 Scaling Patterns

  • Vertical scaling: QPS scales linearly until network IO limits (~50 clients)
  • Horizontal scaling: Requires manual sharding to avoid hotspotting

What I’m Exploring Next

  1. Cold Start: How fast can a new node reach steady-state?
  2. Multi-Modal Search: Latency with CLIP or image+text hybrid models
  3. Failover Impact: AZ outages and recovery times
  4. Cost per Query: Budgeting for 100M+ vector clusters

Final Thought

Never trust a benchmark you didn’t run against your own data.

Tools help – but only your production workload is the valid test.

Top comments (0)