Rhea Kapoor

Posted on Jul 21

Benchmark Realities: How Vector Databases Actually Perform in Production

I’ve lost count of how many times I’ve seen engineering teams choose a vector database based on impressive benchmark numbers, only to watch it stumble when handling real-time queries against live data streams.

Last month’s experience was typical: a prototype using Elasticsearch achieved sub-20ms latency during isolated testing but degraded to 800ms P99 latency when filtering against dynamically updated product inventory.

That disconnect between lab results and production behavior isn’t just frustrating – it derails projects.

The Testing Illusion

Most vector database benchmarks suffer from three critical flaws that render their results misleading:

1. Static Datasets

Benchmarks commonly use outdated datasets like SIFT-1M (128D) or GloVe (50–300D).
Real-world embeddings from models like OpenAI’s text-embedding-3-large reach up to 3072 dimensions.

Testing with undersized vectors is like benchmarking a truck’s fuel efficiency by coasting downhill.

2. Oversimplified Workloads

Many tests measure query performance only after ingesting all data and building indexes offline.

Production systems don’t have that luxury.

When testing Pinecone last quarter, I observed a 40% QPS drop during active ingestion of a 5M vector dataset.

3. Misleading Metrics

Peak QPS and average latency hide critical failures.

Databases with great average latency often show >1s P99 spikes during concurrent filtering operations.

Designing a Production-Valid Benchmark

To address these gaps, I built a test harness simulating real-world conditions.

Key Components

📚 Modern Datasets

Corpus	Embedding Model	Dimensions	Size
Wikipedia	Cohere V2	768	1M/10M
BioASQ	Cohere V3	1024	1M/10M
MSMarco V2	udever-bloom-1b1	1536	138M

🕒 Tail Latency Focus

Measure P95/P99 latency, not just averages.

In a 10M vector dataset test, one system showed 85ms average latency but 420ms P99 – unacceptable for user-facing workloads.

🔁 Sustained Throughput Testing

Gradually increase concurrency and observe:

serial_latency_p99: Baseline, no contention
conc_latency_p99: Under load
max_qps: Sustainable throughput

(Insert Figure: QPS and Latency of Milvus at Varying Concurrency Levels)

At 20+ concurrent queries, nominal QPS stayed flat, but latency surged due to CPU saturation.

Critical Real-World Scenarios

1. Filtered Queries

Combining vector search with metadata filters, like “top 5 sci-fi books released after 2020,” impacts performance dramatically.

Filter Selectivity Impact

50% filtered → Low overhead
99.9% filtered → Can improve speed 10x, or crash the system

(Insert Figure: QPS and Recall Across Filter Selectivity Levels)

OpenSearch’s recall dropped erratically above 95% selectivity, complicating capacity planning.

2. Streaming Data

Testing search-while-inserting reveals architectural bottlenecks.

# Pseudocode
insert_rate = 500 rows/sec
producers = 5

while data_remaining:
    producers.insert(100_rows_each_per_sec)
    if data_ingested % 10 == 0:
        run_queries(concurrency=32)

(Insert Figure: Pinecone vs. Elasticsearch in Streaming Test)

Pinecone started strong, but Elasticsearch overtook it after 3 hours of indexing – an eternity for real-time workloads.

3. Resource Contention

On a 16-core cloud instance with 32 concurrent queries:

System X → OOM at 5M vectors
System Y → Disk I/O saturation → +300% P99 latency

Practical Deployment Insights

✅ Consistency Levels

STRONG: Required for transactional systems (e.g., fraud detection)
BOUNDED: Fine for feed ranking
EVENTUAL: Risked 8% missing vectors in streaming tests

⚙️ Indexing Tradeoffs

Index Type	P99 Latency	Rebuild Time (10M)	Notes
HNSW	15ms	45 min	Fast queries, slow updates
IVF_SQ8	80ms	5 min (incremental)	Slower queries, faster updates

📈 Scaling Patterns

Vertical scaling: QPS scales linearly until network IO limits (~50 clients)
Horizontal scaling: Requires manual sharding to avoid hotspotting

What I’m Exploring Next

Cold Start: How fast can a new node reach steady-state?
Multi-Modal Search: Latency with CLIP or image+text hybrid models
Failover Impact: AZ outages and recovery times
Cost per Query: Budgeting for 100M+ vector clusters

Final Thought

Never trust a benchmark you didn’t run against your own data.

Tools help – but only your production workload is the valid test.

DEV Community