The biggest mistake when choosing a vector database is to immediately link a credit card to the most popular SaaS solution and launch the project. While everything might go smoothly with the first 10,000 vectors, you might find yourself uncomfortable with an $800 monthly bill when your dataset exceeds the 1 million mark. In vector search operations, the primary cost determinant is not CPU, but entirely RAM capacity.
I've personally managed this process in a side project I developed and in an e-commerce project where I provided consultancy. Without optimizing RAM consumption, directly choosing the "fastest" index can lead to your server crashing with an Out-Of-Memory (OOM) error. In this post, based on tests I conducted with pgvector, Qdrant, and Milvus, I'll explain how to build a vector architecture that won't break the bank.
Calculating Memory and Disk Consumption (HNSW vs IVF)
Vector indexing algorithms are fundamentally divided into two types: HNSW (Hierarchical Navigable Small World) and IVF (Inverted File). HNSW offers high accuracy (recall) and incredibly low latency, but it must keep the entire graph in RAM. IVF, on the other hand, clusters data and can read from disk, which reduces memory costs but requires you to compromise on accuracy.
Let's assume we have 1 million vectors, each with a dimension of 1536 (the standard size for OpenAI's text-embedding-3-large or text-embedding-ada-002). Let's start by calculating the raw data size:
1,000,000 (number of vectors) * 1536 (dimension) * 4 bytes (float32) = 6,144,000,000 bytes (~5.72 GB)
Just storing the raw vectors in memory requires 5.72 GB of RAM. If you build an HNSW index, additional memory overhead is incurred per vector for graph edges. When we select HNSW parameters $M=16$ and $ef_construction=64$, the index size increases by approximately 1.5 to 2 times the raw data. In total, you need to allocate at least 12-15 GB of RAM for this index alone.
⚠️ Beware of OOM Risks
If your server has 16 GB of RAM and the operating system and other services also share this memory, the Linux kernel's OOM killer mechanism will mercilessly terminate your database process. To avoid seeing the line
Out of memory: Killed processin your kernel logs, you should correctly configure your Swap space or set soft limits.
When Does Starting with PostgreSQL pgvector Make Sense?
If you are already using PostgreSQL 15+ in your existing infrastructure and your vector count is below 500,000, adding a new database technology to your stack is an entirely unnecessary operational burden. The PostgreSQL pgvector extension (especially with HNSW support introduced in v0.5.0+) allows you to query your relational data and vector data within the same transaction boundaries.
However, there is a critical SQL rule to be aware of when using pgvector. The index creation process (index build) consumes a significant amount of RAM. If you don't temporarily increase the maintenance_work_mem parameter, the index creation process will fall back to disk swapping and can take hours.
-- Increase memory before creating an HNSW index for pgvector
SET maintenance_work_mem = '4GB';
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
This approach conflicts with connection pool management as your data grows. The fixed amount of memory PostgreSQL consumes for each connection is high. As I mentioned in my previous article on [related: PostgreSQL index strategies], you must keep your connection pool limits very tight while pgvector queries are running; otherwise, PostgreSQL might crash due to insufficient memory during a query.
Dedicated Vector Databases: Qdrant and Milvus Comparison
When your data size exceeds 5-10 million vectors, or when you target thousands of queries per second (QPS) instantaneously, PostgreSQL starts to become insufficient. This is where Qdrant (developed with Rust) and Milvus (a mix of Go/C++) come into play, specifically designed for this purpose.
Qdrant is an incredibly optimized engine, especially in terms of memory management and resource consumption. Thanks to Rust's memory safety and lack of a garbage collector, it consumes almost zero RAM while idle. Milvus, on the other hand, is designed for massive, horizontally scalable structures with a microservices architecture, primarily on Kubernetes.
In the table below, you can see the metrics I obtained in my own test environment (8 vCPU, 32 GB RAM bare-metal server):
| Metric (1M Vectors, 1536-dim) | PostgreSQL (pgvector) | Qdrant (HNSW) | Milvus (HNSW) |
|---|---|---|---|
| Index Build Time | 42 minutes | 12 minutes | 18 minutes |
| Search Latency (p95) | 18 ms | 4 ms | 7 ms |
| Memory Consumption (RAM) | ~14 GB | ~8.5 GB | ~11 GB |
| Disk Consumption | ~11 GB | ~7.2 GB | ~9 GB |
The reason Qdrant can keep memory consumption so low is its highly successful implementation of Scalar Quantization (SQ), which compresses vectors in memory. If your budget is limited and you want to achieve maximum performance on a single VPS, you should definitely opt for Qdrant.
Index Configuration Parameters and Performance Trade-offs
There is no magical "best setting" in vector search engines. You have to sacrifice one aspect in the triangle of speed, accuracy, and memory. The three most critical parameters for HNSW indexes are: m, ef_construction, and ef_search.
-
m: This is the maximum number of connections each node in the graph structure can have. It's generally chosen between 4 and 64. As the value increases, search accuracy improves, but memory consumption doubles. -
ef_construction: Determines the number of neighbors to scan during index creation. Higher values make the index higher quality but dramatically increase indexing time. -
ef_search: A parameter that can be dynamically changed during search. Higher values increase accuracy but decrease the queries processed per second (QPS).
// Optimized HNSW and Quantization configuration for Qdrant
{
"hnsw_config": {
"m": 16,
"ef_construct": 100,
"full_scan_threshold": 10000
},
"quantization_config": {
"scalar": {
"type": "int8",
"quantile": 0.99,
"always_ram": true
}
}
}
In the Qdrant JSON configuration above, we converted 32-bit float vectors to 8-bit integers by applying int8 quantization. This simple step leads to only about a 1-2% loss in search accuracy while reducing RAM consumption by almost 70%. This means you could potentially launch a project requiring 32 GB of RAM with just 8 GB of RAM.
Hybrid Search and Reranking Costs
Relying solely on vector-based semantic search doesn't always yield the most accurate results. Especially when searching for product codes, serial numbers, or specific brand names, vectors can fail. This is why modern architectures use "Hybrid Search" (Sparse + Dense). This combines the results of classic BM25 (text search) with dense vector search.
However, this combination process adds an extra load to system resources. Sorting results from two different sources requires a Cross-Encoder (Reranker) model. Reranking operations create a significant latency burden on CPU or GPU.
# Set a threshold instead of calling the reranker model on every query
def hybrid_search(query, limit=50):
# Phase 1: Perform fast vector and text search (Recall stage)
vector_results = qdrant_client.search(collection_name="products", query_vector=get_embedding(query), limit=limit)
text_results = postgres_client.search_text(query, limit=limit)
combined_results = merge_results(vector_results, text_results)
# Phase 2: Rerank only the top 10 results (Precision stage)
if len(combined_results) > 10:
reranked = reranker.compute_score(query, combined_results[:10])
return reranked
return combined_results
If you send the entire dataset (e.g., 100 results) to a Cross-Encoder model, your query time will exceed 200 ms. With the two-stage filtering method I applied in the code block above, I managed to keep the p95 latency at the 35 ms level. For detailed information, you can check our [related: RAG architecture] documentation.
A Real-World Scenario: Cost Analysis on a 2 Million Document Dataset
In the backend infrastructure of a side project I developed, I needed to store the semantic search index for approximately 2 million documents. Initially, I used the starter package of a popular cloud vector database service. However, as the data grew, the bill ballooned to $350 USD per month. Since the side project's revenue wasn't yet at a level to cover this bill, I quickly transitioned to an alternative architecture.
I set up Qdrant on my own rented bare-metal server with 64 GB of RAM. By setting limits within Docker Compose, I brought resource consumption under control:
version: '3.8'
services:
qdrant:
image: qdrant/qdrant:v1.8.0
container_name: qdrant_node
ports:
- "6333:6333"
volumes:
- ./qdrant_storage:/qdrant/storage
deploy:
resources:
limits:
memory: 16g
reservations:
memory: 8g
restart: always
After this transition, my monthly cost dropped to just the server rental fee of 45 USD. Moreover, by enabling Qdrant's on-disk payload feature, I managed to cap RAM usage at the 12 GB limit.
Conclusion: Which Database for Which Scenario?
When choosing a vector database, you need to honestly assess your project's scale. If you are already using PostgreSQL and your dataset is small, there's no need for fancy solutions; stick with pgvector. However, if you aim to build an independent, high-performance, and low-cost search infrastructure, my clear preference in this process is always Qdrant.
In my next post, I will examine advanced chunking strategies I use to optimize the context window in Large Language Models (LLMs) and their direct impact on vector search accuracy.
Top comments (0)