Scaling pgvector: Memory, Quantization, and Index Build Strategies

#ai #database #performance #postgres

Scaling pgvector: Memory, Quantization, and Index Build Strategies

pgvector handles small-scale vector search effortlessly. A few hundred thousand embeddings with an HNSW index, and similarity queries return in milliseconds. But once you push past a million vectors, three problems converge -- and if you haven't planned for them, they hit at the same time.

Three Walls at Scale

Wall 1: HNSW Index Builds Need Massive Memory

Building an HNSW index requires holding the entire graph in memory during construction. If maintenance_work_mem is too low (the default is 64 MB), PostgreSQL falls back to a disk-based build that runs 10-50x slower. For 5 million vectors at 1536 dimensions, you may need 8-16 GB of working memory.

Most teams discover this during the build, after waiting hours with no progress indication.

Wall 2: Full-Precision Vectors Eat Storage

Each dimension in a vector column is stored as a 4-byte float32. A 1536-dimension embedding takes ~6 KB per row. At 10 million rows, that's 60 GB for the vector column alone -- before the HNSW index, which can be 1.5-2x the raw data size.

Wall 3: Indexes Outgrow RAM

An HNSW index on 10 million 1536-dimension vectors can reach 80-120 GB. When that exceeds available memory, every graph traversal during a similarity search hits disk. Sub-millisecond queries become multi-second queries. No query tuning fixes a capacity problem.

Detecting the Problem

Start by understanding your vector landscape:

-- Check vector column dimensions and storage
SELECT
    attrelid::regclass AS table_name,
    attname AS column_name,
    atttypmod AS dimensions,
    pg_size_pretty(pg_column_size(attname)) AS avg_row_size
FROM pg_attribute
WHERE atttypid = 'vector'::regtype
ORDER BY attrelid;

-- Check maintenance_work_mem (critical for HNSW builds)
SHOW maintenance_work_mem;

-- Check index build progress (run during active builds)
SELECT
    phase,
    tuples_done,
    tuples_total,
    round(100.0 * tuples_done / NULLIF(tuples_total, 0), 1) AS pct_done
FROM pg_stat_progress_create_index;

Anything above 768 dimensions at millions of rows deserves attention. If maintenance_work_mem is under 1 GB and you have millions of vectors, you are almost certainly hitting the slow disk-based build path.

Check your index sizes relative to available memory:

SELECT
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan AS scans,
    idx_tup_read AS tuples_read
FROM pg_stat_user_indexes
WHERE indexrelname LIKE '%vector%'
   OR indexrelname LIKE '%hnsw%'
   OR indexrelname LIKE '%ivfflat%'
ORDER BY pg_relation_size(indexrelid) DESC;

If any index exceeds 50-70% of your system's RAM, expect cache pressure during searches.

Fixing It: Quantization + Memory Tuning

Step 1: Set Memory for Fast Index Builds

-- Session-level setting, resets when connection closes
SET maintenance_work_mem = '4GB';

Rough memory estimate for HNSW builds: N * D * 4 bytes * 2 (the 2x factor accounts for graph overhead). For 5 million rows of 1536-dimension vectors: ~60 GB.

Step 2: Use halfvec for 2x Storage Reduction

halfvec uses float16 instead of float32. Storage and index size drop by half with less than 1% recall degradation on normalized embeddings:

ALTER TABLE documents
ADD COLUMN embedding_half halfvec(384);

UPDATE documents
SET embedding_half = embedding::halfvec(384);

CREATE INDEX CONCURRENTLY idx_documents_half_hnsw
ON documents USING hnsw (embedding_half halfvec_cosine_ops)
WITH (m = 16, ef_construction = 200);

For most production use cases, halfvec should be your default. A 60 GB vector column becomes 30 GB. Index builds run twice as fast. Recall stays above 99% for cosine similarity on normalized embeddings.

Step 3: Consider Binary Quantization for Pre-Filtering

For 32x storage reduction, binary quantization reduces each dimension to a single bit:

ALTER TABLE documents
ADD COLUMN embedding_bit bit(384);

Recall degrades significantly for standalone searches, but bit indexes work well as pre-filters: retrieve a candidate set quickly, then re-rank using full-precision vectors.

Step 4: Tune Index and Search Parameters

HNSW build parameters:

m = 16-32: Connections per node. Higher improves recall, increases size.
ef_construction = 100-200: Candidate list during build. Higher improves recall, increases build time.

Query-time parameter:

SET hnsw.ef_search = 100;  -- default is 40, increase for better recall

Profile the trade-off: if going from ef_search = 40 to 100 improves recall by 5% but doubles query time from 2ms to 4ms, decide based on your application's requirements.

Preventing Future Problems

Start with halfvec from day one. Retrofitting quantization onto 50 million full-precision vectors is painful. Starting with reduced precision is free.
Track the index-to-RAM ratio. When your vector index reaches 60% of available RAM, start planning the next scaling step.
Plan for index rebuilds. HNSW indexes cannot be updated incrementally in the background. If your data changes frequently, budget time and memory for periodic rebuilds.

-- Monitor index sizes and usage
SELECT
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan AS scans,
    idx_tup_read AS tuples_read
FROM pg_stat_user_indexes
WHERE indexrelname LIKE '%vector%'
   OR indexrelname LIKE '%hnsw%'
   OR indexrelname LIKE '%ivfflat%'
ORDER BY pg_relation_size(indexrelid) DESC;

Watch for sequential scans on tables that have vector indexes. If the planner is choosing sequential scans over the index, it's usually because the index doesn't fit in memory or the query isn't structured to use the right operator class. Both signal that your scaling strategy needs adjustment.

Originally published at mydba.dev/blog/pgvector-scaling-large-datasets