Your pgvector Queries Are Doing Sequential Scans — Here's Why

#database #machinelearning #performance #postgres

Your pgvector Queries Are Doing Sequential Scans -- Here's Why

pgvector makes it easy to store embeddings in PostgreSQL. Too easy, arguably -- because the thing it does not do is create indexes automatically. You add a vector(1536) column, start inserting embeddings from OpenAI or Cohere, write a similarity search query, and it works. Fast, even. On a few hundred rows, a brute-force sequential scan that compares every vector takes milliseconds.

Then the table grows. At 50,000 rows, the same query takes 800ms. At 500,000, it takes 8 seconds. The query hasn't changed. The schema hasn't changed. The only difference is that PostgreSQL is computing cosine distance against every single row in the table, and that's O(n) with expensive 1536-dimensional floating point math per row.

Finding Unindexed Vector Columns

pgvector registers a vector type in pg_type. You can query the system catalogs to find every vector column that has no index:

SELECT
    a.attrelid::regclass AS table_name,
    a.attname AS column_name,
    a.atttypmod AS dimensions
FROM pg_attribute a
WHERE a.atttypid = 'vector'::regtype
  AND NOT EXISTS (
      SELECT 1 FROM pg_index i
      JOIN pg_class c ON i.indexrelid = c.oid
      WHERE i.indrelid = a.attrelid
        AND a.attnum = ANY(i.indkey)
  );

Every row returned is a vector column doing brute-force scans on every similarity search. No HNSW, no IVFFlat, nothing.

To see how much damage this is causing, check the scan statistics:

SELECT
    schemaname, relname,
    seq_scan, idx_scan,
    seq_tup_read
FROM pg_stat_user_tables
WHERE seq_scan > 100
  AND relname IN (
      SELECT attrelid::regclass::text
      FROM pg_attribute
      WHERE atttypid = 'vector'::regtype
  );

A table with thousands of sequential scans, zero index scans, and seq_tup_read orders of magnitude above the row count is computing brute-force distance on every query.

Why This Gets Missed

Regular B-tree indexes in PostgreSQL are usually created alongside constraints -- a unique constraint on email implies an index, a foreign key on customer_id gets one during schema review. Vector columns have no such triggers. There's no unique constraint, no foreign key pattern. The column exists, embeddings flow in, and nobody thinks about indexing until latency spikes.

The testing environment makes this worse. Similarity search on 500 rows is always fast, indexed or not. The test suite passes. The PR gets merged. The problem only appears at production data volumes.

The Fix: HNSW Index

HNSW (Hierarchical Navigable Small World) is the recommended default for most pgvector workloads. It builds a multi-layered graph that enables fast approximate nearest-neighbour search:

SET maintenance_work_mem = '2GB';
CREATE INDEX CONCURRENTLY idx_docs_embedding
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

Three things to get right:

maintenance_work_mem must be increased. The default 64MB is far too low for vector index construction. HNSW needs to hold the graph in memory as it processes each vector. Set it to 1-2GB for the index build session.

The operator class must match your queries. If your queries use <=> (cosine distance), use vector_cosine_ops. For <-> (L2/Euclidean), use vector_l2_ops. For <#> (inner product), use vector_ip_ops. A mismatch means the planner silently falls back to a sequential scan -- your index exists but is never used.

ef_construction controls recall accuracy. Higher values build a denser graph with better recall but slower build time. The default of 64 is conservative. For production where accuracy matters, 200 is a reasonable starting point. At query time, you can also tune hnsw.ef_search (default 40) to trade latency for recall.

After building, verify it's working:

EXPLAIN ANALYZE
SELECT * FROM documents
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

You should see Index Scan using idx_docs_embedding. If you still see Seq Scan, check the operator class, run ANALYZE documents, and confirm enable_indexscan is on.

Preventing It

The simplest prevention: create the HNSW index in the same migration that adds the vector column. Don't make it a follow-up task. Don't wait until data arrives. The index can be built on an empty table and will be ready when the first embedding is inserted.

Beyond that:

Decide on the distance function before writing queries. Cosine, L2, and inner product each require a different operator class. Mixing operators on the same column means you either need multiple indexes or accept sequential scans on some queries.
Test with realistic vector counts. At minimum, test similarity search against 100,000+ rows. At that scale, the difference between indexed and unindexed is obvious -- milliseconds versus seconds.
Monitor for new vector columns. Tables evolve. A vector column added three months after the initial schema might skip the indexing step. Periodic checks against pg_attribute catch these gaps.
Set maintenance_work_mem in migration scripts. Don't rely on the server default. Explicitly set it before CREATE INDEX so the build doesn't spill to disk.