4 pgvector Mistakes That Silently Break Your RAG Pipeline in Production

pgvector is the fastest way to add vector search to an existing PostgreSQL database. One extension, a few SQL commands, and you have similarity search running alongside your relational data. No new infrastructure. No new SDK. No vendor lock-in.

That simplicity is also its trap. Most teams add pgvector in a day and spend the next six months debugging performance issues that have nothing to do with the extension itself. The problems are almost always configuration mistakes that tutorials skip over.

Here are four I have seen break RAG pipelines in production, and how to fix each one before your team starts debating a migration to Pinecone.

No HNSW Index Means Full Table Scans

By default, pgvector performs exact nearest neighbor search. That means it scans every single row in the table on every query. For a prototype with 10,000 vectors, this is invisible. At 500,000 vectors, queries start crossing 800 milliseconds. At a million, you are looking at multi-second response times that make your RAG pipeline feel broken.

The fix is a single SQL statement: create an HNSW index on your vector column. HNSW (Hierarchical Navigable Small World) is an approximate nearest neighbor algorithm that trades a tiny amount of accuracy for massive speed improvements. After adding the index, the same 500K-vector query drops to under 50 milliseconds.

The reason this catches teams off guard is that pgvector works perfectly without the index. There is no warning, no error, no degradation signal. It just gets slower as data grows, and most teams blame the embedding model or the LLM before they check the database.

Dimensionality Is Not Free

OpenAI's ada-002 embedding model outputs vectors with 1,536 dimensions. Each vector row in PostgreSQL consumes roughly 6 kilobytes of storage. Scale that to one million documents and you are looking at 6 gigabytes just for the embeddings column, before accounting for the HNSW index overhead, which can double or triple the total.

This matters because your AWS or cloud bill is not driven by the LLM API calls most teams obsess over. It is driven by the RDS instance size and storage needed to hold and index those vectors. A db.r6g.xlarge running pgvector with a million high-dimensional vectors costs real money every month.

The alternative is to use a smaller embedding model. Cohere's embed-v3 outputs 384 dimensions and performs competitively on most retrieval benchmarks. That cuts storage by 75 percent and proportionally reduces index build time, memory usage, and query latency. Unless your use case specifically requires the nuance of 1,536 dimensions, smaller is almost always the right production choice.

Wrong Distance Function, Wrong Results

Most tutorials use cosine similarity as the default distance function, and most teams never question it. But pgvector supports three distance functions: cosine similarity, inner product, and L2 (Euclidean) distance. Each one measures "similarity" differently, and the choice directly affects which documents appear in your top-K results.

Cosine similarity measures the angle between vectors, ignoring magnitude. Inner product considers both direction and magnitude, which makes it the better choice when your embeddings are already normalized (as most modern embedding models produce). L2 distance measures the straight-line distance between vector endpoints, which works best when magnitude carries meaningful information.

The practical impact is real. I have seen cases where switching from cosine to inner product on the same dataset changed three of the top five results. If your RAG pipeline returns mediocre answers and you have already tuned your chunking strategy and prompt, check the distance function before anything else. It is a one-line configuration change that can transform result quality.

Know the Scaling Ceiling

pgvector is not a dedicated vector database. It is an extension that adds vector operations to PostgreSQL, and PostgreSQL was not designed to be a vector search engine at scale. In practice, pgvector handles up to about five million vectors comfortably on a db.r6g.xlarge instance with proper HNSW indexing. Past ten million vectors, expect query performance to degrade under concurrent load, and index build times to become a deployment bottleneck.

For most teams, this ceiling is not a problem. The majority of production RAG systems index fewer than five million documents. If you are in that range and already running PostgreSQL, adding pgvector is the right call. You avoid the operational complexity of a separate vector database, keep your data in one place, and eliminate an entire category of infrastructure to manage.

If you are genuinely approaching the ten million mark, look at pgvector-scale (which adds partitioning and distributed indexing) or evaluate a dedicated solution like Pinecone or Weaviate. But make that decision based on actual data volume, not on anxiety about future scale.

The Config Is the Bottleneck

The pattern I see repeated is predictable. Week one, a team adds pgvector and it works great. By month two, queries slow down and nobody thinks to check the index. By month four, someone proposes migrating to a managed vector database. By month six, a senior engineer adds one HNSW index and the problem disappears.

pgvector is a genuinely excellent tool for most production RAG systems. The mistakes that break it are not bugs or limitations. They are configuration gaps that tutorials gloss over and documentation buries. Fix the index, right-size the dimensions, pick the correct distance function, and know your scaling ceiling. That is the entire playbook.

What vector store is your team running in production right now?

Top comments (1)

Edwin • May 31

Spot on about chunking. Most tutorials completely skip over it, and then people wonder why their retrieval metrics are garbage even though their DB queries are fast.

I got so tired of rebuilding this exact chunking logic and worrying about hitting concurrent rate limits on my Postgres instance that I eventually just pushed the whole pipeline out of line into a dedicated BullMQ setup with a worker on Railway.