Krunal Panchal

Posted on Apr 20

Building Production RAG Systems with pgvector: What We Learned After 50 Deployments

#ai #python #postgres #machinelearning

We've built over 50 RAG (Retrieval-Augmented Generation) systems in production. Here's what the tutorials don't tell you.

The Tutorial Version vs. Reality

Every RAG tutorial looks the same:

Chunk your documents
Embed with OpenAI
Store in a vector database
Retrieve top-K on query
Pass to LLM

This works in a demo. In production, it falls apart in ways that aren't obvious until you're debugging at 2am.

Here's what we actually learned.

Why We Stopped Using Dedicated Vector Databases for Most Projects

For our first 15 RAG systems, we used Pinecone. It's great software. But we kept hitting the same problem: two databases to manage, two billing accounts, two sets of credentials, and data sync issues when the source changes.

For most applications — up to ~10M vectors, ~1B tokens of context — pgvector on PostgreSQL is sufficient and dramatically simpler.

-- Setup: add vector extension to existing Postgres
CREATE EXTENSION IF NOT EXISTS vector;

-- Create embeddings table
CREATE TABLE document_chunks (
  id BIGSERIAL PRIMARY KEY,
  document_id BIGINT REFERENCES documents(id),
  chunk_index INTEGER,
  content TEXT NOT NULL,
  embedding vector(1536),  -- OpenAI text-embedding-3-small
  metadata JSONB,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Index for fast similarity search
CREATE INDEX ON document_chunks
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

You already have Postgres. Your app already connects to it. Your backups already cover it. For most RAG use cases, pgvector is the right call.

When to use a dedicated vector DB: 100M+ vectors, multi-tenancy with strict isolation requirements, or if you need features like namespacing and metadata filtering at massive scale.

The Chunking Problem (Where Most Systems Fail)

Chunking strategy has more impact on RAG quality than model choice. Fixed-size chunking (split every N characters) is the default and usually wrong.

What we use instead:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)

# Split on semantic boundaries first, fall back to character count
chunks = splitter.split_text(document_text)

Key decisions we've settled on after 50 deployments:

Chunk size: 512 tokens for most use cases. Smaller chunks = more precise retrieval. Larger chunks = more context per result. 512 is the sweet spot for document Q&A.
Overlap: 10% of chunk size. Without overlap, answers split across chunk boundaries get missed.
Semantic chunking for long documents (papers, legal contracts, manuals). Split on paragraph/section boundaries, not character count.
Store metadata: page number, section heading, source document, created_at. You'll need this for citations and debugging.

Retrieval That Actually Works

Don't just use cosine similarity. It's necessary but not sufficient. Our production retrieval pipeline:

1. Hybrid search (vector + keyword)

-- Combine pgvector similarity with full-text search
SELECT
  id, content, metadata,
  (1 - (embedding <=> query_embedding)) AS vector_score,
  ts_rank(to_tsvector('english', content),
          plainto_tsquery('english', $2)) AS text_score,
  -- Weighted combination
  (0.7 * (1 - (embedding <=> query_embedding)) +
   0.3 * ts_rank(to_tsvector('english', content),
                 plainto_tsquery('english', $2))) AS combined_score
FROM document_chunks
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', $2)
   OR (embedding <=> query_embedding) < 0.3
ORDER BY combined_score DESC
LIMIT 10;

Pure semantic search misses exact keyword matches. Pure keyword search misses semantic similarity. Hybrid catches both.

2. Reranking

After retrieving top-10, rerank with a cross-encoder before passing to the LLM:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

scores = reranker.predict([
    (query, chunk.content)
    for chunk in retrieved_chunks
])

reranked = sorted(
    zip(retrieved_chunks, scores),
    key=lambda x: x[1],
    reverse=True
)[:5]  # Keep top 5 after reranking

Reranking improves answer quality significantly at low cost — the cross-encoder model runs locally, no API call needed.

3. Contextual compression

Instead of passing the full chunk to the LLM, extract only the relevant sentences:

# Before passing to LLM: extract relevant sentences
def extract_relevant_sentences(chunk_text, query, max_sentences=3):
    sentences = chunk_text.split('. ')
    # Score each sentence by similarity to query
    # Return top N most relevant
    ...

This cuts context window usage by 40-60% with minimal quality loss.

The Eval Suite You Must Build Before Production

This is the most skipped step. Every RAG system needs an eval suite before launch.

# Minimum viable RAG eval set
test_cases = [
    {
        "query": "What is our refund policy?",
        "expected_source": "terms-of-service.pdf",
        "expected_answer_contains": ["30 days", "original payment method"]
    },
    # ... 20-50 more cases
]

def evaluate_rag(rag_system, test_cases):
    results = []
    for case in test_cases:
        response = rag_system.query(case["query"])
        results.append({
            "retrieval_hit": case["expected_source"] in response.sources,
            "answer_quality": all(
                term in response.answer
                for term in case["expected_answer_contains"]
            )
        })
    return {
        "retrieval_accuracy": mean(r["retrieval_hit"] for r in results),
        "answer_accuracy": mean(r["answer_quality"] for r in results)
    }

Without this, you're shipping blind. We've seen retrieval accuracy drop from 87% to 61% after an OpenAI model update — the eval suite caught it before users did.

Costs at Scale

For a customer support RAG system handling 5,000 queries/day with a 50K document corpus:

Component	Monthly Cost
Embeddings (OpenAI text-embedding-3-small)	~$12/mo
pgvector on managed Postgres (2vCPU/8GB)	$80/mo
LLM calls (GPT-4o-mini for answers)	~$45/mo
Reranker model (runs locally)	$0
Total	~$137/mo

The same system on a dedicated vector DB + GPT-4o for everything: ~$800/mo. Model selection and pgvector are where the cost savings come from.

We go deeper on the architecture and cost breakdown for different RAG scales in our production RAG systems guide.

The Three Mistakes That Kill Production RAG Systems

1. No re-embedding strategy. When your source documents update, your embeddings go stale. Build a change detection + re-embedding pipeline from day one.

2. Ignoring retrieval failures. Log every query that returns zero results or low-confidence results. These are your highest-value improvement opportunities.

3. Skipping the eval suite. You cannot optimize what you cannot measure. Build 30 test cases before launch, run them weekly.

Building a RAG system right now? Happy to answer questions on architecture specifics.

DEV Community