DEV Community

Cover image for RAG with Postgres pgvector in 2026: the full TypeScript pipeline.
GDS K S
GDS K S

Posted on

RAG with Postgres pgvector in 2026: the full TypeScript pipeline.

RAG with Postgres pgvector in 2026: the full TypeScript pipeline.

I spent a week evaluating dedicated vector databases before deciding to just use the Postgres instance I already had. The pgvector extension handles similarity search well enough for most production workloads, and it collapses three infrastructure components into one. This walkthrough covers everything from schema to answer: chunk your docs, embed them, store in pgvector, retrieve by cosine similarity, and wire the results into an LLM call.

TL;DR

Step Tool Why
Enable vector store pgvector 0.8.x, HNSW index Runs in your existing Postgres, no extra infra
Embed text-embedding-3-small (1,536 dims) $0.02 per million tokens, fast
Query <=> cosine distance, top-k Works with both OpenAI and Voyage models
Augment Claude or GPT-4o with retrieved docs Context window stuffed, hallucination rate drops

1. Why pgvector instead of a dedicated vector database

Pinecone and Weaviate are good products. If you need multi-tenant isolation, sub-millisecond p99 at 100M+ vectors, or native hybrid search with BM25, they earn their place. For most teams, those are future problems.

The cost calculus changes when you consider ops burden. A dedicated vector DB means a new billing line, a new set of credentials to rotate, a new failure mode to track, and a new SDK to keep current in your application. pgvector runs as a Postgres extension: one connection string, one backup strategy, one source of truth. At 10M documents with 1,536-dimensional embeddings, an HNSW index on a reasonably sized Postgres instance returns top-10 results in under 10ms. That covers the overwhelming share of RAG use cases.

pgvector 0.8.0 added iterative HNSW scans. That release made filtered similarity search practical without falling back to sequential scans every time a WHERE clause got specific. The 0.8.0 release was what tipped my team from "maybe later" to "ship it."

2. Schema setup

Enable the extension once per database, then create your table.

-- enable pgvector (run once per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- documents table
CREATE TABLE documents (
  id         BIGSERIAL PRIMARY KEY,
  source     TEXT NOT NULL,          -- filename, URL, or ID of source doc
  chunk_idx  INT NOT NULL,           -- chunk number within the source
  content    TEXT NOT NULL,          -- raw text of the chunk
  embedding  vector(1536) NOT NULL,  -- OpenAI text-embedding-3-small
  created_at TIMESTAMPTZ DEFAULT NOW()
);
Enter fullscreen mode Exit fullscreen mode

Choosing between HNSW and IVFFlat

HNSW builds a navigable small-world graph. Queries scan the graph instead of comparing all rows. Build once, query immediately. The tradeoff is that the index takes more memory: roughly 8 bytes per dimension per row for a 1,536-dim column at default settings.

IVFFlat partitions the embedding space into centroid clusters. Faster to build, smaller memory footprint, but you must load rows before building the index or the centroid assignment is useless. If you are starting from zero rows, build HNSW.

-- HNSW index (recommended default)
-- m = connections per layer (default 16), higher = better recall at higher memory cost
-- ef_construction = candidate list during build (default 64), higher = better recall at slower build
CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- IVFFlat alternative (only after loading rows)
-- lists = sqrt(row_count) is a good starting point for large tables
-- CREATE INDEX ON documents USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
Enter fullscreen mode Exit fullscreen mode

Use vector_cosine_ops with the <=> operator when your embedding model normalizes vectors (OpenAI and Voyage both do). Use vector_l2_ops with <-> for raw Euclidean distance when vectors are not normalized. Use vector_ip_ops with <#> for inner product, which equals cosine similarity on normalized vectors and saves one normalization step.

3. Ingest pipeline in TypeScript

The ingest function chunks a document, calls the embedding API, and bulk inserts rows. Use postgres (the npm package, not pg) for its tagged-template SQL and native array support.

import postgres from "postgres";
import OpenAI from "openai";

const sql = postgres(process.env.DATABASE_URL!);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

const CHUNK_SIZE = 512;   // tokens, not characters
const CHUNK_OVERLAP = 64; // tokens of overlap between adjacent chunks

function chunkText(text: string, size: number, overlap: number): string[] {
  // naive word-boundary chunker — swap for tiktoken in production
  const words = text.split(/\s+/);
  const chunks: string[] = [];
  let start = 0;
  while (start < words.length) {
    const end = Math.min(start + size, words.length);
    chunks.push(words.slice(start, end).join(" "));
    start += size - overlap;
  }
  return chunks;
}

async function embedBatch(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
  return response.data.map((d) => d.embedding);
}

export async function ingestDocument(source: string, text: string): Promise<void> {
  const chunks = chunkText(text, CHUNK_SIZE, CHUNK_OVERLAP);

  // embed in batches of 100 (OpenAI max batch size)
  const BATCH = 100;
  for (let i = 0; i < chunks.length; i += BATCH) {
    const batch = chunks.slice(i, i + BATCH);
    const embeddings = await embedBatch(batch);

    const rows = batch.map((content, j) => ({
      source,
      chunk_idx: i + j,
      content,
      embedding: JSON.stringify(embeddings[j]),
    }));

    await sql`
      INSERT INTO documents (source, chunk_idx, content, embedding)
      SELECT
        r.source,
        r.chunk_idx::int,
        r.content,
        r.embedding::vector
      FROM jsonb_to_recordset(${JSON.stringify(rows)}::jsonb)
        AS r(source text, chunk_idx text, content text, embedding text)
    `;
  }

  console.log(`[ingest] ${source}: ${chunks.length} chunks stored`);
}
Enter fullscreen mode Exit fullscreen mode

A note on chunk size: 512 words is a starting point. The right size depends on your source material. Legal documents with dense paragraphs do better at 256 words. Code files need at least 300 lines or you lose function context. The overlap prevents the embedding from missing a sentence that straddles a chunk boundary.

4. Query pipeline in TypeScript

Embed the user's question, run a top-k cosine similarity search, return the matching chunks.

export async function queryDocuments(
  question: string,
  topK = 5,
): Promise<Array<{ source: string; content: string; distance: number }>> {
  // embed the question with the same model used at ingest time
  const [embedding] = await embedBatch([question]);
  const embeddingStr = JSON.stringify(embedding);

  const rows = await sql<{ source: string; content: string; distance: number }[]>`
    SELECT
      source,
      content,
      (embedding <=> ${embeddingStr}::vector) AS distance
    FROM documents
    ORDER BY embedding <=> ${embeddingStr}::vector
    LIMIT ${topK}
  `;

  return rows;
}
Enter fullscreen mode Exit fullscreen mode

The <=> operator returns cosine distance (0 = identical, 2 = opposite). Lower numbers win. If you add metadata filters, add them in the WHERE clause before ORDER BY so the planner can use the HNSW iterative scan introduced in 0.8.0.

// filtered query example — same model must have returned results for this source
const rows = await sql<{ source: string; content: string; distance: number }[]>`
  SELECT source, content, (embedding <=> ${embeddingStr}::vector) AS distance
  FROM documents
  WHERE source = ${filterSource}
  ORDER BY embedding <=> ${embeddingStr}::vector
  LIMIT ${topK}
`;
Enter fullscreen mode Exit fullscreen mode

5. Wiring retrieved docs into an LLM call

Concatenate the retrieved chunks into a context block, then call your model of choice. Claude 3.5 Sonnet or GPT-4o both handle long contexts well. Keep the context block under 80,000 tokens for cost reasons.

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });

export async function answerWithRAG(question: string): Promise<string> {
  const docs = await queryDocuments(question, 5);

  if (docs.length === 0) {
    return "No relevant documents found.";
  }

  const context = docs
    .map((d, i) => `[${i + 1}] (${d.source})\n${d.content}`)
    .join("\n\n---\n\n");

  const prompt = `You are a helpful assistant. Answer the question using only the provided context.
If the context does not contain the answer, say so.

Context:
${context}

Question: ${question}`;

  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6-20250929",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });

  const block = response.content[0];
  return block.type === "text" ? block.text : "";
}
Enter fullscreen mode Exit fullscreen mode

The "answer using only the provided context" instruction is load-bearing. Without it, the model mixes retrieval with parametric memory and you cannot tell which is which. If the answer comes from the context, citations work. If it comes from training data, they do not. Force the distinction at the prompt level.

One more thing worth noting: rerank before you send to the LLM. A fast cosine search returns the 5 closest chunks by vector distance, but distance does not always equal usefulness. A cross-encoder reranker (Cohere Rerank costs about $1 per 1,000 queries) takes your top-20 candidates and scores them for actual relevance before you trim to 5. The quality jump is noticeable. Skip the reranker while prototyping, add it before you hit production.

6. Two gotchas that bite everyone

Chunk size drives recall more than index parameters

Most teams spend hours tuning HNSW m and ef_construction and see marginal gains. The actual lever is chunk size and overlap. A chunk that is too short loses context (the model cannot answer a cross-sentence question). A chunk that is too long pulls in noise, dilutes the embedding, and wastes context window in the LLM call. Run a quick eval: take 20 representative questions, retrieve top-5, then manually score whether the answer appeared in the returned chunks. Adjust chunk size in 100-word steps until recall tops 85%. Then tune the index.

Build the index after bulk loading, not before

HNSW indexing at insert time is slow. If you load 500,000 documents and the HNSW index exists, every INSERT pays the graph update cost. The fast path: load all rows with the index dropped, then build it once with CREATE INDEX. On a table of 500,000 rows with 1,536-dim embeddings, a cold HNSW build takes roughly 8 to 12 minutes on 4 vCPUs. That is far cheaper than the cumulative insert overhead.

-- drop the index before bulk load
DROP INDEX IF EXISTS documents_embedding_idx;

-- ... run your ingest pipeline ...

-- rebuild once after load
CREATE INDEX documents_embedding_idx
  ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
Enter fullscreen mode Exit fullscreen mode

The bottom line

The full pipeline is about 120 lines of TypeScript and three SQL statements. pgvector 0.8.x is stable enough for production, HNSW is the right default index for most teams, and the two things that matter most for answer quality are chunk size and staying consistent between embed-at-ingest and embed-at-query time (same model, same preprocessing). Dedicated vector DBs are not wrong, they are just a layer you do not need until your row count passes 50M or your recall requirements get strict enough to warrant a tuning team.

What chunk size worked best for your use case? Drop it in the comments.


GDS K S · thegdsks.com · follow on X @thegdsks

Good retrieval beats a better model every time.

Top comments (0)