DEV Community

Željko Šević
Željko Šević

Posted on • Originally published at sevic.dev on

RAG with OpenAI Embeddings, pgvector and LangChain

Retrieval-Augmented Generation (RAG) is a practical pattern: store knowledge as embeddings, retrieve the most relevant chunks with semantic search, then generate an answer grounded in that context.

This guide shows a simple end-to-end flow with OpenAI embeddings, PostgreSQL + pgvector, and LangChain chunking.

Prerequisites

  • OpenAI account
  • Generated API key
  • Enabled billing
  • Node.js version 26
  • PostgreSQL with pgvector extension enabled
  • npm packages: openai, langchain, pg, pgvector

What are embeddings?

Embeddings are numeric vectors that represent the semantic meaning of text. Similar text should produce vectors that are close in vector space.

In practice:

  • Convert document chunks to vectors and store them in pgvector
  • Convert a user question to a vector
  • Run a nearest-neighbor search to find the most relevant chunks

OpenAI client setup

import OpenAI from 'openai';

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
Enter fullscreen mode Exit fullscreen mode

Embedding one input element

Use a single string when embedding a user query.

const response = await client.embeddings.create({
  model: 'text-embedding-3-small',
  input: 'How do I connect pgvector to PostgreSQL?',
});

const queryEmbedding = response.data[0].embedding;
console.log(queryEmbedding.length);
Enter fullscreen mode Exit fullscreen mode

Embedding multiple input elements

Use an array to embed multiple chunks in one API call.

const chunks = [
  'pgvector adds vector similarity search to PostgreSQL.',
  'LangChain helps split long documents into retrieval-friendly chunks.',
  'RAG retrieves context first, then asks an LLM to answer.',
];

const response = await client.embeddings.create({
  model: 'text-embedding-3-small',
  input: chunks,
});

const rows = response.data.map((item, index) => ({
  text: chunks[index],
  embedding: item.embedding,
}));

console.log(rows.length); // 3
Enter fullscreen mode Exit fullscreen mode

Chunking documents with LangChain

Chunking makes retrieval more precise. Instead of embedding one large document, split it into smaller overlapping parts.
Start with chunkSize: 800 and chunkOverlap: 120, then adjust based on your document style and answer quality.

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 120,
});

const docs = await splitter.createDocuments([
  `RAG combines retrieval and generation. Store chunks as vectors and fetch similar chunks at query time.`,
]);

console.log(docs.map((doc) => doc.pageContent));
Enter fullscreen mode Exit fullscreen mode

Store embeddings in pgvector

Create a table with a vector column. text-embedding-3-small outputs 1536 dimensions.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS rag_chunks (
  id BIGSERIAL PRIMARY KEY,
  content TEXT NOT NULL,
  embedding VECTOR(1536) NOT NULL,
  source TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Enter fullscreen mode Exit fullscreen mode

Insert chunk vectors from Node.js:

import pg from 'pg';
import pgvector from 'pgvector/pg';

const pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
await pgvector.registerTypes(pool);

await pool.query(
  `INSERT INTO rag_chunks (content, embedding, source)
   VALUES ($1, $2, $3)`,
  ['Chunk content', pgvector.toSql(queryEmbedding), 'notes.md']
);
Enter fullscreen mode Exit fullscreen mode

Semantic search in pgvector

Embed the user question, then retrieve nearest chunks using cosine distance.
Lower distance means a closer semantic match.
top-k means how many nearest chunks you return (in this query, k=4 with LIMIT 4).
You can also use a simple threshold (for example 0.4) to discard weak matches.
As a starting point, many setups work well in the 0.35 to 0.45 range for cosine distance, then tune with real questions from your domain.

const searchResult = await pool.query(
  `SELECT id, content, source, embedding <=> $1::vector AS distance
   FROM rag_chunks
   ORDER BY embedding <=> $1::vector
   LIMIT 4`,
  [pgvector.toSql(queryEmbedding)]
);

const contextChunks = searchResult.rows.map((row) => row.content);
Enter fullscreen mode Exit fullscreen mode

Threshold filtering example:

const DISTANCE_THRESHOLD = 0.4;
const filteredChunks = searchResult.rows
  .filter((row) => Number(row.distance) <= DISTANCE_THRESHOLD)
  .map((row) => row.content);
Enter fullscreen mode Exit fullscreen mode

If no chunks pass the threshold, skip answer generation and return a fallback message:

if (filteredChunks.length === 0) {
  console.log('I do not have enough context to answer this.');
  process.exit(0);
}
Enter fullscreen mode Exit fullscreen mode

Generate answer from retrieved context

Use retrieved chunks as grounded context for the final model call.

const context = contextChunks.join('\n\n---\n\n');

const answer = await client.responses.create({
  model: 'gpt-5.5',
  instructions:
    'Answer only from the provided context. If context is insufficient, respond with: I do not have enough context to answer this.',
  input: `Context:\n${context}\n\nQuestion: How does pgvector semantic search work?`,
});

console.log(answer.output_text);
Enter fullscreen mode Exit fullscreen mode

Demo

Runnable scripts for this post live in the rag-openai-embeddings-pgvector-demo folder in the private demos repository. Get access via code demos.

Top comments (0)