Recently, I was working on a Retrieval-Augmented Generation (RAG) chatbot where users can upload PDFs or where I can scrape structured website data. The goal was to convert this data into embeddings, store it in a vector database, and then use it to answer user queries.
Step 1: Extract Raw Text
From pdf
import { PdfReader } from 'pdfreader';
let text = '';
const pdfReader = new PdfReader();
await new Promise<void>((resolve, reject) => {
pdfReader.parseBuffer(buffer, (err: any, item: any) => {
if (err) reject(err);
else if (!item) resolve();
else if (item.text) text += item.text + ' ';
});
});
From scraped website data:
function extractTextFromScrapedData(data: ScrapedData): string {
const sections: string[] = [];
if (data.about) sections.push(`About: ${data.about}`);
if (data.services) sections.push(`Services: ${data.services.join('\n• ')}`);
if (data.contact?.phone) sections.push(`Phone: ${data.contact.phone}`);
return sections.join('\n\n');
}
Step 2: Split Text into Chunks
Long documents need to be split so the LLM can handle them. I used LangChain’s RecursiveCharacterTextSplitter:
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { Document } from "@langchain/core/documents";
const doc = new Document({ pageContent: text });
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200
});
const allSplits = await splitter.splitDocuments([doc]);
console.log("Chunks created:", allSplits.length);
Step 3: Generate Embeddings + Store
This is the most important part:
When you call vectorStore.addDocuments(), LangChain automatically calls your embedding model (e.g. OpenAI, Cohere) and saves those vectors into your configured database (Pinecone in my case).
import { vectorStore } from "@/app/lib/langchain";
// Store with namespace (client-specific)
await vectorStore.addDocuments(allSplits, { namespace: clientId });
console.log("✅ Documents added to vectorStore");
Step 4: Retrieve Context During Chat
When a user asks a question, we retrieve the top-k similar chunks:
const retrieved = await vectorStore.similaritySearchWithScore(
userMessage,
3,
{ namespace: clientId }
);
const relevantMatches = retrieved.filter(([doc, score]) => score > 0.6);
console.log("Relevant matches:", relevantMatches.length);
The other chunking methods are :
- Agentic Chunking → LLM-guided, semantic-aware chunking based on meaning and context.
- CharacterTextSplitter → Splits by character count (simple, fast).
- RecursiveCharacterTextSplitter → Splits by hierarchy (paragraph → sentence → word → char). Most commonly used.
- TokenTextSplitter → Splits by tokens, aligns with model tokenization.
- MarkdownTextSplitter → Splits Markdown docs by headers/sections.
- HTMLTextSplitter → Splits HTML while respecting tags.
- CodeTextSplitter → Splits source code by functions, classes, logical blocks.
Refer to this doc for more clarifying
DOC
Happy Coding, Keep Striving.
Top comments (0)