TL;DR: I built DocMind — a multimodal RAG (Retrieval-Augmented Generation) app that lets you upload any PDF, image, or document and ask questions about it in plain English. No API keys needed — it runs entirely locally using Ollama for the LLM and Xenova Transformers for embeddings. This post walks through the full architecture, the chunking strategy, the vector similarity search, and the code.
The Problem
You have a 200-page PDF — a legal contract, a research paper, your company's internal docs. You need to find specific information buried somewhere in it. Ctrl+F only works if you know the exact words. What you really want is to ask the document questions in plain language:
- "What's the termination clause in this contract?"
- "Summarize the results from section 4"
- "What technologies does this company use?"
This is what RAG does. And unlike fine-tuning an LLM (which costs hundreds of dollars and takes hours), RAG is cheap, fast, and works with any document you throw at it.
I built DocMind to solve this for myself — and to have a real AI project in my portfolio that actually does something useful.
What Is RAG? (The 30-Second Version)
RAG = Retrieval-Augmented Generation. Instead of asking an LLM to answer from memory (which it often gets wrong — hallucinations), you:
- Retrieve the relevant parts of your document
- Augment the LLM's prompt with those chunks
- Generate an answer grounded in actual content
RAG vs PURE LLM
════════════════
Pure LLM (without RAG):
┌────────┐ "What's the ┌─────────┐
│ User │──termination clause?──▶│ LLM │──▶ "I think it might be..."
└────────┘ │(guessing │ (HALLUCINATION ⚠️)
│ from │
│ training)│
└─────────┘
RAG (with document context):
┌────────┐ "What's the ┌──────────┐ relevant ┌─────────┐
│ User │──termination clause?──▶│ Vector │──paragraphs──▶│ LLM │
└────────┘ │ Search │ │(answers │
│ │ │ from │
│ searches │ │ actual │
│ your PDF │ │ content)│
└──────────┘ └─────────┘
│
▼
"Section 12.3 states
the termination clause
requires 30 days..."
(GROUNDED ✅)
The Full Architecture
Here's how DocMind works end to end:
flowchart TD
A[Upload Document] --> B{File Type?}
B -->|PDF| C[pdf-parse]
B -->|DOCX| D[mammoth]
B -->|Image| E[Tesseract OCR]
B -->|CSV| F[csv-parse]
B -->|TXT/MD| G[Read directly]
C --> H[Raw Text]
D --> H
E --> H
F --> H
G --> H
H --> I[Chunk Text<br/>500 chars, 50 overlap]
I --> J[Generate Embeddings<br/>all-MiniLM-L6-v2]
J --> K[Store in Vector Index]
L[User Question] --> M[Embed Question]
M --> N[Cosine Similarity Search]
K --> N
N --> O[Top 5 Chunks Retrieved]
O --> P[Build Prompt with Context]
P --> Q[Ollama LLM generates answer]
Q --> R[Return Answer + Sources]
style K fill:#3b82f6,color:#fff
style Q fill:#22c55e,color:#fff
style E fill:#f59e0b,color:#fff
Two pipelines: Upload (extract → chunk → embed → store) and Query (embed question → similarity search → build prompt → LLM answer).
Step-by-Step Build
Step 1: Text Extraction — Getting Text Out of Anything
The first challenge is extracting text from 8 different file types. Each needs a different parser:
// extract.js — Extract text from any supported file type
const pdfParse = require('pdf-parse');
const mammoth = require('mammoth');
const Tesseract = require('tesseract.js');
const csvParse = require('csv-parse/sync');
const fs = require('fs');
const path = require('path');
async function extractText(filePath) {
const ext = path.extname(filePath).toLowerCase();
const buffer = fs.readFileSync(filePath);
switch (ext) {
case '.pdf':
const pdf = await pdfParse(buffer);
return pdf.text;
case '.docx':
const result = await mammoth.extractRawText({ buffer });
return result.value;
case '.txt':
case '.md':
return buffer.toString('utf-8');
case '.csv':
const records = csvParse.parse(buffer, { columns: true });
return records.map(row => Object.values(row).join(' | ')).join('\n');
case '.png':
case '.jpg':
case '.jpeg':
case '.webp':
const { data } = await Tesseract.recognize(buffer, 'eng');
return data.text;
default:
throw new Error(`Unsupported file type: ${ext}`);
}
}
Step 2: Chunking — Breaking Text Into Searchable Pieces
You can't embed a whole document as one vector — it loses meaning. Instead, you break it into small, overlapping chunks. The overlap ensures context isn't lost at chunk boundaries.
CHUNKING STRATEGY
═════════════════
Original text (1000 chars):
┌──────────────────────────────────────────────────────────┐
│ The quick brown fox jumps over the lazy dog. The dog │
│ barked loudly at the mailman. Meanwhile, the cat slept │
│ on the windowsill, completely unbothered by the chaos... │
└──────────────────────────────────────────────────────────┘
Chunk 1 (500 chars): Chunk 2 (500 chars):
┌──────────────────────────┐ ┌──────────────────────────┐
│ The quick brown fox │ │ ...barked loudly at the │
│ jumps over the lazy dog. │ │ mailman. Meanwhile, the │
│ The dog barked loudly at │ │ cat slept on the window- │
│ the mailman. Meanwhile...│ │ sill, completely... │
└──────────────────────────┘ └──────────────────────────┘
▲ ▲
│ OVERLAP (50 chars) │
│ ┌─────────────────────┐ │
└─────│ ...loudly at the │──────┘
│ mailman. Meanwhile..│
└─────────────────────┘
Why overlap? Without it, a question about "the mailman and the cat"
might miss context because they span a chunk boundary.
// chunk.js — Split text into overlapping chunks
function chunkText(text, chunkSize = 500, overlap = 50) {
const chunks = [];
let start = 0;
while (start < text.length) {
let end = start + chunkSize;
// Try to break at a sentence boundary
if (end < text.length) {
const lastPeriod = text.lastIndexOf('.', end);
if (lastPeriod > start + chunkSize * 0.5) {
end = lastPeriod + 1;
}
}
const chunk = text.slice(start, end).trim();
if (chunk.length > 0) {
chunks.push({
text: chunk,
startIndex: start,
endIndex: end,
});
}
// Move forward, but keep `overlap` chars for context continuity
start = end - overlap;
}
return chunks;
}
// Example:
const chunks = chunkText(documentText, 500, 50);
console.log(`Created ${chunks.length} chunks from document`);
Step 3: Embeddings — Turning Text Into Numbers
This is the magic step. An embedding model converts text into a vector (a list of numbers) that captures the meaning of the text. Similar meanings → similar vectors.
HOW EMBEDDINGS WORK
════════════════════
Text: Vector (384 dimensions):
"Redis caching improves [0.23, -0.11, 0.87, 0.45, ..., 0.12]
database performance" ↕
SIMILAR direction as:
"Using cache to speed up [0.25, -0.09, 0.85, 0.43, ..., 0.14]
database queries"
DIFFERENT direction from:
"The weather is sunny [0.91, 0.34, -0.22, 0.05, ..., -0.67]
in Mumbai today"
Similarity = Cosine of angle between vectors
═══════════════════════════════════════════════
"Redis caching" "Speed up DB"
↗ ↗ Cosine similarity = 0.95
/ / (very similar!)
/ /
╱───────────────╱──────────────▶ dimension 1
╱ ╱
╱ ╱
╱
╱ "Weather in Mumbai"
Cosine similarity = 0.12
(very different!)
// embeddings.js — Generate embeddings using Xenova Transformers (runs locally!)
const { pipeline } = require('@xenova/transformers');
let embedder = null;
async function getEmbedder() {
if (!embedder) {
// Downloads the model on first use (~80MB), then cached
embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2'
);
}
return embedder;
}
async function embedText(text) {
const model = await getEmbedder();
const output = await model(text, { pooling: 'mean', normalize: true });
return Array.from(output.data);
}
async function embedChunks(chunks) {
const model = await getEmbedder();
const embeddings = [];
for (const chunk of chunks) {
const vector = await embedText(chunk.text);
embeddings.push({
...chunk,
vector,
});
}
return embeddings;
}
// Each vector is 384 dimensions — captures semantic meaning of the text
Step 4: Vector Store — Finding Similar Chunks
When a user asks a question, we embed the question using the same model, then find the chunks with the most similar vectors using cosine similarity.
// vector-store.js — In-memory vector store with cosine similarity search
class VectorStore {
constructor() {
this.documents = new Map(); // docId → { name, chunks }
}
addDocument(docId, name, embeddedChunks) {
this.documents.set(docId, { name, chunks: embeddedChunks });
}
search(queryVector, topK = 5) {
const results = [];
for (const [docId, doc] of this.documents) {
for (const chunk of doc.chunks) {
const similarity = cosineSimilarity(queryVector, chunk.vector);
results.push({
text: chunk.text,
document: doc.name,
similarity,
startIndex: chunk.startIndex,
});
}
}
// Sort by similarity (highest first) and return top K
return results
.sort((a, b) => b.similarity - a.similarity)
.slice(0, topK);
}
removeDocument(docId) {
this.documents.delete(docId);
}
}
function cosineSimilarity(vecA, vecB) {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < vecA.length; i++) {
dotProduct += vecA[i] * vecB[i];
normA += vecA[i] * vecA[i];
normB += vecB[i] * vecB[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
VECTOR SEARCH — HOW IT WORKS
═════════════════════════════
User question: "What is the termination clause?"
│
▼
Embed question → [0.34, -0.22, 0.91, ...]
│
▼
Compare against ALL chunk vectors:
┌───────────────────────────────────────────────┐
│ Chunk 1: "Company overview..." sim: 0.12 │
│ Chunk 2: "Payment terms..." sim: 0.28 │
│ Chunk 3: "Either party may sim: 0.94 │ ◀── TOP MATCH
│ terminate this │
│ agreement with 30 │
│ days written notice..." │
│ Chunk 4: "Intellectual property..."sim: 0.15 │
│ Chunk 5: "In the event of sim: 0.87 │ ◀── 2nd match
│ breach, termination │
│ is effective..." │
└───────────────────────────────────────────────┘
│
▼
Return top 5 chunks sorted by similarity
Step 5: Prompt Building — Giving the LLM Context
Now we take the retrieved chunks and build a prompt that tells the LLM: "Answer this question using ONLY the provided context."
// prompt.js — Build the RAG prompt
function buildPrompt(question, retrievedChunks) {
const context = retrievedChunks
.map((chunk, i) => `[Source ${i + 1}] ${chunk.text}`)
.join('\n\n');
return `You are a helpful assistant that answers questions based on the provided document context.
IMPORTANT RULES:
- Only answer based on the context below
- If the answer is not in the context, say "I don't have enough information to answer that"
- Be specific and cite which source you're drawing from
- Keep answers concise but complete
CONTEXT:
${context}
QUESTION: ${question}
ANSWER:`;
}
Step 6: LLM Response — Generating the Answer
DocMind uses Ollama to run an LLM locally. No API keys, no cloud costs, complete privacy.
// llm.js — Query Ollama for the answer
async function queryLLM(prompt) {
try {
const response = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'qwen2:0.5b', // Small, fast, runs on CPU
prompt,
stream: false,
options: {
temperature: 0.3, // Low temp = more focused answers
num_predict: 500, // Max tokens in response
},
}),
});
const data = await response.json();
return data.response;
} catch (err) {
// Fallback: extract answer using regex matching
return fallbackExtract(prompt);
}
}
// Fallback when Ollama is unavailable
function fallbackExtract(prompt) {
const contextMatch = prompt.match(/CONTEXT:\n([\s\S]*?)\nQUESTION:/);
if (!contextMatch) return "Unable to process your question.";
const context = contextMatch[1];
// Return the most relevant chunk as the answer
const sentences = context.split(/[.!?]+/).filter(s => s.trim().length > 20);
return sentences.slice(0, 3).join('. ') + '.';
}
Putting It All Together
// rag-pipeline.js — The complete RAG pipeline
const { extractText } = require('./extract');
const { chunkText } = require('./chunk');
const { embedText, embedChunks } = require('./embeddings');
const { VectorStore } = require('./vector-store');
const { buildPrompt } = require('./prompt');
const { queryLLM } = require('./llm');
const vectorStore = new VectorStore();
// UPLOAD: Process a new document
async function uploadDocument(filePath, docId, docName) {
console.log(`Processing ${docName}...`);
// Step 1: Extract text
const text = await extractText(filePath);
console.log(`Extracted ${text.length} characters`);
// Step 2: Chunk text
const chunks = chunkText(text, 500, 50);
console.log(`Created ${chunks.length} chunks`);
// Step 3: Generate embeddings
const embeddedChunks = await embedChunks(chunks);
console.log(`Generated ${embeddedChunks.length} embeddings`);
// Step 4: Store in vector index
vectorStore.addDocument(docId, docName, embeddedChunks);
console.log(`Document "${docName}" indexed and ready for queries`);
return { chunks: chunks.length, characters: text.length };
}
// QUERY: Ask a question about uploaded documents
async function askQuestion(question) {
// Step 1: Embed the question
const queryVector = await embedText(question);
// Step 2: Find similar chunks
const topChunks = vectorStore.search(queryVector, 5);
if (topChunks.length === 0 || topChunks[0].similarity < 0.3) {
return {
answer: "I don't have enough relevant information to answer that question.",
sources: [],
};
}
// Step 3: Build prompt with context
const prompt = buildPrompt(question, topChunks);
// Step 4: Get LLM answer
const answer = await queryLLM(prompt);
return {
answer,
sources: topChunks.map(c => ({
document: c.document,
text: c.text.slice(0, 100) + '...',
similarity: (c.similarity * 100).toFixed(1) + '%',
})),
};
}
Real-World Example
Here's what it looks like in practice:
USER INTERACTION
════════════════
> Upload: "annual-report-2025.pdf" (85 pages)
✓ Extracted 142,000 characters
✓ Created 284 chunks
✓ Generated 284 embeddings (took 12s)
✓ Document indexed and ready
> Question: "What was the total revenue for Q4 2025?"
Answer: Based on the financial summary in the report,
total revenue for Q4 2025 was ₹1,247 crores, representing
a 23% increase from Q4 2024. The primary drivers were
the expansion of the digital services segment (up 34%)
and increased enterprise contracts.
Sources:
├── [96.2% match] Page 42: "Q4 2025 Financial Highlights..."
├── [91.8% match] Page 43: "Revenue breakdown by segment..."
└── [87.4% match] Page 7: "Executive summary: record revenue..."
Common Mistakes in RAG Apps
1. Chunks Too Large or Too Small
┌──────────────────────────────────────────────────┐
│ Chunk size │ Problem │
├────────────┼──────────────────────────────────────┤
│ Too small │ Loses context. "30 days" means │
│ (< 100) │ nothing without surrounding text. │
├────────────┼──────────────────────────────────────┤
│ Too large │ Embedding captures too many topics. │
│ (> 2000) │ Search returns irrelevant content. │
├────────────┼──────────────────────────────────────┤
│ Sweet spot │ 300-800 characters with 50-100 │
│ │ character overlap. │
└────────────┴──────────────────────────────────────┘
2. No Overlap Between Chunks
If a key fact spans two chunks, both chunks miss it. Always overlap.
3. Using Different Embedding Models for Chunks and Queries
The question and the chunks MUST be embedded with the same model. Different models produce incompatible vector spaces.
4. Not Setting a Similarity Threshold
If the top result has 0.3 similarity, the document probably doesn't contain the answer. Return "I don't know" instead of hallucinating.
5. Sending the Entire Document to the LLM
LLMs have context limits. Even with large contexts, more text = slower + more expensive + less focused answers. Retrieve only the relevant chunks.
When to Use RAG / When Not to Use It
USE RAG WHEN: DON'T USE WHEN:
══════════════ ════════════════
✅ You have specific documents ❌ The question is general knowledge
to answer questions about (just use the LLM directly)
✅ Documents change frequently ❌ You need the LLM to reason
(RAG = always up to date) creatively (RAG constrains it)
✅ Accuracy matters more than ❌ Documents are tiny (< 1 page)
creativity (just put them in the prompt)
✅ You want source attribution ❌ You need real-time data
("this came from page 42") (RAG works on static docs)
✅ Privacy matters (local RAG ❌ You need to synthesize across
= data never leaves your machine) hundreds of documents (use
a search engine instead)
Key Takeaways
- RAG = Retrieve + Augment + Generate — ground LLM answers in real document content
- Chunking strategy matters — 300-800 chars with 50-100 char overlap is the sweet spot
- Embeddings capture meaning — cosine similarity finds semantically similar text, not keyword matches
- Use the same embedding model for both documents and queries
- Set a similarity threshold — don't answer if the best match is weak
- Local RAG is possible — Ollama + Xenova Transformers, zero API costs
- Fallback gracefully — if the LLM is down, extract answers from the chunks directly
Connect With Me
If you found this useful, drop a reaction and follow — I post deep-dive AI engineering tutorials every week, always with full code and working examples.
- GitHub: github.com/Robins163 — DocMind source code is in the
ai-portfoliorepo - Twitter/X: twitter.com/robinsingh — I post shorter breakdowns there too
- LinkedIn: linkedin.com/in/robinsingh
Next in this series: "WebSocket vs SSE vs Long Polling — The Complete Visual Comparison"
Questions or feedback? Drop a comment below — I respond to every one.
Top comments (0)