DEV Community

Robins163
Robins163

Posted on • Originally published at unknowntoplay.hashnode.dev

I Built a RAG App That Chats With Any PDF — Here's How

TL;DR: I built DocMind — a multimodal RAG (Retrieval-Augmented Generation) app that lets you upload any PDF, image, or document and ask questions about it in plain English. No API keys needed — it runs entirely locally using Ollama for the LLM and Xenova Transformers for embeddings. This post walks through the full architecture, the chunking strategy, the vector similarity search, and the code.


The Problem

You have a 200-page PDF — a legal contract, a research paper, your company's internal docs. You need to find specific information buried somewhere in it. Ctrl+F only works if you know the exact words. What you really want is to ask the document questions in plain language:

  • "What's the termination clause in this contract?"
  • "Summarize the results from section 4"
  • "What technologies does this company use?"

This is what RAG does. And unlike fine-tuning an LLM (which costs hundreds of dollars and takes hours), RAG is cheap, fast, and works with any document you throw at it.

I built DocMind to solve this for myself — and to have a real AI project in my portfolio that actually does something useful.

What Is RAG? (The 30-Second Version)

RAG = Retrieval-Augmented Generation. Instead of asking an LLM to answer from memory (which it often gets wrong — hallucinations), you:

  1. Retrieve the relevant parts of your document
  2. Augment the LLM's prompt with those chunks
  3. Generate an answer grounded in actual content
  RAG vs PURE LLM
  ════════════════

  Pure LLM (without RAG):
  ┌────────┐   "What's the          ┌─────────┐
  │  User  │──termination clause?──▶│   LLM   │──▶ "I think it might be..."
  └────────┘                        │(guessing │    (HALLUCINATION ⚠️)
                                    │ from     │
                                    │ training)│
                                    └─────────┘

  RAG (with document context):
  ┌────────┐   "What's the          ┌──────────┐   relevant    ┌─────────┐
  │  User  │──termination clause?──▶│  Vector  │──paragraphs──▶│   LLM   │
  └────────┘                        │  Search  │               │(answers │
                                    │          │               │ from    │
                                    │ searches │               │ actual  │
                                    │ your PDF │               │ content)│
                                    └──────────┘               └─────────┘
                                                                    │
                                                                    ▼
                                                          "Section 12.3 states
                                                           the termination clause
                                                           requires 30 days..."
                                                           (GROUNDED ✅)
Enter fullscreen mode Exit fullscreen mode

The Full Architecture

Here's how DocMind works end to end:

flowchart TD
    A[Upload Document] --> B{File Type?}
    B -->|PDF| C[pdf-parse]
    B -->|DOCX| D[mammoth]
    B -->|Image| E[Tesseract OCR]
    B -->|CSV| F[csv-parse]
    B -->|TXT/MD| G[Read directly]
    C --> H[Raw Text]
    D --> H
    E --> H
    F --> H
    G --> H
    H --> I[Chunk Text<br/>500 chars, 50 overlap]
    I --> J[Generate Embeddings<br/>all-MiniLM-L6-v2]
    J --> K[Store in Vector Index]

    L[User Question] --> M[Embed Question]
    M --> N[Cosine Similarity Search]
    K --> N
    N --> O[Top 5 Chunks Retrieved]
    O --> P[Build Prompt with Context]
    P --> Q[Ollama LLM generates answer]
    Q --> R[Return Answer + Sources]

    style K fill:#3b82f6,color:#fff
    style Q fill:#22c55e,color:#fff
    style E fill:#f59e0b,color:#fff
Enter fullscreen mode Exit fullscreen mode

Two pipelines: Upload (extract → chunk → embed → store) and Query (embed question → similarity search → build prompt → LLM answer).

Step-by-Step Build

Step 1: Text Extraction — Getting Text Out of Anything

The first challenge is extracting text from 8 different file types. Each needs a different parser:

// extract.js — Extract text from any supported file type
const pdfParse = require('pdf-parse');
const mammoth = require('mammoth');
const Tesseract = require('tesseract.js');
const csvParse = require('csv-parse/sync');
const fs = require('fs');
const path = require('path');

async function extractText(filePath) {
  const ext = path.extname(filePath).toLowerCase();
  const buffer = fs.readFileSync(filePath);

  switch (ext) {
    case '.pdf':
      const pdf = await pdfParse(buffer);
      return pdf.text;

    case '.docx':
      const result = await mammoth.extractRawText({ buffer });
      return result.value;

    case '.txt':
    case '.md':
      return buffer.toString('utf-8');

    case '.csv':
      const records = csvParse.parse(buffer, { columns: true });
      return records.map(row => Object.values(row).join(' | ')).join('\n');

    case '.png':
    case '.jpg':
    case '.jpeg':
    case '.webp':
      const { data } = await Tesseract.recognize(buffer, 'eng');
      return data.text;

    default:
      throw new Error(`Unsupported file type: ${ext}`);
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Chunking — Breaking Text Into Searchable Pieces

You can't embed a whole document as one vector — it loses meaning. Instead, you break it into small, overlapping chunks. The overlap ensures context isn't lost at chunk boundaries.

  CHUNKING STRATEGY
  ═════════════════

  Original text (1000 chars):
  ┌──────────────────────────────────────────────────────────┐
  │ The quick brown fox jumps over the lazy dog. The dog     │
  │ barked loudly at the mailman. Meanwhile, the cat slept   │
  │ on the windowsill, completely unbothered by the chaos... │
  └──────────────────────────────────────────────────────────┘

  Chunk 1 (500 chars):              Chunk 2 (500 chars):
  ┌──────────────────────────┐      ┌──────────────────────────┐
  │ The quick brown fox      │      │ ...barked loudly at the  │
  │ jumps over the lazy dog. │      │ mailman. Meanwhile, the  │
  │ The dog barked loudly at │      │ cat slept on the window- │
  │ the mailman. Meanwhile...│      │ sill, completely...      │
  └──────────────────────────┘      └──────────────────────────┘
           ▲                                  ▲
           │           OVERLAP (50 chars)     │
           │     ┌─────────────────────┐      │
           └─────│ ...loudly at the    │──────┘
                 │ mailman. Meanwhile..│
                 └─────────────────────┘

  Why overlap? Without it, a question about "the mailman and the cat"
  might miss context because they span a chunk boundary.
Enter fullscreen mode Exit fullscreen mode
// chunk.js — Split text into overlapping chunks
function chunkText(text, chunkSize = 500, overlap = 50) {
  const chunks = [];
  let start = 0;

  while (start < text.length) {
    let end = start + chunkSize;

    // Try to break at a sentence boundary
    if (end < text.length) {
      const lastPeriod = text.lastIndexOf('.', end);
      if (lastPeriod > start + chunkSize * 0.5) {
        end = lastPeriod + 1;
      }
    }

    const chunk = text.slice(start, end).trim();
    if (chunk.length > 0) {
      chunks.push({
        text: chunk,
        startIndex: start,
        endIndex: end,
      });
    }

    // Move forward, but keep `overlap` chars for context continuity
    start = end - overlap;
  }

  return chunks;
}

// Example:
const chunks = chunkText(documentText, 500, 50);
console.log(`Created ${chunks.length} chunks from document`);
Enter fullscreen mode Exit fullscreen mode

Step 3: Embeddings — Turning Text Into Numbers

This is the magic step. An embedding model converts text into a vector (a list of numbers) that captures the meaning of the text. Similar meanings → similar vectors.

  HOW EMBEDDINGS WORK
  ════════════════════

  Text:                              Vector (384 dimensions):
  "Redis caching improves            [0.23, -0.11, 0.87, 0.45, ..., 0.12]
   database performance"                          ↕
                                     SIMILAR direction as:
  "Using cache to speed up           [0.25, -0.09, 0.85, 0.43, ..., 0.14]
   database queries"

                                     DIFFERENT direction from:
  "The weather is sunny              [0.91, 0.34, -0.22, 0.05, ..., -0.67]
   in Mumbai today"


  Similarity = Cosine of angle between vectors
  ═══════════════════════════════════════════════

       "Redis caching"    "Speed up DB"
              ↗                ↗          Cosine similarity = 0.95
            /                /            (very similar!)
           /               /
          ╱───────────────╱──────────────▶ dimension 1
         ╱               ╱
        ╱               ╱
                       ╱
                      ╱  "Weather in Mumbai"
                                              Cosine similarity = 0.12
                                              (very different!)
Enter fullscreen mode Exit fullscreen mode
// embeddings.js — Generate embeddings using Xenova Transformers (runs locally!)
const { pipeline } = require('@xenova/transformers');

let embedder = null;

async function getEmbedder() {
  if (!embedder) {
    // Downloads the model on first use (~80MB), then cached
    embedder = await pipeline(
      'feature-extraction',
      'Xenova/all-MiniLM-L6-v2'
    );
  }
  return embedder;
}

async function embedText(text) {
  const model = await getEmbedder();
  const output = await model(text, { pooling: 'mean', normalize: true });
  return Array.from(output.data);
}

async function embedChunks(chunks) {
  const model = await getEmbedder();
  const embeddings = [];

  for (const chunk of chunks) {
    const vector = await embedText(chunk.text);
    embeddings.push({
      ...chunk,
      vector,
    });
  }

  return embeddings;
}

// Each vector is 384 dimensions — captures semantic meaning of the text
Enter fullscreen mode Exit fullscreen mode

Step 4: Vector Store — Finding Similar Chunks

When a user asks a question, we embed the question using the same model, then find the chunks with the most similar vectors using cosine similarity.

// vector-store.js — In-memory vector store with cosine similarity search
class VectorStore {
  constructor() {
    this.documents = new Map(); // docId → { name, chunks }
  }

  addDocument(docId, name, embeddedChunks) {
    this.documents.set(docId, { name, chunks: embeddedChunks });
  }

  search(queryVector, topK = 5) {
    const results = [];

    for (const [docId, doc] of this.documents) {
      for (const chunk of doc.chunks) {
        const similarity = cosineSimilarity(queryVector, chunk.vector);
        results.push({
          text: chunk.text,
          document: doc.name,
          similarity,
          startIndex: chunk.startIndex,
        });
      }
    }

    // Sort by similarity (highest first) and return top K
    return results
      .sort((a, b) => b.similarity - a.similarity)
      .slice(0, topK);
  }

  removeDocument(docId) {
    this.documents.delete(docId);
  }
}

function cosineSimilarity(vecA, vecB) {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;

  for (let i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }

  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
Enter fullscreen mode Exit fullscreen mode
  VECTOR SEARCH — HOW IT WORKS
  ═════════════════════════════

  User question: "What is the termination clause?"
       │
       ▼
  Embed question → [0.34, -0.22, 0.91, ...]
       │
       ▼
  Compare against ALL chunk vectors:
  ┌───────────────────────────────────────────────┐
  │ Chunk 1: "Company overview..."     sim: 0.12  │
  │ Chunk 2: "Payment terms..."        sim: 0.28  │
  │ Chunk 3: "Either party may         sim: 0.94  │ ◀── TOP MATCH
  │           terminate this                       │
  │           agreement with 30                    │
  │           days written notice..."              │
  │ Chunk 4: "Intellectual property..."sim: 0.15  │
  │ Chunk 5: "In the event of          sim: 0.87  │ ◀── 2nd match
  │           breach, termination                  │
  │           is effective..."                     │
  └───────────────────────────────────────────────┘
       │
       ▼
  Return top 5 chunks sorted by similarity
Enter fullscreen mode Exit fullscreen mode

Step 5: Prompt Building — Giving the LLM Context

Now we take the retrieved chunks and build a prompt that tells the LLM: "Answer this question using ONLY the provided context."

// prompt.js — Build the RAG prompt
function buildPrompt(question, retrievedChunks) {
  const context = retrievedChunks
    .map((chunk, i) => `[Source ${i + 1}] ${chunk.text}`)
    .join('\n\n');

  return `You are a helpful assistant that answers questions based on the provided document context.

IMPORTANT RULES:
- Only answer based on the context below
- If the answer is not in the context, say "I don't have enough information to answer that"
- Be specific and cite which source you're drawing from
- Keep answers concise but complete

CONTEXT:
${context}

QUESTION: ${question}

ANSWER:`;
}
Enter fullscreen mode Exit fullscreen mode

Step 6: LLM Response — Generating the Answer

DocMind uses Ollama to run an LLM locally. No API keys, no cloud costs, complete privacy.

// llm.js — Query Ollama for the answer
async function queryLLM(prompt) {
  try {
    const response = await fetch('http://localhost:11434/api/generate', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: 'qwen2:0.5b',   // Small, fast, runs on CPU
        prompt,
        stream: false,
        options: {
          temperature: 0.3,     // Low temp = more focused answers
          num_predict: 500,     // Max tokens in response
        },
      }),
    });

    const data = await response.json();
    return data.response;
  } catch (err) {
    // Fallback: extract answer using regex matching
    return fallbackExtract(prompt);
  }
}

// Fallback when Ollama is unavailable
function fallbackExtract(prompt) {
  const contextMatch = prompt.match(/CONTEXT:\n([\s\S]*?)\nQUESTION:/);
  if (!contextMatch) return "Unable to process your question.";

  const context = contextMatch[1];
  // Return the most relevant chunk as the answer
  const sentences = context.split(/[.!?]+/).filter(s => s.trim().length > 20);
  return sentences.slice(0, 3).join('. ') + '.';
}
Enter fullscreen mode Exit fullscreen mode

Putting It All Together

// rag-pipeline.js — The complete RAG pipeline
const { extractText } = require('./extract');
const { chunkText } = require('./chunk');
const { embedText, embedChunks } = require('./embeddings');
const { VectorStore } = require('./vector-store');
const { buildPrompt } = require('./prompt');
const { queryLLM } = require('./llm');

const vectorStore = new VectorStore();

// UPLOAD: Process a new document
async function uploadDocument(filePath, docId, docName) {
  console.log(`Processing ${docName}...`);

  // Step 1: Extract text
  const text = await extractText(filePath);
  console.log(`Extracted ${text.length} characters`);

  // Step 2: Chunk text
  const chunks = chunkText(text, 500, 50);
  console.log(`Created ${chunks.length} chunks`);

  // Step 3: Generate embeddings
  const embeddedChunks = await embedChunks(chunks);
  console.log(`Generated ${embeddedChunks.length} embeddings`);

  // Step 4: Store in vector index
  vectorStore.addDocument(docId, docName, embeddedChunks);
  console.log(`Document "${docName}" indexed and ready for queries`);

  return { chunks: chunks.length, characters: text.length };
}

// QUERY: Ask a question about uploaded documents
async function askQuestion(question) {
  // Step 1: Embed the question
  const queryVector = await embedText(question);

  // Step 2: Find similar chunks
  const topChunks = vectorStore.search(queryVector, 5);

  if (topChunks.length === 0 || topChunks[0].similarity < 0.3) {
    return {
      answer: "I don't have enough relevant information to answer that question.",
      sources: [],
    };
  }

  // Step 3: Build prompt with context
  const prompt = buildPrompt(question, topChunks);

  // Step 4: Get LLM answer
  const answer = await queryLLM(prompt);

  return {
    answer,
    sources: topChunks.map(c => ({
      document: c.document,
      text: c.text.slice(0, 100) + '...',
      similarity: (c.similarity * 100).toFixed(1) + '%',
    })),
  };
}
Enter fullscreen mode Exit fullscreen mode

Real-World Example

Here's what it looks like in practice:

  USER INTERACTION
  ════════════════

  > Upload: "annual-report-2025.pdf" (85 pages)
  ✓ Extracted 142,000 characters
  ✓ Created 284 chunks
  ✓ Generated 284 embeddings (took 12s)
  ✓ Document indexed and ready

  > Question: "What was the total revenue for Q4 2025?"

  Answer: Based on the financial summary in the report,
  total revenue for Q4 2025 was ₹1,247 crores, representing
  a 23% increase from Q4 2024. The primary drivers were
  the expansion of the digital services segment (up 34%)
  and increased enterprise contracts.

  Sources:
  ├── [96.2% match] Page 42: "Q4 2025 Financial Highlights..."
  ├── [91.8% match] Page 43: "Revenue breakdown by segment..."
  └── [87.4% match] Page 7: "Executive summary: record revenue..."
Enter fullscreen mode Exit fullscreen mode

Common Mistakes in RAG Apps

1. Chunks Too Large or Too Small

  ┌──────────────────────────────────────────────────┐
  │ Chunk size │ Problem                              │
  ├────────────┼──────────────────────────────────────┤
  │ Too small  │ Loses context. "30 days" means       │
  │ (< 100)    │ nothing without surrounding text.    │
  ├────────────┼──────────────────────────────────────┤
  │ Too large  │ Embedding captures too many topics.  │
  │ (> 2000)   │ Search returns irrelevant content.   │
  ├────────────┼──────────────────────────────────────┤
  │ Sweet spot │ 300-800 characters with 50-100        │
  │            │ character overlap.                    │
  └────────────┴──────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

2. No Overlap Between Chunks

If a key fact spans two chunks, both chunks miss it. Always overlap.

3. Using Different Embedding Models for Chunks and Queries

The question and the chunks MUST be embedded with the same model. Different models produce incompatible vector spaces.

4. Not Setting a Similarity Threshold

If the top result has 0.3 similarity, the document probably doesn't contain the answer. Return "I don't know" instead of hallucinating.

5. Sending the Entire Document to the LLM

LLMs have context limits. Even with large contexts, more text = slower + more expensive + less focused answers. Retrieve only the relevant chunks.

When to Use RAG / When Not to Use It

  USE RAG WHEN:                          DON'T USE WHEN:
  ══════════════                         ════════════════

  ✅ You have specific documents          ❌ The question is general knowledge
     to answer questions about               (just use the LLM directly)
  ✅ Documents change frequently          ❌ You need the LLM to reason
     (RAG = always up to date)               creatively (RAG constrains it)
  ✅ Accuracy matters more than           ❌ Documents are tiny (< 1 page)
     creativity                              (just put them in the prompt)
  ✅ You want source attribution          ❌ You need real-time data
     ("this came from page 42")              (RAG works on static docs)
  ✅ Privacy matters (local RAG           ❌ You need to synthesize across
     = data never leaves your machine)       hundreds of documents (use
                                             a search engine instead)
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  • RAG = Retrieve + Augment + Generate — ground LLM answers in real document content
  • Chunking strategy matters — 300-800 chars with 50-100 char overlap is the sweet spot
  • Embeddings capture meaning — cosine similarity finds semantically similar text, not keyword matches
  • Use the same embedding model for both documents and queries
  • Set a similarity threshold — don't answer if the best match is weak
  • Local RAG is possible — Ollama + Xenova Transformers, zero API costs
  • Fallback gracefully — if the LLM is down, extract answers from the chunks directly

Connect With Me

If you found this useful, drop a reaction and follow — I post deep-dive AI engineering tutorials every week, always with full code and working examples.

Next in this series: "WebSocket vs SSE vs Long Polling — The Complete Visual Comparison"

Questions or feedback? Drop a comment below — I respond to every one.

Top comments (0)