surajrkhonde

Posted on Jun 22

Phase 1: Document Ingestion - The Hidden Complexity Before Embeddings

#llm #rag #dataengineering #ai

The Complete Story: Why Most RAG Systems Fail Before They Start

The Story Begins: Why Your Upload Button Is Just The Beginning

👦 Nephew: Uncle! I finally built my RAG system. User uploads a PDF, system finds answers. Simple, right?

👨‍🦳 Uncle: (smiles knowingly) You uploaded a PDF and got an answer?

👦 Nephew: Yes! It works!

👨‍🦳 Uncle: Did you get the right answer?

👦 Nephew: Well... sometimes. Why?

👨‍🦳 Uncle: Because between "user uploads PDF" and "system creates embeddings", there are 15 critical steps. Skip even one, and your system fails silently. You get wrong answers and don't know why.

👦 Nephew: 15 steps? I just embedded the text!

👨‍🦳 Uncle: Exactly. That's the problem. Come, let me show you what production engineers actually do.

The 15 Steps of Phase 1: Document Ingestion

👨‍🦳 Uncle: Think of it like cooking biryani. You don't just dump rice and meat, right?

👦 Nephew: No, you prepare everything first!

👨‍🦳 Uncle: Exactly. Before you cook, you:

Wash rice
Soak rice
Prepare meat
Marinate meat
Chop onions
... and many more steps

Only THEN you cook.

The same with documents. Before embeddings, you must:

1. Document Upload (user action)
2. File Hashing (did we see this before?)
3. PDF Parsing (extract text)
4. Text Extraction (convert PDF → text)
5. Text Cleaning (remove junk)
6. Metadata Extraction (add context)
7. Chunking (split smartly)
8. Chunk Boundaries (don't break meaning)
9. Chunk Size (1000 tokens, not 100)
10. Overlap (context continuity)
11. Chunk Hashing (detect changes)
12. Deduplication (prevent duplicates)
13. Versioning (handle updates)
14. Incremental Ingestion (avoid re-embedding)
15. Cost Optimization (save money)

ONLY THEN:
   ↓
   Embeddings
   ↓
   Vector DB

👦 Nephew: That's a lot! Where do I start?

👨‍🦳 Uncle: With step 1. Let's go slow. The foundation must be solid.

PHASE 1, STEP 1-2: Document Upload & File Hashing

Why Does File Hashing Matter?

👦 Nephew: Why hash a file? Why not just upload it?

👨‍🦳 Uncle: Because humans are lazy. The HR person uploads the same PDF three times. Your system processes it three times. Three embeddings created. Three times the cost.

👦 Nephew: So hash prevents duplicates?

👨‍🦳 Uncle: Yes. But here's the trick: don't hash the filename.

👦 Nephew: Why not?

👨‍🦳 Uncle: Because someone could change the content and keep the same name. Look:

File: HR_Policy.pdf (Version 1)
Content: "30 days notice required"
Filename Hash: HR_Policy.pdf

File: HR_Policy.pdf (Version 2)
Content: "7 days notice required"
Filename Hash: HR_Policy.pdf (SAME!)

System thinks: Same file!
Reality: Different policies!

That's a disaster.

👦 Nephew: So we hash the content?

👨‍🦳 Uncle: Yes. The actual binary data.

HR_Policy.pdf (Version 1)
↓
[PDF binary bytes: 0xAA 0xBB 0xCC ...]
↓
SHA256
↓
A7B82C1F9D3E...

HR_Policy.pdf (Version 2)
↓
[PDF binary bytes: 0xAA 0xBB 0xDD ...]  ← Different!
↓
SHA256
↓
X9Z47M3Q2K1L...  ← Different hash!

System detects: New file, process it.

Step 1-2: Node.js Implementation

// src/ingestion/fileHasher.ts

import crypto from 'crypto';
import fs from 'fs';
import path from 'path';
import db from '../config/database';
import logger from '../utils/logger';

interface FileHashResult {
  fileName: string;
  fileHash: string;
  alreadyExists: boolean;
  fileSize: number;
}

/**
 * Step 1-2: Upload file and check if already processed
 * 
 * ⚠️ CRITICAL POINTS:
 * 1. Hash FILE CONTENT, not filename
 * 2. Same content → same hash (deterministic)
 * 3. Different content → different hash (even if same filename)
 * 4. Check database before processing
 */
export async function handleFileUpload(
  filePath: string,
  tenantId: string
): Promise<FileHashResult> {
  try {
    const fileName = path.basename(filePath);
    const fileSize = fs.statSync(filePath).size;

    // Step 1: Read file and create hash from content
    logger.info('File upload started', { fileName, fileSize });

    const fileContent = fs.readFileSync(filePath);
    const fileHash = crypto
      .createHash('sha256')
      .update(fileContent)
      .digest('hex');

    logger.debug('File hashed', { fileName, fileHash, contentBytes: fileContent.length });

    // Step 2: Check if this exact file was already processed
    const existingFile = await db.oneOrNone(
      `
      SELECT id, created_at 
      FROM documents 
      WHERE file_hash = $1 AND tenant_id = $2
      `,
      [fileHash, tenantId]
    );

    if (existingFile) {
      logger.warn('Duplicate file detected', {
        fileName,
        fileHash,
        firstUploadedAt: existingFile.created_at
      });

      return {
        fileName,
        fileHash,
        alreadyExists: true,
        fileSize
      };
    }

    // Step 3: File is new, save metadata
    const documentRecord = await db.one(
      `
      INSERT INTO documents (tenant_id, file_name, file_hash, file_size, status)
      VALUES ($1, $2, $3, $4, $5)
      RETURNING id, created_at
      `,
      [tenantId, fileName, fileHash, fileSize, 'uploaded']
    );

    logger.info('New file registered', {
      documentId: documentRecord.id,
      fileName,
      fileHash: fileHash.substring(0, 12) + '...'
    });

    return {
      fileName,
      fileHash,
      alreadyExists: false,
      fileSize
    };

  } catch (error: any) {
    logger.error('File upload error', { error: error.message });
    throw error;
  }
}

/**
 * Verify file integrity (optional but recommended)
 * If file is corrupted, don't process it
 */
export function verifyFileIntegrity(filePath: string, expectedHash: string): boolean {
  const fileContent = fs.readFileSync(filePath);
  const actualHash = crypto
    .createHash('sha256')
    .update(fileContent)
    .digest('hex');

  return actualHash === expectedHash;
}

// Database schema for documents table
export const documentsTableSQL = `
CREATE TABLE IF NOT EXISTS documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL REFERENCES tenants(id),
  file_name VARCHAR(500) NOT NULL,
  file_hash VARCHAR(64) NOT NULL,  -- SHA256 produces 64 char hex
  file_size BIGINT NOT NULL,
  status VARCHAR(50) DEFAULT 'uploaded',  -- uploaded, parsing, parsed, chunking, chunked, embedding, complete
  error_message TEXT,
  uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- Unique constraint: same file can't be uploaded twice for same tenant
  CONSTRAINT unique_file_per_tenant UNIQUE(tenant_id, file_hash),

  INDEX idx_file_hash (file_hash),
  INDEX idx_tenant_status (tenant_id, status)
);
`;

👦 Nephew: So if someone uploads the same PDF twice, we detect it and skip?

👨‍🦳 Uncle: Exactly. And we save embedding cost. Which is the most expensive step.

PHASE 1, STEP 3: PDF Parsing - Choosing the Right Tool

👦 Nephew: Now we have the PDF. How do we extract text?

👨‍🦳 Uncle: This is where the real decision happens. There are five major tools. Each has tradeoffs.

👦 Nephew: Five?! Which one should I use?

👨‍🦳 Uncle: Depends on your documents. Let me show you.

The PDF Parsing Landscape

Simple Text PDFs          →  pdf-parse (cheap, simple)
                          ↓
Mixed content (text + tables) →  PDFPlumber (better)
                          ↓
Complex documents        →  Unstructured (production-grade)
                          ↓
Advanced documents       →  LlamaParse (state-of-art)
(tables, images, OCR)
                          ↓
Enterprise documents     →  Azure Document Intelligence
(forms, invoices, scans)

Tool Comparison

Tool	Best For	Cost	Speed	Table Support	OCR	Metadata	Production Ready
pdf-parse	Simple text	₹0 (free)	⚡⚡⚡ Fast	✗ No	✗	✗	⚠️ Hobby
PDFPlumber	Text + tables	₹0	⚡⚡ Medium	✓ Basic	✗	⚠️ Limited	⚠️ Small
Unstructured	Normal docs	₹50-200/mo	⚡ Slow	✓ Good	✓ Basic	✓ Good	✓ Yes
LlamaParse	Complex docs	₹100-500/mo	⚡ Slow	✓ Excellent	✓ Advanced	✓ Excellent	✓ Yes
Azure Doc Int.	Enterprise	₹500-2000/mo	⚡ Medium	✓ Perfect	✓ Perfect	✓ Perfect	✓ Enterprise

👨‍🦳 Uncle: Let me explain each.

Tool 1: pdf-parse (Free, Simple)

// Simple approach - good for learning, bad for production

const pdf = require('pdf-parse');
const fs = require('fs');

async function extractTextFromPDF(filePath) {
  const dataBuffer = fs.readFileSync(filePath);

  const data = await pdf(dataBuffer);

  console.log(data.text);  // Raw text
  // Output: "Company Policy Leave Policy Employees receive..."
}

👨‍🦳 Uncle: Notice what we lost?

Original PDF:
═══════════════════════════════════════
COMPANY POLICY
───────────────────────────────────────
Leave Policy (HEADING)

Employees are entitled to 24 paid leaves annually.
(PARAGRAPH)

Leave Types:
- Annual Leave
- Casual Leave
(LIST)
═══════════════════════════════════════

pdf-parse Output:
"COMPANY POLICY Leave Policy Employees are entitled to 24 paid leaves annually. Leave Types: Annual Leave Casual Leave"

LOST:
✗ Heading level
✗ List structure
✗ Paragraph breaks
✗ Section organization
✗ Tables (if any)

👦 Nephew: So we just get text soup?

👨‍🦳 Uncle: Yes. And when you embed soup, you get soup answers.

Tool 2: PDFPlumber (Better, Still Python)

# PDFPlumber - extracts tables better

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        # Extract text
        text = page.extract_text()

        # Extract tables (if any)
        tables = page.extract_tables()

        print(f"Page {i}:")
        print(f"Text: {text}")
        print(f"Tables: {tables}")

👨‍🦳 Uncle: Better, but still loses structure. And it's Python, not Node.js.

Tool 3: Unstructured (Production Choice)

👨‍🦳 Uncle: This is what most companies use. It preserves structure.

// Using Unstructured via API (Node.js friendly)

import axios from 'axios';
import fs from 'fs';

/**
 * Step 3: Parse PDF using Unstructured
 * 
 * IMPORTANT: Unstructured preserves document structure
 * Returns: Array of structured elements
 */
export async function parsePDFWithUnstructured(filePath: string): Promise<any[]> {
  try {
    const fileContent = fs.readFileSync(filePath);
    const base64Content = fileContent.toString('base64');

    const response = await axios.post(
      'https://api.unstructuredapp.io/general/v0/general',
      {
        file: base64Content,
        strategy: 'hi_res',  // High resolution parsing
        coordinates: true     // Preserve coordinates (useful for tables)
      },
      {
        headers: {
          'Authorization': `Bearer ${process.env.UNSTRUCTURED_API_KEY}`,
          'Content-Type': 'application/json'
        }
      }
    );

    const elements = response.data.elements;

    // Output: Structured elements
    // [
    //   { type: "Title", text: "Leave Policy", metadata: {...} },
    //   { type: "Heading", text: "Annual Leave", metadata: {...} },
    //   { type: "Paragraph", text: "Employees are entitled...", metadata: {...} },
    //   { type: "ListItem", text: "Manager approval required", metadata: {...} }
    // ]

    logger.info('PDF parsed with structure preserved', {
      elementCount: elements.length,
      types: [...new Set(elements.map((e: any) => e.type))]
    });

    return elements;

  } catch (error: any) {
    logger.error('Unstructured parsing failed', { error: error.message });
    throw error;
  }
}

👦 Nephew: So it preserves structure?

👨‍🦳 Uncle: Yes. Look at the difference:

Unstructured Output:

[
  { 
    type: "Title",
    text: "Leave Policy",
    metadata: {
      page_number: 1,
      section: "Policies"
    }
  },
  { 
    type: "Paragraph",
    text: "Employees receive 24 leaves",
    metadata: {
      page_number: 1
    }
  },
  { 
    type: "List",
    text: "Annual Leave, Casual Leave",
    metadata: {
      page_number: 2,
      list_items: 2
    }
  }
]

Now we know:
✓ What is a title
✓ What is body text
✓ What is a list
✓ Page numbers
✓ Sections

Tool 4: LlamaParse (State of the Art)

👨‍🦳 Uncle: For really complex documents, use LlamaParse.

// LlamaParse - best for complex PDFs

import axios from 'axios';
import FormData from 'form-data';
import fs from 'fs';

/**
 * Parse PDF with LlamaParse (best for:
 * - Multi-column layouts
 * - Tables with merged cells
 * - Images with text
 * - Scanned documents (OCR)
 * - Footnotes and annotations
 */
export async function parsePDFWithLlamaParse(filePath: string): Promise<any> {
  try {
    // Step 1: Upload file
    const formData = new FormData();
    formData.append('file', fs.createReadStream(filePath));
    formData.append('parsing_instruction', `
      Extract all content including:
      - Tables with proper structure
      - Column layouts
      - Images with OCR
      - Headings and sections
      - Metadata like page numbers
    `);

    const uploadResponse = await axios.post(
      'https://api.llamaindex.ai/api/parsing/upload_file',
      formData,
      {
        headers: {
          'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`,
          ...formData.getHeaders()
        }
      }
    );

    const jobId = uploadResponse.data.id;

    // Step 2: Poll for results
    let result = null;
    for (let i = 0; i < 60; i++) {
      const statusResponse = await axios.get(
        `https://api.llamaindex.ai/api/parsing/job/${jobId}/result/markdown`,
        {
          headers: {
            'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`
          }
        }
      );

      if (statusResponse.status === 200) {
        result = statusResponse.data;
        break;
      }

      // Wait 2 seconds before retry
      await new Promise(resolve => setTimeout(resolve, 2000));
    }

    logger.info('LlamaParse completed', { jobId });

    return {
      markdown: result,
      parsedAt: new Date()
    };

  } catch (error: any) {
    logger.error('LlamaParse failed', { error: error.message });
    throw error;
  }
}

👦 Nephew: When do I use LlamaParse vs Unstructured?

👨‍🦳 Uncle: Simple rule:

Document type?

  ├─ Simple text policies
  │  └─→ pdf-parse (free)
  │
  ├─ Text + basic tables
  │  └─→ Unstructured (cheap, good)
  │
  ├─ Complex tables, multi-column
  │  └─→ LlamaParse (excellent)
  │
  └─ Enterprise documents, forms, invoices
     └─→ Azure Document Intelligence (best)

Tool 5: Azure Document Intelligence (Enterprise)

// Azure Document Intelligence - for enterprise documents

import { DocumentAnalysisClient, AzureKeyCredential } from "@azure/ai-form-recognizer";
import fs from 'fs';

/**
 * Parse with Azure (best for:
 * - Invoices
 * - Forms
 * - Bank documents
 * - Scanned PDFs with OCR
 */
export async function parseWithAzureDocumentIntelligence(filePath: string) {
  try {
    const client = new DocumentAnalysisClient(
      process.env.AZURE_FORM_RECOGNIZER_ENDPOINT,
      new AzureKeyCredential(process.env.AZURE_FORM_RECOGNIZER_KEY)
    );

    const fileContent = fs.readFileSync(filePath);

    // Choose model based on document type
    const poller = await client.beginAnalyzeDocument(
      "prebuilt-document",  // Or: prebuilt-invoice, prebuilt-receipt, etc.
      fileContent
    );

    const result = await poller.pollUntilDone();

    // Extract structure
    const output = {
      text: result.content,
      tables: result.tables?.map(table => ({
        rows: table.rowCount,
        columns: table.columnCount,
        cells: table.cells
      })),
      forms: result.fields,  // For form data
      confidence: result.confidence
    };

    logger.info('Azure parsing complete', {
      pages: result.pages?.length,
      tables: result.tables?.length,
      confidence: output.confidence
    });

    return output;

  } catch (error: any) {
    logger.error('Azure parsing failed', { error: error.message });
    throw error;
  }
}

👦 Nephew: These all cost money?

👨‍🦳 Uncle: Yes, except pdf-parse. And here's the secret: you should still use pdf-parse for simple documents!

👦 Nephew: Why?

👨‍🦳 Uncle: Cost. If 80% of your documents are simple policies, use pdf-parse. Use expensive parsers only for 20% that need them.

PHASE 1, STEP 4-5: Text Extraction & Cleaning

The Garbage Problem

👨‍🦳 Uncle: After parsing, you get garbage.

Raw Extracted Text:

═════════════════════════════════════════════════════════════
Company Name                          Page 1
═════════════════════════════════════════════════════════════
[CONFIDENTIAL]

LEAVE POLICY

[Document Version: 2.3]

Employees are entitled to 24 paid leaves annually.
These leaves must be approved by:
1. Direct Manager
2. HR Department

[CONFIDENTIAL]
═════════════════════════════════════════════════════════════
Company Name                          Page 2
═════════════════════════════════════════════════════════════

👦 Nephew: What's wrong with this?

👨‍🦳 Uncle: When you embed "[CONFIDENTIAL]", your system learns that every policy is confidential. When someone asks "Is this policy confidential?", the answer is always YES!

// Step 4-5: Clean the text

import logger from '../utils/logger';

/**
 * Step 4-5: Extract and Clean text
 * 
 * Remove:
 * - Page headers/footers
 * - Repeated watermarks
 * - Extra whitespace
 * - Page numbers
 * - Confidential markings (if generic)
 */
export function cleanExtractedText(rawText: string): string {
  let cleaned = rawText;

  // Remove page headers (appears at top of every page)
  cleaned = cleaned.replace(
    /^Company Name\s+Page \d+\s*$/gm,
    ''
  );

  // Remove confidential markings (if generic)
  cleaned = cleaned.replace(
    /\[CONFIDENTIAL\]\n?/g,
    ''
  );

  // Remove document version
  cleaned = cleaned.replace(
    /\[Document Version: [\d.]+\]/g,
    ''
  );

  // Remove page numbers alone on a line
  cleaned = cleaned.replace(
    /^\s*Page \d+\s*$/gm,
    ''
  );

  // Normalize whitespace (multiple spaces/newlines → single space)
  cleaned = cleaned.replace(/\s+/g, ' ');

  // Remove leading/trailing whitespace
  cleaned = cleaned.trim();

  logger.debug('Text cleaned', {
    originalLength: rawText.length,
    cleanedLength: cleaned.length,
    removed: rawText.length - cleaned.length
  });

  return cleaned;
}

// Example
const raw = `
═════════════════════
Company Name    Page 1
═════════════════════
[CONFIDENTIAL]

LEAVE POLICY

Employees get 24 leaves.

[CONFIDENTIAL]
═════════════════════
Company Name    Page 2
═════════════════════
`;

const clean = cleanExtractedText(raw);
// Output: "LEAVE POLICY Employees get 24 leaves."

PHASE 1, STEP 6: Metadata Extraction

👨‍🦳 Uncle: Metadata is information ABOUT the chunk, not IN the chunk.

👦 Nephew: Like what?

👨‍🦳 Uncle: Like:

Document: HR_Policy.pdf
Department: Human Resources
Section: Leave Policy
Page: 3
Version: 2.3
Effective Date: 2025-01-01
Author: HR Manager
Source: Internal Wiki

This metadata helps retrieval.

👦 Nephew: How?

👨‍🦳 Uncle: Suppose someone asks: "What's the HR policy for leaves?"

Without metadata:

System searches all documents
Returns leaves policies from EVERY department

With metadata:

System knows to search only HR documents
Returns only HR leave policy

Much better!

// Step 6: Extract metadata

import logger from '../utils/logger';

interface DocumentMetadata {
  source: string;
  department?: string;
  section?: string;
  version?: string;
  effectiveDate?: string;
  author?: string;
}

/**
 * Extract metadata from document structure
 * 
 * Sources:
 * 1. Folder structure: /HR/Policies/Leave_Policy.pdf
 * 2. Filename: leave_policy_v2.pdf
 * 3. Parser output: Unstructured gives headings as metadata
 * 4. LLM classification: Ask AI to classify
 */
export function extractMetadataFromStructure(
  filePath: string,
  parserElements: any[]
): DocumentMetadata {
  // Source 1: Folder structure
  const pathParts = filePath.split('/');
  const department = pathParts[pathParts.length - 2] === 'HR' ? 'Human Resources' : undefined;

  // Source 2: Filename
  const fileName = pathParts[pathParts.length - 1];
  const versionMatch = fileName.match(/v(\d+)/i);
  const version = versionMatch ? versionMatch[1] : undefined;

  // Source 3: Parser output (if using Unstructured)
  let section = undefined;
  if (parserElements && parserElements.length > 0) {
    const heading = parserElements.find((e: any) => e.type === 'Title' || e.type === 'Heading');
    if (heading) {
      section = heading.text;
    }
  }

  const metadata: DocumentMetadata = {
    source: filePath,
    department,
    section,
    version
  };

  logger.info('Metadata extracted', metadata);

  return metadata;
}

/**
 * Optional: Use LLM to classify (more powerful but costs money)
 */
export async function classifyWithLLM(text: string) {
  const client = new Anthropic();

  const response = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 200,
    messages: [{
      role: 'user',
      content: `Classify this document. Return JSON:
{
  "department": "HR|Finance|IT|Legal",
  "topic": "Leave|Salary|Security|Contracts",
  "confidentiality": "public|internal|confidential"
}

Document: ${text.substring(0, 500)}`
    }]
  });

  return JSON.parse(response.content[0].text);
}

PHASE 1, STEP 7-10: Chunking - The Heart of Preparation

Why Chunking Matters

👨‍🦳 Uncle: This is the most important step. Everything after depends on it.

👦 Nephew: More important than embeddings?

👨‍🦳 Uncle: YES. Because bad chunks → bad embeddings → wrong answers.

Step 7: Chunking Strategy

// Step 7: Smart chunking with boundaries

/**
 * Chunking strategies:
 * 
 * 1. Fixed size (simple, dumb)
 *    ├─ Pros: Easy to understand
 *    └─ Cons: Breaks meaning
 * 
 * 2. Sentence boundaries (better)
 *    ├─ Pros: Doesn't break sentences
 *    └─ Cons: Variable sizes
 * 
 * 3. Semantic (best, most complex)
 *    ├─ Pros: Keeps meaning together
 *    └─ Cons: Expensive, slow
 * 
 * 4. Recursive (good hybrid)
 *    ├─ Pros: Smart, flexible
 *    └─ Cons: Complex implementation
 */

function slidingWindowChunk(
  text: string,
  windowTokens: number = 1000,
  overlapTokens: number = 200
): string[] {
  // 1 token ≈ 4 characters (rough estimate for English)
  const charWindow = windowTokens * 4;
  const charOverlap = overlapTokens * 4;
  const step = charWindow - charOverlap;

  const chunks: string[] = [];

  for (let i = 0; i < text.length; i += step) {
    let end = i + charWindow;

    // Find sentence boundary (don't break mid-sentence)
    if (end < text.length) {
      const periodIndex = text.lastIndexOf('.', end);
      const newlineIndex = text.lastIndexOf('\n', end);
      const boundaryIndex = Math.max(periodIndex, newlineIndex);

      if (boundaryIndex > i + (charWindow * 0.8)) {
        end = boundaryIndex + 1;
      }
    }

    const chunk = text.substring(i, end).trim();
    if (chunk.length > 0) {
      chunks.push(chunk);
    }

    if (end >= text.length) break;
  }

  return chunks;
}

// Example
const text = `
Employees receive 24 paid leaves annually. 
These leaves can be taken as:
1. Annual Leave (20 days)
2. Casual Leave (4 days)

Manager approval is required for:
- Leaves longer than 5 days
- Leave during peak season
`;

const chunks = slidingWindowChunk(text, 1000, 200);
console.log(chunks);
// [
//   "Employees receive 24 paid leaves annually...",
//   "These leaves can be taken as: 1. Annual Leave..."
// ]

Step 8-10: Chunk Boundaries, Size, and Overlap

👦 Nephew: Why does size matter?

👨‍🦳 Uncle: Look at three scenarios:

Scenario 1: Too Small (100 tokens)

Chunk 1: "Employees receive"
Chunk 2: "24 paid leaves"
Chunk 3: "annually"

Result: Lost context. AI confused.

Question: "How many leaves do employees get?"
Chunk 1 alone: "Employees receive" (incomplete)
Chunk 2 alone: "24 paid leaves" (where?)
Chunk 3 alone: "annually" (what annually?)

Wrong answer!

─────────────────────────────────────

Scenario 2: Right Size (1000 tokens)

Chunk: "Employees receive 24 paid leaves annually.
         These leaves can be taken as annual or casual.
         Manager approval required for leaves >5 days."

Result: Full context. Clear meaning.

Question: "How many leaves do employees get?"
Answer: "24 paid leaves"

Correct!

─────────────────────────────────────

Scenario 3: Too Large (5000 tokens)

Chunk: [Entire company handbook]

Result: Too much noise.

Question: "How many leaves?"
System reads 5000 tokens to find answer.
Slow. Confused by other policies.

Wrong section highlighted.

👨‍🦳 Uncle: Sweet spot: 1000-1500 tokens.

// Complete chunking service

import logger from '../utils/logger';

interface Chunk {
  text: string;
  index: number;
  tokens: number;
  metadata: {
    startChar: number;
    endChar: number;
  };
}

/**
 * Steps 7-10: Complete chunking strategy
 * 
 * - Sliding window with overlap
 * - Respects sentence boundaries  
 * - Includes overlap for context
 * - Validates chunk size
 */
export function createSmartChunks(
  text: string,
  windowTokens: number = 1000,
  overlapTokens: number = 200
): Chunk[] {
  const charWindow = windowTokens * 4;
  const charOverlap = overlapTokens * 4;
  const step = charWindow - charOverlap;

  const chunks: Chunk[] = [];
  let chunkIndex = 0;

  for (let i = 0; i < text.length; i += step) {
    let end = i + charWindow;

    // Validate not empty
    if (end - i < charWindow * 0.3) {
      // Remaining text too small, merge with previous
      if (chunks.length > 0) {
        const lastChunk = chunks[chunks.length - 1];
        lastChunk.text += ' ' + text.substring(i).trim();
        break;
      }
    }

    // Find sentence boundary
    if (end < text.length) {
      const periodIndex = text.lastIndexOf('.', end);
      if (periodIndex > i) {
        end = periodIndex + 1;
      }
    }

    const chunkText = text.substring(i, end).trim();

    if (chunkText.length === 0) continue;

    // Estimate tokens (rough)
    const estimatedTokens = Math.ceil(chunkText.length / 4);

    // Validate: chunk should be reasonable size
    if (estimatedTokens < 100) {
      logger.warn('Chunk too small', { tokens: estimatedTokens });
    }

    chunks.push({
      text: chunkText,
      index: chunkIndex++,
      tokens: estimatedTokens,
      metadata: {
        startChar: i,
        endChar: end
      }
    });

    if (end >= text.length) break;
  }

  logger.info('Chunking complete', {
    textLength: text.length,
    chunkCount: chunks.length,
    avgChunkTokens: Math.round(
      chunks.reduce((sum, c) => sum + c.tokens, 0) / chunks.length
    )
  });

  return chunks;
}

PHASE 1, STEP 11-13: Deduplication, Hashing & Versioning

Step 11: Chunk Hashing

👨‍🦳 Uncle: After chunking, we hash each chunk.

👦 Nephew: Why hash chunks? We already hashed the file!

👨‍🦳 Uncle: Because files change. When they do, we need to know WHICH chunks changed.

HR_Policy_v1.pdf (Old)
├─ Chunk 1: "Employees get 24 leaves"
├─ Chunk 2: "Manager approval required"
└─ Chunk 3: "Casual leave guidelines"

HR_Policy_v2.pdf (New)
├─ Chunk 1: "Employees get 30 leaves"  ← CHANGED!
├─ Chunk 2: "Manager approval required" ← SAME
└─ Chunk 3: "Casual leave guidelines"   ← SAME

Old system: Re-embed all 3 chunks
Smart system: Re-embed only chunk 1

Cost saving: 66%!

// Step 11: Hash each chunk

import crypto from 'crypto';

export function hashChunk(chunkText: string): string {
  return crypto
    .createHash('sha256')
    .update(chunkText)
    .digest('hex');
}

/**
 * Key insight: Hash is deterministic
 * 
 * Same text:
 * hashChunk("Employees get 24 leaves")
 * Always returns: A7B82C1F...
 * 
 * Different text:
 * hashChunk("Employees get 30 leaves")
 * Always returns: X9Z47M3Q...
 * 
 * Different hash = Document changed
 */

// Example
const chunk1 = "Employees receive 24 paid leaves annually";
const chunk1Hash = hashChunk(chunk1);
// A7B82C1F9D3E4K2...

const chunk2 = "Employees receive 24 paid leaves annually";
const chunk2Hash = hashChunk(chunk2);
// A7B82C1F9D3E4K2... (SAME!)

const chunk3 = "Employees receive 30 paid leaves annually";
const chunk3Hash = hashChunk(chunk3);
// X9Z47M3Q2K1L9... (DIFFERENT!)

Step 12: Deduplication

// Step 12: Detect and skip duplicates

/**
 * Deduplication at multiple levels:
 * 
 * Level 1: Document level
 *          Have we processed this exact file?
 *          (Already done with file hash)
 * 
 * Level 2: Chunk level
 *          Have we processed this exact chunk?
 * 
 * Level 3: Semantic level
 *          Is this chunk similar to existing chunk?
 *          (Advanced, uses embeddings)
 */

export async function deduplicateChunks(
  chunks: Chunk[],
  tenantId: string,
  documentId: string
): Promise<{ newChunks: Chunk[]; skipped: number }> {
  const newChunks: Chunk[] = [];
  let skipped = 0;

  for (const chunk of chunks) {
    const chunkHash = hashChunk(chunk.text);

    // Check if chunk exists
    const existingChunk = await db.oneOrNone(
      `
      SELECT id FROM chunks 
      WHERE chunk_hash = $1 AND tenant_id = $2
      `,
      [chunkHash, tenantId]
    );

    if (existingChunk) {
      logger.debug('Duplicate chunk skipped', { chunkHash: chunkHash.substring(0, 8) });
      skipped++;
      continue;
    }

    newChunks.push(chunk);
  }

  logger.info('Deduplication complete', {
    total: chunks.length,
    new: newChunks.length,
    skipped,
    percentage: Math.round((skipped / chunks.length) * 100)
  });

  return { newChunks, skipped };
}

👦 Nephew: What if dedup changes because we change chunk size?

👨‍🦳 Uncle: Good question! That's versioning.

Step 13: Versioning

// Step 13: Handle document versions

export async function handleDocumentVersion(
  tenantId: string,
  fileName: string,
  newFileHash: string
) {
  // Check if this document (by name) was processed before
  const oldVersion = await db.one(
    `
    SELECT id, version FROM documents 
    WHERE file_name = $1 AND tenant_id = $2
    ORDER BY version DESC LIMIT 1
    `,
    [fileName, tenantId]
  );

  let version = 1;
  if (oldVersion) {
    // Old version exists
    // Mark it as inactive
    await db.none(
      `
      UPDATE documents 
      SET active = false 
      WHERE id = $1
      `,
      [oldVersion.id]
    );

    version = oldVersion.version + 1;
  }

  // Create new version
  const newDoc = await db.one(
    `
    INSERT INTO documents 
    (tenant_id, file_name, file_hash, version, active)
    VALUES ($1, $2, $3, $4, true)
    RETURNING id
    `,
    [tenantId, fileName, newFileHash, version]
  );

  logger.info('Document versioned', {
    fileName,
    version,
    active: true
  });

  return {
    documentId: newDoc.id,
    version,
    isNewVersion: oldVersion ? true : false
  };
}

/**
 * Database schema
 */
export const versioningSchema = `
CREATE TABLE documents (
  id UUID PRIMARY KEY,
  tenant_id UUID NOT NULL,
  file_name VARCHAR(500),
  file_hash VARCHAR(64),
  version INTEGER DEFAULT 1,
  active BOOLEAN DEFAULT true,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  INDEX idx_active (tenant_id, active)
);
`;

👨‍🦳 Uncle: Important: During retrieval, we search ONLY active documents.

-- Search only from active documents
SELECT chunk_text
FROM chunks
WHERE tenant_id = $1
  AND document_id IN (
    SELECT id FROM documents 
    WHERE active = true
  )
ORDER BY embedding <=> query_embedding
LIMIT 5

PHASE 1, STEP 14-15: Incremental Ingestion & Cost Optimization

The Big Picture: Before vs After

👨‍🦳 Uncle: Let me show you the difference between naive and smart systems.

NAIVE SYSTEM
───────────────────────────────────────

File v1 uploaded
  ├─ Parse
  ├─ Chunk (1000 chunks)
  ├─ Embed ALL 1000  ← Cost: ₹50 per 1000 = ₹50
  └─ Store

File v2 uploaded (only 1 chunk changed)
  ├─ Parse
  ├─ Chunk (1000 chunks)
  ├─ Embed ALL 1000  ← Cost: ₹50 again!
  └─ Store

Total cost: ₹100
Total waste: ₹50 (50% wasted)

───────────────────────────────────────

SMART SYSTEM
───────────────────────────────────────

File v1 uploaded
  ├─ Parse
  ├─ Chunk (1000 chunks)
  ├─ Embed ALL 1000  ← Cost: ₹50
  └─ Store with chunk hashes

File v2 uploaded (only 1 chunk changed)
  ├─ Parse
  ├─ Chunk (1000 chunks)
  ├─ Compare hashes → 999 same, 1 changed
  ├─ Embed ONLY 1 chunk  ← Cost: ₹0.05 only!
  └─ Store (update index for chunk 1)

Total cost: ₹50.05
Total waste: ₹0 (99% efficient!)

Step 14: Incremental Ingestion

// Step 14: Incremental ingestion (smart re-embedding)

/**
 * The key insight:
 * 
 * Don't re-embed unchanged chunks.
 * Reuse old embeddings.
 * 
 * This is how production systems save money.
 */

export async function incrementalIngestion(
  tenantId: string,
  documentId: string,
  newChunks: Chunk[],
  documentVersion: number
) {
  // Get old chunks from previous version
  const oldDoc = await db.one(
    `
    SELECT id FROM documents 
    WHERE file_name = (SELECT file_name FROM documents WHERE id = $1)
      AND version = $2 - 1
      AND tenant_id = $3
    `,
    [documentId, documentVersion, tenantId]
  );

  let embeddingsNeeded = newChunks;
  let embeddingsReused = 0;

  if (oldDoc) {
    // Get old chunks
    const oldChunks = await db.manyOrNone(
      `
      SELECT id, chunk_hash, embedding FROM chunks 
      WHERE document_id = $1
      `,
      [oldDoc.id]
    );

    // Compare: which chunks are new/changed?
    embeddingsNeeded = newChunks.filter(newChunk => {
      const newHash = hashChunk(newChunk.text);
      const oldChunk = oldChunks.find(
        c => c.chunk_hash === newHash
      );

      if (oldChunk) {
        // Chunk unchanged: reuse embedding
        embeddingsReused++;

        // Copy old embedding to new chunk
        newChunk.embedding = oldChunk.embedding;

        return false; // Don't need to re-embed
      }

      return true; // Need to embed
    });
  }

  logger.info('Incremental ingestion', {
    totalChunks: newChunks.length,
    newChunks: embeddingsNeeded.length,
    reused: embeddingsReused,
    efficiency: Math.round((embeddingsReused / newChunks.length) * 100) + '%'
  });

  return {
    chunksToEmbed: embeddingsNeeded,
    chunksCached: embeddingsReused,
    costSaving: embeddingsReused * 0.00001  // ₹ saved
  };
}

Step 15: Cost Optimization Strategy

// Step 15: Track costs and optimize

export async function optimizeIngestionCosts() {
  // Strategy 1: Batch embeddings
  // Don't embed one by one. Batch them.
  // Cost: 1000 embeddings in 1 API call = cheaper than 1000 calls

  // Strategy 2: Use cheaper models for simple chunks
  // Use Gemini for obvious policy text
  // Use Claude only for complex analysis

  // Strategy 3: Cache embeddings aggressively
  // If same chunk text appears in document 1 and 2
  // Don't embed twice. Reuse.

  // Strategy 4: Incremental updates
  // Don't re-embed when document slightly changes
  // Only embed changed chunks

  /**
   * Real example: HR uploads 100 policies
   * 
   * Policy 1: 100 chunks
   * Policy 2: 100 chunks
   * Policy 3: 100 chunks (99 similar to Policy 1)
   * 
   * Naive: Embed 300 chunks = ₹15
   * Smart: Embed 200 chunks = ₹10 (saved ₹5)
   * 
   * At scale (1000 policies):
   * Naive: ₹500/month
   * Smart: ₹100/month (saved ₹400!)
   */

  const costBreakdown = {
    parsing: 0.50,        // ₹0.50 per document
    embedding: 50,        // ₹50 per 1000 chunks
    vectorStorage: 10,    // ₹10 per GB/month
    llmClassification: 1, // ₹1 per document (optional)
    monthlyInfrastructure: 1000
  };

  logger.info('Cost breakdown', costBreakdown);

  return costBreakdown;
}

Complete Production Ingestion Flow

👨‍🦳 Uncle: Let me show you the complete flow as code.

// Complete ingestion pipeline

import db from '../config/database';
import logger from '../utils/logger';

/**
 * Phase 1: Complete Document Ingestion Pipeline
 * 
 * This is what production systems do before embeddings.
 * 
 * Cost tracking included.
 */
export async function completeIngestionPipeline(
  filePath: string,
  tenantId: string
): Promise<{
  documentId: string;
  chunksCreated: number;
  chunksEmbedded: number;
  costEstimate: number;
  duration: number;
}> {
  const startTime = Date.now();
  const metrics = {
    documentId: '',
    chunksCreated: 0,
    chunksEmbedded: 0,
    costEstimate: 0,
    duration: 0
  };

  try {
    // STEP 1-2: Upload & File Hashing
    logger.info('Step 1-2: File upload and hashing...');
    const fileHash = await handleFileUpload(filePath, tenantId);

    if (fileHash.alreadyExists) {
      logger.warn('File already processed. Skipping.');
      return {
        ...metrics,
        duration: Date.now() - startTime
      };
    }

    // STEP 3: Choose parser
    logger.info('Step 3: Parsing PDF...');
    const parserType = selectParser(filePath); // pdf-parse, unstructured, etc.
    const parsedElements = await parsePDF(filePath, parserType);

    // STEP 4-5: Extract & Clean
    logger.info('Step 4-5: Extracting and cleaning text...');
    let cleanText = extractText(parsedElements);
    cleanText = cleanExtractedText(cleanText);

    // STEP 6: Metadata
    logger.info('Step 6: Extracting metadata...');
    const metadata = extractMetadataFromStructure(filePath, parsedElements);

    // STEP 7-10: Chunking
    logger.info('Step 7-10: Creating chunks...');
    const chunks = createSmartChunks(cleanText, 1000, 200);
    metrics.chunksCreated = chunks.length;

    // STEP 11-12: Deduplication
    logger.info('Step 11-12: Deduplication...');
    const { newChunks, skipped } = await deduplicateChunks(
      chunks,
      tenantId,
      fileHash.fileHash
    );

    // STEP 13: Versioning
    logger.info('Step 13: Handling versions...');
    const versionInfo = await handleDocumentVersion(
      tenantId,
      fileHash.fileName,
      fileHash.fileHash
    );
    metrics.documentId = versionInfo.documentId;

    // STEP 14-15: Incremental & Optimize
    logger.info('Step 14-15: Incremental ingestion...');
    const { chunksToEmbed } = await incrementalIngestion(
      tenantId,
      versionInfo.documentId,
      newChunks,
      versionInfo.version
    );
    metrics.chunksEmbedded = chunksToEmbed.length;

    // Calculate cost
    metrics.costEstimate = (chunksToEmbed.length / 1000) * 50; // ₹50 per 1000

    // Save chunks to database
    for (const chunk of newChunks) {
      const chunkHash = hashChunk(chunk.text);

      await db.none(
        `
        INSERT INTO chunks 
        (document_id, tenant_id, chunk_text, chunk_index, chunk_hash, metadata)
        VALUES ($1, $2, $3, $4, $5, $6)
        `,
        [
          versionInfo.documentId,
          tenantId,
          chunk.text,
          chunk.index,
          chunkHash,
          JSON.stringify(metadata)
        ]
      );
    }

    logger.info('✓ Phase 1 complete!', metrics);

    // Return ready for Phase 2: Embeddings
    return {
      ...metrics,
      duration: Date.now() - startTime
    };

  } catch (error: any) {
    logger.error('Ingestion pipeline failed', { error: error.message });
    throw error;
  }
}

/**
 * Helper: Choose the right parser
 */
function selectParser(filePath: string) {
  const extension = filePath.split('.').pop()?.toLowerCase();

  if (extension === 'pdf') {
    // Use document size to decide
    const size = fs.statSync(filePath).size;

    if (size < 1000000) {
      // < 1MB: simple PDF
      return 'pdf-parse';
    } else if (size < 10000000) {
      // 1-10MB: normal PDF
      return 'unstructured';
    } else {
      // > 10MB: complex PDF
      return 'llamaparse';
    }
  }

  if (extension === 'docx') {
    return 'docx-parse';
  }

  return 'pdf-parse'; // default
}

The Complete Journey: From Upload to Ready-for-Embedding

👨‍🦳 Uncle: Let me show you the entire flow visually.

USER UPLOADS PDF
       ↓
   [STEP 1-2]
   File Hash Check
       ↓
   Already exists? ──YES──→ STOP (save cost)
       │
       NO
       ↓
   [STEP 3]
   Parse PDF ──→ Choose parser
                 ├─ pdf-parse (free)
                 ├─ Unstructured (cheap)
                 └─ LlamaParse (expensive)
       ↓
   [STEP 4-5]
   Extract & Clean ──→ Remove garbage
                       (headers, footers, etc.)
       ↓
   [STEP 6]
   Extract Metadata ──→ Add context
                        (department, section, etc.)
       ↓
   [STEP 7-10]
   Smart Chunking ──→ 1000-1500 tokens
                      with overlap & boundaries
       ↓
   [STEP 11-12]
   Chunk Hashing ──→ Detect duplicates
       ↓
   Skip duplicates? ──YES──→ Don't embed
       │
       NO
       ↓
   [STEP 13]
   Versioning ──→ Mark old version inactive
                  Keep new version active
       ↓
   [STEP 14-15]
   Incremental ──→ Only embed changed chunks
   Ingestion       Reuse cached embeddings
       ↓
   ✓ READY FOR EMBEDDINGS

Interview-Level Summary

👦 Nephew: If someone asks me in an interview about RAG ingestion?

👨‍🦳 Uncle: Say this:

"Phase 1 Document Ingestion has 15 critical steps before embeddings:

File Upload & Hashing - Hash file content (not filename) to detect duplicates
PDF Parsing - Choose tool based on document type (pdf-parse for simple, Unstructured for normal, LlamaParse for complex)
Text Extraction - Get raw text from parser output
Text Cleaning - Remove headers, footers, watermarks, junk
Metadata Extraction - Extract department, section, version from structure or LLM
Chunking - Split into 1000-1500 tokens with smart boundaries (sentence/paragraph)
Chunk Size - Balance between context (large) and precision (small)
Chunk Boundaries - Don't break mid-sentence; respect document structure
Overlap - 200 tokens overlap to maintain context continuity
Chunk Hashing - Hash each chunk to detect changes
Deduplication - Skip duplicate chunks at semantic level
Versioning - Handle document updates (v1 inactive, v2 active)
Incremental Ingestion - Only re-embed changed chunks, reuse old embeddings
Cost Optimization - Track costs, batch requests, use cheaper models
Index Update - Store chunks and prepare for vector embedding

The key insight: Avoid embedding unnecessary content. At enterprise scale, avoiding 100K unnecessary embeddings saves thousands per month."

Practical Checklist: Before You Deploy

Pre-Deployment Questions

const preDeploymentChecklist = {
  // Parsing
  "Have you tested parsing with your actual documents?": false,
  "Do you handle tables, images, multi-column layouts?": false,
  "Did you choose the right parser for your documents?": false,

  // Deduplication
  "Are you hashing file contents, not filenames?": false,
  "Do you check for existing files before parsing?": false,
  "Do you check for duplicate chunks before embedding?": false,

  // Chunking
  "Is your chunk size 800-1500 tokens?": false,
  "Do chunks respect sentence/paragraph boundaries?": false,
  "Do you have 100-200 token overlap?": false,

  // Metadata
  "Do you extract metadata from document structure?": false,
  "Is metadata used during retrieval?": false,

  // Versioning
  "Do you handle document updates (v1, v2, v3)?": false,
  "Do old versions get marked inactive?": false,

  // Cost
  "Are you tracking embedding costs per document?": false,
  "Do you reuse embeddings for unchanged chunks?": false,
  "Are you batching embedding requests?": false,

  // Production
  "Do you have error handling and retries?": false,
  "Are you logging every step?": false,
  "Can you rollback a bad ingestion?": false
};

Key Takeaways

👨‍🦳 Uncle: Remember:

Before embeddings, 15 steps matter
- Skip even one, quality suffers
The right parser saves time
- pdf-parse: Free, simple
- Unstructured: Balanced
- LlamaParse: Best quality
Chunking is everything
- Size: 1000-1500 tokens
- Overlap: 200 tokens
- Boundaries: Sentence/paragraph
Deduplication saves money
- Hash file content, not filename
- Hash chunks, reuse embeddings
- 50%+ cost savings possible
Metadata enables better retrieval
- Extract from structure
- Use during search
- Improve accuracy
Production systems are incremental
- Don't re-embed everything
- Reuse cached embeddings
- Only process what changed

👦 Nephew: So this is what "production-grade RAG" really means?

👨‍🦳 Uncle: Exactly. It's not just about embeddings and vector search. It's about doing the hard work BEFORE you search. That's what separates hobby projects from systems that work at scale.

Next Steps

You now understand Phase 1: Document Ingestion.

Next comes Phase 2: Retrieval & Ranking (vector search, reranking, query expansion)

Then Phase 3: Safety & Production (hallucination prevention, monitoring, cost tracking)

Ready to learn Phase 2?

👦 Nephew: Uncle, definitely! But first... can you show me a working example?

👨‍🦳 Uncle: That's next. We'll build a real ingestion pipeline together.

Reference Code Repository

# Complete implementation files:

1. fileHasher.ts        - File hashing, duplicate detection
2. pdfParsers.ts        - pdf-parse, Unstructured, LlamaParse
3. textCleaning.ts      - Remove junk, normalize whitespace
4. metadataExtractor.ts - Extract from structure, LLM classify
5. chunking.ts          - Sliding window, smart boundaries
6. deduplication.ts     - Chunk hashing, duplicate detection
7. versioning.ts        - Handle updates, version tracking
8. incrementalIngestion.ts - Reuse embeddings, smart updates
9. ingestionPipeline.ts - Complete flow orchestration
10. costOptimizer.ts    - Track costs, identify savings

Total: ~2000 lines of production-ready Node.js code

Remember: Less noise, more action. Phase 1 is the foundation.
Build it right.
written by -SurajK

DEV Community