The Complete Story: Why Most RAG Systems Fail Before They Start
The Story Begins: Why Your Upload Button Is Just The Beginning
π¦ Nephew: Uncle! I finally built my RAG system. User uploads a PDF, system finds answers. Simple, right?
π¨β𦳠Uncle: (smiles knowingly) You uploaded a PDF and got an answer?
π¦ Nephew: Yes! It works!
π¨β𦳠Uncle: Did you get the right answer?
π¦ Nephew: Well... sometimes. Why?
π¨β𦳠Uncle: Because between "user uploads PDF" and "system creates embeddings", there are 15 critical steps. Skip even one, and your system fails silently. You get wrong answers and don't know why.
π¦ Nephew: 15 steps? I just embedded the text!
π¨β𦳠Uncle: Exactly. That's the problem. Come, let me show you what production engineers actually do.
The 15 Steps of Phase 1: Document Ingestion
π¨β𦳠Uncle: Think of it like cooking biryani. You don't just dump rice and meat, right?
π¦ Nephew: No, you prepare everything first!
π¨β𦳠Uncle: Exactly. Before you cook, you:
- Wash rice
- Soak rice
- Prepare meat
- Marinate meat
- Chop onions
- ... and many more steps
Only THEN you cook.
The same with documents. Before embeddings, you must:
1. Document Upload (user action)
2. File Hashing (did we see this before?)
3. PDF Parsing (extract text)
4. Text Extraction (convert PDF β text)
5. Text Cleaning (remove junk)
6. Metadata Extraction (add context)
7. Chunking (split smartly)
8. Chunk Boundaries (don't break meaning)
9. Chunk Size (1000 tokens, not 100)
10. Overlap (context continuity)
11. Chunk Hashing (detect changes)
12. Deduplication (prevent duplicates)
13. Versioning (handle updates)
14. Incremental Ingestion (avoid re-embedding)
15. Cost Optimization (save money)
ONLY THEN:
β
Embeddings
β
Vector DB
π¦ Nephew: That's a lot! Where do I start?
π¨β𦳠Uncle: With step 1. Let's go slow. The foundation must be solid.
PHASE 1, STEP 1-2: Document Upload & File Hashing
Why Does File Hashing Matter?
π¦ Nephew: Why hash a file? Why not just upload it?
π¨β𦳠Uncle: Because humans are lazy. The HR person uploads the same PDF three times. Your system processes it three times. Three embeddings created. Three times the cost.
π¦ Nephew: So hash prevents duplicates?
π¨β𦳠Uncle: Yes. But here's the trick: don't hash the filename.
π¦ Nephew: Why not?
π¨β𦳠Uncle: Because someone could change the content and keep the same name. Look:
File: HR_Policy.pdf (Version 1)
Content: "30 days notice required"
Filename Hash: HR_Policy.pdf
File: HR_Policy.pdf (Version 2)
Content: "7 days notice required"
Filename Hash: HR_Policy.pdf (SAME!)
System thinks: Same file!
Reality: Different policies!
That's a disaster.
π¦ Nephew: So we hash the content?
π¨β𦳠Uncle: Yes. The actual binary data.
HR_Policy.pdf (Version 1)
β
[PDF binary bytes: 0xAA 0xBB 0xCC ...]
β
SHA256
β
A7B82C1F9D3E...
HR_Policy.pdf (Version 2)
β
[PDF binary bytes: 0xAA 0xBB 0xDD ...] β Different!
β
SHA256
β
X9Z47M3Q2K1L... β Different hash!
System detects: New file, process it.
Step 1-2: Node.js Implementation
// src/ingestion/fileHasher.ts
import crypto from 'crypto';
import fs from 'fs';
import path from 'path';
import db from '../config/database';
import logger from '../utils/logger';
interface FileHashResult {
fileName: string;
fileHash: string;
alreadyExists: boolean;
fileSize: number;
}
/**
* Step 1-2: Upload file and check if already processed
*
* β οΈ CRITICAL POINTS:
* 1. Hash FILE CONTENT, not filename
* 2. Same content β same hash (deterministic)
* 3. Different content β different hash (even if same filename)
* 4. Check database before processing
*/
export async function handleFileUpload(
filePath: string,
tenantId: string
): Promise<FileHashResult> {
try {
const fileName = path.basename(filePath);
const fileSize = fs.statSync(filePath).size;
// Step 1: Read file and create hash from content
logger.info('File upload started', { fileName, fileSize });
const fileContent = fs.readFileSync(filePath);
const fileHash = crypto
.createHash('sha256')
.update(fileContent)
.digest('hex');
logger.debug('File hashed', { fileName, fileHash, contentBytes: fileContent.length });
// Step 2: Check if this exact file was already processed
const existingFile = await db.oneOrNone(
`
SELECT id, created_at
FROM documents
WHERE file_hash = $1 AND tenant_id = $2
`,
[fileHash, tenantId]
);
if (existingFile) {
logger.warn('Duplicate file detected', {
fileName,
fileHash,
firstUploadedAt: existingFile.created_at
});
return {
fileName,
fileHash,
alreadyExists: true,
fileSize
};
}
// Step 3: File is new, save metadata
const documentRecord = await db.one(
`
INSERT INTO documents (tenant_id, file_name, file_hash, file_size, status)
VALUES ($1, $2, $3, $4, $5)
RETURNING id, created_at
`,
[tenantId, fileName, fileHash, fileSize, 'uploaded']
);
logger.info('New file registered', {
documentId: documentRecord.id,
fileName,
fileHash: fileHash.substring(0, 12) + '...'
});
return {
fileName,
fileHash,
alreadyExists: false,
fileSize
};
} catch (error: any) {
logger.error('File upload error', { error: error.message });
throw error;
}
}
/**
* Verify file integrity (optional but recommended)
* If file is corrupted, don't process it
*/
export function verifyFileIntegrity(filePath: string, expectedHash: string): boolean {
const fileContent = fs.readFileSync(filePath);
const actualHash = crypto
.createHash('sha256')
.update(fileContent)
.digest('hex');
return actualHash === expectedHash;
}
// Database schema for documents table
export const documentsTableSQL = `
CREATE TABLE IF NOT EXISTS documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
file_name VARCHAR(500) NOT NULL,
file_hash VARCHAR(64) NOT NULL, -- SHA256 produces 64 char hex
file_size BIGINT NOT NULL,
status VARCHAR(50) DEFAULT 'uploaded', -- uploaded, parsing, parsed, chunking, chunked, embedding, complete
error_message TEXT,
uploaded_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Unique constraint: same file can't be uploaded twice for same tenant
CONSTRAINT unique_file_per_tenant UNIQUE(tenant_id, file_hash),
INDEX idx_file_hash (file_hash),
INDEX idx_tenant_status (tenant_id, status)
);
`;
π¦ Nephew: So if someone uploads the same PDF twice, we detect it and skip?
π¨β𦳠Uncle: Exactly. And we save embedding cost. Which is the most expensive step.
PHASE 1, STEP 3: PDF Parsing - Choosing the Right Tool
π¦ Nephew: Now we have the PDF. How do we extract text?
π¨β𦳠Uncle: This is where the real decision happens. There are five major tools. Each has tradeoffs.
π¦ Nephew: Five?! Which one should I use?
π¨β𦳠Uncle: Depends on your documents. Let me show you.
The PDF Parsing Landscape
Simple Text PDFs β pdf-parse (cheap, simple)
β
Mixed content (text + tables) β PDFPlumber (better)
β
Complex documents β Unstructured (production-grade)
β
Advanced documents β LlamaParse (state-of-art)
(tables, images, OCR)
β
Enterprise documents β Azure Document Intelligence
(forms, invoices, scans)
Tool Comparison
| Tool | Best For | Cost | Speed | Table Support | OCR | Metadata | Production Ready |
|---|---|---|---|---|---|---|---|
| pdf-parse | Simple text | βΉ0 (free) | β‘β‘β‘ Fast | β No | β | β | β οΈ Hobby |
| PDFPlumber | Text + tables | βΉ0 | β‘β‘ Medium | β Basic | β | β οΈ Limited | β οΈ Small |
| Unstructured | Normal docs | βΉ50-200/mo | β‘ Slow | β Good | β Basic | β Good | β Yes |
| LlamaParse | Complex docs | βΉ100-500/mo | β‘ Slow | β Excellent | β Advanced | β Excellent | β Yes |
| Azure Doc Int. | Enterprise | βΉ500-2000/mo | β‘ Medium | β Perfect | β Perfect | β Perfect | β Enterprise |
π¨β𦳠Uncle: Let me explain each.
Tool 1: pdf-parse (Free, Simple)
// Simple approach - good for learning, bad for production
const pdf = require('pdf-parse');
const fs = require('fs');
async function extractTextFromPDF(filePath) {
const dataBuffer = fs.readFileSync(filePath);
const data = await pdf(dataBuffer);
console.log(data.text); // Raw text
// Output: "Company Policy Leave Policy Employees receive..."
}
π¨β𦳠Uncle: Notice what we lost?
Original PDF:
βββββββββββββββββββββββββββββββββββββββ
COMPANY POLICY
βββββββββββββββββββββββββββββββββββββββ
Leave Policy (HEADING)
Employees are entitled to 24 paid leaves annually.
(PARAGRAPH)
Leave Types:
- Annual Leave
- Casual Leave
(LIST)
βββββββββββββββββββββββββββββββββββββββ
pdf-parse Output:
"COMPANY POLICY Leave Policy Employees are entitled to 24 paid leaves annually. Leave Types: Annual Leave Casual Leave"
LOST:
β Heading level
β List structure
β Paragraph breaks
β Section organization
β Tables (if any)
π¦ Nephew: So we just get text soup?
π¨β𦳠Uncle: Yes. And when you embed soup, you get soup answers.
Tool 2: PDFPlumber (Better, Still Python)
# PDFPlumber - extracts tables better
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for i, page in enumerate(pdf.pages):
# Extract text
text = page.extract_text()
# Extract tables (if any)
tables = page.extract_tables()
print(f"Page {i}:")
print(f"Text: {text}")
print(f"Tables: {tables}")
π¨β𦳠Uncle: Better, but still loses structure. And it's Python, not Node.js.
Tool 3: Unstructured (Production Choice)
π¨β𦳠Uncle: This is what most companies use. It preserves structure.
// Using Unstructured via API (Node.js friendly)
import axios from 'axios';
import fs from 'fs';
/**
* Step 3: Parse PDF using Unstructured
*
* IMPORTANT: Unstructured preserves document structure
* Returns: Array of structured elements
*/
export async function parsePDFWithUnstructured(filePath: string): Promise<any[]> {
try {
const fileContent = fs.readFileSync(filePath);
const base64Content = fileContent.toString('base64');
const response = await axios.post(
'https://api.unstructuredapp.io/general/v0/general',
{
file: base64Content,
strategy: 'hi_res', // High resolution parsing
coordinates: true // Preserve coordinates (useful for tables)
},
{
headers: {
'Authorization': `Bearer ${process.env.UNSTRUCTURED_API_KEY}`,
'Content-Type': 'application/json'
}
}
);
const elements = response.data.elements;
// Output: Structured elements
// [
// { type: "Title", text: "Leave Policy", metadata: {...} },
// { type: "Heading", text: "Annual Leave", metadata: {...} },
// { type: "Paragraph", text: "Employees are entitled...", metadata: {...} },
// { type: "ListItem", text: "Manager approval required", metadata: {...} }
// ]
logger.info('PDF parsed with structure preserved', {
elementCount: elements.length,
types: [...new Set(elements.map((e: any) => e.type))]
});
return elements;
} catch (error: any) {
logger.error('Unstructured parsing failed', { error: error.message });
throw error;
}
}
π¦ Nephew: So it preserves structure?
π¨β𦳠Uncle: Yes. Look at the difference:
Unstructured Output:
[
{
type: "Title",
text: "Leave Policy",
metadata: {
page_number: 1,
section: "Policies"
}
},
{
type: "Paragraph",
text: "Employees receive 24 leaves",
metadata: {
page_number: 1
}
},
{
type: "List",
text: "Annual Leave, Casual Leave",
metadata: {
page_number: 2,
list_items: 2
}
}
]
Now we know:
β What is a title
β What is body text
β What is a list
β Page numbers
β Sections
Tool 4: LlamaParse (State of the Art)
π¨β𦳠Uncle: For really complex documents, use LlamaParse.
// LlamaParse - best for complex PDFs
import axios from 'axios';
import FormData from 'form-data';
import fs from 'fs';
/**
* Parse PDF with LlamaParse (best for:
* - Multi-column layouts
* - Tables with merged cells
* - Images with text
* - Scanned documents (OCR)
* - Footnotes and annotations
*/
export async function parsePDFWithLlamaParse(filePath: string): Promise<any> {
try {
// Step 1: Upload file
const formData = new FormData();
formData.append('file', fs.createReadStream(filePath));
formData.append('parsing_instruction', `
Extract all content including:
- Tables with proper structure
- Column layouts
- Images with OCR
- Headings and sections
- Metadata like page numbers
`);
const uploadResponse = await axios.post(
'https://api.llamaindex.ai/api/parsing/upload_file',
formData,
{
headers: {
'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`,
...formData.getHeaders()
}
}
);
const jobId = uploadResponse.data.id;
// Step 2: Poll for results
let result = null;
for (let i = 0; i < 60; i++) {
const statusResponse = await axios.get(
`https://api.llamaindex.ai/api/parsing/job/${jobId}/result/markdown`,
{
headers: {
'Authorization': `Bearer ${process.env.LLAMAPARSE_API_KEY}`
}
}
);
if (statusResponse.status === 200) {
result = statusResponse.data;
break;
}
// Wait 2 seconds before retry
await new Promise(resolve => setTimeout(resolve, 2000));
}
logger.info('LlamaParse completed', { jobId });
return {
markdown: result,
parsedAt: new Date()
};
} catch (error: any) {
logger.error('LlamaParse failed', { error: error.message });
throw error;
}
}
π¦ Nephew: When do I use LlamaParse vs Unstructured?
π¨β𦳠Uncle: Simple rule:
Document type?
ββ Simple text policies
β βββ pdf-parse (free)
β
ββ Text + basic tables
β βββ Unstructured (cheap, good)
β
ββ Complex tables, multi-column
β βββ LlamaParse (excellent)
β
ββ Enterprise documents, forms, invoices
βββ Azure Document Intelligence (best)
Tool 5: Azure Document Intelligence (Enterprise)
// Azure Document Intelligence - for enterprise documents
import { DocumentAnalysisClient, AzureKeyCredential } from "@azure/ai-form-recognizer";
import fs from 'fs';
/**
* Parse with Azure (best for:
* - Invoices
* - Forms
* - Bank documents
* - Scanned PDFs with OCR
*/
export async function parseWithAzureDocumentIntelligence(filePath: string) {
try {
const client = new DocumentAnalysisClient(
process.env.AZURE_FORM_RECOGNIZER_ENDPOINT,
new AzureKeyCredential(process.env.AZURE_FORM_RECOGNIZER_KEY)
);
const fileContent = fs.readFileSync(filePath);
// Choose model based on document type
const poller = await client.beginAnalyzeDocument(
"prebuilt-document", // Or: prebuilt-invoice, prebuilt-receipt, etc.
fileContent
);
const result = await poller.pollUntilDone();
// Extract structure
const output = {
text: result.content,
tables: result.tables?.map(table => ({
rows: table.rowCount,
columns: table.columnCount,
cells: table.cells
})),
forms: result.fields, // For form data
confidence: result.confidence
};
logger.info('Azure parsing complete', {
pages: result.pages?.length,
tables: result.tables?.length,
confidence: output.confidence
});
return output;
} catch (error: any) {
logger.error('Azure parsing failed', { error: error.message });
throw error;
}
}
π¦ Nephew: These all cost money?
π¨β𦳠Uncle: Yes, except pdf-parse. And here's the secret: you should still use pdf-parse for simple documents!
π¦ Nephew: Why?
π¨β𦳠Uncle: Cost. If 80% of your documents are simple policies, use pdf-parse. Use expensive parsers only for 20% that need them.
PHASE 1, STEP 4-5: Text Extraction & Cleaning
The Garbage Problem
π¨β𦳠Uncle: After parsing, you get garbage.
Raw Extracted Text:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Company Name Page 1
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[CONFIDENTIAL]
LEAVE POLICY
[Document Version: 2.3]
Employees are entitled to 24 paid leaves annually.
These leaves must be approved by:
1. Direct Manager
2. HR Department
[CONFIDENTIAL]
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Company Name Page 2
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π¦ Nephew: What's wrong with this?
π¨β𦳠Uncle: When you embed "[CONFIDENTIAL]", your system learns that every policy is confidential. When someone asks "Is this policy confidential?", the answer is always YES!
// Step 4-5: Clean the text
import logger from '../utils/logger';
/**
* Step 4-5: Extract and Clean text
*
* Remove:
* - Page headers/footers
* - Repeated watermarks
* - Extra whitespace
* - Page numbers
* - Confidential markings (if generic)
*/
export function cleanExtractedText(rawText: string): string {
let cleaned = rawText;
// Remove page headers (appears at top of every page)
cleaned = cleaned.replace(
/^Company Name\s+Page \d+\s*$/gm,
''
);
// Remove confidential markings (if generic)
cleaned = cleaned.replace(
/\[CONFIDENTIAL\]\n?/g,
''
);
// Remove document version
cleaned = cleaned.replace(
/\[Document Version: [\d.]+\]/g,
''
);
// Remove page numbers alone on a line
cleaned = cleaned.replace(
/^\s*Page \d+\s*$/gm,
''
);
// Normalize whitespace (multiple spaces/newlines β single space)
cleaned = cleaned.replace(/\s+/g, ' ');
// Remove leading/trailing whitespace
cleaned = cleaned.trim();
logger.debug('Text cleaned', {
originalLength: rawText.length,
cleanedLength: cleaned.length,
removed: rawText.length - cleaned.length
});
return cleaned;
}
// Example
const raw = `
βββββββββββββββββββββ
Company Name Page 1
βββββββββββββββββββββ
[CONFIDENTIAL]
LEAVE POLICY
Employees get 24 leaves.
[CONFIDENTIAL]
βββββββββββββββββββββ
Company Name Page 2
βββββββββββββββββββββ
`;
const clean = cleanExtractedText(raw);
// Output: "LEAVE POLICY Employees get 24 leaves."
PHASE 1, STEP 6: Metadata Extraction
π¨β𦳠Uncle: Metadata is information ABOUT the chunk, not IN the chunk.
π¦ Nephew: Like what?
π¨β𦳠Uncle: Like:
Document: HR_Policy.pdf
Department: Human Resources
Section: Leave Policy
Page: 3
Version: 2.3
Effective Date: 2025-01-01
Author: HR Manager
Source: Internal Wiki
This metadata helps retrieval.
π¦ Nephew: How?
π¨β𦳠Uncle: Suppose someone asks: "What's the HR policy for leaves?"
Without metadata:
- System searches all documents
- Returns leaves policies from EVERY department
With metadata:
- System knows to search only HR documents
- Returns only HR leave policy
Much better!
// Step 6: Extract metadata
import logger from '../utils/logger';
interface DocumentMetadata {
source: string;
department?: string;
section?: string;
version?: string;
effectiveDate?: string;
author?: string;
}
/**
* Extract metadata from document structure
*
* Sources:
* 1. Folder structure: /HR/Policies/Leave_Policy.pdf
* 2. Filename: leave_policy_v2.pdf
* 3. Parser output: Unstructured gives headings as metadata
* 4. LLM classification: Ask AI to classify
*/
export function extractMetadataFromStructure(
filePath: string,
parserElements: any[]
): DocumentMetadata {
// Source 1: Folder structure
const pathParts = filePath.split('/');
const department = pathParts[pathParts.length - 2] === 'HR' ? 'Human Resources' : undefined;
// Source 2: Filename
const fileName = pathParts[pathParts.length - 1];
const versionMatch = fileName.match(/v(\d+)/i);
const version = versionMatch ? versionMatch[1] : undefined;
// Source 3: Parser output (if using Unstructured)
let section = undefined;
if (parserElements && parserElements.length > 0) {
const heading = parserElements.find((e: any) => e.type === 'Title' || e.type === 'Heading');
if (heading) {
section = heading.text;
}
}
const metadata: DocumentMetadata = {
source: filePath,
department,
section,
version
};
logger.info('Metadata extracted', metadata);
return metadata;
}
/**
* Optional: Use LLM to classify (more powerful but costs money)
*/
export async function classifyWithLLM(text: string) {
const client = new Anthropic();
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 200,
messages: [{
role: 'user',
content: `Classify this document. Return JSON:
{
"department": "HR|Finance|IT|Legal",
"topic": "Leave|Salary|Security|Contracts",
"confidentiality": "public|internal|confidential"
}
Document: ${text.substring(0, 500)}`
}]
});
return JSON.parse(response.content[0].text);
}
PHASE 1, STEP 7-10: Chunking - The Heart of Preparation
Why Chunking Matters
π¨β𦳠Uncle: This is the most important step. Everything after depends on it.
π¦ Nephew: More important than embeddings?
π¨β𦳠Uncle: YES. Because bad chunks β bad embeddings β wrong answers.
Step 7: Chunking Strategy
// Step 7: Smart chunking with boundaries
/**
* Chunking strategies:
*
* 1. Fixed size (simple, dumb)
* ββ Pros: Easy to understand
* ββ Cons: Breaks meaning
*
* 2. Sentence boundaries (better)
* ββ Pros: Doesn't break sentences
* ββ Cons: Variable sizes
*
* 3. Semantic (best, most complex)
* ββ Pros: Keeps meaning together
* ββ Cons: Expensive, slow
*
* 4. Recursive (good hybrid)
* ββ Pros: Smart, flexible
* ββ Cons: Complex implementation
*/
function slidingWindowChunk(
text: string,
windowTokens: number = 1000,
overlapTokens: number = 200
): string[] {
// 1 token β 4 characters (rough estimate for English)
const charWindow = windowTokens * 4;
const charOverlap = overlapTokens * 4;
const step = charWindow - charOverlap;
const chunks: string[] = [];
for (let i = 0; i < text.length; i += step) {
let end = i + charWindow;
// Find sentence boundary (don't break mid-sentence)
if (end < text.length) {
const periodIndex = text.lastIndexOf('.', end);
const newlineIndex = text.lastIndexOf('\n', end);
const boundaryIndex = Math.max(periodIndex, newlineIndex);
if (boundaryIndex > i + (charWindow * 0.8)) {
end = boundaryIndex + 1;
}
}
const chunk = text.substring(i, end).trim();
if (chunk.length > 0) {
chunks.push(chunk);
}
if (end >= text.length) break;
}
return chunks;
}
// Example
const text = `
Employees receive 24 paid leaves annually.
These leaves can be taken as:
1. Annual Leave (20 days)
2. Casual Leave (4 days)
Manager approval is required for:
- Leaves longer than 5 days
- Leave during peak season
`;
const chunks = slidingWindowChunk(text, 1000, 200);
console.log(chunks);
// [
// "Employees receive 24 paid leaves annually...",
// "These leaves can be taken as: 1. Annual Leave..."
// ]
Step 8-10: Chunk Boundaries, Size, and Overlap
π¦ Nephew: Why does size matter?
π¨β𦳠Uncle: Look at three scenarios:
Scenario 1: Too Small (100 tokens)
Chunk 1: "Employees receive"
Chunk 2: "24 paid leaves"
Chunk 3: "annually"
Result: Lost context. AI confused.
Question: "How many leaves do employees get?"
Chunk 1 alone: "Employees receive" (incomplete)
Chunk 2 alone: "24 paid leaves" (where?)
Chunk 3 alone: "annually" (what annually?)
Wrong answer!
βββββββββββββββββββββββββββββββββββββ
Scenario 2: Right Size (1000 tokens)
Chunk: "Employees receive 24 paid leaves annually.
These leaves can be taken as annual or casual.
Manager approval required for leaves >5 days."
Result: Full context. Clear meaning.
Question: "How many leaves do employees get?"
Answer: "24 paid leaves"
Correct!
βββββββββββββββββββββββββββββββββββββ
Scenario 3: Too Large (5000 tokens)
Chunk: [Entire company handbook]
Result: Too much noise.
Question: "How many leaves?"
System reads 5000 tokens to find answer.
Slow. Confused by other policies.
Wrong section highlighted.
π¨β𦳠Uncle: Sweet spot: 1000-1500 tokens.
// Complete chunking service
import logger from '../utils/logger';
interface Chunk {
text: string;
index: number;
tokens: number;
metadata: {
startChar: number;
endChar: number;
};
}
/**
* Steps 7-10: Complete chunking strategy
*
* - Sliding window with overlap
* - Respects sentence boundaries
* - Includes overlap for context
* - Validates chunk size
*/
export function createSmartChunks(
text: string,
windowTokens: number = 1000,
overlapTokens: number = 200
): Chunk[] {
const charWindow = windowTokens * 4;
const charOverlap = overlapTokens * 4;
const step = charWindow - charOverlap;
const chunks: Chunk[] = [];
let chunkIndex = 0;
for (let i = 0; i < text.length; i += step) {
let end = i + charWindow;
// Validate not empty
if (end - i < charWindow * 0.3) {
// Remaining text too small, merge with previous
if (chunks.length > 0) {
const lastChunk = chunks[chunks.length - 1];
lastChunk.text += ' ' + text.substring(i).trim();
break;
}
}
// Find sentence boundary
if (end < text.length) {
const periodIndex = text.lastIndexOf('.', end);
if (periodIndex > i) {
end = periodIndex + 1;
}
}
const chunkText = text.substring(i, end).trim();
if (chunkText.length === 0) continue;
// Estimate tokens (rough)
const estimatedTokens = Math.ceil(chunkText.length / 4);
// Validate: chunk should be reasonable size
if (estimatedTokens < 100) {
logger.warn('Chunk too small', { tokens: estimatedTokens });
}
chunks.push({
text: chunkText,
index: chunkIndex++,
tokens: estimatedTokens,
metadata: {
startChar: i,
endChar: end
}
});
if (end >= text.length) break;
}
logger.info('Chunking complete', {
textLength: text.length,
chunkCount: chunks.length,
avgChunkTokens: Math.round(
chunks.reduce((sum, c) => sum + c.tokens, 0) / chunks.length
)
});
return chunks;
}
PHASE 1, STEP 11-13: Deduplication, Hashing & Versioning
Step 11: Chunk Hashing
π¨β𦳠Uncle: After chunking, we hash each chunk.
π¦ Nephew: Why hash chunks? We already hashed the file!
π¨β𦳠Uncle: Because files change. When they do, we need to know WHICH chunks changed.
HR_Policy_v1.pdf (Old)
ββ Chunk 1: "Employees get 24 leaves"
ββ Chunk 2: "Manager approval required"
ββ Chunk 3: "Casual leave guidelines"
HR_Policy_v2.pdf (New)
ββ Chunk 1: "Employees get 30 leaves" β CHANGED!
ββ Chunk 2: "Manager approval required" β SAME
ββ Chunk 3: "Casual leave guidelines" β SAME
Old system: Re-embed all 3 chunks
Smart system: Re-embed only chunk 1
Cost saving: 66%!
// Step 11: Hash each chunk
import crypto from 'crypto';
export function hashChunk(chunkText: string): string {
return crypto
.createHash('sha256')
.update(chunkText)
.digest('hex');
}
/**
* Key insight: Hash is deterministic
*
* Same text:
* hashChunk("Employees get 24 leaves")
* Always returns: A7B82C1F...
*
* Different text:
* hashChunk("Employees get 30 leaves")
* Always returns: X9Z47M3Q...
*
* Different hash = Document changed
*/
// Example
const chunk1 = "Employees receive 24 paid leaves annually";
const chunk1Hash = hashChunk(chunk1);
// A7B82C1F9D3E4K2...
const chunk2 = "Employees receive 24 paid leaves annually";
const chunk2Hash = hashChunk(chunk2);
// A7B82C1F9D3E4K2... (SAME!)
const chunk3 = "Employees receive 30 paid leaves annually";
const chunk3Hash = hashChunk(chunk3);
// X9Z47M3Q2K1L9... (DIFFERENT!)
Step 12: Deduplication
// Step 12: Detect and skip duplicates
/**
* Deduplication at multiple levels:
*
* Level 1: Document level
* Have we processed this exact file?
* (Already done with file hash)
*
* Level 2: Chunk level
* Have we processed this exact chunk?
*
* Level 3: Semantic level
* Is this chunk similar to existing chunk?
* (Advanced, uses embeddings)
*/
export async function deduplicateChunks(
chunks: Chunk[],
tenantId: string,
documentId: string
): Promise<{ newChunks: Chunk[]; skipped: number }> {
const newChunks: Chunk[] = [];
let skipped = 0;
for (const chunk of chunks) {
const chunkHash = hashChunk(chunk.text);
// Check if chunk exists
const existingChunk = await db.oneOrNone(
`
SELECT id FROM chunks
WHERE chunk_hash = $1 AND tenant_id = $2
`,
[chunkHash, tenantId]
);
if (existingChunk) {
logger.debug('Duplicate chunk skipped', { chunkHash: chunkHash.substring(0, 8) });
skipped++;
continue;
}
newChunks.push(chunk);
}
logger.info('Deduplication complete', {
total: chunks.length,
new: newChunks.length,
skipped,
percentage: Math.round((skipped / chunks.length) * 100)
});
return { newChunks, skipped };
}
π¦ Nephew: What if dedup changes because we change chunk size?
π¨β𦳠Uncle: Good question! That's versioning.
Step 13: Versioning
// Step 13: Handle document versions
export async function handleDocumentVersion(
tenantId: string,
fileName: string,
newFileHash: string
) {
// Check if this document (by name) was processed before
const oldVersion = await db.one(
`
SELECT id, version FROM documents
WHERE file_name = $1 AND tenant_id = $2
ORDER BY version DESC LIMIT 1
`,
[fileName, tenantId]
);
let version = 1;
if (oldVersion) {
// Old version exists
// Mark it as inactive
await db.none(
`
UPDATE documents
SET active = false
WHERE id = $1
`,
[oldVersion.id]
);
version = oldVersion.version + 1;
}
// Create new version
const newDoc = await db.one(
`
INSERT INTO documents
(tenant_id, file_name, file_hash, version, active)
VALUES ($1, $2, $3, $4, true)
RETURNING id
`,
[tenantId, fileName, newFileHash, version]
);
logger.info('Document versioned', {
fileName,
version,
active: true
});
return {
documentId: newDoc.id,
version,
isNewVersion: oldVersion ? true : false
};
}
/**
* Database schema
*/
export const versioningSchema = `
CREATE TABLE documents (
id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
file_name VARCHAR(500),
file_hash VARCHAR(64),
version INTEGER DEFAULT 1,
active BOOLEAN DEFAULT true,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_active (tenant_id, active)
);
`;
π¨β𦳠Uncle: Important: During retrieval, we search ONLY active documents.
-- Search only from active documents
SELECT chunk_text
FROM chunks
WHERE tenant_id = $1
AND document_id IN (
SELECT id FROM documents
WHERE active = true
)
ORDER BY embedding <=> query_embedding
LIMIT 5
PHASE 1, STEP 14-15: Incremental Ingestion & Cost Optimization
The Big Picture: Before vs After
π¨β𦳠Uncle: Let me show you the difference between naive and smart systems.
NAIVE SYSTEM
βββββββββββββββββββββββββββββββββββββββ
File v1 uploaded
ββ Parse
ββ Chunk (1000 chunks)
ββ Embed ALL 1000 β Cost: βΉ50 per 1000 = βΉ50
ββ Store
File v2 uploaded (only 1 chunk changed)
ββ Parse
ββ Chunk (1000 chunks)
ββ Embed ALL 1000 β Cost: βΉ50 again!
ββ Store
Total cost: βΉ100
Total waste: βΉ50 (50% wasted)
βββββββββββββββββββββββββββββββββββββββ
SMART SYSTEM
βββββββββββββββββββββββββββββββββββββββ
File v1 uploaded
ββ Parse
ββ Chunk (1000 chunks)
ββ Embed ALL 1000 β Cost: βΉ50
ββ Store with chunk hashes
File v2 uploaded (only 1 chunk changed)
ββ Parse
ββ Chunk (1000 chunks)
ββ Compare hashes β 999 same, 1 changed
ββ Embed ONLY 1 chunk β Cost: βΉ0.05 only!
ββ Store (update index for chunk 1)
Total cost: βΉ50.05
Total waste: βΉ0 (99% efficient!)
Step 14: Incremental Ingestion
// Step 14: Incremental ingestion (smart re-embedding)
/**
* The key insight:
*
* Don't re-embed unchanged chunks.
* Reuse old embeddings.
*
* This is how production systems save money.
*/
export async function incrementalIngestion(
tenantId: string,
documentId: string,
newChunks: Chunk[],
documentVersion: number
) {
// Get old chunks from previous version
const oldDoc = await db.one(
`
SELECT id FROM documents
WHERE file_name = (SELECT file_name FROM documents WHERE id = $1)
AND version = $2 - 1
AND tenant_id = $3
`,
[documentId, documentVersion, tenantId]
);
let embeddingsNeeded = newChunks;
let embeddingsReused = 0;
if (oldDoc) {
// Get old chunks
const oldChunks = await db.manyOrNone(
`
SELECT id, chunk_hash, embedding FROM chunks
WHERE document_id = $1
`,
[oldDoc.id]
);
// Compare: which chunks are new/changed?
embeddingsNeeded = newChunks.filter(newChunk => {
const newHash = hashChunk(newChunk.text);
const oldChunk = oldChunks.find(
c => c.chunk_hash === newHash
);
if (oldChunk) {
// Chunk unchanged: reuse embedding
embeddingsReused++;
// Copy old embedding to new chunk
newChunk.embedding = oldChunk.embedding;
return false; // Don't need to re-embed
}
return true; // Need to embed
});
}
logger.info('Incremental ingestion', {
totalChunks: newChunks.length,
newChunks: embeddingsNeeded.length,
reused: embeddingsReused,
efficiency: Math.round((embeddingsReused / newChunks.length) * 100) + '%'
});
return {
chunksToEmbed: embeddingsNeeded,
chunksCached: embeddingsReused,
costSaving: embeddingsReused * 0.00001 // βΉ saved
};
}
Step 15: Cost Optimization Strategy
// Step 15: Track costs and optimize
export async function optimizeIngestionCosts() {
// Strategy 1: Batch embeddings
// Don't embed one by one. Batch them.
// Cost: 1000 embeddings in 1 API call = cheaper than 1000 calls
// Strategy 2: Use cheaper models for simple chunks
// Use Gemini for obvious policy text
// Use Claude only for complex analysis
// Strategy 3: Cache embeddings aggressively
// If same chunk text appears in document 1 and 2
// Don't embed twice. Reuse.
// Strategy 4: Incremental updates
// Don't re-embed when document slightly changes
// Only embed changed chunks
/**
* Real example: HR uploads 100 policies
*
* Policy 1: 100 chunks
* Policy 2: 100 chunks
* Policy 3: 100 chunks (99 similar to Policy 1)
*
* Naive: Embed 300 chunks = βΉ15
* Smart: Embed 200 chunks = βΉ10 (saved βΉ5)
*
* At scale (1000 policies):
* Naive: βΉ500/month
* Smart: βΉ100/month (saved βΉ400!)
*/
const costBreakdown = {
parsing: 0.50, // βΉ0.50 per document
embedding: 50, // βΉ50 per 1000 chunks
vectorStorage: 10, // βΉ10 per GB/month
llmClassification: 1, // βΉ1 per document (optional)
monthlyInfrastructure: 1000
};
logger.info('Cost breakdown', costBreakdown);
return costBreakdown;
}
Complete Production Ingestion Flow
π¨β𦳠Uncle: Let me show you the complete flow as code.
// Complete ingestion pipeline
import db from '../config/database';
import logger from '../utils/logger';
/**
* Phase 1: Complete Document Ingestion Pipeline
*
* This is what production systems do before embeddings.
*
* Cost tracking included.
*/
export async function completeIngestionPipeline(
filePath: string,
tenantId: string
): Promise<{
documentId: string;
chunksCreated: number;
chunksEmbedded: number;
costEstimate: number;
duration: number;
}> {
const startTime = Date.now();
const metrics = {
documentId: '',
chunksCreated: 0,
chunksEmbedded: 0,
costEstimate: 0,
duration: 0
};
try {
// STEP 1-2: Upload & File Hashing
logger.info('Step 1-2: File upload and hashing...');
const fileHash = await handleFileUpload(filePath, tenantId);
if (fileHash.alreadyExists) {
logger.warn('File already processed. Skipping.');
return {
...metrics,
duration: Date.now() - startTime
};
}
// STEP 3: Choose parser
logger.info('Step 3: Parsing PDF...');
const parserType = selectParser(filePath); // pdf-parse, unstructured, etc.
const parsedElements = await parsePDF(filePath, parserType);
// STEP 4-5: Extract & Clean
logger.info('Step 4-5: Extracting and cleaning text...');
let cleanText = extractText(parsedElements);
cleanText = cleanExtractedText(cleanText);
// STEP 6: Metadata
logger.info('Step 6: Extracting metadata...');
const metadata = extractMetadataFromStructure(filePath, parsedElements);
// STEP 7-10: Chunking
logger.info('Step 7-10: Creating chunks...');
const chunks = createSmartChunks(cleanText, 1000, 200);
metrics.chunksCreated = chunks.length;
// STEP 11-12: Deduplication
logger.info('Step 11-12: Deduplication...');
const { newChunks, skipped } = await deduplicateChunks(
chunks,
tenantId,
fileHash.fileHash
);
// STEP 13: Versioning
logger.info('Step 13: Handling versions...');
const versionInfo = await handleDocumentVersion(
tenantId,
fileHash.fileName,
fileHash.fileHash
);
metrics.documentId = versionInfo.documentId;
// STEP 14-15: Incremental & Optimize
logger.info('Step 14-15: Incremental ingestion...');
const { chunksToEmbed } = await incrementalIngestion(
tenantId,
versionInfo.documentId,
newChunks,
versionInfo.version
);
metrics.chunksEmbedded = chunksToEmbed.length;
// Calculate cost
metrics.costEstimate = (chunksToEmbed.length / 1000) * 50; // βΉ50 per 1000
// Save chunks to database
for (const chunk of newChunks) {
const chunkHash = hashChunk(chunk.text);
await db.none(
`
INSERT INTO chunks
(document_id, tenant_id, chunk_text, chunk_index, chunk_hash, metadata)
VALUES ($1, $2, $3, $4, $5, $6)
`,
[
versionInfo.documentId,
tenantId,
chunk.text,
chunk.index,
chunkHash,
JSON.stringify(metadata)
]
);
}
logger.info('β Phase 1 complete!', metrics);
// Return ready for Phase 2: Embeddings
return {
...metrics,
duration: Date.now() - startTime
};
} catch (error: any) {
logger.error('Ingestion pipeline failed', { error: error.message });
throw error;
}
}
/**
* Helper: Choose the right parser
*/
function selectParser(filePath: string) {
const extension = filePath.split('.').pop()?.toLowerCase();
if (extension === 'pdf') {
// Use document size to decide
const size = fs.statSync(filePath).size;
if (size < 1000000) {
// < 1MB: simple PDF
return 'pdf-parse';
} else if (size < 10000000) {
// 1-10MB: normal PDF
return 'unstructured';
} else {
// > 10MB: complex PDF
return 'llamaparse';
}
}
if (extension === 'docx') {
return 'docx-parse';
}
return 'pdf-parse'; // default
}
The Complete Journey: From Upload to Ready-for-Embedding
π¨β𦳠Uncle: Let me show you the entire flow visually.
USER UPLOADS PDF
β
[STEP 1-2]
File Hash Check
β
Already exists? ββYESβββ STOP (save cost)
β
NO
β
[STEP 3]
Parse PDF βββ Choose parser
ββ pdf-parse (free)
ββ Unstructured (cheap)
ββ LlamaParse (expensive)
β
[STEP 4-5]
Extract & Clean βββ Remove garbage
(headers, footers, etc.)
β
[STEP 6]
Extract Metadata βββ Add context
(department, section, etc.)
β
[STEP 7-10]
Smart Chunking βββ 1000-1500 tokens
with overlap & boundaries
β
[STEP 11-12]
Chunk Hashing βββ Detect duplicates
β
Skip duplicates? ββYESβββ Don't embed
β
NO
β
[STEP 13]
Versioning βββ Mark old version inactive
Keep new version active
β
[STEP 14-15]
Incremental βββ Only embed changed chunks
Ingestion Reuse cached embeddings
β
β READY FOR EMBEDDINGS
Interview-Level Summary
π¦ Nephew: If someone asks me in an interview about RAG ingestion?
π¨β𦳠Uncle: Say this:
"Phase 1 Document Ingestion has 15 critical steps before embeddings:
- File Upload & Hashing - Hash file content (not filename) to detect duplicates
- PDF Parsing - Choose tool based on document type (pdf-parse for simple, Unstructured for normal, LlamaParse for complex)
- Text Extraction - Get raw text from parser output
- Text Cleaning - Remove headers, footers, watermarks, junk
- Metadata Extraction - Extract department, section, version from structure or LLM
- Chunking - Split into 1000-1500 tokens with smart boundaries (sentence/paragraph)
- Chunk Size - Balance between context (large) and precision (small)
- Chunk Boundaries - Don't break mid-sentence; respect document structure
- Overlap - 200 tokens overlap to maintain context continuity
- Chunk Hashing - Hash each chunk to detect changes
- Deduplication - Skip duplicate chunks at semantic level
- Versioning - Handle document updates (v1 inactive, v2 active)
- Incremental Ingestion - Only re-embed changed chunks, reuse old embeddings
- Cost Optimization - Track costs, batch requests, use cheaper models
- Index Update - Store chunks and prepare for vector embedding
The key insight: Avoid embedding unnecessary content. At enterprise scale, avoiding 100K unnecessary embeddings saves thousands per month."
Practical Checklist: Before You Deploy
Pre-Deployment Questions
const preDeploymentChecklist = {
// Parsing
"Have you tested parsing with your actual documents?": false,
"Do you handle tables, images, multi-column layouts?": false,
"Did you choose the right parser for your documents?": false,
// Deduplication
"Are you hashing file contents, not filenames?": false,
"Do you check for existing files before parsing?": false,
"Do you check for duplicate chunks before embedding?": false,
// Chunking
"Is your chunk size 800-1500 tokens?": false,
"Do chunks respect sentence/paragraph boundaries?": false,
"Do you have 100-200 token overlap?": false,
// Metadata
"Do you extract metadata from document structure?": false,
"Is metadata used during retrieval?": false,
// Versioning
"Do you handle document updates (v1, v2, v3)?": false,
"Do old versions get marked inactive?": false,
// Cost
"Are you tracking embedding costs per document?": false,
"Do you reuse embeddings for unchanged chunks?": false,
"Are you batching embedding requests?": false,
// Production
"Do you have error handling and retries?": false,
"Are you logging every step?": false,
"Can you rollback a bad ingestion?": false
};
Key Takeaways
π¨β𦳠Uncle: Remember:
-
Before embeddings, 15 steps matter
- Skip even one, quality suffers
-
The right parser saves time
- pdf-parse: Free, simple
- Unstructured: Balanced
- LlamaParse: Best quality
-
Chunking is everything
- Size: 1000-1500 tokens
- Overlap: 200 tokens
- Boundaries: Sentence/paragraph
-
Deduplication saves money
- Hash file content, not filename
- Hash chunks, reuse embeddings
- 50%+ cost savings possible
-
Metadata enables better retrieval
- Extract from structure
- Use during search
- Improve accuracy
-
Production systems are incremental
- Don't re-embed everything
- Reuse cached embeddings
- Only process what changed
π¦ Nephew: So this is what "production-grade RAG" really means?
π¨β𦳠Uncle: Exactly. It's not just about embeddings and vector search. It's about doing the hard work BEFORE you search. That's what separates hobby projects from systems that work at scale.
Next Steps
You now understand Phase 1: Document Ingestion.
Next comes Phase 2: Retrieval & Ranking (vector search, reranking, query expansion)
Then Phase 3: Safety & Production (hallucination prevention, monitoring, cost tracking)
Ready to learn Phase 2?
π¦ Nephew: Uncle, definitely! But first... can you show me a working example?
π¨β𦳠Uncle: That's next. We'll build a real ingestion pipeline together.
Reference Code Repository
# Complete implementation files:
1. fileHasher.ts - File hashing, duplicate detection
2. pdfParsers.ts - pdf-parse, Unstructured, LlamaParse
3. textCleaning.ts - Remove junk, normalize whitespace
4. metadataExtractor.ts - Extract from structure, LLM classify
5. chunking.ts - Sliding window, smart boundaries
6. deduplication.ts - Chunk hashing, duplicate detection
7. versioning.ts - Handle updates, version tracking
8. incrementalIngestion.ts - Reuse embeddings, smart updates
9. ingestionPipeline.ts - Complete flow orchestration
10. costOptimizer.ts - Track costs, identify savings
Total: ~2000 lines of production-ready Node.js code
Remember: Less noise, more action. Phase 1 is the foundation.
Build it right.
written by -SurajK
Top comments (0)