surajrkhonde

Posted on Jun 20

RAG Pipeline: Complete Node.js Implementation Guide

#ai #rag #programming #code

Build production RAG systems in Node.js - Know where it breaks, why it works, and when to use it

Introduction: Why Node.js for RAG?

👦 Nephew: Uncle, why would I build RAG in Node.js? I thought this was AI stuff?

👨‍🦳 Uncle: Good question. Node.js is perfect for RAG because:

Fast I/O (API calls to Claude, PostgreSQL queries)
Real-time capable (WebSockets for streaming)
Easy deployment (Vercel, Railway, Render)
Async/await makes complex flows clean

Plus, you probably already have Node.js running your backend. Why add Python?

👦 Nephew: So I can build the whole thing in JavaScript?

👨‍🦳 Uncle: Yes. Frontend, backend, RAG - all JavaScript. That's the beauty.

But we need to be honest about limitations. Let's talk about that too.

PART 1: PROJECT SETUP

Initialize Your Project

# Create project
mkdir rag-system
cd rag-system

# Initialize Node.js
npm init -y

# Install dependencies
npm install express dotenv @anthropic-ai/sdk pg pg-promise cors body-parser
npm install --save-dev nodemon typescript @types/node

# Optional but recommended for production
npm install winston helmet compression

Project Structure

rag-system/
├── src/
│   ├── config/
│   │   ├── database.ts         # PostgreSQL + pgvector setup
│   │   └── embedding.ts         # Claude embeddings
│   ├── services/
│   │   ├── retrieval.ts         # Vector search logic
│   │   ├── reranking.ts         # Two-stage ranking
│   │   ├── queryProcessing.ts   # Query expansion
│   │   └── safety.ts            # Hallucination prevention
│   ├── routes/
│   │   └── rag.ts               # API endpoints
│   ├── utils/
│   │   ├── logger.ts            # Logging (critical for debugging)
│   │   └── metrics.ts           # Track recall, precision
│   └── index.ts                 # Main server
├── .env                          # Secrets
├── package.json
└── tsconfig.json

TypeScript Configuration

{
  "compilerOptions": {
    "target": "ES2020",
    "module": "commonjs",
    "lib": ["ES2020"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true
  }
}

PART 2: DATABASE SETUP (PostgreSQL + pgvector)

1. Create PostgreSQL Database with pgvector

👨‍🦳 Uncle: This is your foundation. Get it wrong, everything breaks.

-- Connect to PostgreSQL
psql -U postgres

-- Create database
CREATE DATABASE rag_system;

-- Connect to the database
\c rag_system

-- Install pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create resumes table
CREATE TABLE resumes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  candidate_name VARCHAR(255) NOT NULL,
  raw_text TEXT NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- CRITICAL: tenant isolation
  CONSTRAINT tenant_isolation UNIQUE(tenant_id, id)
);

-- Create chunks table (where vectors live)
CREATE TABLE resume_chunks (
  id SERIAL PRIMARY KEY,
  resume_id UUID NOT NULL REFERENCES resumes(id) ON DELETE CASCADE,
  tenant_id UUID NOT NULL,
  chunk_text TEXT NOT NULL,
  chunk_index INTEGER NOT NULL,

  -- The vector: 1536 dimensions for Claude embeddings
  embedding vector(1536) NOT NULL,

  -- Metadata for debugging
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

  -- CRITICAL: Always check tenant
  CONSTRAINT tenant_isolation_chunks 
    FOREIGN KEY (tenant_id) REFERENCES tenants(id)
);

-- Create indexes
-- 1. Vector index for fast search (MOST IMPORTANT)
CREATE INDEX idx_resume_chunks_embedding 
ON resume_chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- 2. Tenant index (security)
CREATE INDEX idx_resume_chunks_tenant 
ON resume_chunks(tenant_id, resume_id);

-- 3. Text search index (keyword search)
CREATE INDEX idx_resume_chunks_text 
ON resume_chunks USING GIN (to_tsvector('english', chunk_text));

-- Create tenants table (multi-tenancy)
CREATE TABLE tenants (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name VARCHAR(255) NOT NULL,
  api_key VARCHAR(255) UNIQUE NOT NULL,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create query logs (for metrics)
CREATE TABLE query_logs (
  id SERIAL PRIMARY KEY,
  tenant_id UUID NOT NULL REFERENCES tenants(id),
  query TEXT NOT NULL,
  latency_ms INTEGER NOT NULL,
  recall DECIMAL(3,2),
  precision DECIMAL(3,2),
  cost_cents INTEGER,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create index on query logs for analytics
CREATE INDEX idx_query_logs_tenant 
ON query_logs(tenant_id, created_at DESC);

2. Database Connection Service

👨‍🦳 Uncle: This is where your first failure point lives.

// src/config/database.ts

import pgPromise from 'pg-promise';
import dotenv from 'dotenv';
import logger from '../utils/logger';

dotenv.config();

const initOptions = {
  // Detailed error info (critical for debugging)
  error(error: any, context: any) {
    logger.error('Database Error', {
      error: error.message,
      query: context.query,
      params: context.params
    });
  },

  // Connection events
  connect(client: any) {
    logger.info('Database connected');
  },

  disconnect(client: any) {
    logger.info('Database disconnected');
  }
};

const pgp = pgPromise(initOptions);

const db = pgp({
  host: process.env.DB_HOST || 'localhost',
  port: parseInt(process.env.DB_PORT || '5432'),
  database: process.env.DB_NAME || 'rag_system',
  user: process.env.DB_USER || 'postgres',
  password: process.env.DB_PASSWORD,

  // Connection pooling
  max: 20,

  // Timeout after 5 seconds
  connectionTimeoutMillis: 5000,

  // Idle timeout
  idleTimeoutMillis: 30000,
});

// Test connection on startup
export async function initializeDatabase() {
  try {
    await db.one('SELECT 1');
    logger.info('✓ Database connection verified');
  } catch (error) {
    logger.error('✗ Database connection failed', { error });
    process.exit(1);
  }
}

export default db;

3. Environment Variables

# .env
DB_HOST=localhost
DB_PORT=5432
DB_NAME=rag_system
DB_USER=postgres
DB_PASSWORD=your_secure_password

ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022

NODE_ENV=production
PORT=3000

# Security
ADMIN_API_KEY=super-secret-key-change-this

PART 3: EMBEDDING SERVICE

Create Embeddings with Claude

👨‍🦳 Uncle: This is where the first real cost happens. Know what can fail here.

// src/config/embedding.ts

import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY
});

interface EmbeddingResult {
  text: string;
  embedding: number[];
}

/**
 * Get embeddings for text chunks.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. API rate limit (429) - implements exponential backoff
 * 2. Token too long (4096 tokens max) - chunks pre-validated
 * 3. Network timeout - retry logic built in
 * 4. Cost tracking - logs cost per embedding
 */
export async function getEmbeddings(texts: string[]): Promise<EmbeddingResult[]> {
  const startTime = Date.now();

  try {
    // VALIDATION: Prevent token overrun
    // Claude's text embeddings: ~1 token = 4 chars average
    const validTexts = texts.map(text => {
      if (text.length > 16000) {  // ~4000 tokens
        logger.warn('Text truncated for embedding', { 
          originalLength: text.length,
          truncatedTo: 16000
        });
        return text.substring(0, 16000);
      }
      return text;
    });

    // Call Claude API for embeddings
    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      messages: [{
        role: 'user',
        content: `Generate embeddings for the following texts. Return ONLY valid JSON array with "embeddings" key containing array of number arrays.

Texts:
${validTexts.map((t, i) => `${i}: ${t}`).join('\n\n')}

Return format: {"embeddings": [[...], [...], ...]}`
      }]
    });

    // Parse response
    const responseText = response.content[0].type === 'text' 
      ? response.content[0].text 
      : '';

    let embeddings: number[][];
    try {
      const parsed = JSON.parse(responseText);
      embeddings = parsed.embeddings || [];
    } catch (parseError) {
      logger.error('Failed to parse embeddings response', { 
        response: responseText.substring(0, 500) 
      });
      throw new Error('Invalid embeddings response format');
    }

    // Validate embeddings
    if (embeddings.length !== validTexts.length) {
      throw new Error(
        `Embedding count mismatch: got ${embeddings.length}, expected ${validTexts.length}`
      );
    }

    // Calculate cost (Claude 3.5 Sonnet: $0.003 per 1M input tokens)
    const inputTokens = response.usage.input_tokens;
    const costCents = (inputTokens / 1_000_000) * 0.003 * 100;

    const latency = Date.now() - startTime;
    logger.info('Embeddings generated', { 
      count: embeddings.length,
      latency,
      inputTokens,
      costCents: costCents.toFixed(4)
    });

    return validTexts.map((text, i) => ({
      text,
      embedding: embeddings[i]
    }));

  } catch (error: any) {
    logger.error('Embedding API error', {
      error: error.message,
      status: error.status
    });

    // Retry logic for rate limits
    if (error.status === 429) {
      logger.warn('Rate limited. Waiting before retry...');
      await new Promise(resolve => setTimeout(resolve, 5000));
      return getEmbeddings(texts); // Exponential backoff in real system
    }

    throw error;
  }
}

/**
 * Embed a single text (convenience function)
 */
export async function embedText(text: string): Promise<number[]> {
  const results = await getEmbeddings([text]);
  return results[0].embedding;
}

Chunking Function

👨‍🦳 Uncle: Remember: 1000-1500 tokens, 200-token overlap.

// src/utils/chunking.ts

import logger from './logger';

interface Chunk {
  text: string;
  index: number;
  tokens: number;
}

/**
 * Break text into chunks with sliding window overlap.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. Overlap larger than chunk size
 * 2. Single chunk can't hold meaningful text
 */
export function chunkText(
  text: string,
  windowTokens: number = 1000,
  overlapTokens: number = 200
): Chunk[] {
  // Simple tokenization (1 token ≈ 4 chars for English)
  const estimatedTokens = Math.ceil(text.length / 4);

  if (estimatedTokens < windowTokens) {
    // Text is smaller than chunk size
    logger.debug('Text smaller than chunk window', { 
      estimatedTokens,
      windowTokens 
    });
    return [{
      text,
      index: 0,
      tokens: estimatedTokens
    }];
  }

  // Calculate character window (1 token ≈ 4 chars)
  const charWindow = windowTokens * 4;
  const charOverlap = overlapTokens * 4;
  const step = charWindow - charOverlap;

  const chunks: Chunk[] = [];
  let index = 0;

  for (let i = 0; i < text.length; i += step) {
    let end = i + charWindow;

    // Find sentence boundary to avoid splitting mid-sentence
    if (end < text.length) {
      const periodIndex = text.lastIndexOf('.', end);
      const newlineIndex = text.lastIndexOf('\n', end);
      const boundaryIndex = Math.max(periodIndex, newlineIndex);

      if (boundaryIndex > i + (charWindow * 0.8)) {
        // Found good boundary
        end = boundaryIndex + 1;
      }
    } else {
      end = text.length;
    }

    const chunk = text.substring(i, end).trim();

    if (chunk.length > 0) {
      chunks.push({
        text: chunk,
        index,
        tokens: Math.ceil(chunk.length / 4)
      });
      index++;
    }

    // Stop if we've reached the end
    if (end >= text.length) break;
  }

  logger.debug('Text chunked', {
    originalLength: text.length,
    chunkCount: chunks.length,
    avgChunkTokens: Math.round(
      chunks.reduce((sum, c) => sum + c.tokens, 0) / chunks.length
    )
  });

  return chunks;
}

PART 4: RETRIEVAL SERVICE

Vector Search with Hybrid (Vector + Keyword)

👨‍🦳 Uncle: This is the heart. Where everything lives or dies.

// src/services/retrieval.ts

import db from '../config/database';
import { embedText } from '../config/embedding';
import logger from '../utils/logger';

interface RetrievalResult {
  chunkText: string;
  chunkIndex: number;
  vectorDistance: number;
  keywordScore: number;
  combinedScore: number;
}

/**
 * Retrieve relevant chunks using hybrid search.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. Missing tenant_id check → DATA BREACH
 * 2. Vector index not built → Slow queries (10s+ instead of 100ms)
 * 3. Query too long → API error
 * 4. No results → Need to handle gracefully
 * 5. Typos in query → Keyword search might fail
 */
export async function hybridSearch(
  tenantId: string,
  resumeId: string,
  query: string,
  topK: number = 5
): Promise<RetrievalResult[]> {
  const startTime = Date.now();

  try {
    // Validate inputs
    if (!tenantId || !resumeId) {
      throw new Error('tenant_id and resume_id are required');
    }

    if (query.length === 0) {
      throw new Error('Query cannot be empty');
    }

    if (query.length > 500) {
      logger.warn('Query truncated', { originalLength: query.length });
      query = query.substring(0, 500);
    }

    // Step 1: Get query embedding
    logger.debug('Embedding query', { query });
    const queryEmbedding = await embedText(query);

    // Step 2: Vector search (fast)
    // Convert embedding to PostgreSQL format: [0.1, 0.2, ...]
    const embeddingString = `[${queryEmbedding.join(',')}]`;

    const vectorResults = await db.manyOrNone(`
      SELECT 
        chunk_text,
        chunk_index,
        embedding <=> $1::vector AS vector_distance
      FROM resume_chunks
      WHERE 
        tenant_id = $2
        AND resume_id = $3
      ORDER BY vector_distance ASC
      LIMIT $4
    `, [embeddingString, tenantId, resumeId, topK * 2]); // Get 2x to filter

    if (vectorResults.length === 0) {
      logger.warn('No vector results found', { query, resumeId });
      return [];
    }

    // Step 3: Keyword filter (precision)
    // Only keep chunks that also match the query
    const keywordResults = await db.manyOrNone(`
      SELECT 
        chunk_text,
        chunk_index,
        ts_rank(
          to_tsvector('english', chunk_text), 
          plainto_tsquery('english', $1)
        ) AS keyword_score
      FROM resume_chunks
      WHERE 
        tenant_id = $2
        AND resume_id = $3
        AND to_tsvector('english', chunk_text) @@ 
            plainto_tsquery('english', $1)
      ORDER BY keyword_score DESC
      LIMIT $4
    `, [query, tenantId, resumeId, topK]);

    // Step 4: Combine results
    // Chunks that appear in both vector AND keyword search are best
    const combined = vectorResults
      .map(vr => {
        const kr = keywordResults.find(k => k.chunk_text === vr.chunk_text);
        return {
          ...vr,
          keywordScore: kr ? kr.keyword_score : 0,
          // Weighted score: 60% vector, 40% keyword
          combinedScore: (1 - vr.vector_distance) * 0.6 + (kr?.keyword_score || 0) * 0.4
        };
      })
      .sort((a, b) => b.combinedScore - a.combinedScore)
      .slice(0, topK);

    const latency = Date.now() - startTime;

    logger.info('Hybrid search complete', {
      query,
      resultsCount: combined.length,
      latency,
      vectorResultsCount: vectorResults.length,
      keywordResultsCount: keywordResults.length
    });

    // Log for metrics
    if (combined.length > 0) {
      await db.none(`
        INSERT INTO query_logs (tenant_id, query, latency_ms)
        VALUES ($1, $2, $3)
      `, [tenantId, query.substring(0, 255), latency]);
    }

    return combined as RetrievalResult[];

  } catch (error: any) {
    logger.error('Retrieval error', {
      error: error.message,
      query,
      resumeId,
      tenantId
    });
    throw error;
  }
}

/**
 * Multi-query retrieval - search with multiple variations.
 * 
 * Better recall, but slower and more expensive.
 */
export async function multiQueryRetrieval(
  tenantId: string,
  resumeId: string,
  queries: string[],
  topK: number = 5
): Promise<RetrievalResult[]> {
  try {
    const allResults: RetrievalResult[] = [];

    for (const query of queries) {
      const results = await hybridSearch(tenantId, resumeId, query, topK * 2);
      allResults.push(...results);
    }

    // Deduplicate by chunk text, keep highest score
    const unique = Array.from(
      allResults
        .reduce((map, item) => {
          const existing = map.get(item.chunkText);
          if (!existing || item.combinedScore > existing.combinedScore) {
            map.set(item.chunkText, item);
          }
          return map;
        }, new Map<string, RetrievalResult>())
        .values()
    );

    return unique
      .sort((a, b) => b.combinedScore - a.combinedScore)
      .slice(0, topK);

  } catch (error) {
    logger.error('Multi-query retrieval error', { error });
    throw error;
  }
}

PART 5: RERANKING SERVICE

👨‍🦳 Uncle: Two-stage is where quality happens. First stage is fast, second is accurate.

// src/services/reranking.ts

import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';

interface RerankedResult {
  text: string;
  score: number;
  rank: number;
}

/**
 * Rerank chunks using Claude (more accurate but slower).
 * 
 * ⚠️ FAILURE POINTS:
 * 1. Claude API timeout (fix with timeout wrapper)
 * 2. Chunks too long (truncate before sending)
 * 3. Response parsing fails
 * 4. Cost explosion (reranking costs money - track it)
 */
export async function rerank(
  query: string,
  chunks: string[],
  topK: number = 5
): Promise<RerankedResult[]> {
  const startTime = Date.now();

  try {
    if (chunks.length === 0) {
      return [];
    }

    const client = new Anthropic({
      apiKey: process.env.ANTHROPIC_API_KEY,
      timeout: 30 * 1000, // 30 second timeout
    });

    // Truncate chunks to prevent token overflow
    const truncatedChunks = chunks.map(c => 
      c.length > 2000 ? c.substring(0, 2000) + '...' : c
    );

    // Build reranking prompt
    const chunksFormatted = truncatedChunks
      .map((chunk, i) => `[${i}] ${chunk}`)
      .join('\n\n---\n\n');

    const prompt = `You are a search relevance expert. Rank the following chunks by relevance to the query.

Query: "${query}"

Chunks to rank:
${chunksFormatted}

Return ONLY valid JSON with this format:
{
  "rankings": [
    {"index": 0, "relevance_score": 0.95},
    {"index": 1, "relevance_score": 0.72}
  ]
}

Relevance score: 0.0 (irrelevant) to 1.0 (highly relevant)
Sort by relevance_score descending.`;

    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      messages: [{
        role: 'user',
        content: prompt
      }]
    });

    // Parse response
    const responseText = response.content[0].type === 'text'
      ? response.content[0].text
      : '';

    let rankings: any[];
    try {
      // Extract JSON from response (might be wrapped in markdown)
      const jsonMatch = responseText.match(/\{[\s\S]*\}/);
      const jsonStr = jsonMatch ? jsonMatch[0] : responseText;
      const parsed = JSON.parse(jsonStr);
      rankings = parsed.rankings || [];
    } catch (parseError) {
      logger.error('Failed to parse reranking response', {
        response: responseText.substring(0, 500)
      });
      // Fallback: return original order
      return chunks.slice(0, topK).map((text, i) => ({
        text,
        score: 1.0 - (i * 0.1),
        rank: i + 1
      }));
    }

    // Convert to results
    const results = rankings
      .filter(r => r.index >= 0 && r.index < chunks.length)
      .map((r, rank) => ({
        text: chunks[r.index],
        score: r.relevance_score,
        rank: rank + 1
      }))
      .slice(0, topK);

    const latency = Date.now() - startTime;

    logger.info('Reranking complete', {
      query,
      inputCount: chunks.length,
      outputCount: results.length,
      latency,
      topScore: results[0]?.score
    });

    return results;

  } catch (error: any) {
    logger.error('Reranking error', {
      error: error.message,
      chunksCount: chunks.length,
      query: query.substring(0, 100)
    });

    // Fallback: return original order
    return chunks.slice(0, topK).map((text, i) => ({
      text,
      score: 1.0 - (i * 0.1),
      rank: i + 1
    }));
  }
}

PART 6: QUERY PROCESSING

👨‍🦳 Uncle: Expand the query so you find more relevant chunks.

// src/services/queryProcessing.ts

import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';

/**
 * Expand a query into related search terms.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. LLM generates irrelevant expansions
 * 2. Original query lost in expansion
 * 3. Too many expansions → slow retrieval
 */
export async function expandQuery(originalQuery: string): Promise<string[]> {
  try {
    const client = new Anthropic({
      apiKey: process.env.ANTHROPIC_API_KEY
    });

    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 200,
      messages: [{
        role: 'user',
        content: `Given this query about a job candidate, generate 2-3 alternative phrasings or related concepts that would help find relevant information.

Original query: "${originalQuery}"

Return ONLY a JSON array of strings:
["alternative1", "alternative2", "alternative3"]

These should help find the same information using different keywords.`
      }]
    });

    const responseText = response.content[0].type === 'text'
      ? response.content[0].text
      : '';

    let expansions: string[];
    try {
      expansions = JSON.parse(responseText);
    } catch (e) {
      logger.warn('Failed to parse query expansion', { response: responseText });
      return [originalQuery];
    }

    // Always include original query
    const allQueries = [originalQuery, ...expansions].filter(Boolean);

    logger.debug('Query expanded', {
      original: originalQuery,
      expansions: allQueries.length
    });

    return allQueries;

  } catch (error) {
    logger.error('Query expansion error', { error });
    return [originalQuery]; // Fallback
  }
}

/**
 * Normalize query (remove typos, standardize terms).
 */
export function normalizeQuery(query: string): string {
  return query
    .toLowerCase()
    .trim()
    // Remove extra spaces
    .replace(/\s+/g, ' ')
    // Remove special characters (keep alphanumeric and spaces)
    .replace(/[^\w\s]/g, '');
}

PART 7: SAFETY & HALLUCINATION PREVENTION

👨‍🦳 Uncle: This is where you prevent the AI from lying. Critical.

// src/services/safety.ts

import Anthropic from '@anthropic-ai/sdk';
import logger from '../utils/logger';

interface SafeAnswer {
  answer: string;
  confidence: number;
  evidence: string[];
  isSafe: boolean;
  reason?: string;
}

/**
 * Get answer from AI with multiple safety layers.
 * 
 * ⚠️ FAILURE POINTS:
 * 1. AI answer not in JSON format
 * 2. Confidence calculation wrong
 * 3. Evidence doesn't exist in chunks
 * 4. Excessive cost for failed attempts
 */
export async function safeAnswer(
  query: string,
  chunks: string[],
  confidenceThreshold: number = 0.7
): Promise<SafeAnswer> {
  const startTime = Date.now();

  try {
    if (chunks.length === 0) {
      return {
        answer: 'No relevant information found.',
        confidence: 0,
        evidence: [],
        isSafe: false,
        reason: 'No source chunks provided'
      };
    }

    const client = new Anthropic({
      apiKey: process.env.ANTHROPIC_API_KEY
    });

    // Layer 1: Retrieval boundaries
    // Show ONLY the chunks, nothing from training
    const chunksText = chunks
      .map((c, i) => `[Chunk ${i}]\n${c}`)
      .join('\n\n---\n\n');

    const prompt = `You are evaluating a candidate resume based on specific chunks.

INSTRUCTIONS:
1. Answer ONLY based on the provided chunks
2. Do NOT use any knowledge from training data
3. If information is not in chunks, say "Unknown"
4. Always cite which chunk supports your answer
5. Return valid JSON ONLY - no other text

Query: "${query}"

Chunks provided:
${chunksText}

Return JSON in this exact format:
{
  "answer": "your answer here",
  "confidence": 0.0 to 1.0,
  "evidence_chunks": [0, 1, 2],
  "explanation": "why you're confident"
}`;

    // Layer 2: Structured output
    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 500,
      messages: [{
        role: 'user',
        content: prompt
      }]
    });

    const responseText = response.content[0].type === 'text'
      ? response.content[0].text
      : '';

    // Parse response
    let parsed: any;
    try {
      const jsonMatch = responseText.match(/\{[\s\S]*\}/);
      const jsonStr = jsonMatch ? jsonMatch[0] : responseText;
      parsed = JSON.parse(jsonStr);
    } catch (e) {
      logger.error('Failed to parse safety response', {
        response: responseText.substring(0, 300)
      });
      return {
        answer: 'Error processing answer',
        confidence: 0,
        evidence: [],
        isSafe: false,
        reason: 'Invalid response format'
      };
    }

    // Layer 3: Validation
    // Check evidence chunks actually exist
    const validEvidenceIndices = (parsed.evidence_chunks || [])
      .filter((i: number) => i >= 0 && i < chunks.length);

    if (validEvidenceIndices.length === 0 && parsed.answer !== 'Unknown') {
      logger.warn('No valid evidence for answer', { 
        answer: parsed.answer,
        requestedIndices: parsed.evidence_chunks,
        chunksCount: chunks.length
      });
    }

    const evidence = validEvidenceIndices.map((i: number) => chunks[i]);

    // Layer 4: Confidence gating
    const isSafe = parsed.confidence >= confidenceThreshold;

    if (!isSafe) {
      logger.warn('Low confidence answer', {
        answer: parsed.answer,
        confidence: parsed.confidence,
        threshold: confidenceThreshold
      });
    }

    const latency = Date.now() - startTime;

    logger.info('Safe answer generated', {
      query: query.substring(0, 50),
      confidence: parsed.confidence,
      isSafe,
      latency,
      evidenceCount: evidence.length
    });

    return {
      answer: parsed.answer,
      confidence: parsed.confidence,
      evidence,
      isSafe,
      reason: isSafe ? 'Confident' : 'Low confidence'
    };

  } catch (error: any) {
    logger.error('Safety check error', { error: error.message });
    return {
      answer: 'Error',
      confidence: 0,
      evidence: [],
      isSafe: false,
      reason: error.message
    };
  }
}

/**
 * Validate that answer is faithful to evidence.
 * Post-check: does answer match the chunks?
 */
export async function validateFaithfulness(
  answer: string,
  evidence: string[],
  threshold: number = 0.8
): Promise<{ isFaithful: boolean; score: number }> {
  try {
    // Simple check: are key terms from answer in evidence?
    const answerTerms = answer.toLowerCase().split(/\s+/);
    const evidenceText = evidence.join(' ').toLowerCase();

    const matchedTerms = answerTerms.filter(term => 
      term.length > 3 && evidenceText.includes(term)
    );

    const score = answerTerms.length > 0 
      ? matchedTerms.length / answerTerms.length 
      : 0;

    return {
      isFaithful: score >= threshold,
      score
    };

  } catch (error) {
    logger.error('Faithfulness validation error', { error });
    return { isFaithful: false, score: 0 };
  }
}

PART 8: API ROUTES & ENDPOINTS

👨‍🦳 Uncle: This is what the client calls. Make it robust.

// src/routes/rag.ts

import express, { Router, Request, Response } from 'express';
import db from '../config/database';
import { hybridSearch, multiQueryRetrieval } from '../services/retrieval';
import { rerank } from '../services/reranking';
import { expandQuery } from '../services/queryProcessing';
import { safeAnswer, validateFaithfulness } from '../services/safety';
import { chunkText } from '../utils/chunking';
import { getEmbeddings } from '../config/embedding';
import logger from '../utils/logger';

const router = Router();

// Middleware: Check authentication
function authMiddleware(req: Request, res: Response, next: Function) {
  const apiKey = req.headers['x-api-key'] as string;

  if (!apiKey) {
    return res.status(401).json({ error: 'Missing API key' });
  }

  // In production, validate against database
  if (apiKey !== process.env.ADMIN_API_KEY) {
    return res.status(401).json({ error: 'Invalid API key' });
  }

  next();
}

router.use(authMiddleware);

/**
 * Upload and process a resume.
 * POST /rag/upload
 */
router.post('/upload', async (req: Request, res: Response) => {
  try {
    const { tenantId, candidateName, resumeText } = req.body;

    if (!tenantId || !candidateName || !resumeText) {
      return res.status(400).json({ 
        error: 'Missing required fields: tenantId, candidateName, resumeText' 
      });
    }

    // Step 1: Save resume
    const resumeResult = await db.one(`
      INSERT INTO resumes (tenant_id, candidate_name, raw_text)
      VALUES ($1, $2, $3)
      RETURNING id
    `, [tenantId, candidateName, resumeText]);

    const resumeId = resumeResult.id;

    // Step 2: Chunk the resume
    const chunks = chunkText(resumeText, 1000, 200);
    logger.info('Resume chunked', { resumeId, chunkCount: chunks.length });

    // Step 3: Get embeddings for all chunks
    const chunkTexts = chunks.map(c => c.text);
    const embeddingResults = await getEmbeddings(chunkTexts);

    // Step 4: Save chunks with embeddings
    for (let i = 0; i < chunks.length; i++) {
      const chunk = chunks[i];
      const embedding = embeddingResults[i].embedding;
      const embeddingArray = `[${embedding.join(',')}]`;

      await db.none(`
        INSERT INTO resume_chunks 
        (resume_id, tenant_id, chunk_text, chunk_index, embedding)
        VALUES ($1, $2, $3, $4, $5::vector)
      `, [resumeId, tenantId, chunk.text, chunk.index, embeddingArray]);
    }

    logger.info('Resume uploaded successfully', { resumeId, chunkCount: chunks.length });

    res.json({
      success: true,
      resumeId,
      chunkCount: chunks.length,
      message: `Resume for ${candidateName} processed successfully`
    });

  } catch (error: any) {
    logger.error('Upload error', { error: error.message });
    res.status(500).json({ error: error.message });
  }
});

/**
 * Query a resume.
 * POST /rag/query
 */
router.post('/query', async (req: Request, res: Response) => {
  try {
    const { tenantId, resumeId, question, useExpansion = false } = req.body;

    if (!tenantId || !resumeId || !question) {
      return res.status(400).json({
        error: 'Missing required fields: tenantId, resumeId, question'
      });
    }

    const startTime = Date.now();

    // Step 1: Expand query if requested
    let queries = [question];
    if (useExpansion) {
      queries = await expandQuery(question);
      logger.debug('Query expanded', { count: queries.length });
    }

    // Step 2: Retrieve chunks (multi-query if expanded)
    const retrieved = useExpansion
      ? await multiQueryRetrieval(tenantId, resumeId, queries, 10)
      : await hybridSearch(tenantId, resumeId, question, 10);

    if (retrieved.length === 0) {
      return res.json({
        answer: 'No relevant information found in resume.',
        confidence: 0,
        evidence: [],
        isSafe: false,
        latency: Date.now() - startTime
      });
    }

    // Step 3: Rerank for accuracy
    const chunks = retrieved.map(r => r.chunkText);
    const reranked = await rerank(question, chunks, 5);
    const topChunks = reranked.map(r => r.text);

    // Step 4: Get safe answer with evidence
    const safeAns = await safeAnswer(question, topChunks, 0.7);

    // Step 5: Validate faithfulness (optional)
    const faithfulness = await validateFaithfulness(safeAns.answer, safeAns.evidence);

    const latency = Date.now() - startTime;

    res.json({
      success: true,
      answer: safeAns.answer,
      confidence: safeAns.confidence,
      evidence: safeAns.evidence,
      isSafe: safeAns.isSafe,
      faithfulness: faithfulness.score,
      latency,
      chunksRetrieved: retrieved.length,
      chunksReranked: reranked.length
    });

  } catch (error: any) {
    logger.error('Query error', { error: error.message });
    res.status(500).json({ error: error.message });
  }
});

/**
 * Get metrics for a tenant.
 * GET /rag/metrics/:tenantId
 */
router.get('/metrics/:tenantId', async (req: Request, res: Response) => {
  try {
    const { tenantId } = req.params;

    // Query logs aggregation
    const metrics = await db.one(`
      SELECT 
        COUNT(*) as query_count,
        AVG(latency_ms) as avg_latency,
        MAX(latency_ms) as max_latency,
        MIN(latency_ms) as min_latency,
        AVG(recall) as avg_recall,
        AVG(precision) as avg_precision,
        SUM(cost_cents) / 100.0 as total_cost_dollars
      FROM query_logs
      WHERE tenant_id = $1
    `, [tenantId]);

    res.json({
      success: true,
      metrics
    });

  } catch (error: any) {
    logger.error('Metrics error', { error: error.message });
    res.status(500).json({ error: error.message });
  }
});

export default router;

PART 9: MAIN SERVER

// src/index.ts

import express from 'express';
import cors from 'cors';
import compression from 'compression';
import helmet from 'helmet';
import dotenv from 'dotenv';
import ragRoutes from './routes/rag';
import { initializeDatabase } from './config/database';
import logger from './utils/logger';

dotenv.config();

const app = express();
const PORT = process.env.PORT || 3000;

// Middleware
app.use(helmet()); // Security headers
app.use(compression()); // Compress responses
app.use(cors());
app.use(express.json());

// Health check
app.get('/health', (req, res) => {
  res.json({ status: 'ok', timestamp: new Date().toISOString() });
});

// RAG routes
app.use('/rag', ragRoutes);

// Error handler
app.use((err: any, req: express.Request, res: express.Response, next: express.NextFunction) => {
  logger.error('Unhandled error', { error: err.message });
  res.status(500).json({ error: 'Internal server error' });
});

// Start server
async function start() {
  try {
    // Initialize database
    await initializeDatabase();

    app.listen(PORT, () => {
      logger.info(`Server running on port ${PORT}`);
    });
  } catch (error) {
    logger.error('Failed to start server', { error });
    process.exit(1);
  }
}

start();

Logging Service

// src/utils/logger.ts

import winston from 'winston';

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  transports: [
    // Console in development
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.printf(({ timestamp, level, message, ...meta }) => {
          return `${timestamp} [${level}] ${message} ${
            Object.keys(meta).length ? JSON.stringify(meta, null, 2) : ''
          }`;
        })
      )
    }),
    // File for production
    new winston.transports.File({ 
      filename: 'logs/error.log',
      level: 'error'
    }),
    new winston.transports.File({ 
      filename: 'logs/combined.log'
    })
  ]
});

export default logger;

PART 10: FAILURE POINTS & SOLUTIONS

Summary Table

Failure Point	Symptom	Root Cause	Solution
Missing tenant_id	Data leak between companies	No isolation check	Add WHERE tenant_id = X to EVERY query
Vector index missing	Queries take 10+ seconds	Sequential scan of 500K vectors	Create IVFFLAT index on embedding column
Query too long	API error 4096 tokens exceeded	Question >16000 chars	Truncate queries to 500 chars
No results	Empty array returned	Chunks don't exist or embeddings wrong	Check if chunks were saved, verify vector distance threshold
Hallucination	AI invents information	No retrieval boundaries in prompt	Use safety layers (5 layers as described)
Rate limit (429)	API call fails	Too many requests to Claude	Implement exponential backoff, queue requests
Database connection lost	"Cannot connect to server"	Network issue, DB down, wrong credentials	Add retry logic, connection pooling, health checks
Embedding dimension mismatch	"Vector dimension 1536 != 768"	Using different embedding model	Ensure consistent model (claude-3-5-sonnet)
Memory overload	Node.js crashes	Trying to embed entire 100MB file	Chunk before embedding, process in batches
Cost explosion	Unexpected $10k bill	Each embedding/rerank/answer costs money	Track costs, log them, set spending limits

PART 11: PROS AND CONS OF RAG

PROS ✓

1. Accuracy

✓ AI answers from evidence, not training data
✓ No hallucinations (with proper safety layers)
✓ Can trace every answer to source

Without RAG: "Does John know Docker?" → Guess → "maybe, looks like it"
With RAG: "Does John know Docker?" → Evidence → "Yes. His resume says: 'Docker, Kubernetes, 4 years'"

2. Up-to-Date

✓ Works with information from last month
✓ No retraining needed
✓ Can add 1000 new resumes per day

3. Explainability

✓ Users see why the system answered
✓ Can audit decisions
✓ Legally defensible

4. Cost

✓ Cheaper than retraining models
✓ Only pay for what you use
✓ PostgreSQL is free (open source)

5. Flexibility

✓ Can use with ANY language
✓ Works with specialized domains
✓ Easy to update documents

CONS ✗

1. Complexity

✗ More moving parts: embeddings, database, retrieval, reranking, safety
✗ More things to break
✗ More things to debug

Single LLM: Simple
RAG: Embeddings → Vector DB → Retrieval → Reranking → Safety checks
Each stage can fail independently

2. Retrieval Quality

✗ If wrong chunks retrieved, answer is wrong
✗ Embeddings are fuzzy (not perfect matches)
✗ Rare topics might not be in training data

Example problem:

Resume: "Worked with distributed ledger technology"
Search: "blockchain"
Result: Miss (DLT ≠ blockchain in embeddings)

3. Cost of Operations

✗ Each embedding: $0.00001
✗ Each rerank: $0.0001
✗ Each LLM answer: $0.001
✗ Adds up fast with scale

100K queries/month = $100/month just for operations
(Not including infrastructure, salaries, etc)

4. Latency

✗ Vector search: 100ms
✗ Reranking: 300-500ms
✗ Safety check: 200ms
✗ Total: 600-800ms

User expects <200ms. RAG adds delay.

5. Data Quality Matters

✗ If source documents are bad, answers are bad
✗ Garbage in = garbage out
✗ No amount of engineering fixes bad data

Resume full of typos: "React" → "Rreact" → Embeddings confused
→ System can't find React skills
→ Answer is wrong

6. Scaling is Hard

✗ Vector index maintenance complex
✗ Multi-tenancy requires isolation at every level
✗ Database queries become slow with 10M+ chunks

When to Use RAG

Scenario	Use RAG?	Why
Customer support	YES	Up-to-date, explainable, no hallucinations
Medical diagnosis	YES	Safety-critical, needs evidence
Resume screening	YES	Domain-specific, needs accuracy
General chatbot	NO	Training data sufficient, latency matters
Quick facts	NO	Simple lookup is faster
Creative writing	NO	Hallucinations are features, not bugs
Code search	MAYBE	Depends on code freshness
Legal documents	YES	Must cite sources, no mistakes

When NOT to Use RAG

1. Simple lookup → Use database
2. Conversational → Use base LLM
3. Speed critical → Too slow (600ms+)
4. Data quality poor → Garbage in/out
5. Training data sufficient → No value-add
6. Cost-sensitive → Each query costs money

PART 12: COST ANALYSIS

Break Down of Costs

Per-Query Costs (approximate):

1. Embedding query:
   - 50 tokens @ $0.000003/token = $0.00015

2. Vector search + keyword filter:
   - Database operation ≈ $0 (hosted: ~$0.00001)

3. Reranking (Claude):
   - 500 tokens input + 100 output @ $0.003/$0.015 = $0.0018

4. Final answer:
   - 500 tokens input + 200 output @ $0.003/$0.015 = $0.004

Total per query: ≈ $0.0075 (~0.75 cents)

At scale:
- 1K queries/month = $7.50
- 100K queries/month = $750
- 1M queries/month = $7,500

How to Reduce Costs

Cache results
- Same question asked twice? Use Redis cache
- Saves: 80% of cost
Batch processing
- Don't embed one by one
- Embed 100 at a time
- Saves: 20% (batch discounts)
Smart reranking
- Only rerank if needed
- Don't rerank for obvious matches
- Saves: 30-50%
Cheaper models
- Use Claude Haiku for simple tasks
- Claude Opus only for complex reasoning
- Saves: 80%

PART 13: DEPLOYMENT & PRODUCTION

Docker Deployment

# Dockerfile

FROM node:18-alpine

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY dist ./dist

EXPOSE 3000

CMD ["node", "dist/index.js"]

Docker Compose (Local Dev)

# docker-compose.yml

version: '3.8'
services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: rag_system
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"

  pgvector:
    build:
      context: .
      dockerfile: Dockerfile.pgvector
    environment:
      POSTGRES_PASSWORD: postgres
    depends_on:
      - postgres

  app:
    build: .
    environment:
      DB_HOST: postgres
      DB_USER: postgres
      DB_PASSWORD: postgres
      DB_NAME: rag_system
      ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY}
    ports:
      - "3000:3000"
    depends_on:
      - postgres

volumes:
  postgres_data:

Deployment Checklist

[ ] Environment variables set correctly
[ ] Database indexes created
[ ] API key securely stored (use AWS Secrets Manager)
[ ] Logging configured (CloudWatch or similar)
[ ] Monitoring set up (CPU, memory, latency)
[ ] Rate limiting enabled
[ ] Database backups enabled
[ ] Error alerts configured
[ ] Cost monitoring enabled
[ ] Security scan passed
[ ] Load test passed (1000 req/sec)
[ ] Graceful shutdown implemented

PART 14: EXAMPLE QUERIES

Complete End-to-End Example

# 1. Upload a resume
curl -X POST http://localhost:3000/rag/upload \
  -H "Content-Type: application/json" \
  -H "X-API-Key: super-secret-key" \
  -d '{
    "tenantId": "company-1",
    "candidateName": "John Doe",
    "resumeText": "John has 5 years React experience, built e-commerce platforms with Node.js..."
  }'

# Response:
# {
#   "success": true,
#   "resumeId": "uuid-123",
#   "chunkCount": 8
# }

# 2. Query the resume
curl -X POST http://localhost:3000/rag/query \
  -H "Content-Type: application/json" \
  -H "X-API-Key: super-secret-key" \
  -d '{
    "tenantId": "company-1",
    "resumeId": "uuid-123",
    "question": "Does John have React experience?",
    "useExpansion": true
  }'

# Response:
# {
#   "success": true,
#   "answer": "Yes, John has 5 years of React experience.",
#   "confidence": 0.95,
#   "evidence": ["John has 5 years React experience, built e-commerce..."],
#   "isSafe": true,
#   "faithfulness": 0.92,
#   "latency": 720,
#   "chunksRetrieved": 10,
#   "chunksReranked": 5
# }

# 3. Get metrics
curl -X GET http://localhost:3000/rag/metrics/company-1 \
  -H "X-API-Key: super-secret-key"

# Response:
# {
#   "success": true,
#   "metrics": {
#     "query_count": 42,
#     "avg_latency": 680,
#     "max_latency": 1200,
#     "avg_recall": 0.91,
#     "avg_precision": 0.88,
#     "total_cost_dollars": 0.32
#   }
# }

FINAL CHECKLIST: BEFORE YOU SHIP

Functionality

[ ] Embeddings working (call Claude API successfully)
[ ] PostgreSQL storing vectors correctly
[ ] Hybrid search returning results
[ ] Reranking improving quality
[ ] Safety layers preventing hallucinations
[ ] Multi-tenancy isolated (tenant1 can't see tenant2)

Performance

[ ] Vector search <200ms
[ ] Full query <1000ms
[ ] No N+1 queries
[ ] Database queries indexed
[ ] Caching working

Security

[ ] API keys not logged
[ ] Tenant isolation verified (manual test)
[ ] SQL injection prevented (parameterized queries)
[ ] Rate limiting enabled
[ ] HTTPS enforced

Operations

[ ] Logging configured
[ ] Error alerts set up
[ ] Cost tracking enabled
[ ] Database backups tested
[ ] Graceful error handling

Monitoring

[ ] Query latency tracked
[ ] Recall/precision measured
[ ] Cost per query logged
[ ] Error rate tracked
[ ] Uptime monitored

Quick Reference: Common Issues & Fixes

Issue: Queries Take 10+ Seconds

Cause: Missing vector index
Fix:

CREATE INDEX ON resume_chunks 
USING ivfflat (embedding vector_cosine_ops) 
WITH (lists = 100);

Issue: Same Query Returns Different Results

Cause: Stale embeddings or non-deterministic reranking
Fix: Always embed with same model, use fixed random seed for reranking

Issue: AI Answers with Wrong Information

Cause: Insufficient safety layers
Fix: Add confidence thresholding, require citations, validate faithfulness

Issue: "Tenant1 sees Tenant2 data"

Cause: Missing tenant_id check in WHERE clause
Fix: Add WHERE tenant_id = $X to EVERY query

Issue: API calls get rate limited (429)

Cause: Too many rapid requests to Claude
Fix: Implement exponential backoff, queue requests, batch operations

Issue: Cost explodes unexpectedly

Cause: No cost tracking, inefficient queries, excessive reranking
Fix: Log cost per operation, implement budgets, use cheaper models for easy tasks

The Bottom Line

👨‍🦳 Uncle's Final Word:

RAG is powerful but complex. Every layer serves a purpose:

Embeddings = understand meaning
Vector DB = store and search fast
Retrieval = find relevant information
Reranking = improve quality
Safety = prevent lies
Operations = track costs and metrics

You don't need all of this on day one. Start simple:

Day 1: PostgreSQL + embeddings + basic search
Week 1: Add reranking
Month 1: Add safety layers
Month 3: Add monitoring and optimization

Each layer buys you something. Know what you're buying.

👦 Nephew: When should I NOT use RAG?

👨‍🦳 Uncle: When: