surajrkhonde

Posted on Jun 20

RAG Pipeline: The Uncle-Nephew Complete Learning Guide

#ai #rag #llm #programming

How to Build Systems That Actually Know Your Data (Not Hallucinate About It)

Introduction: The Story Begins

👦 Nephew: Uncle, I keep hearing "RAG this, RAG that" in tech interviews. When I ask what it means, people throw around words like "Retrieval-Augmented Generation" and I just nod like I understand. But honestly? I'm lost.

👨‍🦳 Uncle: (laughing) That's the best honest question I've heard all week. Let me ask you something first. If I gave you a question right now - "What year did India win the World Cup?" - how would you answer?

👦 Nephew: Well... I'd pull up Google, search for it, read the answer, then tell you.

👨‍🦳 Uncle: Exactly. You don't answer from memory alone. You go fetch the information first, then answer based on what you found. That's RAG in real life. And that simple idea - fetch first, answer after - fixes almost every problem we face with AI today.

👦 Nephew: But uncle, AI can remember things from its training. Why does it need to fetch?

👨‍🦳 Uncle: Ah! That's where we land in trouble. Come, sit...

SECTION 1: RAG FUNDAMENTALS - The Core Concept

The Problem We're Actually Solving

👨‍🦳 Uncle: Imagine you're hiring for a tech company. You receive 500 resumes for a Senior React Developer role. Now tell me - how would you actually process them?

👦 Nephew: I'd... probably make a spreadsheet? List all the candidates with key skills?

👨‍🦳 Uncle: Right. But here's the catch - you can't read all 500 resumes deeply. So what do you really do?

👦 Nephew: Skim for keywords like "React", "JavaScript", "5 years"?

👨‍🦳 Uncle: Exactly. You skim and hope you don't miss anyone good. Now, here's the problem: what if a candidate wrote "React.js" instead of "React"? Your eyes might still catch it. But a dumb computer doing exact string matching? It says "no match".

What if someone wrote "Built real-time user interfaces with the React framework"? The candidate clearly knows React, but the word "React" appears nowhere in that sentence. The computer misses them.

This is exactly what happens with AI. When you ask an AI a question, it tries to answer purely from what it memorized during training. And it makes mistakes - sometimes big ones.

👦 Nephew: So RAG stops these mistakes?

👨‍🦳 Uncle: Precisely. RAG is simple: instead of asking the AI to guess, you hand it the document first. You say, "Here's the resume, here's the job description - now answer my question."

The AI stops guessing. It reads the evidence. It answers correctly.

What is RAG Really?

👦 Nephew: Okay, so RAG = give the AI documents, then ask questions?

👨‍🦳 Uncle: Yes, but we need to be precise about how we give it documents. RAG has three steps:

Retrieval - Find the right documents
Augmentation - Add those documents to the AI's prompt
Generation - Let AI answer based on what it read

That's it. R-A-G.

👦 Nephew: But if it's that simple, why is everyone talking about it like it's complicated?

👨‍🦳 Uncle: Because "finding the right documents" is the hard part. You have 500 resumes. When someone asks "Does John have Docker experience?", you can't search all 500 linearly. That's slow.

And you can't use simple keyword search either - what if John wrote "containerization" instead of "Docker"? What if he wrote "I work with Kubernetes" - which means he knows Docker too?

So the real question becomes: How do you find the right documents fast, and how do you understand when two different words mean the same thing?

This is where everything else - embeddings, vector databases, chunking - comes in. They're all in service of solving that one problem.

👦 Nephew: Okay, I think I get the big picture. But uncle, why can't we just make the AI smarter?

👨‍🦳 Uncle: Two reasons. First - you train an AI once. After that, it doesn't learn new information. If your company has 100 internal policies created last month, the AI knows nothing about them. It can't learn them instantly.

Second - even if you could retrain it, hallucinations would still happen. AIs are pattern-matching machines. They're brilliant at patterns. But they sometimes see patterns that aren't there. RAG forces the AI to cite its sources, to point at evidence.

It's the difference between:

"I think John knows Docker" (guess)
"John knows Docker because his resume clearly says 'Docker, Kubernetes, 4 years'" (evidence)

The second one is what you want. That's RAG.

SECTION 2: EMBEDDINGS - Understanding Meaning

The Fundamental Problem with Words

👦 Nephew: Uncle, I have a question. How does a computer know that "React" and "React.js" are the same thing?

👨‍🦳 Uncle: That's the right question. Let me explain with a story.

I go to a fruit market. I see apples, mangoes, oranges. How do I know which is which? I look at them - color, shape, smell. My brain recognizes patterns and says "that's a mango".

Now, how does a computer do that? It doesn't have eyes. All it has are numbers.

👦 Nephew: Numbers?

👨‍🦳 Uncle: Yes. Here's the trick: every word - "React", "React.js", "JavaScript", "Python" - is just a number to a computer. Or actually, a list of numbers. We call this list an "embedding".

👦 Nephew: Like a list of... what kind of numbers?

👨‍🦳 Uncle: Imagine I'm describing a person to you. I might say:

Height: 180 cm
Age: 30
Skin tone: medium
Hair color: black

That's 4 numbers describing 1 person. Now imagine I use 1536 numbers instead. Each number describes a different quality - not just physical things, but hidden things like "how technical is this word", "is this related to web development", "how often is this used", and so on.

So "React" becomes a point in a 1536-dimensional space. "React.js" becomes another point in that same space. And because they mean the same thing, those two points are very close to each other.

But "React" and "Python"? Those points are far apart.

👦 Nephew: Okay, so closeness = similarity?

👨‍🦳 Uncle: Exactly. And here's the beautiful part: you don't have to manually create these embeddings. An AI model does it for you. You feed it the word "React" and it says, "This word should be represented as [0.12, -0.45, 0.78, ..., 1536 numbers]".

Different models might represent it differently, but the same model always represents similar concepts closely. That's what matters.

Why This Is Genius

👦 Nephew: So embeddings let computers understand meaning?

👨‍🦳 Uncle: They let computers represent meaning as a position in space. And when you represent things as positions, you can do math with them.

For example, if I tell you: "Engineer" - "Java" + "React" = ?

👦 Nephew: That's... weird math?

👨‍🦳 Uncle: Right? But with embeddings, you can actually do this. And you know what the answer is?

The vector closest to that result is usually "web developer" or something similar. The math captures meaning.

Now here's why this matters for RAG:

When someone asks "Does John have Docker experience?", we convert that question to an embedding - a point in 1536D space. Then we search for resume chunks that are close to that point. The chunks mentioning Docker - whether they say "Docker", "containerization", "container orchestration", or "I deploy with Docker" - are all close to the question's embedding.

So we find them all. And the AI reads them. And the AI answers correctly.

Without embeddings, simple keyword search would miss half of them.

SECTION 3: VECTOR DATABASES - Storing at Scale

The Storage Problem Nobody Talks About

👦 Nephew: Okay, so we have embeddings. But uncle, if I have a company with 10,000 resumes, and each resume has 50 chunks, that's 500,000 embeddings. Each embedding is 1536 numbers. How do I store and search that?

👨‍🦳 Uncle: This is where most people choose poorly. They think: "I'll use Pinecone! I'll use Milvus! I'll use Weaviate!"

They add three systems. One for storage, one for search, one for cache. Now they have 3 things that could break. 3 things to debug. 3 bills to pay.

There's a better way.

👦 Nephew: What?

👨‍🦳 Uncle: PostgreSQL.

👦 Nephew: The database? Just... PostgreSQL?

👨‍🦳 Uncle: With one addition: the pgvector extension. PostgreSQL is already reliable, it's already running your data, it's already scaling. We just add vector support.

Think about it this way: 500,000 vectors stored in PostgreSQL as data. When you have a question, you convert it to an embedding, and you say:

SELECT * FROM resume_chunks 
ORDER BY embedding <=> question_embedding 
LIMIT 5

That <=> operator means "find the 5 vectors closest to my question vector". PostgreSQL finds them in milliseconds. You're done.

No new infrastructure. No vendor lock-in. Just one database doing one job well.

👦 Nephew: But doesn't the <=> operator need an index to be fast?

👨‍🦳 Uncle: Good catch. Yes. PostgreSQL's pgvector creates an IVFFLAT index by default. Think of it like this:

When you have 500,000 resumes and need to find the 5 closest to a question, searching all 500,000 linearly is slow. The index divides the 1536D space into regions (like neighborhoods). When you search, it says "The question's embedding is closest to this neighborhood" and only searches that neighborhood. Much faster.

A linear search of 500K vectors? 5 seconds. With the index? 50-100ms.

That's the magic of the right index.

Setting It Up

👦 Nephew: How do I actually create this?

👨‍🦳 Uncle: Simple. PostgreSQL + pgvector. Here's what you do:

-- Create extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table with vector column
CREATE TABLE resume_chunks (
  id SERIAL PRIMARY KEY,
  resume_id UUID,
  chunk_text TEXT,
  embedding vector(1536),
  created_at TIMESTAMP DEFAULT NOW()
);

-- Create index for fast search
CREATE INDEX ON resume_chunks USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

Now you have a table. You insert chunks with their embeddings. You query with the <=> operator. Done.

👦 Nephew: That's it?

👨‍🦳 Uncle: That's it. No API calls to external services. No paying per query. Just a single, reliable database.

This is the beauty of choosing the right tool.

SECTION 4: BASIC RETRIEVAL - Finding What You Need

How Retrieval Works

👦 Nephew: Uncle, we have embeddings, we have a database. How does actual retrieval happen?

👨‍🦳 Uncle: Let me walk you through a real scenario. A hiring manager asks: "Does Priya have blockchain experience?"

Step 1: Understand the question
We need to find chunks about blockchain skills.

Step 2: Convert question to embedding
"blockchain experience" becomes a vector: [0.34, -0.12, ..., 1536 numbers]

Step 3: Search the database
We query:

SELECT chunk_text, similarity FROM resume_chunks 
WHERE resume_id = 'priya-uuid'
ORDER BY embedding <=> question_embedding
LIMIT 5

Step 4: Get results (might be):

"Ethereum smart contracts development"
"Bitcoin protocol understanding"
"Web3 and DeFi platform expertise"
"Solidity programming language"
"Cryptocurrency platform design"

Step 5: Send to AI
We put these 5 chunks in the prompt:

Based on the following resume excerpts:
[chunk 1]
[chunk 2]
[chunk 3]
[chunk 4]
[chunk 5]

Question: Does Priya have blockchain experience?

Step 6: AI answers
"Yes, Priya has strong blockchain experience. She has worked with Ethereum smart contracts, understands Bitcoin protocol, and has built on Web3/DeFi platforms using Solidity."

With evidence.

Why Basic Retrieval Fails Sometimes

👦 Nephew: This seems perfect. What's the problem?

👨‍🦳 Uncle: The problem is subtle. Resume has this line:

"I worked on distributed ledger technology and consensus protocols"

This person knows blockchain. But the word "blockchain" is not there.

You search for "blockchain". Embeddings are smart, but not magic. "Distributed ledger technology" and "blockchain" are close but not identical.

Sometimes the vector search returns this chunk. Sometimes it doesn't.

Basic retrieval works maybe 70% of the time. For production, you need 95%+.

That's why we need advanced techniques.

👦 Nephew: So what makes it better?

👨‍🦳 Uncle: Multiple techniques working together. But we'll get there. First, let's talk about preparation.

SECTION 5: CHUNKING STRATEGIES - Smart Breaking

The Chunking Dilemma

👦 Nephew: Uncle, let me ask something. Why do we break resumes into chunks at all? Why not just embed the whole resume?

👨‍🦳 Uncle: Good question. Two reasons:

Cost - An embedding costs money (usually). A 50-page resume broken into chunks might cost $0.50 instead of $5 if you embed it whole.
Precision - If you embed the entire resume as one chunk, and you search for "Docker experience", the system returns the whole 50-page resume. The AI has to find the Docker mention itself. But if you break it into chunks, the system returns only the chunks about Docker. The AI reads less noise.

But here's the problem: how small should chunks be?

Size Matters More Than You Think

👦 Nephew: Can't I just pick a number? Like 500 tokens per chunk?

👨‍🦳 Uncle: Let's see what happens with different sizes.

Attempt 1: Too small (100 tokens)
Chunk 1: "John has"
Chunk 2: "5 years of"
Chunk 3: "React experience"

Problem: When we read Chunk 3, we've lost context. 5 years of what? The AI is confused.

Attempt 2: Right size (1000 tokens)
"John has 5 years of React experience, built e-commerce platforms, Redux for state management, and mentored junior developers."

Perfect. Context is clear.

Attempt 3: Too large (5000 tokens)
Entire work history section, with all jobs, all skills, everything mixed together.

Problem: Too much noise. When the AI searches for "React", it gets 5000 tokens to read, but only 2 sentences mention React. It's slow and confusing.

👨‍🦳 Uncle: The sweet spot is usually 1000-1500 tokens per chunk. Not less, not more.

Overlap is Important

👦 Nephew: What's overlap? Overlapping chunks?

👨‍🦳 Uncle: Yes. Imagine this:

Without overlap:
Chunk 1 ends: "...John used React..."
Chunk 2 starts: "...for building e-commerce..."

When you read Chunk 2 alone, you don't know John used React for this. Context is broken.

With 200-token overlap:
Chunk 1: "...John has React experience. He built e-commerce..."
Chunk 2: "...He built e-commerce with Redux and WebSockets..."

Now each chunk contains enough context. You can read them individually and still understand.

👦 Nephew: So overlap = context continuity?

👨‍🦳 Uncle: Exactly. It's like reading a book. Each page overlaps the previous one slightly. You always have context.

Chunking Strategies

👦 Nephew: Are there different ways to chunk?

👨‍🦳 Uncle: Yes. Here are the main ones:

Strategy	Chunk Size	Overlap	Best For	Trade-off
Naive/Fixed	512 tokens	None	Simple code	Loses context
Sliding Window	1000-1500	200	Resumes, documents	Balanced
Semantic	Variable	Variable	Technical docs	Complex to implement
Recursive	1000 first, then smaller	200	Large documents	More processing

For resumes? Sliding window, 1000 tokens, 200-token overlap. It's what works best.

Here's example code:

def sliding_window_chunk(text, window_size=1000, overlap=200):
    """
    Break text into chunks with overlap.
    window_size = tokens per chunk
    overlap = tokens of overlap between chunks
    """
    tokens = text.split()  # Simplified tokenization
    chunks = []

    for i in range(0, len(tokens), window_size - overlap):
        chunk = tokens[i:i + window_size]
        chunks.append(" ".join(chunk))

        if i + window_size >= len(tokens):
            break

    return chunks

# Example
resume_text = "John has 5 years React... [long text]"
chunks = sliding_window_chunk(resume_text, 1000, 200)
# chunks[0] = "John has 5 years React... [1000 tokens]"
# chunks[1] = "[last 200 tokens of chunk 0] ... [next 800 tokens]"

Simple. Effective. This is what production systems use.

SECTION 6: HYBRID SEARCH - Vector + Keywords

The Limitation of Pure Vector Search

👦 Nephew: Uncle, embeddings are smart. But are they perfect?

👨‍🦳 Uncle: No. Here's a real problem:

Resume has: "Kubernetes 1.26"
You search for: "Kubernetes 1.26"

Vector search thinks:

"Kubernetes 1.26" is a version
"Kubernetes 1.25" is also a version
They're similar technologies
Distance: 0.85 (similar!)

FALSE MATCH! The versions are different, but vectors say they're similar.

👦 Nephew: So vectors miss exact matches?

👨‍🦳 Uncle: Not miss - they're fuzzy. They blur differences. Sometimes that's good (finding "React" when you search "React.js"). Sometimes it's bad (confusing version numbers).

This is why you need both: vectors AND keywords.

Keyword Search: The Simple Part

👦 Nephew: How does keyword search work?

👨‍🦳 Uncle: Simple. Exact string matching.

Resume: "Kubernetes 1.26"
Search: "1.26"
Result: EXACT MATCH. Yes.

Resume: "Kubernetes 1.25"
Search: "1.26"
Result: NO MATCH. No.

It's binary. No fuzzy. Just right or wrong.

Hybrid = Best of Both

👦 Nephew: So we use both?

👨‍🦳 Uncle: Exactly. Here's how:

Search "Kubernetes 1.26"

Step 1: Vector search on all chunks

Returns top 20 chunks that mention versions/container orchestration

Step 2: Keyword filter

On those 20 chunks, filter by exact keyword match: "1.26"
Returns maybe 3 chunks with exact version

Step 3: Combine and rerank

Return the 3 high-precision results

Result: Perfect accuracy. Both semantic understanding AND exact precision.

PostgreSQL can do this:

SELECT chunk_text 
FROM resume_chunks 
WHERE resume_id = 'john-uuid'
  AND to_tsvector(chunk_text) @@ plainto_tsquery('Kubernetes 1.26')
  AND embedding <=> question_embedding < 0.2
ORDER BY embedding <=> question_embedding
LIMIT 5

This query says:

Find chunks where text contains "Kubernetes" AND "1.26" (keyword search)
AND the embedding is close to the question (vector search)
Order by vector similarity

Both working together.

When Each Excels

👦 Nephew: When do I use pure vector?

👨‍🦳 Uncle: When you're searching for concepts:

"Tell me about this candidate's experience with modern development"
"What's their background in cloud systems?"

Vectors are great here. They find related concepts even with different words.

👦 Nephew: And pure keyword?

👨‍🦳 Uncle: For exact facts:

"What certifications does John have?"
"What's their GPA?"
"When did they graduate?"

Keywords are perfect. You don't need fuzzy matching.

👦 Nephew: And hybrid?

👨‍🦳 Uncle: For most real scenarios. Maximum accuracy. That's what production uses.

SECTION 7: RERANKING - Ranking Results Better

The Quality Problem

👦 Nephew: Uncle, we retrieve 5 chunks. But are they the best 5?

👨‍🦳 Uncle: That's a great question. Let me show you a problem:

Search: "Docker experience"

Vector search returns:

"I have 5 years Docker production experience" (similarity: 0.92)
"Container orchestration with Docker expertise" (similarity: 0.91)
"My Docker projects all failed miserably" (similarity: 0.90) ← PROBLEM!
"Docker mentioned in course material" (similarity: 0.85)
"Web dev includes Docker basics" (similarity: 0.82)

Items 3-5 shouldn't be in the top 5! Item 3 is negative, items 4-5 are weak.

Basic vector search ranks by similarity score. But similarity doesn't capture quality.

👦 Nephew: So we need a better ranker?

👨‍🦳 Uncle: Yes. And here's the trick: we use two models.

Two-Stage Retrieval

👦 Nephew: Two models? That sounds expensive.

👨‍🦳 Uncle: Exactly. So we use them smartly.

Stage 1: Fast retrieval (milliseconds)
Vector search returns 20 candidates. Speed matters here, so we use a fast model.

Time: 100ms
Quality: 70%

Stage 2: Accurate ranking (milliseconds)
We rerank those 20 with a better, slower model.

Time: 300-500ms
Quality: 95%

Total time: 500ms. Still fast. But now the top 5 are good.

The key insight: We don't apply the accurate model to all 500,000 chunks. That's too slow. We apply it only to 20. That's feasible.

Available Rerankers

👦 Nephew: What models should I use?

👨‍🦳 Uncle: A few options:

Reranker	Source	Quality	Cost
BGE-Reranker-Base	Open source	Very Good	Free
Cohere Rerank	API	Excellent	$$
Claude/GPT-4	API	Best	$$$

For production? Use BGE. It's free, open-source, and performs almost as well as expensive options.

Here's how:

from sentence_transformers import CrossEncoder

# Load open-source reranker
reranker = CrossEncoder('BAAI/bge-reranker-base')

# We have top 20 chunks from vector search
top_20_chunks = [...list of chunks...]

# Rerank them
scores = reranker.predict([[query, chunk] for chunk in top_20_chunks])

# Sort by score
ranked = sorted(zip(top_20_chunks, scores), key=lambda x: x[1], reverse=True)

# Return top 5
return ranked[:5]

That's it. Takes 300ms. Makes huge quality difference.

SECTION 8: QUERY PROCESSING - Smart Questions

Queries Are Ambiguous

👦 Nephew: Uncle, when someone asks "Does John have frontend skills?", does the AI understand that means React, Vue, JavaScript, HTML?

👨‍🦳 Uncle: Not automatically. Here's the problem:

Resume has: "React, JavaScript, HTML, CSS expertise"
Search for: "frontend skills"

Vector search looks for the exact embedding of "frontend skills". But the resume doesn't have those words. It has the components.

A human reads "React, JavaScript, HTML, CSS" and immediately thinks "frontend". A computer doesn't make that leap.

👦 Nephew: So how do we fix it?

👨‍🦳 Uncle: Query rewriting. You expand the query.

Query Expansion

👦 Nephew: Expand it how?

👨‍🦳 Uncle: Instead of searching for "frontend skills", you search for:

(frontend OR React OR Vue OR Angular OR JavaScript OR HTML OR CSS)

Now you find any mention of these. Much better.

Here's code:

def expand_query(query):
    """
    Expand a query into related terms.
    """
    # Simple approach using a dict
    expansions = {
        'frontend': ['React', 'Vue', 'Angular', 'JavaScript', 'HTML', 'CSS'],
        'backend': ['Node.js', 'Python', 'Java', 'databases', 'APIs'],
        'devops': ['Docker', 'Kubernetes', 'AWS', 'CI/CD', 'deployment'],
    }

    for key, synonyms in expansions.items():
        if key in query.lower():
            return query + " OR " + " OR ".join(synonyms)

    return query

# Example
original = "frontend skills"
expanded = expand_query(original)
# Result: "frontend skills OR React OR Vue OR Angular OR JavaScript OR HTML OR CSS"

Multi-Query Retrieval

👦 Nephew: Is there a smarter way?

👨‍🦳 Uncle: Yes! Instead of manually expanding, let an LLM generate variations.

User asks: "Real-time system experience?"

LLM thinks: "That means WebSockets, Socket.IO, real-time updates, pub/sub, message queues"

LLM generates multiple queries:

"WebSockets and real-time communication"
"Socket.IO implementation"
"Real-time data streaming"
"Redis pub/sub usage"
"Message queue systems"

You search all 5 queries. You get chunks from all angles. You combine results.

Coverage: 10x better.

from anthropic import Anthropic

def generate_queries(original_query):
    """
    Generate multiple search queries from one.
    """
    client = Anthropic()

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""
Given this question, generate 3-5 related search queries 
that would find relevant information. Return only the queries, 
one per line.

Original question: {original_query}

Queries:
"""
        }]
    )

    queries = response.content[0].text.strip().split('\n')
    return [q.strip() for q in queries if q.strip()]

# Example
original = "Real-time system experience?"
queries = generate_queries(original)
# Returns:
# ["WebSockets real-time", "Socket.IO", "Redis pub/sub", "Message queues", "Real-time streaming"]

# Now search each
all_results = []
for q in queries:
    results = vector_search(q)
    all_results.extend(results)

# Deduplicate and return top 5
unique_results = list({r['id']: r for r in all_results}.values())[:5]

This is elegant. The LLM understands what "real-time" means and generates good search queries.

SECTION 9: EVALUATION & METRICS - Measuring Success

The Measurement Problem

👦 Nephew: Uncle, I built a RAG system. It works. But how do I know if it's good?

👨‍🦳 Uncle: This is where most teams fail. They build, ship, and hope. No metrics.

But you can measure. Here are the important metrics:

Key Metrics

👦 Nephew: What should I measure?

👨‍🦳 Uncle: Four main things:

Metric	Measures	Target	Impact
Recall	Of all relevant docs, how many found?	>90%	Missing information
Precision	Of what found, how relevant?	>85%	Wrong context
Latency	Response time	<2 seconds	User experience
Faithfulness	AI stays factual	>95%	Hallucinations

Let me explain each.

Recall vs Precision

👦 Nephew: What's the difference?

👨‍🦳 Uncle: Let me give you an example. Say there are 10 Docker-related chunks in John's resume.

Recall: Of those 10, how many did you find?

You find 9? Recall = 9/10 = 90%
You find 6? Recall = 6/10 = 60%

High recall means: "I found most of the relevant information"

Precision: Of what you returned, how many were relevant?

You return 15 chunks. 10 are about Docker. Precision = 10/15 = 67%
You return 9 chunks. 8 are about Docker. Precision = 8/9 = 89%

High precision means: "Most of what I returned was useful"

Usually there's a tradeoff:

Return everything (high recall, low precision)
Return only the most confident (low recall, high precision)

You want both high.

Measuring

👦 Nephew: How do I actually measure this?

👨‍🦳 Uncle: You need labeled data. A test set.

Take 10 real questions
For each, have a human list all relevant chunks
Run your system
Compare: did it find all relevant chunks? (Recall)
Were all returned chunks relevant? (Precision)

Here's code:

def calculate_recall_precision(retrieved, relevant):
    """
    retrieved: chunks your system found
    relevant: chunks a human labeled as relevant
    """
    retrieved_ids = set(r['id'] for r in retrieved)
    relevant_ids = set(r['id'] for r in relevant)

    # Recall: of all relevant, how many found?
    if len(relevant_ids) == 0:
        recall = 1.0
    else:
        recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)

    # Precision: of what found, how many relevant?
    if len(retrieved_ids) == 0:
        precision = 0.0
    else:
        precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)

    return recall, precision

# Example
retrieved = [
    {'id': '1', 'text': 'Docker experience'},
    {'id': '2', 'text': 'Kubernetes'},
    {'id': '3', 'text': 'Machine learning'},
]

relevant = [
    {'id': '1', 'text': 'Docker experience'},
    {'id': '2', 'text': 'Kubernetes'},
    {'id': '4', 'text': 'Container orchestration'},
]

recall, precision = calculate_recall_precision(retrieved, relevant)
# Recall = 2/3 = 67% (found 2 of 3 relevant)
# Precision = 2/3 = 67% (2 of 3 returned were relevant)

Latency

👦 Nephew: How do I measure speed?

👨‍🦳 Uncle: Simple:

import time

start = time.time()
results = rag_system.query("Does John have Docker experience?")
latency = (time.time() - start) * 1000  # Convert to ms

print(f"Latency: {latency}ms")

Target: <2000ms (2 seconds). Anything faster is bonus.

Cost

👦 Nephew: What about cost?

👨‍🦳 Uncle: Important for production:

# Cost per query
# Embedding: $0.00001 per 1K tokens
# Reranking: $0.0001 per query (if using paid service)
# LLM answer: $0.001 per query (if using Claude)
# Total: ~$0.0011 per query

# If you do 1M queries/month:
monthly_cost = 1_000_000 * 0.0011  # $1100/month

This is what you need to know for business decisions.

SECTION 10: HALLUCINATION PREVENTION - Keeping AI Honest

Hallucinations: The Biggest Problem

👦 Nephew: Uncle, what's a hallucination?

👨‍🦳 Uncle: Here's a real example:

Resume says: "Worked with React for 2 years"
Hiring manager asks: "Experience with Kubernetes?"
AI responds: "Yes, John has Kubernetes with orchestration."

WRONG! The resume never mentions Kubernetes.

The AI invented an answer. It saw "Docker" or "containers" (maybe), thought "Kubernetes is related to containers", and hallucinated.

This is a hallucination. The AI is lying, but confidently.

Why Hallucinations Happen

👦 Nephew: But why does this happen?

👨‍🦳 Uncle: AIs are pattern-matching machines. They learned during training: "Docker" often appears near "Kubernetes". So the AI's brain has connected them.

Even if you show the AI only a resume mentioning Docker, its training whispers: "And Kubernetes is usually near Docker!"

So the AI thinks: "Probably Kubernetes too."

It's an artifact of how AIs learn. Not a bug. A feature used wrongly.

The Solution: Force the AI to Use Only Given Information

👦 Nephew: How do you stop it?

👨‍🦳 Uncle: Five layers of protection:

Layer 1: Retrieval Boundaries
Show the AI ONLY the retrieved chunks. Nothing from training.

In your prompt:

Based ONLY on the following information:

[Retrieved chunks]

Answer: Does John have Kubernetes experience?

The word "ONLY" is critical.

Layer 2: Structured Output
Force JSON format. Constrain the answer shape.

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    messages=[...],
    system="""
You MUST respond in JSON format:
{
  "answer": "yes" or "no",
  "confidence": 0.0 to 1.0,
  "evidence": ["chunk 1", "chunk 2"]
}
""",
)

This forces the AI to cite evidence or it can't answer.

Layer 3: Validation
Check the output against the original chunks:

answer = json.loads(response.content[0].text)

# Check: does evidence exist in chunks?
for evidence in answer['evidence']:
    if evidence not in chunks:
        # Hallucination detected!
        answer['confidence'] = 0  # Reject

return answer

Layer 4: Confidence Gating
Low confidence? Escalate to human:

if answer['confidence'] < 0.7:
    # Hand off to human
    return {
        'answer': 'Unknown - escalated to human',
        'reason': 'Low confidence'
    }

Layer 5: Required Citations
Every claim must point to a chunk:

Question: Does John know Kubernetes?
Retrieved chunks:
[1] "Docker and container experience"
[2] "Kubernetes cluster management"

Answer: Yes, John knows Kubernetes (from chunk 2).

The AI must cite. No citation? No answer.

Together

👦 Nephew: Do you use all five?

👨‍🦳 Uncle: In production? All five. At different stages.

Retrieval boundaries + structured output stop most hallucinations.

Validation + confidence gating catch the rest.

Citations let users verify.

Together? Hallucinations become nearly impossible.

SECTION 11: PRODUCTION ARCHITECTURE - Building for Real Users

Prototype vs Production

👦 Nephew: Uncle, my RAG system works on my laptop. But production is different?

👨‍🦳 Uncle: Completely different.

Prototype version:

Works for me
Might crash
Takes 5 seconds to respond
Processes 1 user at a time

Production version:

Works for everyone
99.9% uptime
Responds in <2 seconds
Handles 1000 concurrent users
Secure (no data leaks)
Monitored (alerts when something breaks)
Compliant (legal requirements)

These are not the same system.

Core Architecture for Production

👦 Nephew: How do I design it?

👨‍🦳 Uncle: Here's a proven architecture:

                    Users
                      ↓
              Load Balancer (Nginx)
             /          |          \
         API-1      API-2      API-3  (scale horizontally)
             \          |          /
                      ↓
                 Cache Layer (Redis)
                      ↓
           PostgreSQL + pgvector + pgvector Index
                      ↓
              External APIs (Claude, etc.)

Load Balancer

Distributes user traffic
No single API gets overwhelmed
One dies? Traffic routes to others

Multiple APIs

Handle load together
Fault tolerant
Can update one without stopping others

Cache

Redis stores recent results
Same question asked twice? Answer from cache
90% faster

Database

Single source of truth
Persistent
Indexed for speed

External APIs

Claude for answers
Anthropic handles scaling there

Each layer has a job. Each layer is replaceable.

Tenant Isolation - Critical

👦 Nephew: Wait, what if I'm hosting this for multiple companies?

👨‍🦳 Uncle: Then you have a SERIOUS security issue if you don't handle isolation.

Company A's data must NEVER be visible to Company B.

Every single query must include:

SELECT * FROM resume_chunks 
WHERE resume_id = 'john-123'
  AND tenant_id = 'company-a'  ← THIS IS CRITICAL

Without the tenant_id check, a hacker or bug could leak data.

This is not optional. This is "you'll be sued" level of important.

SECTION 12: ATS DEEP DIVE - Resume Scoring

What is ATS (Applicant Tracking System)?

👦 Nephew: Uncle, I've heard of ATS. What is it?

👨‍🦳 Uncle: Old ATS systems just searched for keywords. "Does the resume have the word React?" Yes/No.

Modern ATS - the kind you're building with RAG - is intelligent. It understands:

Does the candidate have the skill?
How many years?
At what level?
Do they have relevant projects?

Resume gets a score. Top scorers get interviewed. Bad scorers get rejected.

The difference between old and new? Old ATS rejects good candidates because they phrased things differently. New ATS finds them anyway.

The Scoring Algorithm

👦 Nephew: How do you score a resume?

👨‍🦳 Uncle: With components. Job requirements × candidate abilities.

Job description requires:

React (5+ years)
JavaScript (required)
Node.js (preferred)
AWS (preferred)

Let's say this resume has:

React (6 years) ← exceeds requirement
JavaScript (7 years) ← exceeds requirement
Node.js (4 years) ← has it
AWS (not mentioned)

Here's the scoring:

def score_resume(candidate, job_requirements):
    """
    Score a resume against job requirements.
    """
    scores = {}

    # Required skills: 60% weight
    required_score = 0
    required_count = 0
    for skill, years_required in job_requirements['required'].items():
        required_count += 1
        candidate_years = candidate.get(skill, 0)

        if candidate_years >= years_required:
            required_score += 100  # Full points
        elif candidate_years > 0:
            # Partial credit
            required_score += (candidate_years / years_required) * 100
        # else: 0 points

    required_avg = required_score / required_count if required_count > 0 else 0

    # Preferred skills: 25% weight
    preferred_score = 0
    preferred_count = 0
    for skill in job_requirements['preferred']:
        preferred_count += 1
        if skill in candidate:
            preferred_score += 50  # Half credit for preferred

    preferred_avg = preferred_score / preferred_count if preferred_count > 0 else 0

    # Projects: 15% weight
    project_score = len(candidate.get('projects', [])) * 25
    project_avg = min(project_score, 100)  # Cap at 100

    # Final score
    final_score = (
        required_avg * 0.60 +
        preferred_avg * 0.25 +
        project_avg * 0.15
    )

    return final_score

# Example
candidate = {
    'React': 6,
    'JavaScript': 7,
    'Node.js': 4,
    'projects': ['E-commerce app', 'Real-time chat']
}

requirements = {
    'required': {'React': 5, 'JavaScript': 3},
    'preferred': ['Node.js', 'AWS']
}

score = score_resume(candidate, requirements)
# React: (6 >= 5) = 100 pts
# JavaScript: (7 >= 3) = 100 pts
# Required avg: 100
# Node.js: has it = 50 pts
# Preferred avg: 25
# Projects: 2 × 25 = 50 pts
# Final: (100 × 0.60) + (25 × 0.25) + (50 × 0.15) = 60 + 6.25 + 7.5 = 73.75/100

That's scoring. But wait - how do you extract the data?

Skill Normalization

👦 Nephew: Resume might say "React" but another might say "React.js". How do you normalize?

👨‍🦳 Uncle: Build a dictionary.

skill_aliases = {
    'React': ['React', 'React.js', 'ReactJS', 'react'],
    'Node.js': ['Node.js', 'node.js', 'nodejs', 'Node'],
    'Docker': ['Docker', 'docker', 'containers'],
}

def normalize_skill(mentioned_skill):
    """Find the canonical skill name."""
    mentioned_lower = mentioned_skill.lower()

    for canonical, aliases in skill_aliases.items():
        if mentioned_lower in [a.lower() for a in aliases]:
            return canonical

    return mentioned_skill  # Unknown skill

# Example
print(normalize_skill("React.js"))  # Output: React
print(normalize_skill("nodejs"))    # Output: Node.js

Now "React" and "React.js" are treated as the same skill.

For Unknown Skills: Use Embeddings

👦 Nephew: But what if the resume mentions a skill not in your dictionary?

👨‍🦳 Uncle: Use embeddings to find similar skills.

Resume mentions: "Built trading platform with real-time updates"
Job requires: "WebSockets experience"

These aren't in your dictionary. But you can check similarity:

def find_skill_match(mentioned_skill, required_skill):
    """
    Use embeddings to find matches for unknown skills.
    """
    # Get embeddings
    mentioned_embedding = get_embedding(mentioned_skill)
    required_embedding = get_embedding(required_skill)

    # Calculate distance
    distance = cosine_distance(mentioned_embedding, required_embedding)

    # If similar enough, count as a match
    if distance > 0.75:  # Threshold
        return True

    return False

# Example
if find_skill_match("real-time data updates", "WebSockets"):
    # Candidate probably has WebSockets experience
    score += 100

This is powerful. It catches variations and similar technologies.

SECTION 13: ADVANCED PATTERNS - Agents & Future

What Happens After Basic RAG

👦 Nephew: Uncle, we have RAG working. What comes next?

👨‍🦳 Uncle: Basic RAG answers questions. Advanced RAG (agents) decides what to do.

Difference:

Basic RAG: "Does John know React?" → Find resume → Answer

Agent RAG: "Is John qualified for Senior Developer role?" →

Find resume
Find job description
Compare skills
Check experience level
Look for projects
Verify certifications
Reason through all evidence
Answer with reasoning

The agent is intelligent about what to search.

Agentic RAG with Claude

👦 Nephew: How do I build an agent?

👨‍🦳 Uncle: Using Claude's tool use (function calling):

import anthropic

client = anthropic.Anthropic()

def search_resume(resume_id, query):
    """Search candidate's resume."""
    # Your RAG search logic
    return rag_system.query(resume_id, query)

def search_job_description(job_id, query):
    """Search job description."""
    return job_db.query(job_id, query)

def analyze_candidate(resume_id, job_id):
    """
    Use Claude as an agent to analyze candidate fit.
    Claude decides what to search, what to compare.
    """
    tools = [
        {
            "name": "search_resume",
            "description": "Search a candidate's resume for information",
            "input_schema": {
                "type": "object",
                "properties": {
                    "resume_id": {"type": "string"},
                    "query": {"type": "string"}
                },
                "required": ["resume_id", "query"]
            }
        },
        {
            "name": "search_job",
            "description": "Search job requirements",
            "input_schema": {
                "type": "object",
                "properties": {
                    "job_id": {"type": "string"},
                    "query": {"type": "string"}
                },
                "required": ["job_id", "query"]
            }
        }
    ]

    messages = [{
        "role": "user",
        "content": f"""
Analyze if candidate {resume_id} is qualified for job {job_id}.

Use the available tools to:
1. Find required skills in the job description
2. Check if the candidate has those skills
3. Compare years of experience
4. Look for relevant projects
5. Make a recommendation

Provide detailed reasoning.
"""
    }]

    # Agent loop
    while True:
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1000,
            tools=tools,
            messages=messages
        )

        # Check if Claude wants to use tools
        if response.stop_reason == "tool_use":
            # Process tool calls
            tool_results = []
            for content in response.content:
                if content.type == "tool_use":
                    if content.name == "search_resume":
                        result = search_resume(
                            content.input["resume_id"],
                            content.input["query"]
                        )
                    else:  # search_job
                        result = search_job_description(
                            content.input["job_id"],
                            content.input["query"]
                        )

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": content.id,
                        "content": str(result)
                    })

            # Add Claude's response and tool results to messages
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

        else:
            # Claude is done - extract final answer
            final_answer = ""
            for content in response.content:
                if hasattr(content, 'text'):
                    final_answer = content.text

            return final_answer

# Use it
result = analyze_candidate("john-resume", "senior-dev-job")
print(result)

This is elegant. Claude is the brain. It decides what to search, in what order, and how to reason about it.

The Future of RAG

👦 Nephew: What's next? After agents?

👨‍🦳 Uncle: A few directions:

1. Adaptive RAG
System learns what works. If a question fails, it tries different search strategies automatically.

2. Multi-modal RAG
Right now we handle text. Soon: images, tables, videos. A resume with a photo of their projects. A video walkthrough.

3. Real-time RAG
System continuously updates knowledge. Candidate updates LinkedIn → RAG system knows instantly.

4. Collaborative RAG
Multiple agents reasoning together. One searches, one evaluates, one questions.

5. Explainable RAG
System shows its work. "I rejected John because he doesn't have 5 years React, only 3 years, and React 5+ was required."

These are coming. RAG is only beginning.

CONCLUSION: You're a RAG Architect Now

👨‍🦳 Uncle: Let me summarize what you've learned:

RAG Fundamentals - Fetch documents, then answer. Don't guess.
Embeddings - Represent meaning as positions in space.
Vector Databases - Store embeddings fast. PostgreSQL + pgvector.
Retrieval - Find relevant documents using vector search.
Chunking - Break documents smartly (1000-1500 tokens, 200 overlap).
Hybrid Search - Combine vectors and keywords for accuracy.
Reranking - Use two models: fast retrieval, accurate ranking.
Query Processing - Expand queries to find more relevant docs.
Metrics - Measure recall, precision, latency, faithfulness.
Hallucination Prevention - Five layers: boundaries, structure, validation, confidence, citations.
Production - Load balancing, caching, security, monitoring.
ATS Scoring - Intelligent resume evaluation.
Agents - AI decides what to search and how.

👦 Nephew: That's a lot, uncle. But I think I understand. RAG isn't magic - it's just engineering.

👨‍🦳 Uncle: Exactly. RAG is:

Retrieval (fast lookup)
+ Augmentation (add context)
+ Generation (let AI answer)
= Reliable AI

Every layer serves one purpose: make AI useful for real people.

👦 Nephew: And production?

👨‍🦳 Uncle: Remember: Reliable for 10 users >> Perfect for nobody.

Start simple. Build on PostgreSQL. Measure everything. Iterate fast.

You don't need fancy systems. You need good engineering.

👦 Nephew: Uncle, one last question.

👨‍🦳 Uncle: Go ahead.

👦 Nephew: How do I actually start? Like, code?

👨‍🦳 Uncle: That's next lesson, beta. First, you understand the system. Then you build it.

But I'll give you one gift: a simple starter stack:

Frontend: Next.js + React
Backend: Node.js + Express
Database: PostgreSQL + pgvector
Embedding: Claude API (embeddings endpoint)
LLM: Claude API (messages endpoint)
Hosting: AWS Lightsail or Render

Simple. Proven. Scales.

Now go build something amazing.

Quick Reference: Key Code Snippets

Embedding and Storage

from anthropic import Anthropic

client = Anthropic()

def create_embeddings(texts):
    """Get embeddings from Claude."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system="You are an embedding service. Return ONLY JSON.",
        messages=[{
            "role": "user",
            "content": f"Get embeddings for: {texts}"
        }]
    )
    return response.content[0].text

PostgreSQL Hybrid Search

SELECT chunk_text, 
       embedding <=> query_embedding AS vector_distance,
       ts_rank(to_tsvector(chunk_text), 
               plainto_tsquery('search_term')) AS keyword_rank
FROM resume_chunks
WHERE resume_id = 'candidate-uuid'
  AND to_tsvector(chunk_text) @@ plainto_tsquery('search_term')
ORDER BY vector_distance ASC, keyword_rank DESC
LIMIT 5;

Scoring Algorithm

def score_candidate(candidate_skills, required_skills, weight=0.6):
    """Simple scoring: % of required skills matched."""
    matched = len([s for s in candidate_skills if s in required_skills])
    score = (matched / len(required_skills)) * 100
    return score

Hallucination Prevention

def safe_answer(query, chunks, confidence_threshold=0.7):
    """Answer only if confident."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        system="""
You MUST:
1. Answer ONLY using the provided chunks
2. Cite evidence
3. Return JSON with answer, confidence, evidence
""",
        messages=[{
            "role": "user",
            "content": f"""
Chunks: {chunks}
Question: {query}
"""
        }]
    )

    answer = json.loads(response.content[0].text)

    if answer['confidence'] < confidence_threshold:
        return "Uncertain - escalated to human"

    return answer['answer']

Final Thought

RAG isn't magic. It's engineering:

Clear data structures
Proven patterns
Careful measurement
Honest assessment

You've learned the foundations. Now use them.

Build something reliable.
Build something honest.
Build something useful.

You've got this.

Created for developers who want to understand how RAG actually works, not just use it as a black box.

Remember: Less noise, more action.

Appendix: Useful Resources

Tools We Discussed

PostgreSQL with pgvector - Local vector database
Claude API - Embeddings and LLM
Python - Implementation examples
Redis - Caching
Nginx - Load balancing

Testing Checklist

[ ] Embedding quality (check cosine distances)
[ ] Retrieval recall >90%
[ ] Retrieval precision >85%
[ ] Latency <2 seconds
[ ] Hallucination rate <5%
[ ] Cost tracking in place
[ ] Monitoring alerts configured
[ ] Data isolation verified (multi-tenant)
[ ] Cache hit rate >50%
[ ] User feedback collected

Architecture Checklist

[ ] Load balancer in front
[ ] Multiple API instances
[ ] Redis caching layer
[ ] PostgreSQL with indexes
[ ] Tenant isolation on every query
[ ] Error logging
[ ] Uptime monitoring
[ ] Graceful degradation
[ ] Security scanning
[ ] Backup strategy

Go build. Good luck. 🚀