DEV Community

Cover image for 98. RAG: Give Your AI Access to Your Documents
Akhilesh
Akhilesh

Posted on

98. RAG: Give Your AI Access to Your Documents

You ask ChatGPT about your company's internal policies. It makes something up. It sounds confident. It's wrong.

That's the hallucination problem. LLMs generate text based on what they learned during training. If the answer wasn't in the training data, they fabricate one that sounds plausible.

RAG (Retrieval Augmented Generation) fixes this. Before generating, the system retrieves relevant documents from your own knowledge base. The LLM reads those documents and generates an answer grounded in real content.

Your documents. Your data. Accurate answers.


What You'll Learn Here

  • Why RAG beats fine-tuning for knowledge-heavy tasks
  • The complete RAG pipeline: chunk, embed, retrieve, generate
  • Chunking strategies that actually work
  • Building RAG from scratch with sentence-transformers and a local LLM
  • Building RAG with LangChain for real projects
  • Evaluating RAG: what good looks like and what breaks it
  • Common failure modes and how to fix them

RAG vs Fine-Tuning: When to Use Which

Both give LLMs access to new knowledge. They're solving different problems.

Fine-tuning:
  - Best for: teaching style, format, behavior
  - Updates model weights
  - Needs retraining when data changes
  - Can't cite sources easily
  - Expensive to update frequently

RAG:
  - Best for: factual knowledge, documents, databases
  - No weight updates
  - Update knowledge base anytime, instantly
  - Can cite exact source passages
  - Perfect for private or frequently changing data

Rule of thumb:
  Behavior/style change → fine-tune
  Knowledge/facts/documents → RAG
  Both → fine-tune + RAG
Enter fullscreen mode Exit fullscreen mode

The Complete RAG Pipeline

1. INDEXING (done once, offline)
   Load documents
   → Split into chunks
   → Embed each chunk
   → Store in vector database

2. RETRIEVAL (done at query time)
   User sends question
   → Embed the question
   → Find top-k similar chunks
   → Return chunks as context

3. GENERATION (done at query time)
   Build prompt: question + retrieved chunks
   → Send to LLM
   → LLM generates answer grounded in chunks
   → Return answer to user
Enter fullscreen mode Exit fullscreen mode

Step 1: Chunking Documents

The most underrated step. How you split documents dramatically affects retrieval quality.

import re
from typing import List

# Strategy 1: Fixed-size chunking
def chunk_fixed(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    chunks = []
    start  = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap   # overlap to preserve context at boundaries
    return chunks

# Strategy 2: Sentence-aware chunking (better)
def chunk_by_sentences(text: str, max_chunk_size: int = 500) -> List[str]:
    # Split on sentence boundaries
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks    = []
    current   = ""

    for sentence in sentences:
        if len(current) + len(sentence) <= max_chunk_size:
            current += " " + sentence if current else sentence
        else:
            if current:
                chunks.append(current.strip())
            current = sentence

    if current:
        chunks.append(current.strip())

    return chunks

# Strategy 3: Paragraph-aware chunking (often best for structured docs)
def chunk_by_paragraphs(text: str, max_chunk_size: int = 800) -> List[str]:
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    chunks     = []
    current    = ""

    for para in paragraphs:
        if len(current) + len(para) + 2 <= max_chunk_size:
            current += "\n\n" + para if current else para
        else:
            if current:
                chunks.append(current.strip())
            current = para

    if current:
        chunks.append(current.strip())

    return chunks

# Test on sample text
sample_text = """
Machine learning is a branch of artificial intelligence that enables computers to learn from data. 
It has three main types: supervised, unsupervised, and reinforcement learning.

Supervised learning uses labeled examples where the correct answers are known. 
The model learns to map inputs to outputs by minimizing error on training data. 
Common algorithms include linear regression, decision trees, and neural networks.

Unsupervised learning finds patterns in data without labels. 
Clustering algorithms group similar examples together. 
Dimensionality reduction simplifies data while preserving structure.

Reinforcement learning trains an agent to take actions in an environment to maximize reward.
It learns through trial and error, receiving feedback from the environment.
Applications include game playing, robotics, and recommendation systems.
"""

chunks_fixed = chunk_fixed(sample_text, chunk_size=200, overlap=30)
chunks_sents = chunk_by_sentences(sample_text, max_chunk_size=300)
chunks_paras = chunk_by_paragraphs(sample_text, max_chunk_size=400)

print(f"Fixed chunks:     {len(chunks_fixed)}")
print(f"Sentence chunks:  {len(chunks_sents)}")
print(f"Paragraph chunks: {len(chunks_paras)}")

print(f"\nParagraph chunk 1:\n'{chunks_paras[0]}'")
print(f"\nParagraph chunk 2:\n'{chunks_paras[1]}'")
Enter fullscreen mode Exit fullscreen mode

Output:

Fixed chunks:     7
Sentence chunks:  4
Paragraph chunks: 4

Paragraph chunk 1:
'Machine learning is a branch of artificial intelligence that enables computers to learn from data. 
It has three main types: supervised, unsupervised, and reinforcement learning.'

Paragraph chunk 2:
'Supervised learning uses labeled examples where the correct answers are known. 
The model learns to map inputs to outputs by minimizing error on training data. 
Common algorithms include linear regression, decision trees, and neural networks.'
Enter fullscreen mode Exit fullscreen mode

Chunking guidelines:

  • Chunk size 300-600 characters works well for most use cases
  • Always include overlap (50-100 chars) so context isn't lost at boundaries
  • Paragraph chunking preserves semantic units better than fixed-size
  • Smaller chunks: better precision (more specific retrieval)
  • Larger chunks: better recall (more context per chunk)

Step 2: Building the Index

from sentence_transformers import SentenceTransformer
import chromadb
import numpy as np

# Knowledge base: a collection of ML documentation
knowledge_base = {
    'doc1.txt': """
        Linear regression predicts a continuous output variable from input features.
        It fits a straight line (or hyperplane in multiple dimensions) through the data.
        The model minimizes the mean squared error between predictions and true values.
        The learned equation is: y = w1*x1 + w2*x2 + ... + b
        Used for: house price prediction, sales forecasting, temperature prediction.
    """,
    'doc2.txt': """
        Logistic regression is used for binary classification despite its name.
        It applies a sigmoid function to the linear combination of features.
        Output is a probability between 0 and 1.
        The decision boundary is where the probability equals 0.5.
        Used for: spam detection, disease diagnosis, fraud detection.
    """,
    'doc3.txt': """
        Random forests combine many decision trees to reduce overfitting.
        Each tree is trained on a random subset of data (bagging).
        Each split considers a random subset of features.
        Final prediction is the majority vote (classification) or average (regression).
        Feature importance can be extracted from the forest.
    """,
    'doc4.txt': """
        XGBoost builds trees sequentially, each one correcting errors from the previous.
        It uses gradient boosting with regularization to prevent overfitting.
        Learning rate controls how much each tree contributes.
        Early stopping prevents overtraining.
        Dominates Kaggle competitions on tabular data.
    """,
    'doc5.txt': """
        Cross-validation gives a reliable estimate of model performance.
        K-fold CV splits data into k equal parts, trains on k-1, tests on 1.
        This is repeated k times with different test sets.
        Average score across folds is the final estimate.
        Prevents optimistic bias from a single train/test split.
    """,
    'doc6.txt': """
        The confusion matrix shows all four prediction outcomes.
        True positives: correctly predicted positive.
        True negatives: correctly predicted negative.
        False positives: incorrectly predicted positive (Type I error).
        False negatives: incorrectly predicted negative (Type II error).
        Precision = TP / (TP + FP). Recall = TP / (TP + FN).
    """,
    'doc7.txt': """
        Overfitting occurs when a model performs well on training data but poorly on test data.
        Signs: large gap between train and validation accuracy.
        Causes: model too complex, too little data, training too long.
        Fixes: regularization, dropout, more data, early stopping, simpler model.
        The bias-variance tradeoff describes the fundamental tension.
    """,
    'doc8.txt': """
        Transformers use self-attention to process sequences in parallel.
        Self-attention computes relationships between all token pairs simultaneously.
        Multi-head attention runs several attention operations in parallel.
        Positional encoding adds position information to token embeddings.
        Layer normalization and residual connections stabilize training.
    """,
}

class RAGIndexer:
    def __init__(self, model_name='sentence-transformers/all-MiniLM-L6-v2'):
        self.model       = SentenceTransformer(model_name)
        self.chroma      = chromadb.Client()
        self.collection  = self.chroma.create_collection(
            name='rag_knowledge_base',
            metadata={'hnsw:space': 'cosine'}
        )

    def index_documents(self, documents: dict, chunk_size: int = 400):
        all_chunks = []
        all_ids    = []
        all_meta   = []

        for doc_name, content in documents.items():
            chunks = chunk_by_sentences(content, max_chunk_size=chunk_size)
            for i, chunk in enumerate(chunks):
                if len(chunk.strip()) < 30:   # skip tiny chunks
                    continue
                chunk_id = f"{doc_name}_chunk{i}"
                all_chunks.append(chunk.strip())
                all_ids.append(chunk_id)
                all_meta.append({'source': doc_name, 'chunk_idx': i})

        if not all_chunks:
            return

        # Encode all chunks
        print(f"Encoding {len(all_chunks)} chunks...")
        embeddings = self.model.encode(all_chunks, show_progress_bar=False)

        # Add to ChromaDB
        self.collection.add(
            ids        = all_ids,
            documents  = all_chunks,
            embeddings = [e.tolist() for e in embeddings],
            metadatas  = all_meta
        )
        print(f"Indexed {len(all_chunks)} chunks from {len(documents)} documents")

    def retrieve(self, query: str, top_k: int = 3) -> List[dict]:
        query_embedding = self.model.encode([query])[0].tolist()

        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )

        retrieved = []
        for doc, meta, dist in zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ):
            retrieved.append({
                'text':       doc,
                'source':     meta['source'],
                'similarity': 1 - dist   # ChromaDB returns distance, not similarity
            })

        return retrieved

# Build the index
indexer = RAGIndexer()
indexer.index_documents(knowledge_base)

# Test retrieval
query   = "How do I prevent a model from overfitting?"
results = indexer.retrieve(query, top_k=3)

print(f"\nQuery: '{query}'")
print("-" * 60)
for i, r in enumerate(results):
    print(f"\n{i+1}. [{r['similarity']:.3f}] From: {r['source']}")
    print(f"   {r['text'][:150]}...")
Enter fullscreen mode Exit fullscreen mode

Output:

Indexed 16 chunks from 8 documents

Query: 'How do I prevent a model from overfitting?'
------------------------------------------------------------

1. [0.712] From: doc7.txt
   Overfitting occurs when a model performs well on training data but poorly on test data...

2. [0.531] From: doc4.txt
   XGBoost builds trees sequentially, each one correcting errors from the previous...

3. [0.489] From: doc3.txt
   Random forests combine many decision trees to reduce overfitting...
Enter fullscreen mode Exit fullscreen mode

Step 3: Generation With Retrieved Context

# Using a local model via HuggingFace Transformers
from transformers import pipeline

# For a real project: use 'google/flan-t5-base' or connect to OpenAI API
generator = pipeline(
    'text2text-generation',
    model='google/flan-t5-base',
    max_new_tokens=200
)

def build_rag_prompt(question: str, context_chunks: List[dict]) -> str:
    context = "\n\n".join([
        f"[Source: {c['source']}]\n{c['text']}"
        for c in context_chunks
    ])

    prompt = f"""Answer the question based only on the provided context.
If the context doesn't contain enough information, say "I don't have enough information to answer this."

Context:
{context}

Question: {question}

Answer:"""

    return prompt

class RAGPipeline:
    def __init__(self, indexer: RAGIndexer, generator_pipeline):
        self.indexer   = indexer
        self.generator = generator_pipeline

    def answer(self, question: str, top_k: int = 3, verbose: bool = False) -> dict:
        # Step 1: Retrieve relevant chunks
        chunks = self.indexer.retrieve(question, top_k=top_k)

        # Step 2: Build prompt
        prompt = build_rag_prompt(question, chunks)

        if verbose:
            print("=== RETRIEVED CONTEXT ===")
            for c in chunks:
                print(f"[{c['source']}] sim={c['similarity']:.3f}: {c['text'][:100]}...")
            print("\n=== PROMPT ===")
            print(prompt[:500] + "...")

        # Step 3: Generate answer
        result = self.generator(prompt)[0]['generated_text']

        return {
            'question': question,
            'answer':   result.strip(),
            'sources':  [c['source'] for c in chunks],
            'chunks':   chunks
        }

# Build the RAG pipeline
rag = RAGPipeline(indexer, generator)

# Ask questions
questions = [
    "What causes overfitting and how do I fix it?",
    "How is precision different from recall?",
    "What makes XGBoost good for competitions?",
    "How do transformers process sequences?",
]

for question in questions:
    result = rag.answer(question)
    print(f"\nQ: {question}")
    print(f"A: {result['answer']}")
    print(f"Sources: {result['sources']}")
    print("-" * 60)
Enter fullscreen mode Exit fullscreen mode

Using the OpenAI API for Better Generation

For production quality, use a real LLM API. The retrieval stays the same. Only the generation step changes.

# Replace the generator with OpenAI API
# pip install openai

import openai

def generate_with_openai(prompt: str, model: str = 'gpt-3.5-turbo') -> str:
    client = openai.OpenAI()   # reads OPENAI_API_KEY from environment

    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                'role': 'system',
                'content': 'You are a helpful assistant. Answer questions based only on the provided context. If the context does not contain enough information, say so clearly.'
            },
            {
                'role': 'user',
                'content': prompt
            }
        ],
        temperature=0.1,   # low temperature for factual answers
        max_tokens=300
    )

    return response.choices[0].message.content

# Integrate into RAG pipeline
class RAGWithOpenAI:
    def __init__(self, indexer: RAGIndexer):
        self.indexer = indexer

    def answer(self, question: str, top_k: int = 3) -> dict:
        chunks = self.indexer.retrieve(question, top_k=top_k)
        prompt = build_rag_prompt(question, chunks)
        answer = generate_with_openai(prompt)

        return {
            'question': question,
            'answer':   answer,
            'sources':  list(set(c['source'] for c in chunks))
        }

# rag_openai = RAGWithOpenAI(indexer)
# result = rag_openai.answer("What causes overfitting?")
print("OpenAI RAG pipeline ready (requires OPENAI_API_KEY)")
Enter fullscreen mode Exit fullscreen mode

LangChain: RAG in 30 Lines

LangChain abstracts the entire RAG pipeline into composable components.

pip install langchain langchain-community langchain-chroma
Enter fullscreen mode Exit fullscreen mode
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline

# 1. Load documents
from langchain.schema import Document

docs = [
    Document(
        page_content=content,
        metadata={'source': name}
    )
    for name, content in knowledge_base.items()
]

# 2. Split
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
    separators=['\n\n', '\n', '. ', ' ', '']
)
chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks")

# 3. Embed and store
embeddings = HuggingFaceEmbeddings(
    model_name='sentence-transformers/all-MiniLM-L6-v2'
)
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever   = vectorstore.as_retriever(search_kwargs={'k': 3})

# 4. Generation model
gen_pipe = hf_pipeline('text2text-generation', model='google/flan-t5-base', max_new_tokens=200)
llm      = HuggingFacePipeline(pipeline=gen_pipe)

# 5. Chain it together
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type='stuff',            # stuff all chunks into one prompt
    return_source_documents=True
)

# 6. Ask questions
result = qa_chain({'query': 'What causes overfitting?'})
print(f"Answer: {result['result']}")
print(f"Sources: {[d.metadata['source'] for d in result['source_documents']]}")
Enter fullscreen mode Exit fullscreen mode

Common RAG Failure Modes and Fixes

failures = {
    "Retrieval finds wrong chunks": {
        "symptoms": "Answer is off-topic or doesn't address the question",
        "causes":   ["Chunk too large (contains many topics)", "Poor embedding model for domain"],
        "fixes":    ["Smaller chunks (200-400 chars)", "Domain-specific embedding model",
                     "Hybrid search (keyword + semantic)"]
    },
    "Chunks miss key information": {
        "symptoms": "Model says 'I don't know' but answer is in the documents",
        "causes":   ["Chunk boundary cut the relevant sentence",
                     "top_k too small", "Query and document phrasing too different"],
        "fixes":    ["Add overlap between chunks", "Increase top_k to 5-7",
                     "Query expansion (rephrase query multiple ways and merge results)"]
    },
    "Model ignores retrieved context": {
        "symptoms": "Answer doesn't match the retrieved chunks at all",
        "causes":   ["LLM is too small", "Prompt not clear about using only context"],
        "fixes":    ["Use larger/better LLM", "Stronger prompt instructions",
                     "Lower temperature"]
    },
    "Too much irrelevant context": {
        "symptoms": "Model is confused, answer is vague",
        "causes":   ["top_k too high", "All chunks have low similarity scores"],
        "fixes":    ["Filter chunks below similarity threshold",
                     "Reduce top_k to 2-3", "Check if query is answerable"]
    },
    "Hallucination despite retrieval": {
        "symptoms": "Model generates facts not in the retrieved context",
        "causes":   ["Model overrides context with training knowledge",
                     "Prompt not clear enough"],
        "fixes":    ["Explicit 'only use context' instruction in system prompt",
                     "Ask model to quote from context", "Use smaller, less opinionated LLM"]
    }
}

for failure, info in failures.items():
    print(f"\n{failure}")
    print(f"  Symptoms: {info['symptoms']}")
    print(f"  Fixes:")
    for fix in info['fixes']:
        print(f"    - {fix}")
Enter fullscreen mode Exit fullscreen mode

Evaluating RAG Quality

# Simple evaluation: does the answer contain key information?
def evaluate_rag_answer(answer: str, expected_keywords: List[str]) -> dict:
    answer_lower   = answer.lower()
    found_keywords = [k for k in expected_keywords if k.lower() in answer_lower]
    coverage       = len(found_keywords) / len(expected_keywords)

    return {
        'coverage':         coverage,
        'found_keywords':   found_keywords,
        'missing_keywords': [k for k in expected_keywords if k not in found_keywords]
    }

# Test cases
test_cases = [
    {
        'question': "What causes overfitting?",
        'keywords': ['complex', 'training', 'gap', 'regularization']
    },
    {
        'question': "How does cross-validation work?",
        'keywords': ['k-fold', 'split', 'average', 'estimate']
    },
]

print("RAG Evaluation Results:")
print("-" * 60)
for test in test_cases:
    result = rag.answer(test['question'])
    eval_  = evaluate_rag_answer(result['answer'], test['keywords'])

    print(f"\nQ: {test['question']}")
    print(f"A: {result['answer'][:150]}...")
    print(f"Keyword coverage: {eval_['coverage']:.1%}")
    print(f"Missing: {eval_['missing_keywords']}")
Enter fullscreen mode Exit fullscreen mode

For production, use RAGAS (Retrieval Augmented Generation Assessment) which evaluates faithfulness, answer relevancy, and context precision automatically.


Quick Cheat Sheet

Step What it does Key decision
Chunking Split docs into pieces Size 300-600 chars, overlap 50-100
Embedding Convert chunks to vectors all-MiniLM-L6-v2 to start
Indexing Store in vector DB ChromaDB for dev, Pinecone for prod
Retrieval Find top-k similar chunks k=3 to 5 usually works
Generation Build prompt + call LLM Include retrieved context explicitly
Problem Quick fix
Wrong chunks retrieved Smaller chunks, better embedding model
Answer not in chunks Add overlap, increase top-k
Model ignores context Stronger prompt, lower temperature
Too slow Smaller embedding model, FAISS ANN index
Hallucinations Explicit "only use context" in system prompt

Practice Challenges

Level 1:
Pick any 10 Wikipedia articles on a topic you know. Chunk them, embed them, and store in ChromaDB. Ask 5 questions where you already know the answer. Did RAG get them right?

Level 2:
Compare three chunking strategies (fixed-size, sentence-aware, paragraph-aware) on the same document set. For each strategy, retrieve the top-3 chunks for 5 queries. Which strategy retrieves more relevant chunks by eye?

Level 3:
Build a complete RAG pipeline with source citations. For each answer, show which document chunk it came from and highlight the specific sentence that grounded the answer. Add a similarity threshold: if the top-k chunks all score below 0.4, return "I don't have information about this" instead of guessing.


References


Next up, Post 99: Build a Chatbot With Memory. Conversation history, context management, multi-turn dialogue. We build a chatbot that actually remembers what you said earlier in the conversation.

Top comments (0)