Iniyarajan

Posted on Mar 11

Complete RAG Tutorial Python: Build Your First Agent

#aiagents #rag #langchain #python

You're staring at a complex codebase, desperately searching for answers buried in thousands of files. Traditional search fails you. Documentation is scattered. Your AI assistant gives generic responses because it doesn't know your specific context.

This is exactly why Retrieval-Augmented Generation (RAG) has become the backbone of modern AI agents. RAG bridges the gap between vast knowledge bases and contextual AI responses, letting you build agents that actually understand your data.

In this comprehensive RAG tutorial Python guide, I'll walk you through building a production-ready RAG system from scratch. No theoretical fluff — just practical code you can deploy today.

Photo by Sergey Meshkov on Pexels

Understanding RAG: Beyond the Hype
Setting Up Your Python RAG Environment
Building the Document Processing Pipeline
Implementing Vector Storage and Retrieval
Creating the RAG Agent
Advanced RAG Patterns for Production
Frequently Asked Questions

Understanding RAG: Beyond the Hype

RAG isn't just another AI buzzword. It's a fundamental shift in how we build intelligent systems that need to work with specific, up-to-date information.

Related: LlamaIndex Tutorial: Build AI Agents with RAG

The core problem RAG solves is simple: Large Language Models (LLMs) have a knowledge cutoff. They can't access your internal documents, recent updates, or domain-specific data. RAG fixes this by retrieving relevant information and injecting it into the model's context.

Also read: Build Chatbot with RAG: Beyond Basic Q&A in 2026

Here's how it works in practice:

Document Ingestion: Your documents get chunked and embedded into vectors
Query Processing: User queries are converted to the same vector space
Retrieval: Similar document chunks are found using vector similarity
Generation: The LLM generates responses using retrieved context

This architecture has transformed how we build AI agents in 2026. Instead of fine-tuning models on every dataset, we can dynamically pull relevant information at inference time.

Setting Up Your Python RAG Environment

Let's get our hands dirty with code. I'll show you how to set up a RAG system using LangChain and ChromaDB — two of the most reliable tools in the RAG ecosystem.

First, install the essential dependencies:

pip install langchain chromadb openai tiktoken pypdf

Here's our foundational RAG setup:

import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

class RAGSystem:
    def __init__(self, openai_api_key):
        os.environ["OPENAI_API_KEY"] = openai_api_key

        # Initialize embeddings and LLM
        self.embeddings = OpenAIEmbeddings()
        self.llm = OpenAI(temperature=0.2)

        # Initialize vector store
        self.vector_store = None
        self.retriever = None

    def load_documents(self, pdf_paths):
        """Load and process documents into the vector store"""
        documents = []

        # Load PDFs
        for pdf_path in pdf_paths:
            loader = PyPDFLoader(pdf_path)
            docs = loader.load()
            documents.extend(docs)

        # Split documents into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", " ", ""]
        )

        chunks = text_splitter.split_documents(documents)

        # Create vector store
        self.vector_store = Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory="./chroma_db"
        )

        # Set up retriever
        self.retriever = self.vector_store.as_retriever(
            search_kwargs={"k": 4}
        )

        return len(chunks)

    def query(self, question):
        """Query the RAG system"""
        if not self.retriever:
            return "No documents loaded. Please load documents first."

        # Create QA chain
        qa_chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",
            retriever=self.retriever,
            return_source_documents=True
        )

        result = qa_chain({"query": question})
        return result["result"]

This foundation gives you a working RAG system in under 70 lines of code. But production systems need more sophistication.

Building the Document Processing Pipeline

The quality of your RAG system depends heavily on how you process documents. Poor chunking leads to irrelevant retrievals. Poor embeddings lead to missed context.

Here's my approach to robust document processing:

The key insight here is semantic chunking over naive character splitting. Instead of blindly splitting at 1000 characters, consider document structure:

Split on headers and sections first
Preserve code blocks intact
Maintain table structures
Keep related paragraphs together

This approach dramatically improves retrieval accuracy.

Implementing Vector Storage and Retrieval

Vector storage is where RAG systems often break down in production. You need fast similarity search, metadata filtering, and persistent storage.

ChromaDB works well for prototypes, but consider these alternatives for production:

Pinecone: Managed vector database with excellent performance
Weaviate: Open-source with built-in vectorization
Qdrant: Fast and feature-rich vector search engine

The retrieval strategy matters just as much as storage. Basic similarity search often returns redundant results. Instead, implement hybrid retrieval:

Semantic similarity for conceptual matches
Keyword matching for exact terms
Metadata filtering for domain constraints
Re-ranking to improve relevance

Creating the RAG Agent

Now let's build an actual agent that can reason about retrieved information. This goes beyond simple Q&A to include memory, tool use, and multi-step reasoning.

from langchain.agents import initialize_agent, Tool
from langchain.memory import ConversationBufferMemory
from langchain.agents import AgentType

class RAGAgent:
    def __init__(self, rag_system):
        self.rag_system = rag_system
        self.memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True
        )

        # Define tools
        tools = [
            Tool(
                name="Knowledge Base",
                func=self.rag_system.query,
                description="Search the knowledge base for information about documents and code"
            )
        ]

        # Initialize agent
        self.agent = initialize_agent(
            tools=tools,
            llm=self.rag_system.llm,
            agent=AgentType.CONVERSATIONAL_REACT_DESCRIPTION,
            memory=self.memory,
            verbose=True
        )

    def chat(self, message):
        """Chat with the RAG agent"""
        response = self.agent.run(input=message)
        return response

    def reset_memory(self):
        """Clear conversation history"""
        self.memory.clear()

# Usage example
rag_system = RAGSystem("your-openai-key")
rag_system.load_documents(["docs/manual.pdf", "docs/api.pdf"])

agent = RAGAgent(rag_system)
response = agent.chat("How do I authenticate with the API?")
print(response)

This agent can maintain conversation context while pulling from your knowledge base. It's the foundation for more complex agentic workflows.

Advanced RAG Patterns for Production

Building production RAG systems taught me several hard lessons. Here are the patterns that actually work:

Query Routing

Not every query needs RAG. Simple factual questions might be better served by the base LLM. Complex domain questions need document retrieval. Implement query classification:

def route_query(query):
    if is_factual_query(query):
        return "direct_llm"
    elif is_domain_specific(query):
        return "rag_pipeline"
    else:
        return "hybrid_approach"

Multi-Step Retrieval

Single-shot retrieval often misses complex questions. Instead, break queries into sub-questions:

Analyze the user's intent
Generate sub-queries
Retrieve for each sub-query
Synthesize results

Confidence Scoring

Always return confidence scores with RAG responses. This helps users understand reliability and enables fallback strategies.

The combination of these patterns creates robust, production-ready RAG systems that developers can trust.

Frequently Asked Questions

Q: What's the optimal chunk size for RAG in Python?

Chunk size depends on your content type and use case. For technical documentation, 800-1200 characters work well with 200 character overlap. Code files need semantic chunking by function or class boundaries rather than fixed sizes.

Q: How do I handle multiple document types in my RAG pipeline?

Use document-type-specific loaders (PyPDFLoader for PDFs, UnstructuredMarkdownLoader for Markdown) and maintain document metadata. This lets you filter retrievals by document type and apply type-specific processing rules.

Q: Why is my RAG system returning irrelevant results?

Common causes include poor chunking strategy, inadequate embedding model, or insufficient context in queries. Try semantic chunking, experiment with different embedding models (OpenAI, Sentence-Transformers, Cohere), and implement query expansion techniques.

Q: How do I scale RAG beyond a few documents?

For production scale, migrate from ChromaDB to dedicated vector databases like Pinecone or Qdrant. Implement async document processing, use batch embeddings, and consider distributed architectures for very large document collections.

Resources I Recommend

If you're serious about building production RAG systems, these RAG and vector database books provide the theoretical foundation you need to debug complex retrieval issues and optimize performance.

Ready to build your next AI agent? This RAG foundation will serve you well, whether you're processing internal docs, building customer support bots, or creating domain-specific AI assistants.

📘 Go Deeper: Building AI Agents: A Practical Developer's Guide

185 pages covering autonomous systems, RAG, multi-agent workflows, and production deployment — with complete code examples.

Get the ebook →

Enjoyed this article?

I write daily about iOS development, AI, and modern tech — practical tips you can use right away.

Follow me on Dev.to for daily articles
Follow me on Hashnode for in-depth tutorials
Follow me on Medium for more stories
Connect on Twitter/X for quick tips

If this helped you, drop a like and share it with a fellow developer!

DEV Community