DEV Community

Akhilesh Pothuri
Akhilesh Pothuri

Posted on

Build a RAG Pipeline from Scratch in Python: A Step-by-Step Guide

Build a RAG Pipeline from Scratch in Python: A Step-by-Step Guide

Turn any folder of documents into an AI that actually knows what it's talking about — no hallucinations, no expensive services, just Python and your own data.

Build a RAG Pipeline from Scratch in Python: A Step-by-Step Guide

Your chatbot just told a customer that your company offers a 90-day return policy. You don't. You never have.

This is the hallucination problem in action — and it's why businesses are terrified of deploying AI on anything that actually matters. Large language models don't know things; they predict what sounds right based on patterns they've seen. They'll cite fake court cases, invent product features, and reference policies that exist only in the statistical space between their training tokens.

Retrieval-Augmented Generation (RAG) fixes this by giving your AI something it desperately needs: a cheat sheet. Instead of guessing, the model retrieves actual documents from your data — your policies, your docs, your knowledge base — and answers based on what it finds. No more confident fabrications. Just grounded responses backed by sources you control.

By the end of this guide, you'll have a working RAG pipeline that turns any folder of documents into an AI assistant that actually knows what it's talking about.

Why Your AI Keeps Making Things Up (And How RAG Fixes It)

You've probably noticed something weird about ChatGPT: it'll confidently tell you that a made-up research paper exists, complete with fake authors and a plausible-sounding title. Ask it about your company's Q3 sales figures, and it'll happily invent numbers that sound reasonable but are completely wrong.

This isn't a bug—it's how these systems fundamentally work. Large Language Models are sophisticated pattern-completion machines. They've learned that when someone asks "Who wrote The Great Gatsby?", the pattern typically ends with "F. Scott Fitzgerald." But when you ask about something not in their training data, they don't say "I don't know." They complete the pattern with whatever sounds plausible. They're professional bullshitters with perfect confidence.

Think of it like this: a closed-book exam forces you to answer from memory alone. You'll fill in gaps with educated guesses—sometimes embarrassingly wrong ones. An open-book exam lets you flip to the relevant page before answering. You're still doing the thinking, but now it's grounded in actual source material.

That's exactly what RAG does. Instead of asking your LLM to conjure answers from its training weights, you first retrieve relevant documents from your own knowledge base, then augment the prompt with that context before generation. The AI gets a cheat sheet.

You need RAG when:

  • Your data is private (internal docs, customer records, proprietary research)
  • The information is recent (anything after the model's training cutoff)
  • Accuracy matters more than creativity (legal, medical, financial contexts)

Basically: if your AI needs to cite its sources, you need RAG.

The Three Building Blocks of Every RAG System

Think of a RAG system like a well-organized research assistant. Before answering your question, they do three things: organize the reference materials into manageable pieces, understand what those pieces are actually about, and know which ones to grab when you ask something specific.

Document Processing: The Art of Good Chunking

Raw documents are messy—PDFs with weird formatting, long articles, nested headers. You can't feed a 50-page document to an LLM and expect precision. So we split documents into chunks: smaller, digestible pieces.

Here's what most tutorials don't tell you: chunk size is a surprisingly high-stakes decision. Too small (50 tokens), and you lose context—imagine trying to understand a paragraph by reading one sentence at a time. Too large (2000+ tokens), and your retrieval becomes imprecise, like searching a library that only has "Science" as a category. Most production systems land between 256-512 tokens, with some overlap between chunks so ideas don't get sliced mid-thought.

Vector Embeddings: Teaching Computers That "Dog" and "Puppy" Are Related

Traditional search matches keywords. Type "automobile" and it won't find documents about "cars." Embedding models solve this by converting text into numerical vectors—long lists of numbers that capture meaning. Similar concepts cluster together in this mathematical space. "Happy," "joyful," and "elated" all land near each other, even though they share no letters.

The Retrieval-Generation Loop

When a question arrives: embed it, find the closest-matching chunks in your vector database, stuff those chunks into the prompt, then ask the LLM to answer using only that provided context. The model becomes a reasoning engine over your curated evidence—not a guesser.

Setting Up Your Python Environment

Before writing any code, let's get your workspace ready. Think of this like prepping ingredients before cooking—five minutes of setup saves hours of frustration later.

Install the Core Libraries

Open your terminal and run:

pip install sentence-transformers chromadb openai python-dotenv
Enter fullscreen mode Exit fullscreen mode

Here's what each does:

  • sentence-transformers: Converts text into those numerical vectors we discussed. Runs entirely on your machine—no API calls needed.
  • chromadb: Our vector database. Stores embeddings and handles similarity search.
  • openai: Talks to GPT models for the generation step. (Want to stay fully local? Swap this for ollama and run Llama or Mistral on your hardware.)
  • python-dotenv: Keeps API keys out of your code.

Why ChromaDB Instead of Pinecone?

Pinecone is excellent for production, but it requires account setup, API keys, and cloud infrastructure. ChromaDB runs as a local file—zero configuration, same vector search concepts. Once you understand the patterns here, migrating to Pinecone (or Weaviate, or Qdrant) takes maybe 20 lines of code changes. Learn the concepts first; optimize infrastructure later.

Project Structure

Create this folder layout:

rag-pipeline/
├── data/
│   └── documents/      # Your source files go here
├── src/
│   ├── chunker.py      # Text splitting logic
│   ├── embedder.py     # Vector generation
│   ├── retriever.py    # Search functionality
│   └── generator.py    # LLM integration
├── chroma_db/          # Auto-created by ChromaDB
├── .env                # Your API keys
└── main.py             # Orchestrates everything
Enter fullscreen mode Exit fullscreen mode

This separation keeps each RAG component testable and swappable. Let's build the chunker first.

Building the Indexing Pipeline: From Documents to Vectors

Think of this stage like preparing ingredients before cooking. You can't just throw a whole cookbook into a blender and expect good results—you need to prep your documents into bite-sized pieces, translate them into a language computers understand (vectors), and organize them so they're easy to find later.

Step 1: Loading and Chunking Your Documents

Raw documents are too long for LLMs to process efficiently. We split them into "chunks"—smaller passages that capture complete thoughts. Here's where the 256-token sweet spot comes from: it's large enough to preserve context, small enough to fit multiple relevant chunks into an LLM's context window.

# src/chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path

def load_and_chunk_markdown(directory: str, chunk_size: int = 1024, overlap: int = 128):
    """
    Load markdown files and split into overlapping chunks.
    ~256 tokens ≈ 1024 characters for English text.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,  # Prevents cutting sentences mid-thought
        separators=["\n## ", "\n### ", "\n\n", "\n", " "]  # Respects markdown structure
    )

    chunks = []
    for file_path in Path(directory).glob("**/*.md"):
        content = file_path.read_text(encoding="utf-8")
        file_chunks = splitter.split_text(content)

        for i, chunk in enumerate(file_chunks):
            chunks.append({
                "text": chunk,
                "source": str(file_path),
                "chunk_index": i
            })

    return chunks
Enter fullscreen mode Exit fullscreen mode

The overlap parameter is crucial—without it, you'd slice sentences in half, losing meaning at chunk boundaries.

Building the Query Pipeline: From Question to Answer

Now comes the part where your pipeline actually thinks. You've got chunks sitting in a vector database, but a user just typed "How do I handle authentication?" How does that question find the right chunks?

Step 1: Embed the Question

The magic trick is simple: convert the user's question into the exact same vector space as your document chunks. Same model, same dimensions, same mathematical universe.

def query_to_vector(question: str, model) -> list[float]:
    """Transform user question into searchable vector."""
    return model.encode(question).tolist()
Enter fullscreen mode Exit fullscreen mode

When both questions and documents live in the same 384-dimensional space, "similar meaning" becomes "nearby points." A question about "authentication" lands close to chunks discussing "login," "credentials," and "OAuth"—even if those exact words never appear in the question.

Step 2: Semantic Search

Vector databases excel at one thing: finding the k-nearest neighbors blazingly fast. You're typically retrieving 3-5 chunks—enough context to be useful, not so much that you overwhelm the LLM.

def retrieve_relevant_chunks(question: str, collection, model, top_k: int = 4):
    """Find the chunks most semantically similar to the question."""
    query_vector = query_to_vector(question, model)

    results = collection.query(
        query_embeddings=[query_vector],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    return results["documents"][0], results["metadatas"][0]
Enter fullscreen mode Exit fullscreen mode

Step 3: Crafting the Prompt

Here's where developers often stumble. You can't just dump retrieved chunks into a prompt—LLMs get confused when context appears without explanation. The fix: explicit framing.

def build_rag_prompt(question: str, chunks: list[str]) -> str:
    context = "\n---\n".join(chunks)

    return f"""Answer the question using ONLY the context below. 
If the context doesn't contain the answer, say "I don't have that information."

CONTEXT:
{context}

QUESTION: {question}

ANSWER:"""
Enter fullscreen mode Exit fullscreen mode

That instruction—"ONLY the context below"—prevents hallucination. The separator lines help the LLM distinguish between different source chunks.

What RAG Won't Do (And How to Make It Better)

Let's address the elephant in the room: RAG isn't magic, and it won't solve every problem you throw at it.

The Hallucination Myth

That prompt instruction telling the LLM to use "ONLY the context below"? The model can still ignore it. LLMs are probabilistic—they generate statistically likely continuations, not logically constrained outputs. If your retrieved context says "revenue was $4.2 million" but the model's training data suggests tech companies typically report in billions, it might "helpfully" adjust the number. RAG reduces hallucinations by giving the model relevant information. It doesn't eliminate them.

When Vector Search Fails You

Semantic search excels at finding conceptually similar content, but it struggles with:

  • Exact matches: "What's the policy for PTO-2024-Rev3?" won't find that specific document code
  • Numbers and dates: "Sales figures from Q3 2023" might return Q2 or Q4 results
  • Proper names: Searching "John Smith's project" could surface any project discussion

This is why hybrid search exists—combining vector similarity with keyword matching (BM25). Most production systems use both.

Quick Wins That Actually Work

  • Re-ranking: Retrieve 20 chunks, then use a cross-encoder model to re-score and keep the top 5. Dramatically improves relevance.
  • Query caching: Repeated questions don't need fresh embedding calls. A simple dictionary cache cuts latency and API costs.
  • Chunk size tuning: Legal documents need larger chunks (1000+ tokens) to preserve clause relationships. FAQs work better with smaller chunks (200-300 tokens). There's no universal "right" size—test with your actual data.

Running the Complete Pipeline and Next Steps

Now let's put everything together. Here's a complete script that indexes a folder of markdown notes and lets you ask questions:

import os
from pathlib import Path

def index_notes_folder(folder_path: str):
    """Index all .md and .txt files in a folder."""
    documents = []
    for file_path in Path(folder_path).rglob("*"):
        if file_path.suffix in [".md", ".txt"]:
            content = file_path.read_text(encoding="utf-8")
            chunks = chunk_document(content, str(file_path))
            documents.extend(chunks)

    # Create embeddings and build index
    embeddings = [get_embedding(doc["content"]) for doc in documents]
    index = build_faiss_index(embeddings)
    return index, documents

def ask_with_sources(question: str, index, documents, k: int = 3):
    """Ask a question and show which chunks informed the answer."""
    results = search(question, index, documents, k)

    # Generate answer
    answer = generate_answer(question, results)

    # Show the receipts
    print(f"\n📝 Answer: {answer}\n")
    print("📚 Sources used:")
    for i, result in enumerate(results, 1):
        print(f"  {i}. {result['source']} (similarity: {result['score']:.3f})")
        print(f"     Preview: {result['content'][:100]}...")

    return answer, results

# Usage
index, docs = index_notes_folder("./my_notes")
ask_with_sources("What did I write about project deadlines?", index, docs)
Enter fullscreen mode Exit fullscreen mode

Where to Go From Here

You've built a working RAG pipeline—but production systems need more. Three areas to explore:

  1. Evaluation metrics: RAGAS and TruLens measure retrieval precision and answer faithfulness. Without metrics, you're tuning blind.

  2. Production databases: FAISS lives in memory. For real applications, consider Pinecone, Weaviate, or pgvector (if you're already on Postgres).

  3. Hybrid retrieval: Combine vector search with BM25 keyword matching. Libraries like rank_bm25 integrate easily and handle exact-match queries that pure semantic search misses.


Full working code: GitHub →



RAG isn't magic—it's retrieval plus generation, stitched together with embeddings. The pipeline you've built here handles 80% of real-world use cases: chunk your documents, embed them, find relevant pieces, and let the LLM synthesize an answer with actual sources. Start with this foundation, measure what breaks, then add complexity only where the metrics demand it.

Key Takeaways

  • RAG = search engine + LLM: You retrieve relevant chunks via vector similarity, then pass them as context to a language model—giving it knowledge it never saw during training.
  • Chunking strategy matters more than model choice: Overlapping chunks (200 tokens with 50-token overlap) preserve context across boundaries; poor chunking breaks even the best embeddings.
  • Always return sources: The top_k results aren't just for the LLM—showing users where answers came from builds trust and lets them verify (or correct) the output.

What's your biggest RAG challenge—chunking strategy, retrieval quality, or something else entirely? Drop it in the comments.

Top comments (0)