Behind the Scenes of RAG

Introduction

In my last article, "RAG to Riches", we explored why Retrieval-Augmented Generation (RAG) is transforming the way AI models deliver accurate, up-to-date answers by blending retrieval and generation. As promised, it’s time to roll up our sleeves and see RAG in action—step by step, from setting up your knowledge base to watching your AI answer real questions with live data.

What You’ll Learn

How to prepare and structure your knowledge base
The basics of embeddings and vector databases
How to connect a retriever and a language model
How to run real-time Q&A with your own data
Tools and code snippets to get you started

*Step 1: Define Your Use Case and Gather Data
*
Start by identifying your use case—customer support, internal documentation, research assistant, etc. Gather all relevant documents, FAQs, manuals, or datasets that your AI will reference.

*Step 2: Preprocess and Chunk Your Data
*
Large documents need to be broken into smaller, meaningful chunks. This makes retrieval more precise and efficient. For example, split a 100-page manual into sections or paragraphs, each focused on a specific topic.

*Step 3: Create Embeddings and Store Vectors
*
Transform each chunk into a vector (embedding) that captures its semantic meaning. Use models like Sentence Transformers, OpenAI, or even local tools via Ollama. Store these vectors in a vector database such as ChromaDB, Pinecone, or FAISS for fast similarity search.

*Step 4: Build the Retriever
*
When a user asks a question, convert it to an embedding and search your vector database for the most relevant chunks. This is your retriever at work—bringing back the best context for your model to use.

*Step 5: Connect to a Language Model
*
Feed the retrieved context and the user’s question to a language model (like GPT, Llama, or Mistral). The model generates a response grounded in the retrieved information, reducing hallucinations and improving accuracy.

*Step 6: Run, Test, and Iterate
*
Now, you’re ready to test! Ask real questions and see your AI respond with answers sourced from your own knowledge base. Analyze the results, tweak chunk sizes, retriever settings, or the prompt to improve performance.

*Tools & Frameworks to Make It Easier
*

FastAPI: Build REST endpoints for document ingestion and query handling
PyMuPDF: PDF text extraction without external parsers
Ollama: Run LLMs and embedding models locally—no internet required.
ChromaDB/FAISS: Popular open-source vector databases.

*Code Example: PDF Document Analysis
*

from fastapi import APIRouter, UploadFile, File, HTTPException
import uuid

router = APIRouter()

@app.post("/upload")
async def upload_documents(file: UploadFile = File(...)):
    """
    Uploads and processes documents for RAG systems by:
    1. Reading file content
    2. Extracting and validating text
    3. Chunking content
    4. Storing in vector database

    Returns document metadata with processing details
    """

    # 1. Read file content asynchronously
    content = await file.read()

    # 2. Decode and validate content
    decoded_text = text_processor.decode_content(content)
    if not decoded_text or len(decoded_text.strip()) < 10:
        raise HTTPException(
            status_code=400,
            detail="File is empty or contains no readable text"
        )

    # 3. Generate unique document ID
    doc_id = str(uuid.uuid4())

    # 4. Chunk text into processable segments
    chunks = text_processor.chunk_text(decoded_text)
    if not chunks:
        raise HTTPException(
            status_code=400,
            detail="Failed to extract meaningful text chunks"
        )

    # 5. Prepare chunks with metadata
    doc_chunks = [
        {
            'id': f"{doc_id}_{index}",
            'text': chunk,
            'source': file.filename,
            'doc_id': doc_id
        }
        for index, chunk in enumerate(chunks)
    ]

    # 6. Store chunks in vector database
    try:
        await rag_system.add_document_chunks(
            chunks=doc_chunks,
            doc_id=doc_id,
            file=file,
            encoding="utf-8"
        )
    except Exception as error:
        print(f"Database error: {error}")
        raise HTTPException(
            status_code=500,
            detail=f"Failed to store document: {str(error)}"
        )

    # 7. Return processing metadata
    return {
        'document_id': doc_id,
        'filename': file.filename,
        'file_size': len(content),
        'chunks_generated': len(chunks),
        'message': 'Document processed successfully'
    } 
from datetime import datetime
from fastapi import HTTPException
from pydantic import BaseModel

# Define structured request/response models
class QueryRequest(BaseModel):
    query: str
    top_k: int = 5  # Default value for optional parameter

class QueryResponse(BaseModel):
    answer: str
    query_time: float

@app.post("/query")
async def query_document(request: QueryRequest):
    """Processes user queries through RAG pipeline"""
    start_time = datetime.now()

    try:
        # 1. Retrieve relevant document chunks
        relevant_chunks = rag_system.search_chunks(
            query_text=request.query,
            result_count=request.top_k
        )

        # 2. Build context from retrieved chunks
        context = "\n\n".join(chunk["text"] for chunk in relevant_chunks)

        # 3. Generate AI-powered response
        answer = rag_system.generate_answer(
            user_query=request.query,
            context=context
        )

        # 4. Calculate processing time
        query_time = (datetime.now() - start_time).total_seconds()

        return QueryResponse(
            answer=answer,
            query_time=query_time
        )

    except Exception as error:
        # Handle specific error types
        error_msg = f"Query processing failed: {str(error)}"
        print(f"ERROR: {error_msg}")
        raise HTTPException(
            status_code=500,
            detail=error_msg
        )

This is a part of the code, see my entire code on Github

Tips for Success

Start small: Test with a limited dataset before scaling up.
Use open-source tools for flexibility and privacy.
Always evaluate your system with real user queries and iterate.

Let’s make AI smarter—one chunk at a time!

DEV Community

Behind the Scenes of RAG

Top comments (0)