Aayush Mishra

Posted on Sep 3

Setting up RAG Locally with Ollama: A Beginner-Friendly Guide

#ai #rag #llm #python

Introduction

Retrieval-Augmented Generation (RAG) is one of the most powerful ways to make LLMs more useful by grounding them in your own data. Instead of relying only on a model's pretraining, RAG lets you ask questions over PDFs, docs, or databases and get precise, context-aware answers.

In this post, we'll set up a local RAG pipeline using Ollama, so you can run everything privately on your machine without cloud costs.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances large language models (LLMs) by enabling them to access and utilize external knowledge sources during response generation.

Think of it like this:
Instead of relying only on what the LLM was trained on, RAG lets it search a knowledge base (like your PDFs, notes, or datastores) and combine that retrieved information with its reasoning capabilities.

Quick example:
Imagine you need to understand your company's 120-page policy manual. Instead of manually searching through it, you can just ask: "What's the travel reimbursement policy?" A RAG system will fetch the relevant paragraph from the PDF, and the LLM will generate a clear answer based on that specific content.

At its core, RAG works in two key phases:

1. Document Retrieval Phase

Documents are converted into numerical vectors (embeddings) using specialized models and stored in vector databases optimized for similarity search (e.g., FAISS, ChromaDB).

When you ask a question, the system performs semantic search to find the most relevant chunks of text based on cosine similarity or other distance metrics.

2. Response Generation Phase

The retrieved chunks are passed as context into the LLM prompt template.

The model then generates an answer grounded in your specific documents, combining retrieved facts with natural language generation.

Why Local RAG with Ollama?

Running RAG locally provides several compelling advantages:

Privacy and Data Security — Your documents never leave your local machine. No risk of sending sensitive data to third-party APIs or cloud services.

Cost Efficiency — Zero API call costs. Perfect for experimentation, high-volume usage, or continuous operation without budget constraints.

Model Experimentation — You can easily test multiple models (Llama 3.1, Mistral, CodeLlama) and compare their performance for your specific use case.

Offline Capability — Complete independence from internet connectivity once models are downloaded.

Ollama makes this possible by providing a simple interface to run optimized open-source models locally with minimal setup overhead.

Implementation Setup

Download and Configure Ollama

Download Ollama from: Ollama Download Page

Run ollama post download and verify installation:

ollama --version

Pull Required Models

Before running any code, you need to download both the embedding model and the language model:

# Pull the embedding model (essential for document vectorization)
ollama pull nomic-embed-text

# Pull the language model for text generation
ollama pull llama3.1:latest

Verify your models are available:

ollama list

Start Ollama Service

Make ollama available for API calls:

ollama serve

Python Dependencies

Ensure you have Python 3.9+ installed, then install the required libraries:

llama-index-core
llama-index-embeddings-ollama
llama-index-llms-ollama
llama-index-vector-stores-chroma
chromadb
pypdf

Install using:

pip install -r requirements.txt

Library Overview:

llama-index-core → Core framework for document loading, indexing, and query processing
llama-index-embeddings-ollama → Integrates Ollama for generating embeddings
llama-index-llms-ollama → Integrates Ollama as the language model
llama-index-vector-stores-chroma → ChromaDB connector for vector storage
chromadb → Vector database backend for similarity search
pypdf → PDF parsing and text extraction

Complete RAG Implementation

Create your project structure:

project/
├── data/           # Place your PDF files here
├── test_rag.py     # Main implementation
└── requirements.txt

test_rag.py

from pathlib import Path
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama


# Initialize the embedding model
embed_model = OllamaEmbedding(
    model_name="nomic-embed-text",
    request_timeout=300.0,  # Increased timeout for large documents
)

# Initialize the LLM with optimized settings
llm = Ollama(
    model="llama3.1:latest",  # Confirm with `ollama list`
    request_timeout=300.0,
    temperature=0.1,          # Lower temperature for more factual responses
)

# Set global configurations
Settings.embed_model = embed_model
Settings.llm = llm

def load_and_index_documents(data_dir="data"):
    """Load documents and create vector index"""

    # Check if data directory exists
    if not Path(data_dir).exists():
        raise FileNotFoundError(f"Data directory '{data_dir}' not found. Please create it and add your PDF files.")

    # Load documents from the data folder
    docs = SimpleDirectoryReader(data_dir).load_data()

    if not docs:
        raise ValueError(f"No documents found in {data_dir}")


    # Build vector index from documents
    index = VectorStoreIndex.from_documents(docs, embed_model=embed_model)

    return index

def create_query_engine(index, similarity_top_k=3):
    """Create query engine with specified retrieval parameters"""

    query_engine = index.as_query_engine(
        llm=llm,
        similarity_top_k=similarity_top_k,  # Number of relevant chunks to retrieve
        response_mode="compact"             # Compact response generation
    )

    return query_engine

def test_rag_system():
    """Test the RAG system with sample queries"""

    try:
        # Load documents and create index
        index = load_and_index_documents()

        # Create query engine
        query_engine = create_query_engine(index)

        # Sample test queries
        test_queries = [
            "Summarize this document in 3 lines",
            "What are the main topics covered in these documents?",
        ]

        print("RAG System Test Results")
        print("=" * 50)

        for i, query in enumerate(test_queries, 1):
            print(f"\nTest {i}: {query}")
            print("-" * 40)

            try:
                response = query_engine.query(query)
                print(f"Response: {response}")
                print(f"Status: SUCCESS")
            except Exception as e:
                print(f"Error: {str(e)}")
                print(f"Status: FAILED")

            print("-" * 40)

        return True

    except Exception as e:
        print(f"System Error: {str(e)}")
        return False

# Main execution
if __name__ == "__main__":

    print("Starting RAG Pipeline Test...")

    # Test the complete system
    success = test_rag_system()

    if success:
        print("\nRAG system is working correctly!")
        print("You can now use the query_engine to ask questions about your documents.")
    else:
        print("\nRAG system test failed. Check the error messages above.")

Usage Instructions

Prepare your documents: Place PDF files in a data/ folder
Run the test: Execute python test_rag.py to verify everything works
Interactive usage: After successful testing, you can use the functions individually

Testing Your Setup

The code includes a comprehensive testing function that will:

Verify document loading works correctly
Test vector index creation
Run sample queries to ensure end-to-end functionality
Provide clear success/failure feedback

Advanced Configuration Options

Chunk Size Optimization: Adjust chunk sizes based on your document types:

Settings.chunk_size = 1024    # Default: good for most documents
Settings.chunk_overlap = 200  # Maintains context between chunks

Retrieval Tuning: Modify similarity search parameters:

query_engine = index.as_query_engine(
    similarity_top_k=5,        # Retrieve more chunks for complex queries
    response_mode="tree_summarize"  # Better for longer documents
)

Production Considerations

For production deployment, consider:

Performance: Use appropriate chunk sizes and similarity thresholds based on your document types
Monitoring: Implement logging to track query patterns and response quality
Storage: ChromaDB provides good performance for most use cases; consider FAISS for larger datasets
Model Selection: Test different models to find the best balance of speed and accuracy for your specific use case

This implementation provides a solid foundation for document-based question answering with complete privacy and no ongoing costs. The modular structure makes it easy to customize for specific requirements while maintaining reliability and performance.

For questions or improvements to this setup, feel free to reach out.

This is a basic setup; in future blogs, we will explore more enhanced and production-grade setups for RAG architecture to achieve better similarity search. Meanwhile, if you want to study more on the similarity score, the following links can help :
Similarity Score
cosine similarity

DEV Community