DEV Community

SatStack
SatStack

Posted on

Build a RAG System with Python and a Local LLM (No API Costs)

Build a RAG System with Python and a Local LLM (No API Costs)

RAG (Retrieval-Augmented Generation) is the most in-demand LLM skill in 2026. Every company wants to point an AI at their docs, their codebase, their knowledge base — and get useful answers back.

The typical stack involves OpenAI embeddings + GPT-4 + a vector DB. The typical bill involves a credit card.

Here's how to build the same thing entirely on local hardware: Python + Ollama + ChromaDB. No API keys. No per-token costs. Runs on a laptop or a home server.


What We're Building

A RAG pipeline that:

  1. Ingests documents (text files, markdown, PDFs)
  2. Embeds them using a local model
  3. Stores vectors in ChromaDB (local, in-memory or persistent)
  4. Retrieves relevant chunks on query
  5. Generates an answer using a local LLM via Ollama

Total cloud cost: $0.


Prerequisites

  • Python 3.10+
  • Ollama installed with at least one model pulled
  • 8 GB RAM minimum (16 GB recommended for 14B models)
# Install dependencies
pip install chromadb ollama requests

# Pull models — one for embeddings, one for generation
ollama pull nomic-embed-text   # Fast, purpose-built embedding model
ollama pull qwen2.5:14b        # Generation model
Enter fullscreen mode Exit fullscreen mode

Step 1: Document Ingestion

import os
import glob
from pathlib import Path

def load_documents(docs_dir: str) -> list[dict]:
    """
    Load text documents from a directory.
    Returns list of {content, source, chunk_id} dicts.
    """
    documents = []

    # Supported formats
    patterns = ['**/*.txt', '**/*.md', '**/*.py', '**/*.rst']

    for pattern in patterns:
        for filepath in glob.glob(os.path.join(docs_dir, pattern), recursive=True):
            try:
                with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
                    content = f.read()

                if len(content.strip()) < 50:
                    continue  # Skip tiny files

                # Chunk the document
                chunks = chunk_text(content, chunk_size=500, overlap=50)

                for i, chunk in enumerate(chunks):
                    documents.append({
                        'content': chunk,
                        'source': filepath,
                        'chunk_id': f"{Path(filepath).stem}_{i}"
                    })

            except Exception as e:
                print(f"[warn] Skipping {filepath}: {e}")

    print(f"[ingest] Loaded {len(documents)} chunks from {docs_dir}")
    return documents


def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks by word count."""
    words = text.split()
    chunks = []

    i = 0
    while i < len(words):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
        i += chunk_size - overlap  # Slide with overlap

    return chunks
Enter fullscreen mode Exit fullscreen mode

Step 2: Local Embeddings with Ollama

nomic-embed-text is a purpose-built embedding model — fast, small (274M params), and genuinely good at semantic similarity.

import ollama

def embed_texts(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:
    """
    Generate embeddings for a list of texts using Ollama.
    Returns list of embedding vectors.
    """
    embeddings = []

    for i, text in enumerate(texts):
        if i % 50 == 0:
            print(f"[embed] Processing chunk {i}/{len(texts)}...")

        response = ollama.embeddings(model=model, prompt=text)
        embeddings.append(response['embedding'])

    return embeddings
Enter fullscreen mode Exit fullscreen mode

Step 3: Vector Storage with ChromaDB

import chromadb
from chromadb.config import Settings

def build_vector_store(
    documents: list[dict],
    embeddings: list[list[float]],
    collection_name: str = "local_rag",
    persist_dir: str = "./chroma_db"
) -> chromadb.Collection:
    """
    Store document chunks and their embeddings in ChromaDB.
    """
    client = chromadb.PersistentClient(path=persist_dir)

    # Delete existing collection if rebuilding
    try:
        client.delete_collection(collection_name)
    except Exception:
        pass

    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}  # Cosine similarity
    )

    # Batch insert
    batch_size = 100
    for i in range(0, len(documents), batch_size):
        batch_docs = documents[i:i + batch_size]
        batch_embeddings = embeddings[i:i + batch_size]

        collection.add(
            ids=[doc['chunk_id'] for doc in batch_docs],
            embeddings=batch_embeddings,
            documents=[doc['content'] for doc in batch_docs],
            metadatas=[{'source': doc['source']} for doc in batch_docs]
        )

    print(f"[store] Indexed {len(documents)} chunks into ChromaDB")
    return collection
Enter fullscreen mode Exit fullscreen mode

Step 4: Retrieval

def retrieve_context(
    query: str,
    collection: chromadb.Collection,
    embed_model: str = "nomic-embed-text",
    n_results: int = 5
) -> list[dict]:
    """
    Find the most relevant document chunks for a query.
    """
    # Embed the query using the same model
    query_embedding = ollama.embeddings(model=embed_model, prompt=query)['embedding']

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=n_results,
        include=['documents', 'metadatas', 'distances']
    )

    context_chunks = []
    for doc, meta, dist in zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ):
        context_chunks.append({
            'content': doc,
            'source': meta.get('source', 'unknown'),
            'relevance': round(1 - dist, 3)  # Convert distance to similarity
        })

    return context_chunks
Enter fullscreen mode Exit fullscreen mode

Step 5: Generation

import requests
import json

def generate_answer(
    query: str,
    context_chunks: list[dict],
    model: str = "qwen2.5:14b",
    ollama_url: str = "http://localhost:11434"
) -> str:
    """
    Generate an answer using retrieved context and a local LLM.
    """
    # Build context block
    context_text = "\n\n---\n\n".join([
        f"Source: {chunk['source']}\n{chunk['content']}"
        for chunk in context_chunks
    ])

    prompt = f"""You are a helpful assistant. Answer the question using ONLY the provided context.
If the answer isn't in the context, say so clearly. Do not make up information.

CONTEXT:
{context_text}

QUESTION: {query}

ANSWER:"""

    response = requests.post(
        f"{ollama_url}/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": 0.1}  # Low temp for factual Q&A
        },
        timeout=120
    )
    response.raise_for_status()
    return response.json()['response'].strip()
Enter fullscreen mode Exit fullscreen mode

Step 6: Putting It All Together

class LocalRAG:
    """Full local RAG pipeline — zero cloud dependencies."""

    def __init__(
        self,
        docs_dir: str,
        persist_dir: str = "./chroma_db",
        embed_model: str = "nomic-embed-text",
        gen_model: str = "qwen2.5:14b",
        collection_name: str = "local_rag"
    ):
        self.embed_model = embed_model
        self.gen_model = gen_model
        self.collection_name = collection_name
        self.persist_dir = persist_dir

        print(f"[rag] Initializing with docs from: {docs_dir}")

        # Load and chunk documents
        documents = load_documents(docs_dir)

        # Generate embeddings
        print(f"[rag] Embedding {len(documents)} chunks...")
        texts = [doc['content'] for doc in documents]
        embeddings = embed_texts(texts, model=embed_model)

        # Store in ChromaDB
        self.collection = build_vector_store(
            documents, embeddings,
            collection_name=collection_name,
            persist_dir=persist_dir
        )

        print("[rag] Ready.")

    def query(self, question: str, n_context: int = 5, verbose: bool = False) -> str:
        """Answer a question using local retrieval + generation."""

        # Retrieve relevant chunks
        context = retrieve_context(
            question, self.collection,
            embed_model=self.embed_model,
            n_results=n_context
        )

        if verbose:
            print(f"\n[retrieve] Top {len(context)} chunks:")
            for c in context:
                print(f"  [{c['relevance']:.2f}] {c['source']}: {c['content'][:80]}...")

        # Generate answer
        return generate_answer(question, context, model=self.gen_model)


# --- Usage ---
if __name__ == "__main__":
    import sys

    docs_dir = sys.argv[1] if len(sys.argv) > 1 else "./docs"

    rag = LocalRAG(docs_dir=docs_dir)

    print("\nLocal RAG ready. Type your questions (Ctrl+C to exit):\n")
    while True:
        try:
            question = input("Q: ").strip()
            if not question:
                continue
            answer = rag.query(question, verbose=True)
            print(f"\nA: {answer}\n")
        except KeyboardInterrupt:
            print("\nDone.")
            break
Enter fullscreen mode Exit fullscreen mode

Running It

# Index your documents
python rag.py ./my_docs

# Output:
# [ingest] Loaded 342 chunks from ./my_docs
# [rag] Embedding 342 chunks...
# [embed] Processing chunk 0/342...
# [embed] Processing chunk 50/342...
# [store] Indexed 342 chunks into ChromaDB
# [rag] Ready.
#
# Local RAG ready. Type your questions:
#
# Q: What does the authentication module do?
# [retrieve] Top 5 chunks:
#   [0.94] ./my_docs/auth.md: The authentication module handles...
# A: The authentication module handles JWT token validation and...
Enter fullscreen mode Exit fullscreen mode

Performance on Local Hardware

Tested on an Intel tower, Ubuntu 24.04, 32 GB RAM, no GPU:

Operation Time Notes
Embed 100 chunks ~8s nomic-embed-text, CPU
Embed 1000 chunks ~75s One-time indexing cost
Retrieval query <100ms ChromaDB is fast
Generation (14B) 10-20s Depends on answer length
Total Q&A latency ~15-25s Perfectly fine for async use

For real-time applications, run the indexing once and keep the collection persistent. Retrieval is nearly instant — only generation adds latency.


Drop-In OpenAI Replacement

If you have existing code using OpenAI's embedding API, swap it out:

# Before (OpenAI)
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(input=text, model="text-embedding-3-small")
embedding = response.data[0].embedding

# After (Local Ollama — same result, zero cost)
import ollama
response = ollama.embeddings(model="nomic-embed-text", prompt=text)
embedding = response['embedding']
Enter fullscreen mode Exit fullscreen mode

Same vector space semantics. Zero API cost.


What to Build With This

Use case Index target Value
Codebase Q&A Your repo Dev productivity
Docs chatbot Product docs Customer support
Research assistant PDF papers Knowledge work
Log analysis Server logs Ops tooling
Personal knowledge base Notes/Obsidian Second brain

All of these are client deliverables. All run on a $600 desktop. All cost $0/month in API fees.


Full Stack Summary

Documents → chunk_text() → embed_texts() → ChromaDB
                                                ↓
Query → embed_texts() → ChromaDB.query() → top-k chunks
                                                ↓
                                    generate_answer() → Ollama → Response
Enter fullscreen mode Exit fullscreen mode

No cloud. No vendor lock-in. No surprise bills.

If you want to pair this with a persistent API server, check out my guide on running a local AI coding agent with Ollama — the setup is identical, just point the generation step at the same Ollama instance.

Drop a comment with what you're indexing — always curious what people are pointing RAG at.

Top comments (0)