Sam Hartley

Posted on Mar 13

I Built a Personal AI That Actually Knows My Projects (RAG + Ollama, Zero Cloud)

#ai #rag #ollama #selfhosted

I got tired of explaining my own codebase to an AI every single session.

"Here's the architecture. Here's the README. Here's what I tried last time." Every. Single. Time.

So I built a local RAG (Retrieval-Augmented Generation) system that knows my projects, my notes, and my docs — permanently. No cloud. No API costs. No context window resets.

Here's exactly how it works.

The Problem with Context Windows

LLMs don't remember. You paste the same 200 lines of context every session, hit the token limit, and start over. It's fine for one-off questions. It's exhausting for ongoing projects.

The standard solution is RAG: instead of stuffing everything into the prompt, you store docs in a vector database and retrieve only the relevant chunks when you ask a question. The model sees 3-5 paragraphs of targeted context instead of your entire repo.

Result: faster, cheaper, and the AI actually answers the right question.

The Architecture

Your Documents (markdown, code, PDFs, notes)
  → Chunked + embedded (Ollama nomic-embed-text)
  → Stored in Chroma (local vector DB)

Query
  → Embedded (same model)
  → Top-5 relevant chunks retrieved
  → Stuffed into Ollama prompt (Qwen 3.5 9B)
  → Answer

Zero cloud. Zero API keys. Runs on a Mac Mini or any machine with 8GB RAM.

What I Index

Everything that would normally eat my context window:

Project READMEs and architecture docs
My personal notes (Obsidian vault)
Code snippets and past solutions
API documentation I use regularly
Stack Overflow answers I bookmarked (because I always forget them again)
Config files and deployment notes

Total indexed: ~4,800 chunks. Query time: under 2 seconds.

Step 1: Install the Stack (15 minutes)

# Ollama (already installed? skip)
curl -fsSL https://ollama.com/install.sh | sh

# Pull models
ollama pull qwen3.5:9b          # LLM for answers
ollama pull nomic-embed-text    # Embedding model

# Python dependencies
pip install chromadb langchain ollama pypdf markdown

That's the entire stack. No Docker required (though Chroma has a Docker option if you want a persistent server).

Step 2: Index Your Documents

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

# Load your docs folder
loader = DirectoryLoader("~/projects/", glob="**/*.md", recursive=True)
docs = loader.load()

# Split into chunks (400 tokens, 50 overlap)
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
chunks = splitter.split_documents(docs)

print(f"Indexed {len(chunks)} chunks from {len(docs)} documents")

# Embed and store locally
embeddings = OllamaEmbeddings(model="nomic-embed-text")
db = Chroma.from_documents(
    chunks, 
    embeddings, 
    persist_directory="./my-knowledge-base"
)
db.persist()

Run once. Done. Your docs are now searchable by meaning, not just keywords.

Step 3: Query It

import ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

# Load existing DB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
db = Chroma(persist_directory="./my-knowledge-base", embedding_function=embeddings)

def ask(question: str) -> str:
    # Retrieve top 5 relevant chunks
    results = db.similarity_search(question, k=5)
    context = "\n\n".join([r.page_content for r in results])

    # Query local LLM with context
    response = ollama.chat(
        model="qwen3.5:9b",
        messages=[{
            "role": "user",
            "content": f"Based on this context:\n\n{context}\n\nAnswer: {question}"
        }]
    )
    return response['message']['content']

# Example
print(ask("How does my Garmin watch face fetch stock data?"))
print(ask("What's the API rate limit for the crypto bot?"))
print(ask("How do I deploy the Telegram bot to the VPS?"))

Real answers from your own documentation. No hallucinations about your specific setup.

The Killer Feature: Incremental Updates

Don't re-index everything when one file changes. Just update what's new:

import hashlib
import json
from pathlib import Path

def get_file_hash(path):
    return hashlib.md5(Path(path).read_bytes()).hexdigest()

def update_index(docs_dir, db, index_cache="./index-cache.json"):
    cache = json.loads(Path(index_cache).read_text()) if Path(index_cache).exists() else {}

    changed_files = []
    for f in Path(docs_dir).rglob("*.md"):
        h = get_file_hash(f)
        if cache.get(str(f)) != h:
            changed_files.append(str(f))
            cache[str(f)] = h

    if changed_files:
        print(f"Re-indexing {len(changed_files)} changed files...")
        # [load, chunk, embed, upsert only changed files]

    json.dumps(cache) and Path(index_cache).write_text(json.dumps(cache))

Run this as a cron job every hour. Your knowledge base stays current automatically.

What Actually Changed for Me

Before RAG:

"Explain the background service memory limit in my Garmin project" → paste 200 lines → wait → answer
Every new chat session: context reset, start explaining again

After RAG:

ask("Garmin background service memory limit") → "64KB sandbox, pass data via Background.exit(dictionary)" — in 1.8 seconds

My LLM now answers questions about projects I haven't touched in 6 months. No context management. No pasting. Just ask.

Hardware Requirements

Setup	RAM	Embedding Speed	Query Speed
Mac Mini M4 8GB	8GB	~500 docs/min	~2s
RTX 3060 12GB	12GB VRAM	~3000 docs/min	~0.5s
Old laptop 8GB	8GB	~100 docs/min	~5-8s

The embedding step (indexing) is the slow part — run it once, then it's instant.

Tips from Running This for 3 Months

Chunk size matters — 400 tokens works well for prose and docs. For code, try 200 with more overlap.
Metadata is your friend — store filename and section in chunk metadata. When the AI says "see the deployment notes," you know exactly where to look.
Re-rank when accuracy matters — if top-5 chunks aren't enough, add a re-ranker step (Cohere has a free API, or use a local cross-encoder).
Watch your embed model — nomic-embed-text beats most larger models for RAG. Don't use your chat LLM for embeddings.
Hybrid search — combine vector search with BM25 keyword search for better results on technical queries with specific names/functions.

The Bigger Picture

This is step one of something bigger: a personal AI that grows with your projects instead of resetting every session.

Next phase I'm building: automatic indexing from Git commits (index diffs in real-time as you code) + a simple web UI for non-terminal queries.

Total current cost of this setup: $0/month. It runs on the same Mac Mini I already had.

Want This for Your Business?

RAG systems are one of the highest-value AI implementations you can build. A local knowledge base trained on your company's docs, SOPs, and code can:

Answer customer support questions without hallucinating
Help your team find answers instantly across years of internal docs
Run 24/7 with zero ongoing API costs

→ Custom RAG setup on Fiverr
→ Follow us on Telegram

Built with 🤖 by CelebiBots — AI that runs on your terms.

DEV Community