I got tired of explaining my own codebase to an AI every single session.
"Here's the architecture. Here's the README. Here's what I tried last time." Every. Single. Time.
So I built a local RAG (Retrieval-Augmented Generation) system that knows my projects, my notes, and my docs — permanently. No cloud. No API costs. No context window resets.
Here's exactly how it works.
The Problem with Context Windows
LLMs don't remember. You paste the same 200 lines of context every session, hit the token limit, and start over. It's fine for one-off questions. It's exhausting for ongoing projects.
The standard solution is RAG: instead of stuffing everything into the prompt, you store docs in a vector database and retrieve only the relevant chunks when you ask a question. The model sees 3-5 paragraphs of targeted context instead of your entire repo.
Result: faster, cheaper, and the AI actually answers the right question.
The Architecture
Your Documents (markdown, code, PDFs, notes)
→ Chunked + embedded (Ollama nomic-embed-text)
→ Stored in Chroma (local vector DB)
Query
→ Embedded (same model)
→ Top-5 relevant chunks retrieved
→ Stuffed into Ollama prompt (Qwen 3.5 9B)
→ Answer
Zero cloud. Zero API keys. Runs on a Mac Mini or any machine with 8GB RAM.
What I Index
Everything that would normally eat my context window:
- Project READMEs and architecture docs
- My personal notes (Obsidian vault)
- Code snippets and past solutions
- API documentation I use regularly
- Stack Overflow answers I bookmarked (because I always forget them again)
- Config files and deployment notes
Total indexed: ~4,800 chunks. Query time: under 2 seconds.
Step 1: Install the Stack (15 minutes)
# Ollama (already installed? skip)
curl -fsSL https://ollama.com/install.sh | sh
# Pull models
ollama pull qwen3.5:9b # LLM for answers
ollama pull nomic-embed-text # Embedding model
# Python dependencies
pip install chromadb langchain ollama pypdf markdown
That's the entire stack. No Docker required (though Chroma has a Docker option if you want a persistent server).
Step 2: Index Your Documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
# Load your docs folder
loader = DirectoryLoader("~/projects/", glob="**/*.md", recursive=True)
docs = loader.load()
# Split into chunks (400 tokens, 50 overlap)
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
chunks = splitter.split_documents(docs)
print(f"Indexed {len(chunks)} chunks from {len(docs)} documents")
# Embed and store locally
embeddings = OllamaEmbeddings(model="nomic-embed-text")
db = Chroma.from_documents(
chunks,
embeddings,
persist_directory="./my-knowledge-base"
)
db.persist()
Run once. Done. Your docs are now searchable by meaning, not just keywords.
Step 3: Query It
import ollama
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
# Load existing DB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
db = Chroma(persist_directory="./my-knowledge-base", embedding_function=embeddings)
def ask(question: str) -> str:
# Retrieve top 5 relevant chunks
results = db.similarity_search(question, k=5)
context = "\n\n".join([r.page_content for r in results])
# Query local LLM with context
response = ollama.chat(
model="qwen3.5:9b",
messages=[{
"role": "user",
"content": f"Based on this context:\n\n{context}\n\nAnswer: {question}"
}]
)
return response['message']['content']
# Example
print(ask("How does my Garmin watch face fetch stock data?"))
print(ask("What's the API rate limit for the crypto bot?"))
print(ask("How do I deploy the Telegram bot to the VPS?"))
Real answers from your own documentation. No hallucinations about your specific setup.
The Killer Feature: Incremental Updates
Don't re-index everything when one file changes. Just update what's new:
import hashlib
import json
from pathlib import Path
def get_file_hash(path):
return hashlib.md5(Path(path).read_bytes()).hexdigest()
def update_index(docs_dir, db, index_cache="./index-cache.json"):
cache = json.loads(Path(index_cache).read_text()) if Path(index_cache).exists() else {}
changed_files = []
for f in Path(docs_dir).rglob("*.md"):
h = get_file_hash(f)
if cache.get(str(f)) != h:
changed_files.append(str(f))
cache[str(f)] = h
if changed_files:
print(f"Re-indexing {len(changed_files)} changed files...")
# [load, chunk, embed, upsert only changed files]
json.dumps(cache) and Path(index_cache).write_text(json.dumps(cache))
Run this as a cron job every hour. Your knowledge base stays current automatically.
What Actually Changed for Me
Before RAG:
- "Explain the background service memory limit in my Garmin project" → paste 200 lines → wait → answer
- Every new chat session: context reset, start explaining again
After RAG:
-
ask("Garmin background service memory limit")→ "64KB sandbox, pass data via Background.exit(dictionary)" — in 1.8 seconds
My LLM now answers questions about projects I haven't touched in 6 months. No context management. No pasting. Just ask.
Hardware Requirements
| Setup | RAM | Embedding Speed | Query Speed |
|---|---|---|---|
| Mac Mini M4 8GB | 8GB | ~500 docs/min | ~2s |
| RTX 3060 12GB | 12GB VRAM | ~3000 docs/min | ~0.5s |
| Old laptop 8GB | 8GB | ~100 docs/min | ~5-8s |
The embedding step (indexing) is the slow part — run it once, then it's instant.
Tips from Running This for 3 Months
- Chunk size matters — 400 tokens works well for prose and docs. For code, try 200 with more overlap.
-
Metadata is your friend — store
filenameandsectionin chunk metadata. When the AI says "see the deployment notes," you know exactly where to look. - Re-rank when accuracy matters — if top-5 chunks aren't enough, add a re-ranker step (Cohere has a free API, or use a local cross-encoder).
-
Watch your embed model —
nomic-embed-textbeats most larger models for RAG. Don't use your chat LLM for embeddings. - Hybrid search — combine vector search with BM25 keyword search for better results on technical queries with specific names/functions.
The Bigger Picture
This is step one of something bigger: a personal AI that grows with your projects instead of resetting every session.
Next phase I'm building: automatic indexing from Git commits (index diffs in real-time as you code) + a simple web UI for non-terminal queries.
Total current cost of this setup: $0/month. It runs on the same Mac Mini I already had.
Want This for Your Business?
RAG systems are one of the highest-value AI implementations you can build. A local knowledge base trained on your company's docs, SOPs, and code can:
- Answer customer support questions without hallucinating
- Help your team find answers instantly across years of internal docs
- Run 24/7 with zero ongoing API costs
→ Custom RAG setup on Fiverr
→ Follow us on Telegram
Built with 🤖 by CelebiBots — AI that runs on your terms.
Top comments (0)