Karpathy showed us how to build LLM-powered knowledge bases. But what happens when your wiki gets too big for the context window? Here's the missing piece.
In a recent post, Andrej Karpathy described a workflow that resonated with thousands of developers: use LLMs to build and maintain personal knowledge bases as markdown wikis. Raw documents go in, the LLM compiles them into structured articles, and you query the wiki like a research assistant.
He also noted something important:
"I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries... at this ~small scale."
The key phrase is "at this small scale." His wiki is ~100 articles and ~400K words. That fits in a large context window. But what happens when you hit 500 articles? 1,000? 2 million words?
The context window runs out. Your LLM can't read everything anymore. This is where RAG comes in — and it's simpler than you think.
What is RAG?
RAG (Retrieval Augmented Generation) is a three-step pattern:
- Retrieve — Find the most relevant documents for a given question
- Augment — Attach those documents to the prompt
- Generate — LLM answers using only the relevant context
Think of it as an open-book exam. The LLM doesn't memorize your entire wiki — it looks up the right pages before answering.
You: "How does attention differ from convolution?"
↓
1. Search vector DB → top 5 relevant articles found
2. Attach articles to prompt
3. LLM reads 5 articles (not 500) → generates answer
↓
LLM: "Based on your wiki articles on attention mechanisms
and CNN architectures, the key differences are..."
Without RAG, you'd need to feed all 500 articles into the context window. With RAG, you feed only 5. Same quality, 100x less tokens.
How It Works Under the Hood

RAG relies on vector embeddings — turning text into numbers that capture meaning.
Step 1: Index your wiki
Every article gets converted into a vector (a list of numbers) by an embedding model:
"Attention mechanism" → [0.42, 0.68, 0.35, -0.12, ...]
"CNN architecture" → [0.39, 0.71, 0.30, -0.15, ...] ← similar topic, close vectors
"Cooking recipes" → [0.85, 0.10, 0.92, 0.44, ...] ← different topic, far apart
These vectors are stored in a vector database — a specialized database that finds similar vectors fast.
Step 2: Query
When you ask a question, the same embedding model converts your question to a vector, then the vector DB finds the closest matches:
"How does self-attention work?"
→ vector → search → top 5 closest articles
→ attention-mechanism.md, transformer-architecture.md, ...
Step 3: Generate
Those articles are injected into the LLM prompt:
System: Answer based on the following context:
[article 1 content]
[article 2 content]
[article 3 content]
User: How does self-attention work?
The LLM now has the right context and generates an accurate, grounded answer.
The Landscape: Existing Tools
Since Karpathy's post, several tools have emerged. Here's a comparison of the most notable ones:
| Tool | Stack | Best For |
|---|---|---|
| ObsidianRAG | ChromaDB + Ollama + GraphRAG | Full-featured local RAG with wikilink-aware search |
| obsidian-notes-rag | SQLite-vec + MCP server | Claude Code / AI agent integration |
| llmwiki | Web UI + Claude | Non-technical users who want a GUI |
| obsidian-note-taking-assistant | DuckDB + Web app | Combined note-taking + RAG |
| obsidianRAGsody | CLI + URL clipper | CLI-first workflow with web scraping |
Which one should you use?
- Want everything local + privacy? → ObsidianRAG (Ollama + ChromaDB)
- Using Claude Code as your agent? → obsidian-notes-rag (MCP server)
- Just want to try RAG quickly? → obsidianRAGsody (simple CLI)
What Makes a Good RAG Pipeline?
A naive RAG (embed → search → generate) works, but production-quality tools like ObsidianRAG go further:
1. Hybrid Search (Vector + Keyword)
Vector search finds semantically similar content ("How do transformers work?" → finds articles about attention). But it can miss exact terms. BM25 keyword search catches those. The best systems combine both — ObsidianRAG uses a 60/40 vector/keyword split.
2. Reranking
Initial retrieval returns ~20 candidates. A CrossEncoder reranker (like bge-reranker-v2-m3) then scores each candidate against the original query more carefully, keeping only the top 5. This dramatically improves precision.
3. Graph-Aware Expansion
If article A is retrieved and it contains [[article B]] wikilinks, a smart system also pulls in article B. This follows the knowledge graph your LLM already built — exactly how Obsidian's backlinks work.
4. Multilingual Embeddings
If your wiki has mixed-language content, use paraphrase-multilingual-mpnet-base-v2 instead of English-only models. It covers 50+ languages.
Simple RAG: Query → Vector Search → Top 5 → LLM
Better RAG: Query → Hybrid Search → Top 20 → Rerank → Top 5 → Expand Links → LLM
Build It Yourself: Minimal RAG in 50 Lines
If you want to understand the core concept, here's a minimal implementation. For production use, consider the tools listed above.
Prerequisites
pip install chromadb sentence-transformers ollama
The Code
import os
import glob
import chromadb
from sentence_transformers import SentenceTransformer
# 1. Setup
embedding_model = SentenceTransformer("all-MiniLM-L6-v2") # local, no API. Use paraphrase-multilingual-mpnet-base-v2 for multilingual wikis
chroma = chromadb.PersistentClient(path="./wiki_vectors")
collection = chroma.get_or_create_collection("wiki")
# 2. Index your wiki
def index_wiki(wiki_path):
md_files = glob.glob(os.path.join(wiki_path, "**/*.md"), recursive=True)
for filepath in md_files:
with open(filepath) as f:
content = f.read()
doc_id = os.path.relpath(filepath, wiki_path)
# Chunk long articles (simple split by sections)
chunks = content.split("\n## ")
for i, chunk in enumerate(chunks):
chunk_id = f"{doc_id}::chunk_{i}"
collection.upsert(
ids=[chunk_id],
documents=[chunk],
metadatas=[{"source": doc_id, "chunk": i}]
)
print(f"Indexed {len(md_files)} files")
# 3. Search
def search(query, n_results=5):
results = collection.query(
query_texts=[query],
n_results=n_results
)
return results["documents"][0], results["metadatas"][0]
# 4. Ask with RAG
def ask(question, wiki_path=None):
if wiki_path:
index_wiki(wiki_path)
docs, metas = search(question)
context = "\n\n---\n\n".join(
f"[Source: {m['source']}]\n{doc}"
for doc, m in zip(docs, metas)
)
prompt = f"""Answer the question based on the following context from my wiki.
Cite your sources.
Context:
{context}
Question: {question}"""
# Use Ollama for local LLM
import ollama
response = ollama.chat(model="llama3.2", messages=[
{"role": "user", "content": prompt}
])
return response["message"]["content"]
# Usage
index_wiki("~/knowledge-base/wiki")
answer = ask("What are the key differences between GPT and BERT?")
print(answer)
That's it. ~50 lines. Fully local. No API keys. No cloud.
When to Use RAG vs. Direct Context
Not everything needs RAG. Here's a simple decision guide:
| Wiki Size | Approach | Why |
|---|---|---|
| < 50 articles | Direct context | Fits in most context windows |
| 50-200 articles | Index file + direct | Karpathy's approach — LLM reads index, then relevant files |
| 200-1000 articles | RAG | Too big for context, but RAG handles it easily |
| 1000+ articles | RAG + hybrid search | Add keyword search alongside vector search for precision |
The sweet spot for adding RAG is when you notice your LLM starting to miss information that's definitely in your wiki, or when token costs become significant.
Tips for Better RAG
1. Chunk wisely
Don't index entire articles as single vectors. Split by sections (## headings). A 5,000-word article as one chunk loses precision — the vector becomes a blur of all topics in that article. Smaller chunks = more precise retrieval.
2. Keep metadata
Store the source file path, section title, and date with each chunk. This lets you filter results ("only search articles from the last month") and cite sources in answers.
3. Use hybrid search
Vector search finds semantically similar content. Keyword search finds exact matches. Combine both:
- Vector: "How do transformers handle long sequences?" → finds articles about attention, context windows
- Keyword: "RoPE" → finds the exact article mentioning Rotary Position Embeddings
4. Re-index incrementally
Don't rebuild the entire index when you add one article. Use upsert to add/update only the changed files. Most vector DBs support this natively.
5. Let the LLM maintain the wiki, RAG maintains the retrieval
Keep Karpathy's workflow intact — the LLM still writes and organizes the wiki. RAG is just the lookup layer. Don't let RAG complexity infect your clean wiki structure.
What's Next: The Compounding Knowledge Loop
The real power emerges when you combine Karpathy's wiki pattern with RAG in a feedback loop:
Raw Sources → LLM compiles wiki → RAG indexes wiki
↑ ↓
└──── You ask questions ─┘
Answers filed back
into the wiki
Every question you ask, every answer you file back — they compound. The wiki grows smarter. The RAG index gets richer. Six months in, you have a personal research assistant that knows your domain better than any general-purpose LLM ever could.
And the best part? It all runs on your laptop.
Credit: The LLM knowledge base concept was originally described by Andrej Karpathy. This post explores the RAG extension for scaling beyond context window limits.
If you're new to Karpathy's approach, check out my previous post on building the wiki itself.
Further Reading:
- Karpathy's original LLM Wiki gist
- ObsidianRAG — Full-featured local Obsidian RAG
- obsidian-notes-rag — MCP server for AI agents
- ChromaDB docs — Getting started with vector databases


Top comments (0)