Zafer Dace

Posted on Apr 8

When Your AI Wiki Outgrows the Context Window — A Practical Guide to RAG

#ai #rag #llm #tutorial

Karpathy showed us how to build LLM-powered knowledge bases. But what happens when your wiki gets too big for the context window? Here's the missing piece.

In a recent post, Andrej Karpathy described a workflow that resonated with thousands of developers: use LLMs to build and maintain personal knowledge bases as markdown wikis. Raw documents go in, the LLM compiles them into structured articles, and you query the wiki like a research assistant.

He also noted something important:

"I thought I had to reach for fancy RAG, but the LLM has been pretty good about auto-maintaining index files and brief summaries... at this ~small scale."

The key phrase is "at this small scale." His wiki is ~100 articles and ~400K words. That fits in a large context window. But what happens when you hit 500 articles? 1,000? 2 million words?

The context window runs out. Your LLM can't read everything anymore. This is where RAG comes in — and it's simpler than you think.

What is RAG?

RAG (Retrieval Augmented Generation) is a three-step pattern:

Retrieve — Find the most relevant documents for a given question
Augment — Attach those documents to the prompt
Generate — LLM answers using only the relevant context

Think of it as an open-book exam. The LLM doesn't memorize your entire wiki — it looks up the right pages before answering.

You: "How does attention differ from convolution?"
          ↓
    1. Search vector DB → top 5 relevant articles found
    2. Attach articles to prompt
    3. LLM reads 5 articles (not 500) → generates answer
          ↓
LLM: "Based on your wiki articles on attention mechanisms
      and CNN architectures, the key differences are..."

Without RAG, you'd need to feed all 500 articles into the context window. With RAG, you feed only 5. Same quality, 100x less tokens.

How It Works Under the Hood

RAG relies on vector embeddings — turning text into numbers that capture meaning.

Step 1: Index your wiki

Every article gets converted into a vector (a list of numbers) by an embedding model:

"Attention mechanism" → [0.42, 0.68, 0.35, -0.12, ...]
"CNN architecture"   → [0.39, 0.71, 0.30, -0.15, ...]  ← similar topic, close vectors
"Cooking recipes"    → [0.85, 0.10, 0.92, 0.44, ...]   ← different topic, far apart

These vectors are stored in a vector database — a specialized database that finds similar vectors fast.

Step 2: Query

When you ask a question, the same embedding model converts your question to a vector, then the vector DB finds the closest matches:

"How does self-attention work?"
    → vector → search → top 5 closest articles
    → attention-mechanism.md, transformer-architecture.md, ...

Step 3: Generate

Those articles are injected into the LLM prompt:

System: Answer based on the following context:
[article 1 content]
[article 2 content]
[article 3 content]

User: How does self-attention work?

The LLM now has the right context and generates an accurate, grounded answer.

The Landscape: Existing Tools

Since Karpathy's post, several tools have emerged. Here's a comparison of the most notable ones:

Tool	Stack	Best For
ObsidianRAG	ChromaDB + Ollama + GraphRAG	Full-featured local RAG with wikilink-aware search
obsidian-notes-rag	SQLite-vec + MCP server	Claude Code / AI agent integration
llmwiki	Web UI + Claude	Non-technical users who want a GUI
obsidian-note-taking-assistant	DuckDB + Web app	Combined note-taking + RAG
obsidianRAGsody	CLI + URL clipper	CLI-first workflow with web scraping

Which one should you use?

Want everything local + privacy? → ObsidianRAG (Ollama + ChromaDB)
Using Claude Code as your agent? → obsidian-notes-rag (MCP server)
Just want to try RAG quickly? → obsidianRAGsody (simple CLI)

What Makes a Good RAG Pipeline?

A naive RAG (embed → search → generate) works, but production-quality tools like ObsidianRAG go further:

1. Hybrid Search (Vector + Keyword)
Vector search finds semantically similar content ("How do transformers work?" → finds articles about attention). But it can miss exact terms. BM25 keyword search catches those. The best systems combine both — ObsidianRAG uses a 60/40 vector/keyword split.

2. Reranking
Initial retrieval returns ~20 candidates. A CrossEncoder reranker (like bge-reranker-v2-m3) then scores each candidate against the original query more carefully, keeping only the top 5. This dramatically improves precision.

3. Graph-Aware Expansion
If article A is retrieved and it contains [[article B]] wikilinks, a smart system also pulls in article B. This follows the knowledge graph your LLM already built — exactly how Obsidian's backlinks work.

4. Multilingual Embeddings
If your wiki has mixed-language content, use paraphrase-multilingual-mpnet-base-v2 instead of English-only models. It covers 50+ languages.

Simple RAG:    Query → Vector Search → Top 5 → LLM
Better RAG:    Query → Hybrid Search → Top 20 → Rerank → Top 5 → Expand Links → LLM

Build It Yourself: Minimal RAG in 50 Lines

If you want to understand the core concept, here's a minimal implementation. For production use, consider the tools listed above.

Prerequisites

pip install chromadb sentence-transformers ollama

The Code

import os
import glob
import chromadb
from sentence_transformers import SentenceTransformer

# 1. Setup
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")  # local, no API. Use paraphrase-multilingual-mpnet-base-v2 for multilingual wikis
chroma = chromadb.PersistentClient(path="./wiki_vectors")
collection = chroma.get_or_create_collection("wiki")

# 2. Index your wiki
def index_wiki(wiki_path):
    md_files = glob.glob(os.path.join(wiki_path, "**/*.md"), recursive=True)

    for filepath in md_files:
        with open(filepath) as f:
            content = f.read()

        doc_id = os.path.relpath(filepath, wiki_path)

        # Chunk long articles (simple split by sections)
        chunks = content.split("\n## ")
        for i, chunk in enumerate(chunks):
            chunk_id = f"{doc_id}::chunk_{i}"
            collection.upsert(
                ids=[chunk_id],
                documents=[chunk],
                metadatas=[{"source": doc_id, "chunk": i}]
            )

    print(f"Indexed {len(md_files)} files")

# 3. Search
def search(query, n_results=5):
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )
    return results["documents"][0], results["metadatas"][0]

# 4. Ask with RAG
def ask(question, wiki_path=None):
    if wiki_path:
        index_wiki(wiki_path)

    docs, metas = search(question)

    context = "\n\n---\n\n".join(
        f"[Source: {m['source']}]\n{doc}"
        for doc, m in zip(docs, metas)
    )

    prompt = f"""Answer the question based on the following context from my wiki.
Cite your sources.

Context:
{context}

Question: {question}"""

    # Use Ollama for local LLM
    import ollama
    response = ollama.chat(model="llama3.2", messages=[
        {"role": "user", "content": prompt}
    ])

    return response["message"]["content"]

# Usage
index_wiki("~/knowledge-base/wiki")
answer = ask("What are the key differences between GPT and BERT?")
print(answer)

That's it. ~50 lines. Fully local. No API keys. No cloud.

When to Use RAG vs. Direct Context

Not everything needs RAG. Here's a simple decision guide:

Wiki Size	Approach	Why
< 50 articles	Direct context	Fits in most context windows
50-200 articles	Index file + direct	Karpathy's approach — LLM reads index, then relevant files
200-1000 articles	RAG	Too big for context, but RAG handles it easily
1000+ articles	RAG + hybrid search	Add keyword search alongside vector search for precision

The sweet spot for adding RAG is when you notice your LLM starting to miss information that's definitely in your wiki, or when token costs become significant.

Tips for Better RAG

1. Chunk wisely

Don't index entire articles as single vectors. Split by sections (## headings). A 5,000-word article as one chunk loses precision — the vector becomes a blur of all topics in that article. Smaller chunks = more precise retrieval.

2. Keep metadata

Store the source file path, section title, and date with each chunk. This lets you filter results ("only search articles from the last month") and cite sources in answers.

3. Use hybrid search

Vector search finds semantically similar content. Keyword search finds exact matches. Combine both:

Vector: "How do transformers handle long sequences?" → finds articles about attention, context windows
Keyword: "RoPE" → finds the exact article mentioning Rotary Position Embeddings

4. Re-index incrementally

Don't rebuild the entire index when you add one article. Use upsert to add/update only the changed files. Most vector DBs support this natively.

5. Let the LLM maintain the wiki, RAG maintains the retrieval

Keep Karpathy's workflow intact — the LLM still writes and organizes the wiki. RAG is just the lookup layer. Don't let RAG complexity infect your clean wiki structure.

What's Next: The Compounding Knowledge Loop

The real power emerges when you combine Karpathy's wiki pattern with RAG in a feedback loop:

Raw Sources → LLM compiles wiki → RAG indexes wiki
                    ↑                      ↓
                    └──── You ask questions ─┘
                          Answers filed back
                          into the wiki

Every question you ask, every answer you file back — they compound. The wiki grows smarter. The RAG index gets richer. Six months in, you have a personal research assistant that knows your domain better than any general-purpose LLM ever could.

And the best part? It all runs on your laptop.

Credit: The LLM knowledge base concept was originally described by Andrej Karpathy. This post explores the RAG extension for scaling beyond context window limits.

If you're new to Karpathy's approach, check out my previous post on building the wiki itself.

Further Reading:

Karpathy's original LLM Wiki gist
ObsidianRAG — Full-featured local Obsidian RAG
obsidian-notes-rag — MCP server for AI agents
ChromaDB docs — Getting started with vector databases

DEV Community