vishalmysore

Posted on Apr 19 • Edited on Apr 21

RAG vs. Agent Memory vs. LLM Wiki: A Practical Comparison

#agents #rag #llm #architecture

You build a RAG pipeline. It works. Sort of. Your LLM retrieves the right chunks, scores look great, but the answers feel generic — like a stranger who read your documents once and forgot who they were talking to. You add memory. Better, but now the agent remembers the user and still cannot synthesize knowledge across sessions. You consider a knowledge graph. Now you have three systems to maintain and the complexity is killing your velocity.

This is the knowledge retrieval problem in 2026: powerful tools exist but no clear framework for choosing between them. This article maps three main approaches — RAG, Agent Memory, and LLM Wiki — honestly, including where each one breaks.

The deeper question underlying all three is not which tool to pick. It is: where does the heavy reasoning work happen — and what are the consequences of that choice?

	RAG	Agent Memory	LLM Wiki
Reasoning concentrated at	Query time	Split: extraction at write time, retrieval at query time	Ingest time
Default statefulness	Stateless (but can be engineered otherwise)	Stateful by design	Stateful by design
Write-back behavior	Not by default — requires deliberate engineering	Core to the pattern	Recommended design — implementations vary

At a Glance

	RAG	Agent Memory	LLM Wiki
What it answers	"What does the document say?"	"What has this user told me?"	"What do I know about this topic?"
Persistence	None by default	Cross-session	Compounding wiki
Infrastructure	Vector DB + embedding pipeline	Memory store + retrieval	Markdown files + index (often with retrieval layer too)
Scales to	Millions of docs	Per-user state	Bounded, curated sources
Blind spot	Synthesis quality degrades at scale	Knows user, not domain	Error amplification + continuous knowledge engineering

1. RAG — The Default Everyone Reaches For

RAG (Retrieval-Augmented Generation) is the entry point for most AI developers. The pipeline is well understood: chunk your documents, embed them into a vector store at ingest time, retrieve the top-K semantically similar chunks at query time, and inject them into the LLM's context window for synthesis.

It is important to be precise about where work happens in RAG. Embedding generation happens at ingest time. But the heavy reasoning — synthesis, answer generation, multi-hop inference — happens at query time, on every single call, with no memory of having done it before.

Where it works well: Large, dynamic document corpora. Single-turn factual queries. Cases where the knowledge base changes frequently. Enterprise search across thousands of documents where breadth matters more than depth.

Where it quietly fails: Naive RAG is stateless by default — every query starts from zero, and synthesis quality degrades as questions become more complex. The chunking process also destroys document structure: relationships between entities, contradictions across sources, and synthesized insights all disappear when you shred a document into 512-token pieces.

Production RAG systems partially mitigate this through query rewriting, feedback loops, cached responses, hybrid search (BM25 + vector), re-ranking models, and GraphRAG-style knowledge graph layers. You can architect RAG to write back — storing successful query-answer pairs, updating retrieval rankings from user feedback, or feeding query patterns back into the index. Naive RAG struggles with multi-hop synthesis; advanced systems mitigate this at higher engineering complexity.

The key point: RAG is stateless by default, but statefulness can be engineered in. Every step toward statefulness requires deliberate work on top of the base pattern. This is the fundamental difference from Agent Memory and LLM Wiki, where statefulness is the design intent, not the exception.

Best for: High-volume document retrieval, frequently updated knowledge bases, enterprise Q&A systems, any corpus too large to pre-compile.

2. Agent Memory — A Two-Phase System

Agent memory solves a different problem: continuity across sessions. Where RAG answers "what does the document say?", memory answers "what does this user need?" A memory system extracts facts from conversations — preferences, history, constraints — stores them externally, and retrieves them on demand.

Unlike RAG, Agent Memory is not a query-time-only system. It has two distinct phases:

Write phase (at conversation time): The system extracts facts from what the user says and writes them to the memory store. This extraction and storage is itself a reasoning operation — deciding what is worth keeping and how to store it.
Read phase (at query time): Stored context is retrieved and injected alongside the query to personalize the response.

This two-phase nature is what makes memory genuinely different from RAG — it actively writes knowledge about the user over time, not just retrieves at query time.

Modern memory systems go further — summarizing memory across sessions, clustering related facts, deriving preferences, and building structured user models. The write phase becomes increasingly sophisticated as the system matures.

Where it works well: Personalization, user-specific agents, customer support bots that need to remember past interactions, long-running agentic workflows where the same user returns repeatedly.

Where it quietly fails: Memory is sparse and noisy — it only knows what the user has explicitly said, which is rarely the full picture. More importantly, memory knows the user but is blind to domain knowledge unless paired with RAG or a structured knowledge layer. An agent that remembers a user prefers Python but has no access to your documentation is still useless for technical support. The memory and domain knowledge problems are orthogonal and require separate solutions.

Best for: User-facing agents with returning users, personalization layers, session continuity in long-running tasks, any situation where user-specific context matters as much as document content.

3. LLM Wiki — An Idea, Not a Spec

On April 4, 2026, Andrej Karpathy published a GitHub Gist describing a pattern for building personal knowledge bases with LLMs. It is important to read it for what it actually is. Karpathy opens with: "This is an idea file... Its goal is to communicate the high level idea, but your agent will build out the specifics in collaboration with you." And closes with: "This document is intentionally abstract. It describes the idea, not a specific implementation. Everything mentioned above is optional and modular — pick what's useful, ignore what isn't."

This matters because a lot of the discussion around LLM Wiki — including formal ingest/lint/query operations, strict architectural boundaries, and governance layers — comes from community implementations and blog elaborations, not from the original idea itself. The Gist is a starting point, not a specification.

The core idea is straightforward: instead of re-deriving knowledge from raw documents on every query, use the LLM to compile knowledge into a persistent, interlinked set of markdown pages — and then query that compiled artifact. Raw sources stay immutable. The LLM writes and maintains the wiki layer. You read it.

In practice, implementations vary enormously:

Some use pure markdown with a flat index file, no retrieval layer
Many add embeddings and hybrid search on top of the wiki pages
Some integrate with tools like Obsidian for navigation and graph views
Some use MCP servers to give agents direct wiki access
Some add formal lint passes; others do it ad hoc

Karpathy himself suggests using a local search engine with "hybrid BM25/vector search" for larger wikis. LLM Wiki is not a replacement for retrieval — it is an alternative organizing layer that can sit alongside or on top of retrieval systems.

What tends to happen at ingest time in most implementations: the LLM reads a new source, extracts key information, writes or updates wiki pages, and cross-references existing content. This is the expensive, high-reasoning operation — and doing it upfront means queries can draw on pre-compiled synthesis rather than raw text.

What tends to happen at query time: the LLM reads an index, finds relevant wiki pages, and synthesizes an answer. This is generally lighter than RAG synthesis over raw documents — but it is not reasoning-free. The LLM still synthesizes across wiki pages at query time. The difference is that it is working with structured, pre-compiled knowledge rather than raw chunked text.

The compounding effect is the key advantage — when it works. A wiki that has ingested 50 papers on a topic can answer questions with greater depth than RAG over the same 50 papers, because relationships, contradictions, and synthesis are already compiled. But this holds only when ingest quality is high. Poorly generated wiki pages, missed edge cases, or hallucinated synthesis baked in during ingest can all reverse this advantage — and unlike RAG, which re-reads the original source on every query, LLM Wiki has baked the LLM's interpretation into the knowledge base. Errors compound rather than stay isolated.

The real limitation is not context window size — that is model-dependent and changing rapidly. It is what is better described as continuous knowledge engineering:

Keeping pages consistent and contradiction-free as new sources arrive
Preventing schema drift as the domain evolves
Catching silent quality degradation from LLM edits
Validating that ingest errors have not propagated across linked pages

There is also a structural gap no amount of maintenance resolves: the wiki knows the domain but has no awareness of who is reading or why. The same page reads identically for a surgeon and a patient. The wiki is a great library. It has no librarian who knows why you walked in.

Best for: Research compilation, personal knowledge bases, bounded domain expertise, cases where synthesis across sources matters more than retrieval at scale.

How They Fit Together

These three approaches are not competitors on the same spectrum. They address different dimensions of the knowledge problem:

LLM Wiki        ← Domain knowledge, compiled at ingest time
Agent Memory    ← User knowledge, written at conversation time, read at query time
RAG             ← Document retrieval, stateless by default, stateful by design

In practice, production AI systems increasingly combine all three: RAG for long-tail retrieval over large corpora, memory for user personalization, and LLM Wiki for compiled domain expertise. The governance layer underneath all of them — data quality, freshness, access control — is what most teams underinvest in. Stale or ungoverned inputs degrade all three simultaneously.

A Concrete Example: The Same Query, Three Different Systems

Consider a parental leave policy document. An employee asks: "I just found out I'm pregnant. What do I need to do and when?"

RAG retrieves the eligibility chunk and the submission deadline paragraph — the two most semantically similar pieces. The answer is fragmented: "Employees must have 1 year of tenure. Requests must be submitted 4 weeks in advance." Technically accurate. No synthesis, no sequence, no sense of what to do first or what the full timeline looks like.

Agent Memory recalls that this user is in their second year of employment and previously asked about benefits. It personalizes the opening — "Based on your tenure, you are eligible" — but memory alone has no knowledge of the policy content. Without a document layer alongside it, the personalization wraps around a hollow answer. With RAG or a wiki underneath, the answer becomes both personal and complete.

LLM Wiki draws on a pre-compiled page that already synthesizes eligibility criteria, the 12-month window, primary vs. secondary caregiver differences, and the HR portal submission process into a structured, sequenced summary. The answer reads like it was written by someone who understood the whole policy — because during ingest, it was compiled that way. The tradeoff: if that ingest pass misread the policy, the mistake is now embedded in every answer drawn from that page, and the user has no way to know.

What Most Teams Get Wrong

Most teams default to RAG for everything because it is the lowest-friction starting point. That is a reasonable instinct early on. The mistake is never questioning it.

RAG works until your users start asking questions that require synthesis, continuity, or depth — and then it fails quietly, producing answers that are technically grounded but practically useless. The failure is invisible because retrieval metrics still look fine.

The more precise mistake is treating "where does reasoning happen?" as a technical detail rather than an architectural decision. It determines your maintenance burden, your failure modes, your scaling ceiling, and your ability to personalize — all at once.

The teams building the most capable systems are not debating which approach is best in the abstract. They are asking: what kind of knowledge does this system need, how often does it change, who is asking for it, and how much engineering can we sustain? The answers to those questions determine the architecture — not the other way around.

References:

Karpathy, A. (2026). LLM Wiki (idea file). https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
Letta. (2025). RAG is not Agent Memory. https://www.letta.com/blog/rag-vs-agent-memory
MindStudio. (2026). LLM Wiki vs RAG for Internal Codebase Memory. https://www.mindstudio.ai/blog/llm-wiki-vs-rag-internal-codebase-memory
Atlan. (2026). AI Memory vs RAG vs Knowledge Graph. https://atlan.com/know/ai-memory-vs-rag-vs-knowledge-graph/
Mem0. (2026). RAG vs. Memory: What AI Agent Developers Need to Know. https://mem0.ai/blog/rag-vs-ai-memory

Top comments (3)

Chen Zhang • Apr 20

solid breakdown but imo the line between rag and agent memory blurs fast once you add rerankers or long-term summaries on top. the wiki angle is underrated though, most teams skip curation until the first bad hallucination ships to prod