Most AI knowledge systems have a hidden failure mode: they keep rediscovering what they already know.
That is fine for lightweight lookup. It becomes a problem in long-horizon, multi-source research. Once you move beyond a handful of documents, the usual pattern of “retrieve some chunks, ask the model, hope it synthesizes correctly” starts to break down. Answers get fuzzier. Citations blur together. Contradictions slip through. And every new document increases cost without necessarily increasing understanding.
I ran into this while working with large research corpora: papers, transcripts, articles, and project-specific source material that needed to stay usable over time. RAG solved one problem — fitting large corpora into a model workflow — but it did not solve the deeper one. It gave me search, not cumulative knowledge.
So I built something different: a compounding markdown wiki inspired by Recursive Language Model (RLM) principles, where ingestion happens once, structure accumulates over time, and the 100th document makes the first 99 more useful. This is not a literal implementation of the RLM paper; it is an adaptation of the core idea that useful state should live outside the model context window and be updated incrementally.
In this post, I’ll break down the failure mode I was seeing, why standard chunk-retrieval RAG was not enough for this kind of work, and how this architecture produces more stable, cited, human-readable research outputs at scale.
The Real Problem: Context Rot
If you’ve used AI agents on research material, you’ve probably seen some version of this:
- You feed one paper into ChatGPT. It answers your question well.
- You feed a second paper. It still works.
- You ask about both papers together. The synthesis gets weaker.
- You add a tenth paper. Hallucinations start creeping in.
- You add a fiftieth. Earlier sources start getting dropped without warning.
This is context rot.
The issue is not that the model “forgets” in a clean, binary way. It is that as the context window fills up, attention gets diluted across more and more tokens. Earlier material does not vanish; it just becomes harder for the model to use precisely. The degradation is gradual and messy:
- Answers become more generic
- Citations get mixed up
- Contradictions go undetected
- Earlier evidence stops influencing later responses reliably
That matters in research workflows, where the hard part usually is not finding a sentence that mentions a term. It is maintaining a consistent understanding across many sources.
What context rot looks like in practice
A typical agent loop looks something like this:
1. Build system prompt (persona, instructions, tool schemas)
2. Add conversation history (growing every turn)
3. Add retrieved context (RAG chunks, file contents)
4. Send to model
5. Get response
6. Append response to history
7. Repeat
By turn 10, you may already be around 30K tokens. By turn 30, you can be well past 100K. Even if your model technically supports a large context window, it does not attend equally well to everything inside it.
The symptoms are familiar:
- The answer shifts from specific claims to vague summaries
- The model contradicts something it said a few turns earlier
- It attributes one source’s claim to the wrong person
- It refuses to answer because it detects inconsistency but cannot resolve it
This is the hidden tax on long-running AI research sessions. And it is what led me to rethink the default architecture.
Why Standard RAG Solves Scale but Not Cumulative Research
The standard answer to context rot is Retrieval-Augmented Generation.
Instead of stuffing everything into the prompt, you embed documents, retrieve the most relevant chunks for the current question, and pass only those chunks to the model. That absolutely helps with scale. It keeps prompts smaller and avoids dumping entire corpora into context.
But basic top-K chunk-retrieval RAG pipelines have a different limitation: they often treat each query as a mostly fresh synthesis problem.
For a query like “What did Smith say about the merger?”, a standard RAG pipeline typically does this:
- Embed the question
- Search for similar chunks across all documents
- Return the top-K matches
- Ask the model to synthesize an answer
That works well for retrieval. It works much less well for cumulative understanding.
Production systems can add memory layers, rerankers, metadata filters, cached summaries, or knowledge graphs. But once you add those pieces, you are already moving beyond plain chunk retrieval and toward persistent state — which is the shift this post is about.
Where standard RAG falls short for research
No persistent cross-source memory by default
If Document A mentions Smith’s stance on regulation, and Document B records Jones pushing the opposite position, basic RAG does not persist that relationship. It may retrieve both chunks if your query is good enough, but it does not maintain an explicit “Smith vs. Jones” structure over time.
That means comparisons are synthesized from scratch on demand, not built into the knowledge system.
No compounding
Adding Document C does not automatically make the system smarter about Documents A and B. The vector index just gets larger.
In practice, that often means more possible matches, more noise, and more opportunity for irrelevant chunks to crowd out the useful ones. As the corpus grows, retrieval quality can degrade unless you keep tuning chunking, ranking, metadata filters, and prompts.
Weak provenance
RAG can usually tell you which file a chunk came from. It is much worse at maintaining fine-grained provenance over time:
- Which paragraph supported this claim?
- Is the claim directly stated or inferred?
- Do two sources disagree on the same point?
- Was this information updated later by a more reliable source?
Those questions matter in research, and standard chunk retrieval does not answer them by itself.
Query-time cost keeps accumulating
As more documents enter the system, you are not only searching more material. You are also often retrieving more candidates, performing more ranking, and building prompts that grow in complexity. Latency and token costs rise with corpus size, and the burden shifts to query time.
That is acceptable for search. It is inefficient for ongoing research programs with hundreds of documents per project.
The Shift: Knowledge Should Compound
What I wanted was not a better search engine. I wanted a knowledge base.
More specifically, I wanted a system where:
- The 50th document is processed in the context of what the first 49 already taught the system
- Query cost depends more on the selected wiki pages than on the full raw corpus
- Every claim traces back to a specific source paragraph
- Contradictions are flagged for review instead of being silently overwritten
- The resulting knowledge is human-readable markdown, not an opaque index
That is the real distinction here:
A search engine finds relevant documents. A knowledge base preserves relationships between them.
The RLM Pattern Behind the Design
The deeper design pattern I built on is Recursive Language Models (arXiv:2512.24601). Again, I am using the paper as inspiration rather than claiming a direct implementation.
The core idea I borrowed is simple:
- keep persistent external state outside the model’s context window
- use the model for bounded operations on that state
- avoid making the model reconstruct everything from raw text on every query
In my case, that persistent state is a markdown wiki.
Instead of asking the LLM to rediscover structure from raw documents over and over, I let it:
- extract entities from a chunk
- update a specific page
- merge structured facts
- synthesize from pre-resolved references
That shifts the expensive reasoning from “every query forever” to “once at ingestion, then maintained incrementally.”
The Architecture: A Compounding Markdown Wiki
The system is organized as a markdown wiki with three layers:
wiki/
├── SCHEMA.md # Domain conventions, tag taxonomy
├── index.md # Content catalog
├── log.md # Audit trail
├── raw/ # Layer 1: Immutable source files
│ ├── articles/
│ ├── papers/
│ └── transcripts/
├── entities/ # Layer 2: People, orgs, products
├── concepts/ # Layer 2: Topics, ideas
├── comparisons/ # Layer 2: Side-by-side analyses
└── queries/ # Layer 2: Preserved Q&A
This structure is deliberately plain. No proprietary store. No hidden schema behind an API. The wiki itself is the database.
Layer 1: Raw sources
The raw/ directory contains immutable source material. The system can read these files, but it does not rewrite them.
Each source gets a SHA-256 hash, which makes drift detection straightforward. If I ingest the same source again and the content hash has not changed, the pipeline can safely skip expensive work.
Layer 2: The wiki
This is the agent-owned knowledge layer.
Every major entity gets its own page:
- people
- organizations
- products
Every major concept gets its own page:
- topics
- definitions
- theories
- techniques
Pages are linked using [[wikilinks]], which makes relationships explicit and navigable. This is where the compounding happens: as new sources arrive, they enrich existing pages rather than living as isolated chunks.
Layer 3: The schema
SCHEMA.md defines domain conventions, taxonomy, and page structure. This matters because consistency is what allows incremental updates to stay clean over time. Without a schema, you do not get a knowledge base — you get a pile of markdown.
Why markdown matters
One of the most practical design decisions here is also one of the least flashy: everything is stored as files.
That gives you several advantages immediately:
- You can inspect the knowledge base without custom tooling
- You can open it directly in Obsidian
- You can version it with Git
- You can edit pages by hand when needed
- You are not locked into a vector database or vendor-specific format
That last point matters for long-lived research systems. If your knowledge base outlives your current model stack, it should still be readable.
How Ingestion Works
The ingestion pipeline is where the system earns its keep. Instead of postponing structure until query time, it resolves as much as possible up front.
When I ingest a source — whether it is a URL, PDF, or transcript — the pipeline does the following:
- Fetch and normalize the content into clean markdown
-
Write to
raw/with metadata and a SHA-256 hash - Check drift and skip unchanged sources
- Chunk semantically instead of by arbitrary token count alone
- Extract per chunk using a smaller, cheaper model call
- Aggregate and deduplicate the extracted structures
- Cross-reference against existing wiki pages
- Generate or update pages via LLM-assisted editing
- Update the index and append to the audit log
The key difference from RAG is this:
Extraction happens once at ingest time, not repeatedly at query time.
What “semantic chunking” means here
In this system, chunking is heuristic rather than magical. I usually split by the strongest available content boundary first:
- document headings and subheadings for articles or papers
- paragraph groups for prose-heavy text
- speaker turns for transcripts
- sentence windows only as a fallback when no stronger boundary exists
The goal is to keep each chunk as a coherent unit of thought. That improves:
- entity resolution
- fact extraction
- citation accuracy
- contradiction checks
What gets extracted per chunk
Each chunk is processed into structured JSON containing:
- Entities: people, organizations, products
- Concepts: topics and definitions
- Facts: claims with certainty markers and provenance
Here is the kind of structure this stage is designed to produce:
{
"entities": [
{
"name": "John Smith",
"type": "person",
"aliases": ["J. Smith"]
}
],
"concepts": [
{
"name": "merger regulation",
"definition": "Regulatory constraints affecting merger approval."
}
],
"facts": [
{
"claim": "Smith argued that the merger would face regulatory friction.",
"certainty": "stated",
"source_paragraph": "p12",
"source_id": "q3-call-2026-04"
}
]
}
The exact schema can vary by domain, but the point is the same: ingestion produces structured evidence, not just searchable text.
A Compact End-to-End Example
This is the shortest example I have found that shows the full idea.
1) Source excerpt
[Source: raw/transcripts/q3-call-2026-04.md]
Paragraph p12:
"John Smith said the merger would likely face regulatory friction in the EU,
but still described the deal as strategically necessary."
2) Extraction output
{
"entities": [
{ "name": "John Smith", "type": "person" },
{ "name": "EU", "type": "organization" }
],
"concepts": [
{
"name": "merger regulation",
"definition": "Regulatory constraints that may affect merger approval."
}
],
"facts": [
{
"claim": "John Smith said the merger would likely face regulatory friction in the EU.",
"certainty": "stated",
"source_id": "q3-call-2026-04",
"source_paragraph": "p12"
},
{
"claim": "John Smith described the deal as strategically necessary.",
"certainty": "stated",
"source_id": "q3-call-2026-04",
"source_paragraph": "p12"
}
]
}
3) Wiki page update
# John Smith
tags: [person, executive]
## Positions
- Smith said the merger would likely face regulatory friction in the EU.
Source: [[q3-call-2026-04#p12]]
- Smith described the deal as strategically necessary.
Source: [[q3-call-2026-04#p12]]
## Related
- [[merger regulation]]
- [[EU]]
4) Query response
If I later ask:
What did Smith say about the merger in the Q3 call?
The system can answer from the wiki first:
John Smith said the merger would likely face regulatory friction in the EU,
while still describing the deal as strategically necessary.
Citations:
- [[John Smith]]
- [[q3-call-2026-04#p12]]
That is the compounding behavior in miniature: source text becomes structured evidence, evidence updates a persistent page, and later questions hit the page instead of forcing the model to rediscover the same relationship from raw text.
How Contradiction Detection Works
This is the part I had to be careful not to overstate.
I do not treat contradiction detection as perfect. In my implementation, it is a review-oriented mechanism, not a guarantee of truth.
At ingestion time, new facts are compared against existing page facts with a mix of:
- entity matching
- claim-type matching from the schema
- provenance checks
- an LLM judgment step that labels a new fact as supports, updates, conflicts with, or unclear
When the system sees a likely conflict, it does not auto-resolve it into one canonical statement. It adds both claims to the relevant page and marks the conflict for review.
A simplified page fragment might look like this:
## Open conflicts
- Smith said the merger was strategically necessary.
Source: [[q3-call-2026-04#p12]]
- Smith said the merger should be delayed until regulatory conditions improve.
Source: [[investor-interview-2026-05#p4]]
Status: needs review
So when I say contradictions are “flagged,” I mean candidate contradictions are surfaced and preserved instead of being silently smoothed over.
How Querying Works
Once the wiki exists, queries become much lighter.
Instead of searching raw chunks across the entire corpus, the system searches the structured wiki first. In many cases, that means keyword, title, and tag-based search, with no embeddings required for the first pass.
The query flow looks like this:
- Search the wiki
- Select a strategy using a fast model call
- Execute the strategy with the appropriate model
-
Return an answer with
[[Page Name]]or source-anchor citations
The strategy selection matters because not every question needs the same level of work.
Query strategies
Lookup
For a single fact from one page.
Synthesis
For compare/contrast tasks across multiple pages or sources.
Deep-dive
For more extensive analysis across a small set of selected pages plus raw-source verification if needed.
In practice, query cost is much closer to wiki size and selected pages than to total raw document count. It is not literally constant, but it scales more gently than repeatedly searching and prompting over the full corpus.
Where raw retrieval still fits
This is not an anti-RAG purity test. I still use raw-document retrieval in edge cases:
- cold-start questions before enough wiki structure exists
- verification against the underlying source text
- cases where a page summary looks incomplete or stale
- exploratory search across new material before ingestion rules are tuned
For me, the best framing is not “replace RAG everywhere.” It is “stop making raw retrieval do all the memory work.”
Why This Beats Plain RAG for Research
Here is the comparison that mattered in practice:
| Dimension | Plain chunk-retrieval RAG | Compounding Wiki |
|---|---|---|
| Knowledge growth | Re-synthesizes per query | Accumulates across sources |
| Cross-referencing | Implicit (similar chunks) | Explicit (entity pages + wikilinks) |
| Query cost trend | Usually grows with retrieval/ranking complexity | Tied more to selected pages than total corpus |
| Provenance | Often file/chunk-level | Page and paragraph-anchor level |
| Contradictions | Usually handled inside answer generation | Surfaced as reviewable conflicts |
| Human-readable | Often opaque to humans | Plain markdown |
| Edit maintenance | Often requires re-indexing | Often localized to page updates |
A few of these deserve emphasis.
Provenance becomes operational, not cosmetic
In many RAG systems, “citation” means attaching a filename or chunk excerpt to an answer. That is useful, but it is not enough for research-quality traceability.
In the wiki model, claims can point back to specific paragraphs, and disagreements can remain visible instead of being collapsed into one model-generated summary.
Contradictions stop disappearing
One of the worst behaviors in AI research pipelines is silent reconciliation. If two sources disagree, the model often smooths them into a single answer.
By maintaining persistent pages and structured fact updates, likely conflicts can be surfaced and flagged instead of overwritten. That changes the system from “helpful summarizer” into something closer to a research assistant.
Edits become cheaper
In a vector-centric workflow, changing a source often means re-embedding and re-indexing. In this setup, a targeted edit can mean updating one source, recomputing only the affected structures, and revising the relevant pages.
That is not free, but it is usually more localized.
Practical Performance Notes
The original draft of this post was too absolute here, so let me be precise.
The exact latency and cost depend on:
- the model provider
- whether extraction uses a small local model or an API model
- corpus size and document shape
- how aggressively you verify against raw sources
- the size of the wiki pages being synthesized at query time
In my own testing, the main gain was not “zero scaling cost.” It was moving a large share of the work to ingestion and keeping many queries small because they hit pre-structured pages instead of raw corpora.
So when I say the system is more stable at scale, I mean:
- qualitatively more consistent on repeated research questions
- operationally easier to trace and debug
- economically better when the same corpus gets queried many times
If you publish hard benchmark numbers, include your corpus, models, prompt strategy, and what counts as an “active project.” Without that context, qualitative claims are more honest.
What This Enables
The compounding wiki model changes what kinds of workflows are practical.
Multi-source research
You can ask questions across large document sets and get answers tied back to the exact paragraph that supports them.
If you ask:
What did Smith say about the merger in the Q3 call?
the goal is not to retrieve a vaguely relevant transcript chunk. The goal is to resolve “Smith,” locate the relevant page or source anchor, and return the supporting paragraph.
Project-specific memory for teams and writers
Each new interview, paper, or article enriches what the system already knows.
If the third transcript creates a [[John Smith]] page, the fiftieth transcript should update that page, not create one more disconnected chunk.
Better structure as the corpus matures
As the graph gets denser, the wiki tends to become more useful:
- more entity resolution
- better cross-referencing
- richer concept pages
- stronger comparison pages
That is the compounding effect plain RAG does not naturally provide on its own.
Practical Packaging and Tooling
The system is open source here:
https://github.com/pksw4u/rlm-wiki
At the time of writing, I would describe it as a practical, evolving project rather than a finished universal framework.
It works in several modes:
- As a Hermes Agent skill by dropping
SKILL.mdinto your skills directory - As a standalone Python library via
pip install rlm-wiki - As a native Obsidian vault because the wiki is plain markdown
- Alongside llm-wiki, using the same format so they can coexist
If you are not familiar with those names:
- Hermes Agent is the agent runtime this project was designed to plug into.
- llm-wiki is a related markdown-wiki workflow that shares a compatible file format.
Because the wiki is just files, the data is not trapped inside a database that only one runtime understands.
Minimal usage example
pip install rlm-wiki
rlm-wiki ingest ./docs/q3-call.pdf --project ./wiki
rlm-wiki ask "What did Smith say about the merger?" --project ./wiki
Example entity page format
# John Smith
tags: [person]
aliases: [J. Smith]
## Summary
Executive involved in merger discussions.
## Claims
- Smith argued the merger would face regulatory friction in the EU.
Source: [[q3-call-2026-04#p12]]
## Related
- [[merger regulation]]
- [[EU]]
Here is the operational point I care about most:
The wiki is just markdown files. No required database, no vendor lock-in, and no mandatory embedding index to keep the core state readable.
That means your knowledge base remains usable even if your model stack changes later.
The Deeper Point: Search Is Not Understanding
A lot of AI tooling still treats every question as a fresh event:
- search
- retrieve
- answer
- forget
That is a useful pattern for document access. It is a poor pattern for research.
Research is cumulative. If you are trying to understand Smith’s position, the answer usually does not live in one chunk. It emerges from multiple interviews, repeated claims, revisions over time, and contradictions with other people’s statements.
A system that compounds can do things a query-time retrieval stack struggles with:
- resolve some relationships once instead of rediscovering them repeatedly
- preserve explicit links across sources
- keep provenance attached to claims
- improve existing pages as new evidence appears
That is not just a better implementation detail. It is a different idea of what an AI knowledge system should be.
Key Takeaways
- RAG is good for retrieval, but plain chunk-retrieval RAG is weak for cumulative research because it treats many questions as fresh synthesis tasks.
- Context rot is a real scaling problem in long-running AI research workflows, not just a context-window marketing issue.
- A markdown wiki can serve as persistent external state, replacing vector-only memory with readable, editable files.
- Ingestion-time extraction changes the economics by doing the hard structuring work once instead of on every query.
- RLM-style bounded operations keep the working set small even as document collections grow.
- Hybrid systems are often the practical answer: persistent wiki first, raw retrieval when needed.
Conclusion
What I ended up building was not “RAG, but better.” It was a different architecture with a different goal.
RAG helps models find relevant text. A compounding wiki helps a system build and preserve relationships, provenance, and working understanding over time. That distinction starts to matter once your corpus gets large, your questions become comparative, and your tolerance for vague synthesis drops.
If you work with research-heavy AI workflows, this is the assumption I would revisit: maybe the model should not be responsible for reconstructing knowledge from scratch on every question. Maybe the system should already know how your sources relate before the question is even asked.
That is the promise of a compounding knowledge base. The more you feed it, the more useful its prior structure becomes — not because the model got bigger, but because the system got better at remembering.
Top comments (0)