Pramoda Sahu

Posted on Jun 25

How I Built a Compounding Markdown Wiki for Research: When Standard RAG Stops Being Enough

#ai #productivity #rag #showdev

Most AI knowledge systems have a hidden failure mode: they keep rediscovering what they already know.

That is fine for lightweight lookup. It becomes a problem in long-horizon, multi-source research. Once you move beyond a handful of documents, the usual pattern of “retrieve some chunks, ask the model, hope it synthesizes correctly” starts to break down. Answers get fuzzier. Citations blur together. Contradictions slip through. And every new document increases cost without necessarily increasing understanding.

I ran into this while working with large research corpora: papers, transcripts, articles, and project-specific source material that needed to stay usable over time. RAG solved one problem — fitting large corpora into a model workflow — but it did not solve the deeper one. It gave me search, not cumulative knowledge.

So I built something different: a compounding markdown wiki inspired by Recursive Language Model (RLM) principles, where ingestion happens once, structure accumulates over time, and the 100th document makes the first 99 more useful. This is not a literal implementation of the RLM paper; it is an adaptation of the core idea that useful state should live outside the model context window and be updated incrementally.

In this post, I’ll break down the failure mode I was seeing, why standard chunk-retrieval RAG was not enough for this kind of work, and how this architecture produces more stable, cited, human-readable research outputs at scale.

The Real Problem: Context Rot

If you’ve used AI agents on research material, you’ve probably seen some version of this:

You feed one paper into ChatGPT. It answers your question well.
You feed a second paper. It still works.
You ask about both papers together. The synthesis gets weaker.
You add a tenth paper. Hallucinations start creeping in.
You add a fiftieth. Earlier sources start getting dropped without warning.

This is context rot.

The issue is not that the model “forgets” in a clean, binary way. It is that as the context window fills up, attention gets diluted across more and more tokens. Earlier material does not vanish; it just becomes harder for the model to use precisely. The degradation is gradual and messy:

Answers become more generic
Citations get mixed up
Contradictions go undetected
Earlier evidence stops influencing later responses reliably

That matters in research workflows, where the hard part usually is not finding a sentence that mentions a term. It is maintaining a consistent understanding across many sources.

What context rot looks like in practice

A typical agent loop looks something like this:

1. Build system prompt (persona, instructions, tool schemas)
2. Add conversation history (growing every turn)
3. Add retrieved context (RAG chunks, file contents)
4. Send to model
5. Get response
6. Append response to history
7. Repeat

By turn 10, you may already be around 30K tokens. By turn 30, you can be well past 100K. Even if your model technically supports a large context window, it does not attend equally well to everything inside it.

The symptoms are familiar:

The answer shifts from specific claims to vague summaries
The model contradicts something it said a few turns earlier
It attributes one source’s claim to the wrong person
It refuses to answer because it detects inconsistency but cannot resolve it

This is the hidden tax on long-running AI research sessions. And it is what led me to rethink the default architecture.

Why Standard RAG Solves Scale but Not Cumulative Research

The standard answer to context rot is Retrieval-Augmented Generation.

Instead of stuffing everything into the prompt, you embed documents, retrieve the most relevant chunks for the current question, and pass only those chunks to the model. That absolutely helps with scale. It keeps prompts smaller and avoids dumping entire corpora into context.

But basic top-K chunk-retrieval RAG pipelines have a different limitation: they often treat each query as a mostly fresh synthesis problem.

For a query like “What did Smith say about the merger?”, a standard RAG pipeline typically does this:

Embed the question
Search for similar chunks across all documents
Return the top-K matches
Ask the model to synthesize an answer

That works well for retrieval. It works much less well for cumulative understanding.

Production systems can add memory layers, rerankers, metadata filters, cached summaries, or knowledge graphs. But once you add those pieces, you are already moving beyond plain chunk retrieval and toward persistent state — which is the shift this post is about.

Where standard RAG falls short for research

No persistent cross-source memory by default

If Document A mentions Smith’s stance on regulation, and Document B records Jones pushing the opposite position, basic RAG does not persist that relationship. It may retrieve both chunks if your query is good enough, but it does not maintain an explicit “Smith vs. Jones” structure over time.

That means comparisons are synthesized from scratch on demand, not built into the knowledge system.

No compounding

Adding Document C does not automatically make the system smarter about Documents A and B. The vector index just gets larger.

In practice, that often means more possible matches, more noise, and more opportunity for irrelevant chunks to crowd out the useful ones. As the corpus grows, retrieval quality can degrade unless you keep tuning chunking, ranking, metadata filters, and prompts.

Weak provenance

RAG can usually tell you which file a chunk came from. It is much worse at maintaining fine-grained provenance over time:

Which paragraph supported this claim?
Is the claim directly stated or inferred?
Do two sources disagree on the same point?
Was this information updated later by a more reliable source?

Those questions matter in research, and standard chunk retrieval does not answer them by itself.

Query-time cost keeps accumulating

As more documents enter the system, you are not only searching more material. You are also often retrieving more candidates, performing more ranking, and building prompts that grow in complexity. Latency and token costs rise with corpus size, and the burden shifts to query time.

That is acceptable for search. It is inefficient for ongoing research programs with hundreds of documents per project.

The Shift: Knowledge Should Compound

What I wanted was not a better search engine. I wanted a knowledge base.

More specifically, I wanted a system where:

The 50th document is processed in the context of what the first 49 already taught the system
Query cost depends more on the selected wiki pages than on the full raw corpus
Every claim traces back to a specific source paragraph
Contradictions are flagged for review instead of being silently overwritten
The resulting knowledge is human-readable markdown, not an opaque index

That is the real distinction here:

A search engine finds relevant documents. A knowledge base preserves relationships between them.

The RLM Pattern Behind the Design

The deeper design pattern I built on is Recursive Language Models (arXiv:2512.24601). Again, I am using the paper as inspiration rather than claiming a direct implementation.

The core idea I borrowed is simple:

keep persistent external state outside the model’s context window
use the model for bounded operations on that state
avoid making the model reconstruct everything from raw text on every query

In my case, that persistent state is a markdown wiki.

Instead of asking the LLM to rediscover structure from raw documents over and over, I let it:

extract entities from a chunk
update a specific page
merge structured facts
synthesize from pre-resolved references

That shifts the expensive reasoning from “every query forever” to “once at ingestion, then maintained incrementally.”

The Architecture: A Compounding Markdown Wiki

The system is organized as a markdown wiki with three layers:

wiki/
├── SCHEMA.md           # Domain conventions, tag taxonomy
├── index.md            # Content catalog
├── log.md              # Audit trail
├── raw/                # Layer 1: Immutable source files
│   ├── articles/
│   ├── papers/
│   └── transcripts/
├── entities/           # Layer 2: People, orgs, products
├── concepts/           # Layer 2: Topics, ideas
├── comparisons/        # Layer 2: Side-by-side analyses
└── queries/            # Layer 2: Preserved Q&A

This structure is deliberately plain. No proprietary store. No hidden schema behind an API. The wiki itself is the database.

Layer 1: Raw sources

The raw/ directory contains immutable source material. The system can read these files, but it does not rewrite them.

Each source gets a SHA-256 hash, which makes drift detection straightforward. If I ingest the same source again and the content hash has not changed, the pipeline can safely skip expensive work.

Layer 2: The wiki

This is the agent-owned knowledge layer.

Every major entity gets its own page:

people
organizations
products

Every major concept gets its own page:

topics
definitions
theories
techniques

Pages are linked using [[wikilinks]], which makes relationships explicit and navigable. This is where the compounding happens: as new sources arrive, they enrich existing pages rather than living as isolated chunks.

Layer 3: The schema

SCHEMA.md defines domain conventions, taxonomy, and page structure. This matters because consistency is what allows incremental updates to stay clean over time. Without a schema, you do not get a knowledge base — you get a pile of markdown.

Why markdown matters

One of the most practical design decisions here is also one of the least flashy: everything is stored as files.

That gives you several advantages immediately:

You can inspect the knowledge base without custom tooling
You can open it directly in Obsidian
You can version it with Git
You can edit pages by hand when needed
You are not locked into a vector database or vendor-specific format

That last point matters for long-lived research systems. If your knowledge base outlives your current model stack, it should still be readable.

How Ingestion Works

The ingestion pipeline is where the system earns its keep. Instead of postponing structure until query time, it resolves as much as possible up front.

When I ingest a source — whether it is a URL, PDF, or transcript — the pipeline does the following:

Fetch and normalize the content into clean markdown
Write to raw/ with metadata and a SHA-256 hash
Check drift and skip unchanged sources
Chunk semantically instead of by arbitrary token count alone
Extract per chunk using a smaller, cheaper model call
Aggregate and deduplicate the extracted structures
Cross-reference against existing wiki pages
Generate or update pages via LLM-assisted editing
Update the index and append to the audit log

The key difference from RAG is this:

Extraction happens once at ingest time, not repeatedly at query time.

What “semantic chunking” means here

In this system, chunking is heuristic rather than magical. I usually split by the strongest available content boundary first:

document headings and subheadings for articles or papers
paragraph groups for prose-heavy text
speaker turns for transcripts
sentence windows only as a fallback when no stronger boundary exists

The goal is to keep each chunk as a coherent unit of thought. That improves:

entity resolution
fact extraction
citation accuracy
contradiction checks

What gets extracted per chunk

Each chunk is processed into structured JSON containing:

Entities: people, organizations, products
Concepts: topics and definitions
Facts: claims with certainty markers and provenance

Here is the kind of structure this stage is designed to produce:

{
  "entities": [
    {
      "name": "John Smith",
      "type": "person",
      "aliases": ["J. Smith"]
    }
  ],
  "concepts": [
    {
      "name": "merger regulation",
      "definition": "Regulatory constraints affecting merger approval."
    }
  ],
  "facts": [
    {
      "claim": "Smith argued that the merger would face regulatory friction.",
      "certainty": "stated",
      "source_paragraph": "p12",
      "source_id": "q3-call-2026-04"
    }
  ]
}

The exact schema can vary by domain, but the point is the same: ingestion produces structured evidence, not just searchable text.

A Compact End-to-End Example

This is the shortest example I have found that shows the full idea.

1) Source excerpt

[Source: raw/transcripts/q3-call-2026-04.md]

Paragraph p12:
"John Smith said the merger would likely face regulatory friction in the EU,
but still described the deal as strategically necessary."

2) Extraction output

{
  "entities": [
    { "name": "John Smith", "type": "person" },
    { "name": "EU", "type": "organization" }
  ],
  "concepts": [
    {
      "name": "merger regulation",
      "definition": "Regulatory constraints that may affect merger approval."
    }
  ],
  "facts": [
    {
      "claim": "John Smith said the merger would likely face regulatory friction in the EU.",
      "certainty": "stated",
      "source_id": "q3-call-2026-04",
      "source_paragraph": "p12"
    },
    {
      "claim": "John Smith described the deal as strategically necessary.",
      "certainty": "stated",
      "source_id": "q3-call-2026-04",
      "source_paragraph": "p12"
    }
  ]
}

3) Wiki page update

# John Smith

tags: [person, executive]

## Positions

- Smith said the merger would likely face regulatory friction in the EU.  
  Source: [[q3-call-2026-04#p12]]

- Smith described the deal as strategically necessary.  
  Source: [[q3-call-2026-04#p12]]

## Related

- [[merger regulation]]
- [[EU]]

4) Query response

If I later ask:

What did Smith say about the merger in the Q3 call?

The system can answer from the wiki first:

John Smith said the merger would likely face regulatory friction in the EU,
while still describing the deal as strategically necessary.

Citations:
- [[John Smith]]
- [[q3-call-2026-04#p12]]

That is the compounding behavior in miniature: source text becomes structured evidence, evidence updates a persistent page, and later questions hit the page instead of forcing the model to rediscover the same relationship from raw text.

How Contradiction Detection Works

This is the part I had to be careful not to overstate.

I do not treat contradiction detection as perfect. In my implementation, it is a review-oriented mechanism, not a guarantee of truth.

At ingestion time, new facts are compared against existing page facts with a mix of:

entity matching
claim-type matching from the schema
provenance checks
an LLM judgment step that labels a new fact as supports, updates, conflicts with, or unclear

When the system sees a likely conflict, it does not auto-resolve it into one canonical statement. It adds both claims to the relevant page and marks the conflict for review.

A simplified page fragment might look like this:

## Open conflicts

- Smith said the merger was strategically necessary.  
  Source: [[q3-call-2026-04#p12]]

- Smith said the merger should be delayed until regulatory conditions improve.  
  Source: [[investor-interview-2026-05#p4]]

Status: needs review

So when I say contradictions are “flagged,” I mean candidate contradictions are surfaced and preserved instead of being silently smoothed over.

How Querying Works

Once the wiki exists, queries become much lighter.

Instead of searching raw chunks across the entire corpus, the system searches the structured wiki first. In many cases, that means keyword, title, and tag-based search, with no embeddings required for the first pass.

The query flow looks like this:

Search the wiki
Select a strategy using a fast model call
Execute the strategy with the appropriate model
Return an answer with [[Page Name]] or source-anchor citations

The strategy selection matters because not every question needs the same level of work.

Query strategies

Lookup

For a single fact from one page.

Synthesis

For compare/contrast tasks across multiple pages or sources.

Deep-dive

For more extensive analysis across a small set of selected pages plus raw-source verification if needed.

In practice, query cost is much closer to wiki size and selected pages than to total raw document count. It is not literally constant, but it scales more gently than repeatedly searching and prompting over the full corpus.

Where raw retrieval still fits

This is not an anti-RAG purity test. I still use raw-document retrieval in edge cases:

cold-start questions before enough wiki structure exists
verification against the underlying source text
cases where a page summary looks incomplete or stale
exploratory search across new material before ingestion rules are tuned

For me, the best framing is not “replace RAG everywhere.” It is “stop making raw retrieval do all the memory work.”

Why This Beats Plain RAG for Research

Here is the comparison that mattered in practice:

Dimension	Plain chunk-retrieval RAG	Compounding Wiki
Knowledge growth	Re-synthesizes per query	Accumulates across sources
Cross-referencing	Implicit (similar chunks)	Explicit (entity pages + wikilinks)
Query cost trend	Usually grows with retrieval/ranking complexity	Tied more to selected pages than total corpus
Provenance	Often file/chunk-level	Page and paragraph-anchor level
Contradictions	Usually handled inside answer generation	Surfaced as reviewable conflicts
Human-readable	Often opaque to humans	Plain markdown
Edit maintenance	Often requires re-indexing	Often localized to page updates

A few of these deserve emphasis.

Provenance becomes operational, not cosmetic

In many RAG systems, “citation” means attaching a filename or chunk excerpt to an answer. That is useful, but it is not enough for research-quality traceability.

In the wiki model, claims can point back to specific paragraphs, and disagreements can remain visible instead of being collapsed into one model-generated summary.

Contradictions stop disappearing

One of the worst behaviors in AI research pipelines is silent reconciliation. If two sources disagree, the model often smooths them into a single answer.

By maintaining persistent pages and structured fact updates, likely conflicts can be surfaced and flagged instead of overwritten. That changes the system from “helpful summarizer” into something closer to a research assistant.

Edits become cheaper

In a vector-centric workflow, changing a source often means re-embedding and re-indexing. In this setup, a targeted edit can mean updating one source, recomputing only the affected structures, and revising the relevant pages.

That is not free, but it is usually more localized.

Practical Performance Notes

The original draft of this post was too absolute here, so let me be precise.

The exact latency and cost depend on:

the model provider
whether extraction uses a small local model or an API model
corpus size and document shape
how aggressively you verify against raw sources
the size of the wiki pages being synthesized at query time

In my own testing, the main gain was not “zero scaling cost.” It was moving a large share of the work to ingestion and keeping many queries small because they hit pre-structured pages instead of raw corpora.

So when I say the system is more stable at scale, I mean:

qualitatively more consistent on repeated research questions
operationally easier to trace and debug
economically better when the same corpus gets queried many times

If you publish hard benchmark numbers, include your corpus, models, prompt strategy, and what counts as an “active project.” Without that context, qualitative claims are more honest.

What This Enables

The compounding wiki model changes what kinds of workflows are practical.

Multi-source research

You can ask questions across large document sets and get answers tied back to the exact paragraph that supports them.

If you ask:

What did Smith say about the merger in the Q3 call?

the goal is not to retrieve a vaguely relevant transcript chunk. The goal is to resolve “Smith,” locate the relevant page or source anchor, and return the supporting paragraph.

Project-specific memory for teams and writers

Each new interview, paper, or article enriches what the system already knows.

If the third transcript creates a [[John Smith]] page, the fiftieth transcript should update that page, not create one more disconnected chunk.

Better structure as the corpus matures

As the graph gets denser, the wiki tends to become more useful:

more entity resolution
better cross-referencing
richer concept pages
stronger comparison pages

That is the compounding effect plain RAG does not naturally provide on its own.

Practical Packaging and Tooling

The system is open source here:

https://github.com/pksw4u/rlm-wiki

At the time of writing, I would describe it as a practical, evolving project rather than a finished universal framework.

It works in several modes:

As a Hermes Agent skill by dropping SKILL.md into your skills directory
As a standalone Python library via pip install rlm-wiki
As a native Obsidian vault because the wiki is plain markdown
Alongside llm-wiki, using the same format so they can coexist

If you are not familiar with those names:

Hermes Agent is the agent runtime this project was designed to plug into.
llm-wiki is a related markdown-wiki workflow that shares a compatible file format.

Because the wiki is just files, the data is not trapped inside a database that only one runtime understands.

Minimal usage example

pip install rlm-wiki
rlm-wiki ingest ./docs/q3-call.pdf --project ./wiki
rlm-wiki ask "What did Smith say about the merger?" --project ./wiki

Example entity page format

# John Smith

tags: [person]
aliases: [J. Smith]

## Summary
Executive involved in merger discussions.

## Claims
- Smith argued the merger would face regulatory friction in the EU.  
  Source: [[q3-call-2026-04#p12]]

## Related
- [[merger regulation]]
- [[EU]]

Here is the operational point I care about most:

The wiki is just markdown files. No required database, no vendor lock-in, and no mandatory embedding index to keep the core state readable.

That means your knowledge base remains usable even if your model stack changes later.

The Deeper Point: Search Is Not Understanding

A lot of AI tooling still treats every question as a fresh event:

search
retrieve
answer
forget

That is a useful pattern for document access. It is a poor pattern for research.

Research is cumulative. If you are trying to understand Smith’s position, the answer usually does not live in one chunk. It emerges from multiple interviews, repeated claims, revisions over time, and contradictions with other people’s statements.

A system that compounds can do things a query-time retrieval stack struggles with:

resolve some relationships once instead of rediscovering them repeatedly
preserve explicit links across sources
keep provenance attached to claims
improve existing pages as new evidence appears

That is not just a better implementation detail. It is a different idea of what an AI knowledge system should be.

Key Takeaways

RAG is good for retrieval, but plain chunk-retrieval RAG is weak for cumulative research because it treats many questions as fresh synthesis tasks.
Context rot is a real scaling problem in long-running AI research workflows, not just a context-window marketing issue.
A markdown wiki can serve as persistent external state, replacing vector-only memory with readable, editable files.
Ingestion-time extraction changes the economics by doing the hard structuring work once instead of on every query.
RLM-style bounded operations keep the working set small even as document collections grow.
Hybrid systems are often the practical answer: persistent wiki first, raw retrieval when needed.

Conclusion

What I ended up building was not “RAG, but better.” It was a different architecture with a different goal.

RAG helps models find relevant text. A compounding wiki helps a system build and preserve relationships, provenance, and working understanding over time. That distinction starts to matter once your corpus gets large, your questions become comparative, and your tolerance for vague synthesis drops.

If you work with research-heavy AI workflows, this is the assumption I would revisit: maybe the model should not be responsible for reconstructing knowledge from scratch on every question. Maybe the system should already know how your sources relate before the question is even asked.

That is the promise of a compounding knowledge base. The more you feed it, the more useful its prior structure becomes — not because the model got bigger, but because the system got better at remembering.