Abuzar Gore

Posted on May 22

LLM-Wiki: Multi-Agent Memory Without RAG

#ai #agents #llm #llmwiki

How three AI agents can collaborate on a complex task by sharing a folder of markdown files — and nothing else.

Three agents at work. They never call each other. The wiki — eight markdown files visible in the center panel — is their only memory.

I just built a small project that proves a powerful pattern: a shared markdown wiki can be the only memory channel between multiple LLM agents. No vector database. No embeddings. No JSON state-dump in every prompt. Just files.

The pattern — dubbed "LLM-Wiki" by Andrej Karpathy in an April 2026 gist — is small enough to explain in a blog post and big enough to replace most agentic-memory infrastructure people are bolting onto LLM apps today. Here's what it is and what makes it actually work.

What is LLM-Wiki?

The pattern at a glance: three-layer architecture, three core operations, wiki structure, agent flow, and the tools every agent uses.

Karpathy's pitch in one line: memory should be synthesis, not retrieval.

Instead of giving an agent a vector index and re-searching raw documents on every query, you have the LLM distill what it learns into structured markdown — once, at write time. From then on, agents read the distilled markdown directly. The wiki is the memory.

The original pattern has three layers:

Sources — raw inputs (immutable)
Wiki — LLM-maintained markdown files, cross-linked, evolving
Schema — a file that tells agents how to read and write the wiki

The wiki layer is what matters for multi-agent work. When several specialized agents share one wiki, the wiki becomes the team's institutional memory and the workspace they pass work through. They never message each other. They edit a shared document. The next agent reads it.

That's it. That's the pattern.

The problem this replaces

Two usual approaches for multi-agent memory:

1. Stuff everything in the prompt. State carries forward as a JSON blob that grows each turn. Token bloat. Contradictions hide inside it. Humans can't read it. Old context never decays.

2. Bolt on RAG. Embed every document chunk-by-chunk. Query-time similarity search. The same fact retrieved three different ways. The same chunk weighted three times. No synthesis until the agent assembles it on the fly, every query. And every update requires re-embedding.

Both approaches work. Both have load-bearing failure modes. The wiki pattern moves the work upstream — synthesis happens once, when an agent decides to write a section. From then on, every read is a cheap markdown fetch.

The pattern

The full architecture. User submits a task. Pipeline runs three specialist agents in order. Each agent reads and writes a shared wiki of 8 markdown files. Human can read or edit the same files.

Three layers in the project:

Wiki — eight markdown files on disk. Each has a YAML frontmatter (tag, version, updated_by, updated_at) and a body. Snapshot sections (full rewrite on update): identity, vision, glossary, architecture, file-index. Append-only logs: decisions, open-questions, handoffs.

Tools — five functions the agent can call:

list_sections() — get the catalog (metadata only, cheap)
read_section(tag) — fetch one section's body
update_section(tag, body) — full rewrite (snapshot only)
append_to_section(tag, line) — log entry (append-only only)
handoff_to(next_agent, summary) — end turn

Pipeline — hardcoded sequence: PM → Architect → Backend → done. Forward-only, capped.

The whole storage layer hides behind a single Protocol so the project's filesystem adapter can be swapped for SQL later without touching agent code:

class WikiPort(Protocol):
    async def list_sections(self) -> list[SectionMeta]: ...
    async def read_section(self, tag: str) -> Section: ...
    async def update_section(self, tag: str, content: str,
                             expected_version: int, agent: str) -> Section: ...
    async def append_to_section(self, tag: str, line: str, agent: str) -> Section: ...
    async def snapshot(self) -> dict[str, Section]: ...

That's the entire contract between the engine and storage. Five methods.

What an agent's turn looks like

Each agent box turns amber when running, green when done. The five dots underneath each agent are the per-turn loop: orient, gather, think, write, hand off.

Every agent runs the same five-step loop:

Orient. Call list_sections() — returns tags, titles, versions, last-author, word count. No bodies. Just the catalog.
Gather. Pick the 2–4 sections this turn actually needs. Call read_section(tag) for each. Don't read everything.
Think. The model now has the user task + the previous agent's handoff summary + the relevant wiki content. Reason.
Write. Call update_section for full rewrites (e.g. vision). Call append_to_section for log entries (e.g. decisions). One write or eight, the agent decides.
Hand off. Call handoff_to(next_agent, summary). Summary is mandatory and 1–3 sentences. It biases what the next agent reads first.

The agent's output is not prose. Its text reply is a status line shown in the UI ("Wrote vision and glossary, recorded 3 decisions"). The real output lives in the wiki. This inversion is what makes the pattern work: every agent's deliverable is a diff against shared state, not a chat message.

Three things that make it actually work

These are the load-bearing parts. Without them, the pattern degrades into chaos.

1. Per-role permissions in the tool wrapper

PM should not be able to overwrite architecture. Architect should not be able to overwrite vision. The tool wrapper enforces this — the LLM sees a structured error and adapts:

async def update_section(ctx: RunContext[RunDeps], *, tag: str, new_content: str):
    role = ROLE_CONFIGS.get(ctx.deps.agent_role)
    if tag not in role.snapshot_writable:
        return {
            "status": "denied",
            "error": (
                f"agent {role.name!r} cannot update section {tag!r}. "
                f"allowed: {sorted(role.snapshot_writable)!r}"
            ),
        }
    # ... read current version, write, return ...

The LLM reads the error in plain English, picks a different action, retries. No exception crashes the agent. Permissions become a guardrail instead of a wall.

2. Optimistic locking on writes

Reads return a version number. Writes must declare the version they read. Mismatch → the tool returns the latest content and asks the agent to re-apply intent. This lets a human edit vision.md in their text editor mid-run, save, and have the agent's next write either notice and merge or fail safely instead of clobbering the human's edit.

For a sequential pipeline this rarely fires. The cost of having it is two lines of code. The cost of not having it once you scale to parallel agents is a corrupted wiki.

3. Forward-only handoff with a max-step cap

Without this, agents loop. We saw it during development: Architect handed back to PM, PM ran again, handed forward to Architect, Architect handed back. Forever. Token budget exhausted, no useful output.

The fix is unglamorous:

HANDOFF_SENTINEL = "handoff recorded, you are done. Do not call more tools."

# In the runtime loop:
if chosen_next in PIPELINE_ORDER:
    target_idx = PIPELINE_ORDER.index(chosen_next)
    if target_idx <= current_idx:
        # Backward / self handoff — force forward
        forced = current_idx + 1
        next_agent = PIPELINE_ORDER[forced] if forced < len(PIPELINE_ORDER) else "done"
    else:
        next_agent = chosen_next
        current_idx = target_idx

Plus a max_steps = len(PIPELINE_ORDER) + 2 cap as the final safety net. Bounded cost no matter what the model decides to do.

What this pattern doesn't solve

Scale past ~50 sections. The catalog itself becomes a big read. Add per-section 1-line summaries first, then hierarchy (wiki/architecture/services.md), then eventually a search tool layered on top.
Parallel agents. Single shared wiki + optimistic locking handles light concurrency. For dozens of agents writing simultaneously, you need real conflict resolution.
Agents that over-read. If a model decides to read all 8 sections every turn, your token cost balloons. Cap reads in the tool wrapper, sharpen the prompt, or both.
Dynamic schema. Agents can't create new sections at runtime in this project — the 8 sections are fixed. Intentional: stops the wiki from devolving into chaos. Adding create_section(tag, title, mode) is a 20-line change once you actually need it.

The pattern is strongest for projects with structured, repeatable work that benefits from being human-readable. Software design. Research investigations. Long-running analyses. Anything where you'd want to look at the doc later or have a teammate edit it.

Wrap

The whole memory system here is a folder of markdown files, five tool functions, and three plain-English system prompts. No vector store. No graph database. No four-tier memory consolidation. The discipline is in the interfaces — what each tool returns, who can call what, how a handoff is structured. The infrastructure is what software engineers have used for forty years: text files and grep.

The same engine package will drop into a much larger project I'm working on next, unchanged, because none of it depends on the web API, the UI, or the storage backend. Files + tools + agents. That's the whole thing.

Top comments (1)

Rohit Jadhav • May 22

It was a great read. I was able to clearly understand the core concepts of llm-wiki through it.