Beever AI

Posted on Apr 24

LLM Wiki vs RAG: a different approach to team-chat memory

#ai #llm #rag #llmwiki

Retrieval-augmented generation became the reflex answer for "my LLM needs to know about my data." Chunk it, embed it, retrieve top-k on query, stuff it in the prompt. It works so well on documents that engineers now apply it to everything — PDFs, wikis, code, support tickets, and, increasingly, team chat.

RAG is the wrong tool for chat.

At Beever AI we spent the last months building an alternative and just released it as open source. Beever Atlas is an LLM Wiki — not a RAG system — that ingests Slack, Discord, Microsoft Teams, Mattermost, and Telegram conversations into a structured, browsable, auto-maintained wiki and knowledge graph. This post explains what an LLM Wiki is, what it can answer that RAG cannot, and the design decisions behind our implementation.

What is an LLM Wiki?

An LLM Wiki is a structured, LLM-maintained knowledge artefact derived from a conversational corpus. It is not a retrieval preprocessing step. It is the artefact itself — browsable by humans, queryable by agents, versioned, and cited back to its sources.

Seven differences worth naming:

Primary output. RAG surfaces prompt-time retrieval results. An LLM Wiki produces a standing artefact (the wiki and the knowledge graph).
What breaks when retrieval breaks. RAG hallucinates. An LLM Wiki is still readable as-is.
Consumed by. RAG: LLMs, mostly. LLM Wiki: humans and LLMs both.
Freshness cost. RAG re-indexes on every write. An LLM Wiki re-consolidates affected topics only.
Dedup. RAG does it per-query, ranking-dependent. An LLM Wiki does it at extraction time, structurally.
Citations. RAG reconciles sources post-hoc and can drift. An LLM Wiki carries citations forward from extraction.
Multi-hop questions. RAG is poor at them. An LLM Wiki makes them first-class via the graph.

Karpathy outlined the idea in a widely-shared thread earlier this year. We took it seriously and built a production implementation.

Why RAG falls down on conversations

Chat isn't documents. Four specific failure modes:

Same fact, said dozens of times. Someone announces "we're moving to Postgres on March 15" in #engineering. Twelve people ack. Four people quote it in later threads. Three retrospectives cite it. The same fact now exists as 20+ near-duplicate chunks. Chunk-based retrieval surfaces one somewhat arbitrarily, or several near-duplicates that crowd out unrelated relevant context.

Time-ordered, not topic-grouped. In a document, related content sits adjacent. In chat, a feature decision, a bug report, and a lunch plan can be interleaved across five minutes.

Structure lives outside the text. Meaningful metadata is not in the message body: author, thread parent, reactions, platform, mentions, attachments. Embedding the body only throws away half the signal.

Answer instability. Ask the same question a week apart. Different chunks rank highest depending on recent message volume and minor embedding perturbations. You get subtly different answers — corrosive to user trust.

What an LLM Wiki can answer that RAG can't

This is the part of the design that most people miss when they first see the architecture. A good vector retriever can handle "what did we decide about onboarding?" just fine. The capabilities that separate an LLM Wiki from RAG show up when the question is relational, temporal, or multi-hop:

"Who decided we're moving to Postgres, and did anything supersede that decision?"
"Which projects does the billing service block, and who owns them?"
"Show me every decision Alan announced in Q1 that was later reversed."
"Which technologies are members of team A working with that are still flagged experimental?"
"What got decided in threads that mention the SOC2 audit?"

Each of these requires walking a relationship — decision supersession, team membership, thread mentions, status attributes — that chunk-based retrieval fundamentally can't express. In Beever Atlas the knowledge graph in Neo4j holds those relationships. The Cypher traversal used in the codebase for decision-chain queries looks like this:

MATCH (d:Entity {type: 'Decision'})
WHERE d.channel_id = $channel_id
   OR EXISTS {
        MATCH (d)-[:MENTIONED_IN]->(ev:Event)
        WHERE ev.channel_id = $channel_id
      }
OPTIONAL MATCH (person:Entity)-[:DECIDED]->(d)
OPTIONAL MATCH (d)-[:SUPERSEDES]->(old:Entity)
OPTIONAL MATCH (newer:Entity)-[:SUPERSEDES]->(d)
WITH d,
     collect(DISTINCT person.name) AS decided_by,
     collect(DISTINCT old.name)    AS supersedes,
     collect(DISTINCT newer.name)  AS superseded_by
RETURN d, decided_by, supersedes, superseded_by
LIMIT $limit

This is the kind of query a Q&A agent's router hands off to the graph path when a question's shape demands it.

The shape of a fact and an entity

Before showing the pipeline, the data shapes that hold everything together:

Atomic fact — a single self-contained claim with a source message, author, timestamp, and thread context:

{
  "text": "The team is migrating from MySQL to Postgres on March 15",
  "source_message_id": "slack:C08TX:1712500000001100",
  "author": "alan5543",
  "timestamp": "2026-03-01T14:22:00Z",
  "thread_parent": null,
  "entities": [
    {"name": "MySQL", "type": "Technology"},
    {"name": "Postgres", "type": "Technology"}
  ],
  "relationships": [
    {
      "source": "Team",
      "target": "Postgres",
      "type": "MIGRATING_TO",
      "confidence": 0.92,
      "valid_from": "2026-03-01T14:22:00Z",
      "valid_until": null,
      "source_message_id": "slack:C08TX:1712500000001100"
    }
  ]
}

Two details to flag:

Every relationship has temporal props — confidence, valid_from, valid_until, source_message_id. This is how we answer "as of when" questions and how later decisions supersede earlier ones without destroying history.
Entity scope — some entity types (Person, Technology, Project, Team) merge globally across channels. Others (Decision, Meeting, Artifact) merge only within a channel scope. Scope-aware MERGE prevents two different channels' "Q1 planning decisions" from collapsing into each other, while still letting "Alan" mean the same person everywhere.

String-level dedup is done with Jaro-Winkler similarity (via APOC) inside the scope; semantic dedup is done with embedding cosine on Entity.name_vector.

How the wiki gets built — six ADK stages

Ingestion runs as a pipeline of Google ADK LlmAgent steps, using Gemini 2.5 Flash for extraction. Six stages:

Preprocessor — normalises platform-specific message shapes into a single NormalizedMessage with author, timestamp, thread context, attachments, platform metadata.
Fact Extractor — pulls atomic facts with channel + author + timestamp attached. Bounded output via MAX_FACTS_PER_MESSAGE.
Entity Extractor — LLM-driven with a flexible type schema (Person / Decision / Project / Technology / Team / Meeting / Artifact / …). Type vocabulary isn't frozen; new types can be added without a migration.
Cross-batch Validator — dedupes entities and relationships across batch boundaries. This is where scope-aware MERGE and Jaro-Winkler similarity live.
Relationship Graph — materialises bidirectional relationships (DECIDED ↔ DECIDED_BY, BLOCKED_BY ↔ BLOCKS, and so on) with the temporal props above.
Persister — transactional-outbox write to Weaviate + Neo4j + MongoDB, with media attribution preserved. The outbox pattern means a partial failure doesn't leave the stores inconsistent.

Periodically, a separate consolidation agent clusters related facts and synthesises them into topic pages — the "wiki" users browse in the dashboard. Consolidation is idempotent and checkpointed: you re-run it on a subset without rebuilding from scratch.

Dual memory: why Weaviate and Neo4j

Facts live in two stores because no single store wins both query shapes:

Weaviate — 3-tier semantic memory (channel-level summaries, topic-level synopses, fact-level detail). Hybrid BM25 + vector retrieval. Good for "what did we decide about X?" — any question whose answer is a semantic lookup over fact text.
Neo4j — entities, decisions, episodic links, media attributions. Good for "who decided this and why?" — any question that requires walking a relationship.

A query router picks semantic, graph, or both per question. The router is a small LLM classifier, not a rule engine. Rule engines fail on novel phrasings; a classifier trades a few milliseconds of routing latency for resilience to queries we haven't seen before.

flowchart LR
    Q[Question] --> R[Query Router<br/>LLM classifier]
    R -->|semantic| W[Weaviate<br/>channel / topic / fact]
    R -->|graph| N[Neo4j<br/>entity + rel traversal]
    R -->|both| W
    R -->|both| N
    W --> M[Merge + dedup by fact_id]
    N --> M
    M --> A[Answer agent<br/>ADK SkillToolset]
    A --> S[SSE stream<br/>response + citations]

MongoDB holds the wiki page cache, ingestion state, and the transactional outbox. Redis holds sessions. Nothing interesting happens in either — they're operational plumbing.

The MCP surface: same memory, two audiences

The dashboard is one surface. The MCP (Model Context Protocol) server is the other. Sixteen tools are exposed to external agents like Claude Code, Cursor, or any MCP-capable client:

search_by_topic, search_channel_facts
get_decision_timeline, find_supersessions
graph_traverse, resolve_entity, find_mentions
list_channels, read_wiki_section
…and more for ingestion status, media lookup, and policy.

The meaningful shift here is that the team memory becomes reusable context, not a siloed app. An IDE-resident coding agent can ask "what did the team decide about the auth library?" with the same guarantees the dashboard gives a human — citations back to source messages, scope-aware resolution, permission-aware access (roadmap, below).

What this costs

Honest accounting: ingestion is slower and more expensive than RAG. You pay the LLM twice — once for extraction, once for consolidation. Using Gemini Flash keeps both bounded. Consolidation runs sparsely (once per topic cluster) and amortises across many messages.

The tradeoff is explicit: you pay more ingestion cost to get a standing artefact instead of ephemeral retrieval results. For team-chat corpora with heavy repetition, high conversational noise, and relational/temporal queries, an LLM Wiki reliably wins on answer quality, citation fidelity, browsability, and the types of question it can even entertain. For already-structured documents, classic RAG is still the right tool.

A concrete query path

Someone asks "who decided we're moving to Postgres, and has anything superseded that decision?":

The router classifies this as both — it needs fact retrieval and graph traversal, since there's a decider and a supersession chain.
Weaviate retrieves the top facts about Postgres migration from the fact tier.
Neo4j runs the decision-chain traversal shown earlier, returning decided_by, supersedes, and superseded_by collections.
Results merge, dedup by fact_id, and pass to the answer agent.
The agent streams back via SSE:

Alan announced the migration from MySQL to Postgres on March 1, 2026 [1], with cutover scheduled for March 15 [1]. The decision was ratified in the March 3 engineering sync [2] and has not been superseded since.

[1] alan5543 in #engineering, 2026-03-01 14:22
[2] alan5543 in #engineering-sync, 2026-03-03 10:00

The citations come straight from the fact records surfaced in steps 2 and 3. There's no separate "find the source" step that can drift from what the agent actually retrieved.

Where we sit, and what's defensible

Most team-memory products are vector-only — Glean, Notion AI, various OSS "LLM wiki" clones built on a single embedding store. Beever Atlas is the first OSS product we're aware of putting a real knowledge graph under team-chat memory.

The graph isn't cosmetic. It's load-bearing for a specific class of query — multi-hop, temporal, scope-aware — that vector retrieval fundamentally can't serve. It's also the substrate on which a permission spine can be built: mirroring Slack Enterprise Grid ACLs as graph-level access rules means permissions get enforced at query time, not papered over with app-tier filters. That's on our roadmap and we think it's the right shape for enterprise deployment.

The rest of the stack

The full system runs locally via docker compose:

Python backend — FastAPI + Google ADK agents
TypeScript bot bridge — Slack / Discord / Teams / Mattermost / Telegram webhooks with platform-specific signature verification
React dashboard — wiki view, interactive Cytoscape knowledge graph, streaming Q&A with live citations
Weaviate, Neo4j, MongoDB, Redis

Platform credentials (bot tokens) are encrypted at rest with AES-256-GCM. All data stays in databases you control. The app sends no telemetry anywhere — LLM calls go through API keys you configure in your own .env.

Try it

A make demo target brings up the full stack pre-loaded with a public Wikipedia corpus (Ada Lovelace + Python history, CC-BY-SA 3.0). Pre-computed fixtures ship in the repo, so seeding runs without any API keys. Asking questions via the Q&A agent needs a free-tier Gemini API key — Ollama support for fully-local inference is on the roadmap.

git clone https://github.com/Beever-AI/beever-atlas
cd beever-atlas
make demo

Then:

curl -X POST http://localhost:8000/api/channels/demo-wikipedia/ask \
  -H "Authorization: Bearer dev-key-change-me" \
  -H "Content-Type: application/json" \
  -d '{"question":"Who was Ada Lovelace?"}'

You'll see a streaming SSE response with six citations linking back to the source Wikipedia articles.

Links

Repository: github.com/Beever-AI/beever-atlas
Documentation: docs.beever.ai/atlas
Demo video: youtu.be/VJ81Uxyjxb0

Beever Atlas is developed by Beever AI Limited in Toronto, Ontario, and released under the Apache 2.0 license. The LLM Wiki concept was inspired by Andrej Karpathy's earlier-2026 thread on LLM Knowledge Bases; we took the idea seriously and built the implementation.

Contributions welcome. Issues especially — we want to hear the edge cases where a vector-only system would have worked and our graph overhead turned out not to pay for itself.

DEV Community