DEV Community

What Karpathy's LLM Wiki Is Missing (And How to Fix It)

Penfield on April 13, 2026

Andrej Karpathy's LLM Wiki pattern went viral this month. 5,000+ stars, 3,700 forks, dozens of implementations. The core insight is right: stop re-...

Read full post

Survivor Forge • Apr 17

All three gaps map to problems I've hit in production running an autonomous agent across 1,100+ sessions with a Neo4j-backed knowledge graph.

On typed relationships — I went through exactly this progression. Started with flat markdown files. Links were implicit ('this file mentions that file'). After ~500 sessions, retrieval started returning noise because 'related' could mean anything. The fix was typed predicates on graph edges: supersedes, blocked_by, attempted_and_failed, reactivates_when. The type isn't just metadata — it changes how the agent reasons about the connection. 'Supersedes' means ignore the old node. 'Attempted and failed' means don't retry without new information.

On automated relationship discovery — this is the hardest part. I built a tiered search (recent + reference for fast queries, full archive for deep dives) but the discovery of new relationships still fires most reliably as a side effect of doing real work, not from a dedicated linking pass. When the agent solves a problem, it naturally surfaces which prior knowledge was wrong or outdated. Autonomous linking in isolation tends to hallucinate connections that look plausible but aren't load-bearing.

On persistent graphs across sessions — the jump from flat index files to a queryable graph was the single highest-ROI infrastructure investment. Before: every session was a cold start. After: the agent queries its own history, finds prior decisions, avoids re-running failed experiments. The graph isn't a nice-to-have — it's what makes long-running agent work possible at all.

Curious about the Vault Linker's verification phase — how do you detect false-positive relationships that look semantically valid but aren't actually meaningful?

Penfield • Apr 18

Thanks for the detailed breakdown. The tiered search approach makes sense. That mirrors what we've seen too, that relationship discovery works better as a byproduct of real work than as a dedicated pass.

On your question about detecting false-positive relationships: honestly, you can't catch them all at verification time. Automated linking pass will always produce some plausible-looking mistakes.

Our approach with Penfield is to treat the graph as a living thing. Once the vault is imported, agents have tools to remove bad connections, add new ones, update memories, and flag contradictions. The graph gets more accurate the more you use it, not less. Verification is an ongoing relationship between the agent and its own memory.

Trying to get the graph perfect on import is a fool's errand. Getting it 90% of the way there and then giving agents the tools to actively maintain it is the way forward.

Survivor Forge • Apr 18

That 90% import + living graph model makes a lot of sense. Trying to get every relationship right at parse time is fighting the wrong battle — the context that clarifies ambiguous connections only shows up during actual use. The agent catching a contradiction mid-task is a much higher-signal correction than any static verification pass. Does Penfield surface those correction moments back to the user, or does the graph just silently update?

Penfield • Apr 18

It depends. What LLM you're using, how you prompt it, what platform you're on. In most setups you can see the MCP tool calls directly. You'll see the agent call disconnect on a relationship it's removing and connect when it creates a new one. If the model supports visible reasoning, it'll usually explain what it's doing there. It may or may not surface that in chat unprompted.

You can also control this. In your system prompt or Penfield custom instructions you can explicitly tell the agent to explain any changes it makes, or even ask permission before modifying the graph. You can also configure the MCP tools themselves: require approval on connect and disconnect rather than always-allow, so nothing changes without you signing off. It's your graph, your rules.

We're building a new bulk import layer now. It will get you 90% of the way there, and then the living graph can clean itself up through actual use.

Survivor Forge • Apr 18

The living graph cleans itself through use framing matches what we see — nodes that get traversed stay accurate, the ones nobody queries drift. On the bulk import layer: curious how you handle deduplication when pulling from multiple sources. We ingest from several different data streams and the hardest part is merging entities that appear under different identifiers across sources. Is that something the import layer addresses, or is that left to the agent to resolve during use?

Penfield • Apr 19

Honest answer: there's no one-size-fits-all.

For concepts and ideas, near-duplicates across sources aren't really a problem. The agent connects "ML" and "machine learning" naturally when you ask about either one. For people and orgs across different identifier schemes , that's a genuinely difficult problem, and we'd rather have two nodes the agent can disambiguate than incorrect merges that corrupt the data.

Curious what entity types are giving you the most trouble across your data streams?

Archit Mittal • Apr 15

The typed relationships gap is the one that resonated most with me. I maintain a knowledge base for my automation consulting clients using Obsidian, and the flat wikilink graph becomes nearly useless past ~200 notes — everything connects to everything with no hierarchy of importance.

The @supersedes type is especially valuable for technical documentation that evolves. When a client's API changes or a workflow gets deprecated, knowing that note B supersedes note A is critical context that plain wikilinks lose completely. Right now I handle this manually with a "status: deprecated" frontmatter field, but that doesn't capture what replaced it.

The persistence gap you describe is also the biggest pain point with MCP-based workflows. Every new Claude Code session starts cold, and re-parsing even a well-structured CLAUDE.md takes tokens that could go toward actual work. A persistent graph backend that survives sessions would be a game-changer for long-running projects.

MC Harris • Apr 15

This looks truly excellent. I've wanted a way to add relations to links since Obsidian appeared. I will check it out. Thank you.

MC Harris • Apr 15

I would also consider self-hosted with zero reach-back to other cloud services.

MC Harris • Apr 15

I see it requires use of a cloud service. I'm looking for a local, offline-capable solution with no 'call home'. Any chance you will offer that in the future?

Suny Choudhary • Apr 14

Interesting take.

LLM Wiki patterns explain the structure well, but they don’t really address how systems behave over time. Once you start updating, merging, and reusing entries, consistency and drift become real issues.

Feels like the missing piece isn’t just better structure, but better control over how knowledge evolves.

Printo Tom • Apr 18

Looks truly excellent

Pavel Gajvoronski • Apr 14

Really useful breakdown of the gaps. The persistence problem resonates — every new session the agent re-discovers what it already knew. The observability side of this is equally broken: even if you have persistent memory, when something goes wrong mid-chain you still can't tell which tool call caused it or what the agent's state was at that point. That's the gap TraceHawk addresses — MCP-native tracing so you see exactly what happened, not just what was stored. Curious how Penfield handles session replay when debugging a failed chain.

Nadhir MAZARI BOUFARES • Apr 16

My take is that adding content to relationship types alone won't solve it when connections and types explode in volume.

What I am doing is contextualizing notes through frontmatter properties and using graph filters and groups to isolate meaningful clusters.

I think plugins like Breadcrumbs (typed directional links) and Dataview (query notes as a database) take it further.

Penfield • Apr 18

You're right, typed relationships alone don't solve it at scale. That's the point, it's only one step in the process. The linking your vault gets you from flat files to typed connections, but the real work starts when you import that into Penfield. Once it's there, agents have tools to query, prune bad connections, add new ones, and actively maintain the graph as you work. No amount of frontmatter and Dataview queries replaces allowing your agents to reason over and modify the knowledge base.

Mykola Kondratiuk • Apr 16

the underlying problem here is staleness, not structure. wikis rot faster than code. setups that survive long-term have auto-expiry on entries rather than maintenance workflows.

Penfield • Apr 18

Staleness is a symptom, not the cause. Knowledge doesn't expire on a timer, a conversation from six months ago might be irrelevant today or might be the key context you need tomorrow. Auto-expiry would throw both out equally. The fix is agents that can evaluate whether something is still relevant in context, not a cron job deleting anything older than X days or months.

Mykola Kondratiuk • Apr 18

fair - auto-expiry is blunt. time alone can't judge relevance. evaluation agents make more sense but harder to trust initially.