Jim L

Posted on Apr 13

Karpathy's LLM Knowledge Base SEO: I applied the pattern for 12 months and here's what I learned

#ai #llm #productivity #tooling

Karpathy's LLM Knowledge Base × SEO: I applied the pattern for 12 months and here's what I learned

On April 3, 2026, Andrej Karpathy posted a short but influential note about using LLMs to build personal knowledge bases. The premise: instead of RAG pipelines and vector databases, you manually clip raw sources into a raw/ folder, let an LLM distill them into structured wiki pages, and query the graph later with your LLM CLI of choice.

No SaaS lock-in. No embeddings. No subscription. Just markdown and an LLM that knows the schema.

I'd been drowning in scattered SEO research for a year — running openaitoolshub.org, an AI tools directory that's gone from DR 0 to DR 30 in 12 months, 126 articles, 130+ earned backlinks. My notes were spread across Notion, Kagi Assistant, local markdown files, a neglected Readwise Reader queue, and a thousand unread tabs. Karpathy's pattern gave me the discipline to consolidate everything into a single Obsidian vault that an LLM could maintain.

This article walks through what I built, the key design decisions, and the one contradiction-preservation trick that changed how I think about personal knowledge bases entirely.

The five-step pattern

Karpathy's original framing was simple:

Set up raw/ — every source you encounter, unedited
Set up wiki/ — structured concept pages the LLM maintains
Distill with an LLM — run a pass where Claude/Codex/etc reads raw sources and updates wiki pages
Cross-link with [[wikilinks]] — let the LLM suggest relationships between concepts
Query the graph with your CLI — ask questions months later, get synthesized answers from the vault

The genius is in step 3 — the LLM does the hard work of synthesis, contradiction detection, and cross-referencing. You do the reading and judgment calls.

How I adapted it for SEO

SEO is a moving target. What worked in Q4 2024 is wrong by Q2 2025. Google's March 2026 Core Update just rewrote half the playbook. I needed a system that could absorb new evidence and propagate updates without me manually re-reading every page.

My vault structure:

seo-obsidian/
├── Home.md                    # glassmorphism dashboard
├── CLAUDE.md                  # LLM operations guide
├── wiki/
│   ├── schema.md              # the concept-page template rulebook
│   ├── concepts/              # 12 SEO concept pages
│   ├── tools/                 # 3 tool profiles
│   ├── people/                # 1 person profile (Karpathy)
│   └── indexes/               # alphabetical catalogs
├── raw/
│   ├── README.md              # explains the three-layer architecture
│   ├── articles/              # long-form sources
│   └── practitioner-notes/    # curated short-form observations
└── maps/
    └── SEO-Domain-Map.canvas  # 21-node mind map

Every concept page follows a strict schema: ## TLDR, ## Key Points, ## Details, ## Applied Example, ## Related Concepts, ## Sources. The rigidity felt annoying at first, but it pays off at query time because Claude knows exactly where to look for each piece.

Three design decisions worth discussing

1. Preserve contradictions instead of resolving them

On April 10, Zhang Kai published a 602-prompt study claiming structured content (H2/bullets/tables) correlates with AI citation. On April 11, a Japanese SEO practitioner published experiments claiming structured data does NOT help AI understanding.

In a traditional wiki I'd have to pick one. In the Karpathy pattern, both claims live in the vault. The Zhang Kai finding is in the main section of geo-generative-engine-optimization.md. The Suzuki counter-evidence is in a ⚠️ Counter-Evidence callout right below it. When I query the vault with Claude, I get both cited.

This is the single most important insight I took from applying the pattern: honest knowledge > confident answers. The vault is a snapshot of the field's current state of confusion, not an attempt to pretend the confusion doesn't exist.

2. The ripple effect as the compounding mechanism

When I add a new raw source, I don't manually update related concept pages. I tell Claude:

$ claude
> Ingest raw/practitioner-notes/zhang-kai-602-prompt-geo-study.md 
> following wiki/schema.md. Update all related concepts with the new 
> evidence and flag any contradictions.

Claude then:

Reads the new source
Decides which of the 12 existing concept pages it affects
Updates each one with the new evidence
Flags contradictions against existing claims
Updates the concept index
Writes a log entry

One source → 5-15 pages updated → all in 45 seconds.

This is what makes it compound. Most note-taking systems are linear (you add, you rarely re-read). This one is multiplicative — every new source makes the whole wiki incrementally smarter.

3. Strict concept-page schema > flexible notes

I experimented with both. Flexible concept pages were easier to write but hell to query. Strict ones were slightly annoying to fill out but let Claude parse them reliably.

The schema:

aliases: []
tags: []
sources: []
cssclasses: [seo-brain-concept]

# Concept Title

## TLDR
One paragraph, 200-250 words. This is what AI engines cite.

## Key Points
5-8 bullet points.

## Details
The main content, 800-1500 words. Can have sub-sections.

## Applied Example
A concrete worked scenario.

## Related Concepts
- [[concept-a]] — why it's related
- [[concept-b]] — why it's related

## Sources
- External URLs
- raw/... paths

Every single concept page follows this. It's like a database schema — restrictive, but queryable.

Three concrete SEO insights that came out of the exercise

Insight 1 — Mean AI-cited content length is 1,375 characters

Zhang Kai's study measured the length of every fragment cited by ChatGPT, Perplexity, and Google AI Overview across 602 prompts. The mean was 1,375 characters — roughly 200-250 words, or about 10 sentences.

Practical implication: write TL;DR blocks of 200-250 words near the top of every article. Break the body into H2-bounded sections of 1,000-1,500 characters. That's the GEO sweet spot for citation.

Insight 2 — Google's March 2026 Core Update targets 7 specific AI writing patterns

Kill these and your content survives:

"Not just X, but Y" constructions
Em-dash overuse
Triad lists ("powerful, elegant, and fast")
Formulaic openers ("In today's fast-paced world...")
Breathless enthusiasm ("game-changing")
False-authority hedging ("It's worth noting that...")
Broad-to-narrow openings

I went through every article on openaitoolshub.org and stripped these patterns. Traffic stabilized. Articles that failed the update all shared these tells.

Insight 3 — Free dofollow directories above DR 55 exist

Conventional wisdom says free directories are DR 0-10 and useless. Actual: I found at least 12 free dofollow directories above DR 55. A field study in early April showed that adding 50 such backlinks moved a DR 46 site to DR 50 in one week.

The misconception comes from the early 2010s when directory submission was spammed to death. Post-2024, curated directories (Navs Site, Acid Tools, Ben's Bites, ShowMySites, NextGen Tools) are legitimate editorial sources.

What tools I used (and didn't use)

Used:

Obsidian (free) for the vault UI
Claude Code for the distillation + query layer
Ahrefs (~$99/month, but sem.3ue.com mirror for specific lookups)
Google Search Console (free) — the most important SEO tool for indie devs

Explicitly NOT used:

No SEO course (they go stale)
No paid link-building service (PBNs are a DMCA landmine)
No vector database (the whole point of the Karpathy pattern is avoiding this)
No subscription SaaS tools beyond Ahrefs

The goal was to keep the tool budget under $100/month and replace expensive tools with LLM-assisted workflows. Mostly worked.

What's next

I packaged the vault as "SEO Brain" for other indie devs. Free 5-concept starter kit is at openaitoolshub.org/en/seo-brain (canonical source, no Medium paywall). Full 12-concept Starter Edition is on Gumroad, $19 launch week, $29 regular.

More importantly — if you're doing personal research in any domain, I think Karpathy's LLM KB pattern is the right structure for 2026. Try it with your own domain (investing research, game dev, climate science, whatever) and let me know what you learn.

The compounding is real. The contradictions-preserved discipline is the trick.

About the author

Jim runs openaitoolshub.org (DR 30, 126 articles, solo) and four sister sites covering trading, SaaS, AI tools, and game directories. He writes about applying indie dev patterns to SEO at his main site.

This article's canonical version lives at openaitoolshub.org/en/seo-brain. Dev.to is a syndication copy.

Top comments (1)

Bhavin Sheth • Apr 13

Tried a similar setup for my own tool research — the “keep contradictions” part is 🔥. It actually makes your decisions sharper instead of blindly following one “best practice.”