Nick Yeo

Posted on May 20

Think with your second brain: a proper Claude Code harness for Obsidian

#claude #obsidian #ai #productivity

An agentic doc harness is a set of Claude Code skills that turn an Obsidian vault into a structured workspace, letting an LLM walk wikilinks the way a coding agent walks imports — without vector-RAG retrieval. On a 99-note evaluation vault it beat a vector-RAG baseline on faithfulness +0.27, grounding +0.80, insight novelty +1.00, answer relevancy +0.40 (Claude-as-judge, 0–3 scale).

Repo: github.com/nickyeolk/agentic_doc_harness

Why I built it

I used to have OneNote as my main note keeping app, supplemented by Google Keep for quick notes. As my notes grew, and OneNote's android app withered, I experienced two problems: OneNote was way too slow, and there was no easy way for an AI agent to organically plug into it.
The idea was to make use of a coding agent's capability to understand code, and use it to understand notes instead.
I consolidated years of notes from OneNote into Obsidian. It worked ok-ish at first, but then I quickly started to encounter limits to the way claude grepped and grokked its way through my 'notebase'. Coding agents depend on code structure to jump from object to object, this did not exist in my disjointed notes.

Vector RAG vs vault harness

Same input, same model, different navigation.

Vector RAG embeds the vault, retrieves k chunks by cosine similarity, hands them to the LLM. Chunks arrive as sentence-level fragments with no provenance.

The harness reads VAULT_INDEX.md (a generated map of the vault), routes to an entry note, walks outbound wikilinks, surfaces a few topically-similar but unlinked notes. Notes arrive whole, in their original structure.

Harness structure

Four Claude Code skills plus a small set of generated files.

/harness-init (one-time). Walks the vault, classifies sections, detects hub-candidate notes by inbound mention count, asks 5 clarifying questions, generates VAULT_INDEX.md (the map), root and per-section CLAUDE.md files (orientation), and a small config.

/vault-discover (graph builder). Four modes. Mode 1 ranks notes by inbound mention frequency to surface hub candidates. Mode 2 finds every unlinked mention of a hub and proposes adding [[wikilinks]]. Mode 3 detects orphan notes and classifies them. Mode 4 groups by shared vocabulary to find clusters that need a MOC.

vault-context (runtime navigator). Used every session. Depth Mode for directed tasks (1–2 hops). Synthesis Mode for cross-domain queries (multi-hub traversal). Hybrid Mode (opt-in) layers in embedding-aware filtering when Smart Connections is installed.

/obsidian-tooling (optional). Installs the Smart Connections plugin and pre-configures it. The harness then reads .smart-env/multi/*.ajson directly to use embeddings without any Python ML dependency.

The harness instructs Claude about the vault. It never prescribes what's in the vault.

Eval

Synthetic vault: 99 notes representing a fictional marketing consultant pursuing an MS Marketing degree with a family. Five folders. Generator emits zero wikilinks — flat import state.

Baseline: LlamaIndex vector RAG with nomic-embed-text.

Harness: pure wikilink traversal first, Hybrid Mode after one iteration.

15 synthesis tasks (single-domain, cross-domain, trap queries). Each response scored by a Claude Sonnet judge on faithfulness, grounding, insight novelty, answer relevancy (0–3 each).

First pass, pure wikilink traversal:

Dimension	Baseline	Harness (pure)
Faithfulness	2.067	2.000
Grounding	2.133	2.533
Insight novelty	1.533	2.333
Answer relevancy	2.067	2.400

Three wins, one loss. Faithfulness regressed below baseline. Diagnosis: wikilink traversal is query-agnostic. From a Studies entry note, the agent followed a link to a Clients note even when the query was strictly about coursework. Cross-domain contamination.

Hybrid Mode

Embedding the query at runtime would have required a Python ML stack. I wanted to avoid that.

Smart Connections (Obsidian plugin) already maintains an embedding cache on every note save. The harness reads it.

Move one: filter wikilinks by anchor similarity. For each outbound link, compute cosine(entry_note_embedding, candidate_embedding). Drop links below threshold. A Studies-to-Studies link scores 0.78–0.80. Studies-to-Clients scores 0.68. Studies-to-Family scores 0.57. Threshold becomes a tunable filter.

Move two: orphan surfacing. After traversal, take top-k notes vault-wide by similarity to the entry note. Drop anything already loaded. Surface up to 5. These are notes the wikilink graph never reaches but the embedding flags as topical.

Both moves use Python stdlib only. About 150 lines to walk .smart-env/'s .ajson files and compute cosine similarity over embeddings that already exist.

Final results

Threshold sweep at {0.55, 0.60, 0.65, 0.70, 0.75}. Orphan surfacing on. Pareto-optimal at t=0.65, orphan-k=5.

Variant	Faith	Ground	Novel	Relev	Notes loaded
Baseline (vector RAG)	2.067	2.133	1.533	2.067	n/a
Pure traversal	2.000	2.533	2.333	2.400	8.9
Hybrid t=0.65 +orph5	2.333	2.933	2.533	2.467	11.5
Hybrid t=0.75 +orph5	2.333	2.800	2.667	2.533	9.3

Every Hybrid variant beats baseline AND pure-traversal on every dimension. Per-query latency ~8 seconds (same as pure traversal). Notes loaded up 30%, still within token budget.

Unexpected finding: stricter wikilink filtering plus orphan surfacing beats permissive filtering. At t=0.75 most wikilinks get cut, the surfaced orphans fill the gap, insight novelty peaks. Orphan surfacing is doing more work than the wikilink filter.

This might change over time as your links build up, or if you have an extremely structured note structure.

Where this fits in the bigger picture

The approach is not entirely novel. It sits inside the confluence of a numbr of trends.

Anthropic dropped vector RAG from Claude Code in favor of agentic search. Boris Cherny, Claude Code lead: "Early Claude Code used RAG + a local vector DB, but in the end, we found agentic search to be overwhelmingly better" (Pragmatic Engineer interview, HN confirmation). Claude Code now uses Glob, Grep, Read to navigate the way a developer does. The harness extends this pattern from code to prose.

Karpathy's LLM Wiki pattern (gist) describes keeping knowledge as markdown and skipping retrieval infrastructure entirely. Multiple open-source implementations exist: LLM Wiki Compiler, obsidian-llm-wiki-local, nashsu/llm_wiki, Ar9av/obsidian-wiki. The harness shares the anti-RAG stance and traverses progressively rather than dumping the whole corpus into context.

Microsoft's GraphRAG (blog, arXiv 2404.16130) builds a knowledge graph from a text corpus, then uses the graph for sensemaking queries. The harness uses the graph the user already built.

Smart Connections (repo) is the dominant Obsidian-AI plugin and does RAG over the vault. The harness uses Smart Connections' embedding cache as a secondary signal in Hybrid Mode, not as the primary retrieval mechanism.

The contribution here is synthesis: applying the agentic-search pattern to personal knowledge already structured by the user, with the wikilink graph as the first-class navigation primitive, validated against a vector-RAG baseline with concrete numbers.

Limitations

Eval is on a synthetic vault. A real personal vault will surface failure modes the synthetic one does not.

Path 1 anchor scoring uses the entry note's embedding, not the query's. When the entry note is broad and the query is narrow, the anchor does not pick up query intent. A future iteration may add a small local query embedder.

Entry-point selection runs on routing heuristics generated from the vault. Fragile on first contact with a new vault.

FAQ

Q: Why not just use Smart Connections?
A: Smart Connections is RAG over the vault — embed, retrieve k chunks, chat. Loses the structure. The harness uses the structure first, embeddings second.

Q: Does this work without Smart Connections?
A: Yes. Depth Mode and Synthesis Mode work on wikilinks alone. Hybrid Mode is the opt-in layer that adds Smart Connections.

Q: Does the harness modify my notes?
A: /vault-discover Mode 2 adds [[wikilinks]] to your notes, with every change shown before writing. The runtime navigator never writes.

Q: What if my vault has no wikilinks yet?
A: The harness handles that case. Designed for the OneNote / Notion / Evernote migration scenario. Mode 1 ranks hub candidates by inbound mention frequency. Mode 2 wires them in.

Q: How does this differ from Karpathy's LLM Wiki?
A: LLM Wiki dumps the full knowledge base into the model's context and trusts long context. The harness traverses progressively from an entry point. For vaults larger than the context window, this matters; for small vaults the two converge.

Q: Why not embed the query at runtime?
A: Would require a Python ML stack. The harness is stdlib-only by design. Hybrid Mode achieves most of the win using anchor-based scoring (entry note as the anchor) instead.

Q: What if I want to use this on an existing vault with thousands of notes?
A: It should work — vault-context is bounded by MAX_NOTES_PER_QUERY (default 12). The harness scales with the entry-point routing, not with vault size. The eval is on 99 notes; larger vaults are untested.

DEV Community