DEV Community: Nick Yeo

Is That an LLM in Your Harness, or Are You Just Glad to See Me?

Nick Yeo — Sat, 18 Jul 2026 08:52:08 +0000

Benchmarking the Claude Code harness (and three rivals) on agentic knowledge management

Harness engineering is having a moment. Mitchell Hashimoto kicked it off in February with a post arguing that when an agent makes a mistake, you should engineer the environment so it can never make that mistake again. A few weeks later, OpenAI and Anthropic had published their own harness engineering guides, Martin Fowler weighed in, and the term acquired a Wikipedia page.

Not everyone agrees about the impact of the harness. The LangChain team moved their coding agent from 52.8% to 66.5% on Terminal Bench 2.0 without touching the model, purely by reworking the harness. Meanwhile Boris Cherny, the creator of Claude Code, says the opposite: the harness is "the thinnest possible wrapper over the model", and the secret sauce is all in the model. The two claims seem contradictory: "the harness is the moat" and "the harness is a napkin". Sounds like something worth testing.

I took the four major agent harness SDKs, Anthropic's Claude Agent SDK (which runs the actual Claude Code binary), OpenAI's Agents SDK, Google's ADK, and Microsoft's Agent Framework, and gave them all the same evaluation task: answering 133 questions buried in 85 years of U.S. Treasury Bulletins. No retrieval pipeline nor vector database. Just an agent, a pile of documents, and three tools to find things with. It is a standard question-answering test over a document archive, a knowledge management problem every company has.

To keep the fight fair, everything is pinned: the same question set, the exact same three tools backed by a single shared implementation, the same prompt, the same grader. Only two things vary. The harness SDK, and the model inside it, each vendor's big model and its smaller, cheaper sibling. If the harness is the moat, that should show up in the numbers. If it is all model, that shows up too. As it turns out, both matter, they just matter for different things, and the small models produce the most expensive surprise in the article.

TL;DR

Agentic navigation over a raw document corpus works: the best stacks answer 64% of a hard grounded-reasoning benchmark at 1% numeric tolerance, with no index, no chunking, and no embeddings.
The model is the biggest variable by far. Every vendor's frontier model beats its small sibling by a landslide; GPT-5.5 scores 16x GPT-5.4-mini on the same harness.
The GPT minis are false economy. Per correct answer in actual billed cost, GPT-5.4-mini costs about twice as much as GPT-5.5 on the same harness, because it thrashes: twice the tool calls, a tenth of the accuracy. Claude's Haiku dodges this on unit cost thanks to aggressive caching, but only answers a quarter of the questions.
The harness is a real variable too. The same GPT-5.5, same tools, same prompt scored a statistical tie on OpenAI's Agents SDK and Microsoft's Agent Framework, but Microsoft's run finished with zero errors, roughly 17% fewer tokens, and a roughly 19% smaller per-question bill.
Every harness fails in its own accent: turn-limit exhaustion here, hallucinated tool names there. If you are choosing the foundation for a knowledge assistant, the error log matters as much as the accuracy column.

Knowledge management is becoming an agent problem

Every organization sits on a pile of documents that people need answers from: filings, reports, policies, contracts, decades of archived PDFs. The default architecture for the last few years has been retrieval-augmented generation: split the documents into chunks, embed them, store vectors, retrieve the most similar passages, and hope the answer is inside the top-k.

That architecture has known failure modes, and they cluster where enterprise archives are hardest:

Near-duplicate documents defeat similarity search. An archive of monthly bulletins contains hundreds of documents with nearly identical structure and vocabulary. An embedding can tell "budget table" from "poetry", but not the January 1941 issue from the February 1941 issue.
Tables do not chunk. The answer to a numeric question is one cell whose meaning depends on the row label, the column header, the units note above the table, and sometimes a footnote below it. Chunking scatters those across fragments.
Retrieval is one-shot. If the right passage is not in the top-k, a classic pipeline has no way to notice and try a different query.

The agentic alternative skips the pipeline entirely. Give the model a small set of read-only tools over the raw corpus, the same primitives a human analyst (or a coding agent like Claude Code) uses: list files, search their contents, open and page through them. Let the model decide what to look for, follow leads, recover from dead ends, and stop when it has the answer. No index to build or keep fresh, no chunking decisions, and citations that point at an exact file and line.

The question is whether that actually works, and what it costs. That requires a benchmark built for it.

OfficeQA: a benchmark for navigating documents, not retrieving them

OfficeQA (Databricks) is an end-to-end grounded-reasoning benchmark: 133 questions (the "Pro" subset) that can only be answered from dense financial tables scattered across 85 years of U.S. Treasury Bulletins, a corpus of roughly 460 MB of parsed text.

It is almost adversarial to embedding retrieval, in precisely the ways enterprise archives are: hundreds of structurally near-identical monthly issues, answers living in single table cells, questions that hinge on dates, units, and fiscal-versus-calendar distinctions. There is no shortcut. An agent has to work out which bulletin is relevant, open it, find the right table, and produce a number graded at 1% relative-error tolerance.

That makes it a good proxy for the agentic knowledge management task in general: if a harness-plus-model stack can navigate this archive with three file tools, it can plausibly navigate yours.

The experiment

One toolset, one prompt, one grader. Every harness got the same three read-only tools, backed by a single shared implementation; only the SDK-specific registration layer differs:

glob(pattern): list corpus filenames matching a pattern
grep(pattern, ...): regex search across files
read_file(filename, offset, limit): read a file with line paging

This is the canonical minimal set used by coding agents, applied to a document archive. Every run used the same system prompt, the same <FINAL_ANSWER> output contract, a 30-turn cap, and OfficeQA's official reward.py scorer at four tolerances (exact, 0.1%, 1%, 5%).

Four vendor stacks, two model tiers each.

Vendor stack	Frontier	Small
Anthropic, Claude Agent SDK	Claude Opus 4.7	Claude Haiku 4.5
OpenAI, Agents SDK	GPT-5.5	GPT-5.4-mini
Microsoft, Agent Framework	GPT-5.5	GPT-5.4-mini
Google, ADK (via LiteLLM)	Gemini 3.1 Pro	Gemini 3.5 Flash

The design measures each vendor's stack as a unit (its SDK running its own models), plus the frontier-versus-small gap within each stack. Microsoft's Agent Framework has no first-party frontier model, so it runs OpenAI's, which conveniently puts the same model on two harnesses: an experiment of pure harness comparison.

All model traffic was routed through OpenRouter so provider-side serving is as comparable as we could make it. (The Claude Agent SDK required a small local proxy that stubs Anthropic-API endpoints OpenRouter does not implement.) Tool calls were counted at the shared tool layer, identically across SDKs. Runs were performed in late May 2026.

How to read the accuracy numbers. All accuracy figures in this article count harness errors (turn-limit exhaustion, crashes) as wrong answers, so every stack is graded on the same 133 questions. The Errors column shows how many of each run's misses were harness failures rather than wrong answers.

Results

Vendor stack	Model	Errors	Acc@0%	Acc@1%	Acc@5%	Latency med (s)	Tool calls/q	Prompt tok/q	Billed cost/q	$/correct answer
Claude Agent SDK	Opus 4.7	4	.489	.639	.692	79	9.0	n/a*	$0.69	$1.08
Claude Agent SDK	Haiku 4.5	15	.128	.271	.308	88	16.2	n/a*	$0.28	$1.04
OpenAI Agents SDK	GPT-5.5	4	.481	.617	.662	103	9.9	312.5K	$0.61	$0.99
OpenAI Agents SDK	GPT-5.4-mini	4	.008	.038	.090	29	18.8	399.6K	$0.08	$2.14
MS Agent Framework	GPT-5.5	0	.519	.639	.707	94	9.2	261.4K	$0.50	$0.78
MS Agent Framework	GPT-5.4-mini	0	.030	.045	.113	24	18.2	329.7K	$0.07	$1.61
Google ADK	Gemini 3.1 Pro	4	.248	.308	.316	182	35.2	2,836K	$2.50	$8.11
Google ADK	Gemini 3.5 Flash	2	.150	.195	.195	194	44.2	1,771.8K	$1.13	$5.80

* Why the Claude token cells are blank. The Anthropic API reports cached input separately from uncached input, and our harness recorded only the uncached slice (8 to 111 prompt tokens per multi-turn run, obviously not the full picture), so the token columns would be misleading. The costs are unaffected: they come straight from the billing export. The billing also quietly confirms the caching story, since Claude's actual charges are about 4x what its recorded uncached tokens alone would cost, with cache reads at a tenth of the base rate making up the difference.

The headline for knowledge management: agentic navigation works, with no retrieval pipeline at all. The best stacks answer nearly two-thirds of a deliberately hard archive benchmark to within 1%, using nothing but list, search, and read over raw text files. The rest of the findings are about what separates the stacks.

Finding 1: the model is the biggest lever, and it is not close

Within each vendor's own stack, dropping from the frontier model to the small one is catastrophic on this task: GPT-5.5 to GPT-5.4-mini falls from 63.9% to 4.5% (Microsoft) and 61.7% to 3.8% (OpenAI). Opus 4.7 to Haiku 4.5 falls from 63.9% to 27.1%. Gemini 3.1 Pro to 3.5 Flash falls from 30.8% to 19.5%.

No harness choice in this experiment moves accuracy anywhere near that much. For a knowledge assistant, the model tier is the first decision and the main budget line: the navigation loop (form a hypothesis about where the answer lives, search, read, revise) is exactly the multi-step reasoning that separates frontier models from their small siblings.

The degradation profile also differs by family: Haiku keeps 42% of Opus's accuracy and Flash keeps 63% of Gemini Pro's, but GPT-5.4-mini keeps only about 7% of GPT-5.5's. "Small" means very different things across vendors on this kind of task.

Finding 2: small models are false economy per correct answer

The small GPT models look absurdly cheap per question ($0.07 to $0.08 in actual charges, versus $0.50 to $0.61 for GPT-5.5). But a knowledge assistant is bought to produce answers, not attempts, and per correct answer the ordering flips:

Stack	Correct (of 133)	$/correct answer (billed)
MS Agent Framework + GPT-5.5	85	$0.78
OpenAI SDK + GPT-5.5	82	$0.99
Claude SDK + Haiku 4.5	36	$1.04
Claude SDK + Opus 4.7	85	$1.08
MS Agent Framework + GPT-5.4-mini	6	$1.61
OpenAI SDK + GPT-5.4-mini	5	$2.14
Google ADK + Gemini 3.5 Flash	26	$5.80
Google ADK + Gemini 3.1 Pro	41	$8.11

The GPT minis cost about twice as much per correct answer as GPT-5.5 on the same harness (2.1x on both). One honesty note: with only 5 or 6 correct answers, the mini estimates are noisy, and at the luckiest edge of a 95% confidence interval the gap narrows toward parity. What is not noisy is what you get for the money: a stack that answers 4% of questions is not a knowledge assistant at any price.

The Claude rows tell a different small-model story. Haiku lands at cost parity with Opus per correct answer ($1.04 vs $1.08), because the Claude harness's aggressive caching makes even Haiku's thrashing cheap. It still answers only 27% of the questions, so the false economy shows up as coverage rather than unit cost. And Google's stack is expensive at both tiers: Flash comes in about 28% cheaper per correct answer than Gemini 3.1 Pro, but both cost 6x to 10x more than the GPT-5.5 stacks. The mechanism behind the minis' false economy is visible in the telemetry:

The small models do not fail quickly, they thrash: roughly double the tool calls per question (18 to 19, versus about 9), a similar or larger prompt bill (context still accumulates on every one of those turns), and a tenth of the accuracy. Capability shows up not just in answer quality but in search efficiency. A strong model finds the right bulletin in a few probes; a weak model wanders the archive at your expense. In a deployed knowledge assistant, that waste compounds with a hidden cost this benchmark can grade but users cannot: a wrong number delivered confidently.

The Google stack's numbers deserve their own note: 35 to 44 tool calls and 0.7M to 2.8M prompt tokens per question, an order of magnitude above the other stacks, making it the most expensive way to get an answer in this experiment ($2.50 per question in actual charges for Gemini 3.1 Pro, 4x to 5x the GPT-5.5 stacks, for half the accuracy). In this round the Google stack is measured as a unit (ADK plus Gemini); how much of the bill is the harness's context-accumulation strategy versus the model's search style is exactly what the sequel's crossed grid will separate.

Finding 3: "close" is only a consolation prize for frontier models

Loosening the grader from exact match to 5% relative error moves the frontier stacks by 15 to 20 points (Opus: 48.9% exact, 69.2% at 5%). These models usually find the right table and occasionally fumble precision: units, rounding, fiscal versus calendar year. The GPT minis barely move (0.8% to 9%): when they are wrong, they are not almost right, they are lost.

For knowledge management this cuts two ways. If your use case tolerates approximate figures, a frontier stack gets a meaningful boost for free. And the frontier failure mode (right document, wrong precision) is the kind a citation-checking or unit-verification step can catch, whereas "never found the right page" is not.

Finding 4: same model, different harness: the accuracy ties, the bill does not

The one deliberately crossed cell in this experiment: GPT-5.5 with identical tools and prompts on OpenAI's Agents SDK and Microsoft's Agent Framework. Accuracy is a statistical tie (61.7% versus 63.9%, a 3-question gap on 133). There is clear harness-driven distinction in every other metric:

Reliability: Microsoft's harness completed all 133 questions without a single hard error, in both the frontier and mini runs. OpenAI's SDK dropped 4 questions to turn-limit exhaustion in each.
Tokens: roughly 17% fewer prompt tokens and 44% fewer completion tokens on Microsoft's harness, for the same model, tools, and questions.
Money: $0.50 versus $0.61 per question in actual charges, about 19% cheaper for the identical job (on the cache-aware split of the shared billing bucket).

The harness did not change how smart the model was. It changed what the intelligence cost and whether the run finished, which is exactly the operational layer a knowledge management deployment lives or dies on, and where a reliable harness can shine.

Finding 5: every harness fails in its own accent

The error logs are practically a fingerprint of each SDK's design philosophy:

Claude Agent SDK: all failures were turn-budget exhaustion ("Reached max turns"), 4 for Opus and 15 for Haiku. The harness never crashed; the agent just ran out of runway.
OpenAI Agents SDK: MaxTurnsExceeded, plus one JSON-decode error deep in a tool-call payload.
Microsoft Agent Framework: zero hard errors across all 266 questions it ran.
Google ADK: the most revealing failures. The model repeatedly hallucinated tool names (Tool 'grey' not found, Tool 'view_esf1_sept_2022' not found) and ADK surfaced them as run-fatal errors rather than feeding a correction back to the model. The other SDKs' stricter tool-schema enforcement never let an invalid tool name through.

Outside of accuracy, this is the next most important metric, and a critical part of AI Engineering harnesses for knowledge management.

Limitations

One corpus, one domain. OfficeQA is numeric, table-heavy, and English. It stresses the navigation loop hard, but agentic knowledge management over contracts or policies may rank the stacks differently. Additionally, some of the questions required access to a sandbox for calculations and analysis. No compute or sandbox was made available to the harnesses.

What this means for agentic knowledge management

Three practical takeaways, pending the sequel's crossed grid:

You may not need a retrieval pipeline. Three file tools and a frontier model answered 64% of a benchmark built to punish retrieval, with exact file-and-line provenance for every answer. For archives full of near-duplicate, table-heavy documents, agentic navigation is a serious architecture, not a demo.
Spend on the model, not on attempts. The GPT minis are cheaper per question and roughly twice as expensive per correct answer. Budget on cost per correct answer and the frontier tier usually wins; the exception (Haiku at unit-cost parity with Opus, thanks to caching) still only answers a quarter of the questions.
Pick the harness for operations. Accuracy ties across harnesses running the same model, but errors, tokens, latency, and caching behaviour do not. The harness is an infrastructure choice; evaluate it like one.

Benchmark: OfficeQA (Databricks), Pro subset, N=133, scored with the official reward.py at 1% relative-error tolerance unless noted. Harness code, prompts, and per-question results available in the accompanying repo. Pricing: OpenRouter model pages for GPT-5.5, GPT-5.4-mini, Claude Opus 4.7, Claude Haiku 4.5, Gemini 3.1 Pro, Gemini 3.5 Flash, retrieved July 2026.

Think with your second brain: a proper Claude Code harness for Obsidian

Nick Yeo — Wed, 20 May 2026 13:56:29 +0000

An agentic doc harness is a set of Claude Code skills that turn an Obsidian vault into a structured workspace, letting an LLM walk wikilinks the way a coding agent walks imports — without vector-RAG retrieval. On a 99-note evaluation vault it beat a vector-RAG baseline on faithfulness +0.27, grounding +0.80, insight novelty +1.00, answer relevancy +0.40 (Claude-as-judge, 0–3 scale).

Repo: github.com/nickyeolk/agentic_doc_harness

Why I built it

I used to have OneNote as my main note keeping app, supplemented by Google Keep for quick notes. As my notes grew, and OneNote's android app withered, I experienced two problems: OneNote was way too slow, and there was no easy way for an AI agent to organically plug into it.
The idea was to make use of a coding agent's capability to understand code, and use it to understand notes instead.
I consolidated years of notes from OneNote into Obsidian. It worked ok-ish at first, but then I quickly started to encounter limits to the way claude grepped and grokked its way through my 'notebase'. Coding agents depend on code structure to jump from object to object, this did not exist in my disjointed notes.

Vector RAG vs vault harness

Same input, same model, different navigation.

Vector RAG embeds the vault, retrieves k chunks by cosine similarity, hands them to the LLM. Chunks arrive as sentence-level fragments with no provenance.

The harness reads VAULT_INDEX.md (a generated map of the vault), routes to an entry note, walks outbound wikilinks, surfaces a few topically-similar but unlinked notes. Notes arrive whole, in their original structure.

Harness structure

Four Claude Code skills plus a small set of generated files.

/harness-init (one-time). Walks the vault, classifies sections, detects hub-candidate notes by inbound mention count, asks 5 clarifying questions, generates VAULT_INDEX.md (the map), root and per-section CLAUDE.md files (orientation), and a small config.

/vault-discover (graph builder). Four modes. Mode 1 ranks notes by inbound mention frequency to surface hub candidates. Mode 2 finds every unlinked mention of a hub and proposes adding [[wikilinks]]. Mode 3 detects orphan notes and classifies them. Mode 4 groups by shared vocabulary to find clusters that need a MOC.

vault-context (runtime navigator). Used every session. Depth Mode for directed tasks (1–2 hops). Synthesis Mode for cross-domain queries (multi-hub traversal). Hybrid Mode (opt-in) layers in embedding-aware filtering when Smart Connections is installed.

/obsidian-tooling (optional). Installs the Smart Connections plugin and pre-configures it. The harness then reads .smart-env/multi/*.ajson directly to use embeddings without any Python ML dependency.

The harness instructs Claude about the vault. It never prescribes what's in the vault.

Eval

Synthetic vault: 99 notes representing a fictional marketing consultant pursuing an MS Marketing degree with a family. Five folders. Generator emits zero wikilinks — flat import state.

Baseline: LlamaIndex vector RAG with nomic-embed-text.

Harness: pure wikilink traversal first, Hybrid Mode after one iteration.

15 synthesis tasks (single-domain, cross-domain, trap queries). Each response scored by a Claude Sonnet judge on faithfulness, grounding, insight novelty, answer relevancy (0–3 each).

First pass, pure wikilink traversal:

Dimension	Baseline	Harness (pure)
Faithfulness	2.067	2.000
Grounding	2.133	2.533
Insight novelty	1.533	2.333
Answer relevancy	2.067	2.400

Three wins, one loss. Faithfulness regressed below baseline. Diagnosis: wikilink traversal is query-agnostic. From a Studies entry note, the agent followed a link to a Clients note even when the query was strictly about coursework. Cross-domain contamination.

Hybrid Mode

Embedding the query at runtime would have required a Python ML stack. I wanted to avoid that.

Smart Connections (Obsidian plugin) already maintains an embedding cache on every note save. The harness reads it.

Move one: filter wikilinks by anchor similarity. For each outbound link, compute cosine(entry_note_embedding, candidate_embedding). Drop links below threshold. A Studies-to-Studies link scores 0.78–0.80. Studies-to-Clients scores 0.68. Studies-to-Family scores 0.57. Threshold becomes a tunable filter.

Move two: orphan surfacing. After traversal, take top-k notes vault-wide by similarity to the entry note. Drop anything already loaded. Surface up to 5. These are notes the wikilink graph never reaches but the embedding flags as topical.

Both moves use Python stdlib only. About 150 lines to walk .smart-env/'s .ajson files and compute cosine similarity over embeddings that already exist.

Final results

Threshold sweep at {0.55, 0.60, 0.65, 0.70, 0.75}. Orphan surfacing on. Pareto-optimal at t=0.65, orphan-k=5.

Variant	Faith	Ground	Novel	Relev	Notes loaded
Baseline (vector RAG)	2.067	2.133	1.533	2.067	n/a
Pure traversal	2.000	2.533	2.333	2.400	8.9
Hybrid t=0.65 +orph5	2.333	2.933	2.533	2.467	11.5
Hybrid t=0.75 +orph5	2.333	2.800	2.667	2.533	9.3

Every Hybrid variant beats baseline AND pure-traversal on every dimension. Per-query latency ~8 seconds (same as pure traversal). Notes loaded up 30%, still within token budget.

Unexpected finding: stricter wikilink filtering plus orphan surfacing beats permissive filtering. At t=0.75 most wikilinks get cut, the surfaced orphans fill the gap, insight novelty peaks. Orphan surfacing is doing more work than the wikilink filter.

This might change over time as your links build up, or if you have an extremely structured note structure.

Where this fits in the bigger picture

The approach is not entirely novel. It sits inside the confluence of a numbr of trends.

Anthropic dropped vector RAG from Claude Code in favor of agentic search. Boris Cherny, Claude Code lead: "Early Claude Code used RAG + a local vector DB, but in the end, we found agentic search to be overwhelmingly better" (Pragmatic Engineer interview, HN confirmation). Claude Code now uses Glob, Grep, Read to navigate the way a developer does. The harness extends this pattern from code to prose.

Karpathy's LLM Wiki pattern (gist) describes keeping knowledge as markdown and skipping retrieval infrastructure entirely. Multiple open-source implementations exist: LLM Wiki Compiler, obsidian-llm-wiki-local, nashsu/llm_wiki, Ar9av/obsidian-wiki. The harness shares the anti-RAG stance and traverses progressively rather than dumping the whole corpus into context.

Microsoft's GraphRAG (blog, arXiv 2404.16130) builds a knowledge graph from a text corpus, then uses the graph for sensemaking queries. The harness uses the graph the user already built.

Smart Connections (repo) is the dominant Obsidian-AI plugin and does RAG over the vault. The harness uses Smart Connections' embedding cache as a secondary signal in Hybrid Mode, not as the primary retrieval mechanism.

The contribution here is synthesis: applying the agentic-search pattern to personal knowledge already structured by the user, with the wikilink graph as the first-class navigation primitive, validated against a vector-RAG baseline with concrete numbers.

Limitations

Eval is on a synthetic vault. A real personal vault will surface failure modes the synthetic one does not.

Path 1 anchor scoring uses the entry note's embedding, not the query's. When the entry note is broad and the query is narrow, the anchor does not pick up query intent. A future iteration may add a small local query embedder.

Entry-point selection runs on routing heuristics generated from the vault. Fragile on first contact with a new vault.

FAQ

Q: Why not just use Smart Connections?
A: Smart Connections is RAG over the vault — embed, retrieve k chunks, chat. Loses the structure. The harness uses the structure first, embeddings second.

Q: Does this work without Smart Connections?
A: Yes. Depth Mode and Synthesis Mode work on wikilinks alone. Hybrid Mode is the opt-in layer that adds Smart Connections.

Q: Does the harness modify my notes?
A: /vault-discover Mode 2 adds [[wikilinks]] to your notes, with every change shown before writing. The runtime navigator never writes.

Q: What if my vault has no wikilinks yet?
A: The harness handles that case. Designed for the OneNote / Notion / Evernote migration scenario. Mode 1 ranks hub candidates by inbound mention frequency. Mode 2 wires them in.

Q: How does this differ from Karpathy's LLM Wiki?
A: LLM Wiki dumps the full knowledge base into the model's context and trusts long context. The harness traverses progressively from an entry point. For vaults larger than the context window, this matters; for small vaults the two converge.

Q: Why not embed the query at runtime?
A: Would require a Python ML stack. The harness is stdlib-only by design. Hybrid Mode achieves most of the win using anchor-based scoring (entry note as the anchor) instead.

Q: What if I want to use this on an existing vault with thousands of notes?
A: It should work — vault-context is bounded by MAX_NOTES_PER_QUERY (default 12). The harness scales with the entry-point routing, not with vault size. The eval is on 99 notes; larger vaults are untested.

DEV Community: Nick Yeo

Is That an LLM in Your Harness, or Are You Just Glad to See Me?

Benchmarking the Claude Code harness (and three rivals) on agentic knowledge management

Knowledge management is becoming an agent problem

OfficeQA: a benchmark for navigating documents, not retrieving them

The experiment

Results

Finding 1: the model is the biggest lever, and it is not close

Finding 2: small models are false economy per correct answer

Finding 3: "close" is only a consolation prize for frontier models

Finding 4: same model, different harness: the accuracy ties, the bill does not

Finding 5: every harness fails in its own accent

Limitations

What this means for agentic knowledge management

Think with your second brain: a proper Claude Code harness for Obsidian

Why I built it

Vector RAG vs vault harness

Harness structure

Eval

Hybrid Mode

Final results

Where this fits in the bigger picture

Limitations

FAQ

Links