Shreyash

Posted on May 28

AI Coding Agents Search Like It's 2009. Provenant Cuts Tokens by 65x.

#ai #llm #rag #showdev

Here's what happens every time you ask an AI coding agent a question:

It greps your codebase
It returns 15 files
It stuffs ~69,000 tokens of raw source code into your context window
It answers your question using maybe 3 of those files
You pay for all 69,000 tokens anyway

This is BM25 keyword search on raw source code. It's the same algorithm that powered web search in 2009. And it's still the shape of most coding-agent retrieval systems: keyword search, grep, file search, context stuffing.

I spent the last few months building something better. Here's what I found.

The Vocabulary Gap Nobody Talks About

When you ask "how does Flask handle URL routing?", you're writing in English. The answer lives in scaffold.py, app.py, and wrappers.py — files full of Python syntax, decorator patterns, and Werkzeug internals.

BM25 tries to match your words against those files. It mostly fails.

The word "routing" appears 4 times in Flask's source. "URL" appears 31 times — mostly in docstrings and variable names scattered across 70+ files. BM25 retrieves 15 of them and hopes for the best.

The agent doesn't just have a retrieval problem. It has a vocabulary problem.

Natural language queries describe behavior. Source code implements syntax. These are different vocabularies, and no amount of BM25 tuning bridges that gap.

What If You Searched a Wiki Instead?

Generate a human-readable wiki page for every file and module, then search the wiki.

A wiki page for flask/sansio/scaffold.py reads like this:

Scaffold is the shared base class for Flask and Blueprint. @route() calls add_url_rule(), which creates a Werkzeug Rule and inserts it into url_map. View callables are stored in view_functions keyed by endpoint name.

Search that for "how does Flask handle URL routing?" — the query and the document speak the same language. No vocabulary gap.

That's Provenant. Index once, search a wiki forever.

The Numbers (Benchmarked on SWE-bench Verified)

I ran this against SWE-bench Verified — 500 real GitHub issues across 12 major Python repos. The metric is Coverage@5: does the correct file appear in the top 5 retrieved results?

Method	Coverage@5	Tokens/query	Delta
Raw BM25 (baseline)	~40%	~65,000	—
Provenant (wiki + BM25)	63.8%	~1,030	+24pp
Provenant + HyDE	66.2%	~1,030	+26pp

+24 percentage points. From 40% to 63.8%. On 500 tasks. Across 12 repos.

And the token numbers aren't rounding errors:

Repo	Naive tokens	Provenant tokens	Reduction
Flask (30 queries)	69,044	1,070	64.5×
Django (20 queries)	59,634	994	60.0×

Answer quality delta: −0.15 on a 5-point blind-judge scale. In this sample, that was not a meaningful drop. The model answers just as well with 1k tokens as it does with 69k — it just wasn't using the other 68k anyway.

How It Works (The 60-Second Version)

Step 1: Index your repo once.

provenant init /path/to/your/repo

Provenant parses every file with tree-sitter, generates a wiki page per module via LLM, and stores everything in SQLite/FTS5 + LanceDB. 6,122 pages across 12 repos. Done in minutes.

Step 2: Start the MCP server.

provenant serve --repo /path/to/your/repo

That's it. Provenant is now a local MCP server exposing tools your agent can call natively.

Step 3: Just use Claude. No special commands.

Add it to your claude_desktop_config.json:

{
  "mcpServers": {
    "provenant": {
      "command": "provenant",
      "args": ["serve", "--repo", "/path/to/your/repo"]
    }
  }
}

Now when you ask Claude "how does authentication work?" — it doesn't grep your codebase. It calls provenant_ask, gets 3 wiki pages (~1k tokens), and answers. You never change how you work. The retrieval layer is just better.

You ask Claude a question
         ↓
Claude calls provenant_ask (MCP tool)
         ↓
Provenant: BM25 over wiki pages → top-k results
         ↓
Claude synthesizes answer from ~1,030 tokens
         ↓
Attribution confidence logged → weak pages auto-repaired

What Claude Actually Said

I asked a fresh repo — a Java Android music player it had never seen — "How does this app play music?" Here's the actual response after calling provenant_ask:

Imgur: The magic of the Internet

imgur.com

Screenshot: Claude's unedited response after Provenant retrieved 3 wiki pages (~1k tokens). Discovery phase: ~30 seconds.

"Provenant compressed the discovery phase from ~5–10 minutes of grepping/reading to ~30 seconds. It's like having an experienced teammate say 'here's the 3 files you need and what they do' before you dive in."
— Claude, unprompted

That's on a Java codebase. Provenant indexes Python — but the wiki pages are plain English, and Claude reads English just fine.

The Part That Surprised Me: Attribution Confidence

Nobody measures when a retrieval index is wrong. BM25 returns 5 results and acts confident. The model uses 2. The other 3 were noise. The index degrades silently as your codebase changes.

I built a metric for this:

attribution confidence = pages actually cited / pages retrieved

Zero extra LLM calls. Derived from the citation structure already in the answer. It correlates with answer quality (r = 0.415 against a blind LLM judge) — high-confidence retrievals score 5.0/5 on average; low-confidence score 4.5.

When a page's confidence drops below 0.35, Provenant queues a background repair:

# Fires silently after low-confidence answers
asyncio.create_task(_background_repair(uncited_pages))

75% of low-confidence queries improved after one repair cycle. Cost: ~$0.02. Touches only 0.7% of pages.

The index improves the more you use it. Without you doing anything.

Per-Repo Breakdown

Some repos benefit more than others. The pattern: small, well-documented repos see the biggest gains. Large monoliths still improve, just from a harder baseline.

Repo	Coverage@5	Improvement	Wiki pages
requests	78%	+38pp	58
pytest	72%	+32pp	186
seaborn	71%	+31pp	94
flask	69%	+29pp	74
xarray	66%	+26pp	218
sphinx	63%	+23pp	412
django	61%	+21pp	1,393
scikit-learn	57%	+17pp	1,124
matplotlib	55%	+15pp	634

requests at 78% makes sense — it's a small, well-structured library with clean module boundaries. Each file does one thing. The wiki pages are precise. The retrieval is nearly perfect.

Django at 61% is still a +21pp improvement on a 1,393-page codebase. That's not nothing.

One More Thing: HyDE

For the ~3% of queries where even wiki vocabulary doesn't match, Provenant generates a hypothetical wiki snippet that would answer the question, then searches against that. Merged with BM25 via Reciprocal Rank Fusion.

+2.4pp Coverage@5. One extra LLM call. Not the headline — but it's there when it helps. The fact that it only fires 3% of the time is the point: the wiki handles the rest.

Honest About What Didn't Work

Speculative prefetching — I built a hook that pre-fetches wiki context whenever your agent greps a file, warming the cache. Median speedup: 1.0×. The DB reads were already fast enough. Keeping the code, not claiming a win.

Compression/pruning — removing low-attribution pages before synthesis. Firing rate on test set: 0%. The threshold was too conservative. Needs tuning before it's useful.

Self-healing at scale — the repair loop is only evaluated on Django (20 questions). I can't claim it generalises yet. It's early evidence, not a proven result.

Try It in 2 Minutes

pip install provenant

# Index
provenant init /path/to/your/repo

# Serve (MCP)
provenant serve --repo /path/to/your/repo

Works with Claude Code, Cursor, or anything MCP-compatible. Your agent gets provenant_ask, provenant_search, provenant_context, and provenant_risk as native tools. It stops grepping. It starts reading the wiki.

⭐ GitHub: github.com/shreyashsharma/provenant

What's Next

Scale self-healing across all 12 repos (not just Django)
SWE-bench end-to-end: patch generation, not just retrieval
Figure out when HyDE helps vs hurts on different repo types
Paper: "Provenant: Attribution-Guided Wiki Indexing for Repository-Level AI Coding Agents" — submitted to IEEE ICAITPR 2026, under review

The retrieval problem in AI coding tools is real and under-measured. BM25 on raw source code is the floor, not the ceiling.

If you try Provenant on your repo, I'm especially interested in two numbers:

How many tokens your agent was reading before — run with a token counter on your current setup, then compare
Whether the retrieved wiki pages match the files you would have opened manually — that's the real test, independent of benchmarks

Those two data points are more honest than any eval I can run on my own repos. Happy to compare notes.

Benchmarked with DeepSeek-V3.2 · nomic-embed-text-v1.5 · SWE-bench Verified (500 tasks) · 12 Python OSS repos

Top comments (2)

Harjot Singh • May 31

The 69,000-tokens-to-use-3-files breakdown is the quiet waste tax on every coding agent, and you're right that it's the same BM25-on-raw-source shape from 2009 wearing an AI hat. Two costs hide in it: the obvious one (you pay for 66k tokens of noise) and the sneaky one (that noise actively degrades the answer, because the model now has to find the signal in a haystack you handed it, and it sometimes anchors on the wrong file). So it's not just cheaper to retrieve precisely, it's better. The vocabulary-gap point is the real root cause, you ask in intent (how does auth work) and grep matches on tokens (the literal string auth), so keyword search structurally can't bridge intent-to-implementation. That's exactly where semantic + structural retrieval earns its 65x. The principle I keep is: the retrieval layer should do the filtering so only the relevant slice ever reaches the model, raw bulk stays out of the window. That context-discipline is core to how I build in Moonshift. Is Provenant leaning on embeddings, AST/structural understanding, or both to close the vocabulary gap?

Shreyash • Jun 2

Exactly. The hidden cost is not just wasted tokens. Noise can make the model anchor on the wrong file and still produce a plausible answer.

That is why I track attribution confidence: cited pages / retrieved pages. In my Django eval, it correlated with answer quality at r = 0.415 across 20 questions.

Provenant closes most of the vocabulary gap at index time:

AST → prose wiki → BM25

tree-sitter extracts symbols, imports, and call relationships. An LLM turns that into natural-language wiki pages, and BM25 searches those instead of raw code.

Embeddings are only a fallback for the small set of queries where wiki vocabulary still misses intent.

How does Moonshift handle it: pre-LLM filtering or query-time ranking?