Here's what happens every time you ask an AI coding agent a question:
- It greps your codebase
- It returns 15 files
- It stuffs ~69,000 tokens of raw source code into your context window
- It answers your question using maybe 3 of those files
- You pay for all 69,000 tokens anyway
This is BM25 keyword search on raw source code. It's the same algorithm that powered web search in 2009. And it's still the shape of most coding-agent retrieval systems: keyword search, grep, file search, context stuffing.
I spent the last few months building something better. Here's what I found.
The Vocabulary Gap Nobody Talks About
When you ask "how does Flask handle URL routing?", you're writing in English. The answer lives in scaffold.py, app.py, and wrappers.py — files full of Python syntax, decorator patterns, and Werkzeug internals.
BM25 tries to match your words against those files. It mostly fails.
The word "routing" appears 4 times in Flask's source. "URL" appears 31 times — mostly in docstrings and variable names scattered across 70+ files. BM25 retrieves 15 of them and hopes for the best.
The agent doesn't just have a retrieval problem. It has a vocabulary problem.
Natural language queries describe behavior. Source code implements syntax. These are different vocabularies, and no amount of BM25 tuning bridges that gap.
What If You Searched a Wiki Instead?
Generate a human-readable wiki page for every file and module, then search the wiki.
A wiki page for flask/sansio/scaffold.py reads like this:
Scaffold is the shared base class for Flask and Blueprint.
@route()callsadd_url_rule(), which creates a Werkzeug Rule and inserts it intourl_map. View callables are stored inview_functionskeyed by endpoint name.
Search that for "how does Flask handle URL routing?" — the query and the document speak the same language. No vocabulary gap.
That's Provenant. Index once, search a wiki forever.
The Numbers (Benchmarked on SWE-bench Verified)
I ran this against SWE-bench Verified — 500 real GitHub issues across 12 major Python repos. The metric is Coverage@5: does the correct file appear in the top 5 retrieved results?
| Method | Coverage@5 | Tokens/query | Delta |
|---|---|---|---|
| Raw BM25 (baseline) | ~40% | ~65,000 | — |
| Provenant (wiki + BM25) | 63.8% | ~1,030 | +24pp |
| Provenant + HyDE | 66.2% | ~1,030 | +26pp |
+24 percentage points. From 40% to 63.8%. On 500 tasks. Across 12 repos.
And the token numbers aren't rounding errors:
| Repo | Naive tokens | Provenant tokens | Reduction |
|---|---|---|---|
| Flask (30 queries) | 69,044 | 1,070 | 64.5× |
| Django (20 queries) | 59,634 | 994 | 60.0× |
Answer quality delta: −0.15 on a 5-point blind-judge scale. In this sample, that was not a meaningful drop. The model answers just as well with 1k tokens as it does with 69k — it just wasn't using the other 68k anyway.
How It Works (The 60-Second Version)
Step 1: Index your repo once.
provenant init /path/to/your/repo
Provenant parses every file with tree-sitter, generates a wiki page per module via LLM, and stores everything in SQLite/FTS5 + LanceDB. 6,122 pages across 12 repos. Done in minutes.
Step 2: Start the MCP server.
provenant serve --repo /path/to/your/repo
That's it. Provenant is now a local MCP server exposing tools your agent can call natively.
Step 3: Just use Claude. No special commands.
Add it to your claude_desktop_config.json:
{
"mcpServers": {
"provenant": {
"command": "provenant",
"args": ["serve", "--repo", "/path/to/your/repo"]
}
}
}
Now when you ask Claude "how does authentication work?" — it doesn't grep your codebase. It calls provenant_ask, gets 3 wiki pages (~1k tokens), and answers. You never change how you work. The retrieval layer is just better.
You ask Claude a question
↓
Claude calls provenant_ask (MCP tool)
↓
Provenant: BM25 over wiki pages → top-k results
↓
Claude synthesizes answer from ~1,030 tokens
↓
Attribution confidence logged → weak pages auto-repaired
What Claude Actually Said
I asked a fresh repo — a Java Android music player it had never seen — "How does this app play music?" Here's the actual response after calling provenant_ask:
Screenshot: Claude's unedited response after Provenant retrieved 3 wiki pages (~1k tokens). Discovery phase: ~30 seconds.
"Provenant compressed the discovery phase from ~5–10 minutes of grepping/reading to ~30 seconds. It's like having an experienced teammate say 'here's the 3 files you need and what they do' before you dive in."
— Claude, unprompted
That's on a Java codebase. Provenant indexes Python — but the wiki pages are plain English, and Claude reads English just fine.
The Part That Surprised Me: Attribution Confidence
Nobody measures when a retrieval index is wrong. BM25 returns 5 results and acts confident. The model uses 2. The other 3 were noise. The index degrades silently as your codebase changes.
I built a metric for this:
attribution confidence = pages actually cited / pages retrieved
Zero extra LLM calls. Derived from the citation structure already in the answer. It correlates with answer quality (r = 0.415 against a blind LLM judge) — high-confidence retrievals score 5.0/5 on average; low-confidence score 4.5.
When a page's confidence drops below 0.35, Provenant queues a background repair:
# Fires silently after low-confidence answers
asyncio.create_task(_background_repair(uncited_pages))
75% of low-confidence queries improved after one repair cycle. Cost: ~$0.02. Touches only 0.7% of pages.
The index improves the more you use it. Without you doing anything.
Per-Repo Breakdown
Some repos benefit more than others. The pattern: small, well-documented repos see the biggest gains. Large monoliths still improve, just from a harder baseline.
| Repo | Coverage@5 | Improvement | Wiki pages |
|---|---|---|---|
| requests | 78% | +38pp | 58 |
| pytest | 72% | +32pp | 186 |
| seaborn | 71% | +31pp | 94 |
| flask | 69% | +29pp | 74 |
| xarray | 66% | +26pp | 218 |
| sphinx | 63% | +23pp | 412 |
| django | 61% | +21pp | 1,393 |
| scikit-learn | 57% | +17pp | 1,124 |
| matplotlib | 55% | +15pp | 634 |
requests at 78% makes sense — it's a small, well-structured library with clean module boundaries. Each file does one thing. The wiki pages are precise. The retrieval is nearly perfect.
Django at 61% is still a +21pp improvement on a 1,393-page codebase. That's not nothing.
One More Thing: HyDE
For the ~3% of queries where even wiki vocabulary doesn't match, Provenant generates a hypothetical wiki snippet that would answer the question, then searches against that. Merged with BM25 via Reciprocal Rank Fusion.
+2.4pp Coverage@5. One extra LLM call. Not the headline — but it's there when it helps. The fact that it only fires 3% of the time is the point: the wiki handles the rest.
Honest About What Didn't Work
Speculative prefetching — I built a hook that pre-fetches wiki context whenever your agent greps a file, warming the cache. Median speedup: 1.0×. The DB reads were already fast enough. Keeping the code, not claiming a win.
Compression/pruning — removing low-attribution pages before synthesis. Firing rate on test set: 0%. The threshold was too conservative. Needs tuning before it's useful.
Self-healing at scale — the repair loop is only evaluated on Django (20 questions). I can't claim it generalises yet. It's early evidence, not a proven result.
Try It in 2 Minutes
pip install provenant
# Index
provenant init /path/to/your/repo
# Serve (MCP)
provenant serve --repo /path/to/your/repo
Works with Claude Code, Cursor, or anything MCP-compatible. Your agent gets provenant_ask, provenant_search, provenant_context, and provenant_risk as native tools. It stops grepping. It starts reading the wiki.
⭐ GitHub: github.com/shreyashsharma/provenant
What's Next
- Scale self-healing across all 12 repos (not just Django)
- SWE-bench end-to-end: patch generation, not just retrieval
- Figure out when HyDE helps vs hurts on different repo types
- Paper: "Provenant: Attribution-Guided Wiki Indexing for Repository-Level AI Coding Agents" — submitted to IEEE ICAITPR 2026, under review
The retrieval problem in AI coding tools is real and under-measured. BM25 on raw source code is the floor, not the ceiling.
If you try Provenant on your repo, I'm especially interested in two numbers:
- How many tokens your agent was reading before — run with a token counter on your current setup, then compare
- Whether the retrieved wiki pages match the files you would have opened manually — that's the real test, independent of benchmarks
Those two data points are more honest than any eval I can run on my own repos. Happy to compare notes.
Benchmarked with DeepSeek-V3.2 · nomic-embed-text-v1.5 · SWE-bench Verified (500 tasks) · 12 Python OSS repos


Top comments (0)