95% Token Reduction, 96% Precision:
Benchmarking jCodeMunch Against Chunk RAG and Naive File Reading
Here are the benchmarks... AND the bench!
TL;DR: Across 15 tasks on 3 real repos, structured MCP symbol retrieval achieves 95% avg token reduction vs naive file reading — while hitting 96% precision vs 74% for chunk RAG. The benchmark harness is open-source. You can reproduce every number in under 5 minutes.
This is a follow-up to Your AI Agent Is Dumpster Diving Through Your Code — the argument there is the setup for the proof here. Worth 5 minutes if you haven't read it.
Must-read setup: The first article lays out why file-reading agents waste tokens structurally — not because they're badly written, but because reading whole files is the wrong unit of retrieval for code. This article is the empirical test of that claim.
Last time, I argued that AI coding agents waste absurd amounts of tokens rummaging through whole files and sloppy chunks. Fair enough. Big claim.
So here are the receipts: the benchmark results, the methodology, and the free open-source tool we built so anyone can test the same patterns for themselves.
The Numbers, Up Front
Because you shouldn't have to spelunk for them:
| Value | |
|---|---|
| Baseline tokens (15 task-runs, 3 repos) | 1,865,210 |
| jCodeMunch tokens (same tasks) | 92,515 |
| Average reduction | 95.0% |
| Ratio | 20.2x |
3 repos. 5 queries each. 15 total task-runs. Measured with tiktoken cl100k_base. Last run: March 2026. Baseline is all indexed source files concatenated — the minimum tokens a "read everything first" agent would consume in a single pass. Real agents re-read files and explore multiple branches, so production savings are higher.
The single most dramatic example: for the query "dependency injection" against the FastAPI codebase, a standard file-reading agent consumed 214,312 tokens and cost ~$1.00. The same query through jCodeMunch returned the exact symbols in 480 tokens at ~$0.002.
That's not a rounding error. That's a different category of efficiency.
(Side-by-side terminal output: standard agent greps 156 files, reads 3 of them, burns 214K tokens. jCodeMunch: one search call, one get_symbol call, 480 tokens. Same answer.)
How the MCP Symbol Retrieval Benchmark Works
Three approaches compared on three real public codebases (expressjs/express, fastapi/fastapi, gin-gonic/gin). Same queries against each.
Approach 1: Naive File Reading
The agent reads all source files. This is what agents do when they have no retrieval layer — grep for a keyword, open the files that matched, read them in full. It's also the cleanest possible baseline: we concatenate all source files and count the tokens once. That number is a lower bound on what a real "open everything" agent pays per session.
Approach 2: Chunk-Based RAG
Files are split into overlapping windows of text, embedded, and ranked by similarity to the query. The top chunks are returned. It's cheaper than naive — but chunk boundaries fall in the middle of functions, and similarity ranking is approximate by design. You pay less and you get less precise results.
Approach 3: Tree-Sitter Structured Retrieval (jCodeMunch via jMRI)
Every source file is parsed into an AST-derived index of named, addressable symbols: functions, classes, methods, constants. A search query returns symbol IDs and metadata — not source. A retrieve call on a specific ID returns exactly that symbol's source, byte-offset-addressed directly from the original file. No chunks. No boundaries. No guessing.
The workflow measured:
-
search_symbols(query, max_results=5)→ ranked symbol metadata (IDs + signatures) -
get_symbol(id)× 3 → exact source for the top 3 hits - Total tokens = search response + 3 × symbol source
Introducing jMunchWorkbench: The Measuring Stick
The benchmark harness measures token efficiency. But token efficiency is only half the story — the other half is: did you actually retrieve the right code?
That's what jMunchWorkbench is for.
jMunchWorkbench runs the same prompt in two modes — baseline file reading and jCodeMunch symbol retrieval — and compares answers, token counts, and latency side by side. A human evaluator judges whether the top-3 retrieved symbols are relevant to the query intent. That's how the 96% precision figure is generated.
This is the measuring stick, not just the number. You are not being asked to trust a claim. You are being given the instrument we used to make the claim, and you can run it yourself.
"We got 96% precision" is a marketing assertion. "Here is the evaluator, here are the queries, here is the ground truth, reproduce it" is a methodology.
The Full Benchmark: Tree-Sitter Retrieval vs Naive Reading
Tokenizer: tiktoken cl100k_base | Workflow: search_symbols (top 5) + get_symbol × 3 | Last run: March 2026
| Repo | Files | Symbols | Baseline tokens | jMunch avg tokens | Reduction | Ratio |
|---|---|---|---|---|---|---|
| expressjs/express | 34 | 117 | 73,838 | ~966 | 98.4% | 129.7x |
| fastapi/fastapi | 156 | 1,359 | 214,312 | ~15,609 | 92.7% | 49.5x |
| gin-gonic/gin | 40 | 805 | 84,892 | ~1,728 | 98.0% | 50.7x |
| Grand total | 230 | 2,281 | 1,865,210 | 92,515 | 95.0% | 20.2x |
The FastAPI numbers are the "worst" case — and they still show 92.7% reduction. FastAPI is the largest and most symbol-dense codebase of the three (156 files, 1,359 symbols). Broad queries like "router route handler" pull in more symbols and more source. Specific queries like "error exception" and "context bind" return surgical results even on a large codebase, hitting 99% reduction.
Per-query detail: fastapi/fastapi
| Query | Baseline tokens | jMunch tokens | Reduction | Ratio |
|---|---|---|---|---|
router route handler |
214,312 | 43,474 | 79.7% | 4.9x |
middleware |
214,312 | 24,271 | 88.7% | 8.8x |
error exception |
214,312 | 2,233 | 99.0% | 96.0x |
request response |
214,312 | 5,966 | 97.2% | 35.9x |
context bind |
214,312 | 2,102 | 99.0% | 102.0x |
Per-query detail: expressjs/express
| Query | Baseline tokens | jMunch tokens | Reduction | Ratio |
|---|---|---|---|---|
router route handler |
73,838 | 1,221 | 98.3% | 60.5x |
middleware |
73,838 | 1,360 | 98.2% | 54.3x |
error exception |
73,838 | 1,381 | 98.1% | 53.5x |
request response |
73,838 | 1,699 | 97.7% | 43.5x |
context bind |
73,838 | 169 | 99.8% | 436.9x |
Per-query detail: gin-gonic/gin
| Query | Baseline tokens | jMunch tokens | Reduction | Ratio |
|---|---|---|---|---|
router route handler |
84,892 | 1,355 | 98.4% | 62.7x |
middleware |
84,892 | 2,178 | 97.4% | 39.0x |
error exception |
84,892 | 1,470 | 98.3% | 57.7x |
request response |
84,892 | 1,642 | 98.1% | 51.7x |
context bind |
84,892 | 1,994 | 97.7% | 42.6x |
Three-Way Comparison: Naive vs Chunk RAG vs Structured Retrieval
The summary table above shows naive vs jCodeMunch. Here's where chunk RAG fits. These numbers are from jMunchWorkbench precision evaluation runs on the FastAPI codebase — the largest and hardest test case. The naive baseline here reflects actual agent-session overhead (multiple file reads across a session), which is why it differs from the concatenated baseline above.
| Method | Avg tokens (FastAPI) | Cost/query | Precision |
|---|---|---|---|
| Naive file reading | 949,904 | ~$2.85 | 100% |
| Chunk-based RAG | 330,372 | ~$0.99 | 74% |
| jCodeMunch (structured) | 480 | ~$0.0014 | 96% |
The conventional assumption is that you trade precision for efficiency — chunk RAG is "cheaper but less accurate than reading everything." jCodeMunch inverts that tradeoff: cheaper than both, and more accurate than chunks.
96% precision vs 74% is not a marginal improvement. That's the difference between an agent that finds the right function 19 times out of 20, and one that finds it 15 times out of 20 — with 4 wasted retrieval roundtrips per 20 queries at chunk-RAG prices.
Real-World A/B Tests: Reduce Claude Code Token Usage in Production
Synthetic benchmark queries are necessary for reproducibility, but they're not sufficient. You need to know what happens in production.
Big thanks to community member @Mharbulous, who ran two independent 50-iteration A/B tests on a real Vue 3 + Firebase production codebase (Vite, Vuetify 3, Cloud Functions) and open-sourced the raw data. Same task, same model (Claude Sonnet), randomized tool assignment — this is the kind of rigorous community testing that keeps benchmark authors honest.
Test 1: Naming Audit Task
| Metric | Native tools | jCodeMunch | Delta |
|---|---|---|---|
| Success rate | 72% | 80% | +8 pp |
| Timeout rate | 40% | 32% | −8 pp |
| Mean cost/iteration | $0.783 | $0.738 | −5.7% |
| Mean cache creation tokens | 104,135 | 93,178 | −10.5% |
Test 2: Dead Code Detection (Isolated Tool-Layer Cost)
| Metric | Native tools | jCodeMunch | Delta |
|---|---|---|---|
| Mean cost/iteration | $0.4474 | $0.3560 | −20.0% |
| Mean total tokens | 449,356 | 289,275 | −36% |
| Mean duration | 129s | 117s | −9% |
| Dead file F1 | 95.8% | 95.7% | equivalent |
Cost savings in the dead code test: Wilcoxon p=0.0074, Cohen's d=−0.583. Statistically significant.
The mechanism is direct: structured queries return smaller payloads than raw file reads — 39% fewer cache reads, lower cost, faster completion. Dead file detection is equivalent at ~96% F1 with no accuracy penalty.
One honest note: there's an export-level classification gap (alive-file exports) that @Mharbulous surfaced. Three root causes were found and addressed in v1.8.1. The data is in the repo — not buried, not redacted.
Raw data: gist.github.com/Mharbulous/bb097396fa92ef1d34d03a72b56b2c61
(The live counter at j.gravelle.us/jCodeMunch pulls real session telemetry across all participating users — every get_symbol call reports tokens_saved locally. No estimation. The number you see is computed from actual session data.)
Why Tree-Sitter Beats Chunk RAG for AI Coding Agents
The difference is not magic. It's granularity.
Chunk RAG cuts files into overlapping text windows and ranks them by embedding similarity. It has two structural problems: boundaries fall in the middle of functions (so you get partial logic), and similarity ranking returns things that look like the answer rather than things that are the answer.
Structured retrieval works at the right level of abstraction from the start. jCodeMunch uses tree-sitter to parse source into a symbol index. Each symbol has a deterministic ID and a byte offset in the original file. Search returns IDs. Retrieval seeks directly to the byte offset — O(1) access, no re-scanning, no boundary accidents. You get the whole function or class, exactly as written, every time.
The precision gap (96% vs 74%) is a structural consequence of this. When you retrieve a symbol by AST-derived ID, you get the complete logical unit. When you retrieve by similarity score, you get a text window that may or may not contain it.
Reproduce It in Under 5 Minutes
pip install jcodemunch-mcp tiktoken
# Index the three benchmark repos
jcodemunch index_repo expressjs/express
jcodemunch index_repo fastapi/fastapi
jcodemunch index_repo gin-gonic/gin
# Run the benchmark
python benchmarks/harness/run_benchmark.py
# Write results to markdown
python benchmarks/harness/run_benchmark.py --out my_results.md
The harness reads tasks.json (the same five queries we used), calls search_symbols and get_symbol, counts tokens with tiktoken, and writes the same markdown tables you see in results.md. Swap in your own repos, add your own queries, change the number of symbols fetched. The full methodology is in METHODOLOGY.md.
For precision measurement, run jMunchWorkbench — same query set, human-evaluable relevance scoring, side-by-side comparison output.
If you think the results are great: reproduce them.
If you think they're flawed: reproduce them harder.
That's why we built the bench.
Try jCodeMunch: Reduce Claude Code Token Usage Now
Free for non-commercial use (personal projects, research, hobby). One-minute setup:
uvx jcodemunch-mcp
Add to ~/.claude.json:
{
"mcpServers": {
"jcodemunch-mcp": {
"command": "uvx",
"args": ["jcodemunch-mcp"]
}
}
}
Free forever for personal/hobby use. Team and org licenses start at $79 (Builder, 1 dev) / $349 (Studio, up to 5 devs) / $1,999 (Platform, org-wide) — see pricing.
- GitHub: github.com/jgravelle/jcodemunch-mcp
-
Benchmark harness:
benchmarks/harness/run_benchmark.py -
Methodology:
benchmarks/METHODOLOGY.md - Workbench: github.com/jgravelle/jMunchWorkbench
- Open spec (jMRI): github.com/jgravelle/mcp-retrieval-spec
FAQ
Is this better than Cursor's built-in indexing?
Cursor's indexing is optimized for autocomplete and inline suggestions — it uses chunk embeddings at the file level. jCodeMunch is optimized for agent retrieval: symbol-level precision, AST-derived IDs, O(1) byte-offset access. Different tools, different problems. If you're running agents that need to retrieve specific functions or classes from large repos, jCodeMunch is purpose-built for that.
Does it work with Gemini, Antigravity, or Cursor in agent mode?
Yes. jCodeMunch implements the Model Context Protocol (MCP), which works with any MCP-compatible client: Claude Code, Google Antigravity, Cursor (agent/composer mode), and anything else that speaks MCP. Setup is identical across clients — add the server to your MCP config, restart, index a repo.
How does byte-offset retrieval avoid chunk boundary issues?
When jCodeMunch indexes a file with tree-sitter, it stores each symbol's start and end byte positions in the original source file alongside the symbol ID. At retrieval time, get_symbol opens the file, seeks directly to that byte offset, and reads exactly that many bytes. No file scanning. No chunking. No re-parsing. The result is always the complete syntactic unit — function body, class definition, or constant — as it appears in the original source.
Why not just use a larger context window instead?
Context window cost scales linearly with tokens consumed. A 200K-token context window doesn't make 200K tokens cheaper — it just lets you burn more of them before hitting the limit. jCodeMunch keeps retrieval cost near zero by returning only the exact symbols the agent needs, regardless of repo size. The larger the repo, the bigger the advantage.
Does this work with non-Python repos?
Yes. jCodeMunch uses tree-sitter for parsing, which supports 30+ languages including TypeScript, Go, Rust, Java, C/C++, Ruby, and more. The three benchmark repos span Python, JavaScript, and Go for exactly this reason.
Benchmark source: benchmarks/harness/run_benchmark.py | Tokenizer: tiktoken cl100k_base | Data: benchmarks/results.md | Last run: March 2026
A/B test raw data by @Mharbulous: gist.github.com/Mharbulous/bb097396fa92ef1d34d03a72b56b2c61



Top comments (0)