How we measured 99.6% token reduction across 15 task-runs

#mcp #claude #ai #python

Two months after publishing the headline, here are the receipts.

Two months ago I published "Your AI Agent Is Dumpster Diving Through Your Code." The most common reply was some flavor of: "Cool numbers, but how did you actually measure them?"

Fair question. Here's the answer.

What we measured

The jCodeMunch benchmark measures retrieval token efficiency — how many LLM input tokens a code-exploration tool consumes compared to reading all source files. It does not measure answer quality, latency, or end-to-end task completion. Those are separate axes (we measure precision separately in jMunchWorkbench, but that's a different post).

Three repos, five queries, run on 2026-03-28:

Repository	Files	Symbols	Baseline tokens
expressjs/express	165	181	137,978
fastapi/fastapi	951	5,325	699,425
gin-gonic/gin	98	1,489	187,018

The five queries cover the most common code-exploration intents I see in the wild: router route handler, middleware, error exception, request response, context bind. Three repos × five queries = 15 task-runs.

Total baseline cost across all 15: 5,122,105 tokens.
Total jcodemunch cost across all 15: 19,406 tokens.

Reduction: 99.6% average.

How the runs work

For each query, the harness does two things and compares.

Baseline: concatenate every file in the repo, tokenize, count. This is the "read everything first" agent — the minimum cost to put the entire codebase in context.

jcodemunch: call search_symbols(query, max_results=5), then get_symbol_source() on the top 3 matching symbol IDs. Total tokens = search response tokens + 3 × symbol source tokens.

AI summaries are disabled during benchmarking (signature-only fallback). Without that, jcodemunch numbers would look even better, but it would conflate retrieval efficiency with summarization efficiency, and those are separable concerns.

Token counts come from the serialized JSON response strings, not raw source bytes — JSON field names and structure overhead are included. Slightly understates savings, but the count is deterministic and reproducible.

Tokenizer: tiktoken with cl100k_base encoding. Used by GPT-4 and compatible with Claude estimates within ~5%.

The "common misreadings"

The methodology doc has a section called Common Misreadings. It addresses the four pushbacks I get every time I publish a number.

"The claim is up to 99%." No — the primary claim is 99.6% average across all 15 task-runs. Individual queries reach 99.9% on large repos with tight symbol matches (error exception on fastapi/fastapi: 99.9%, 801× reduction). The 99.6% aggregate is the honest headline across the current index state.

"I tested it on a different repo and got 80%." Results vary by repo structure. Flat script collections — repositories of hundreds of unrelated standalone scripts — produce lower savings, because the symbol index can't distinguish which script is relevant to a given query, and the agent still has to scan broadly. The benchmark repos (Express, FastAPI, Gin) are structured application codebases where symbol-based navigation is most effective. Testing a flat script collection and comparing to our benchmark is apples-to-oranges. I say so out loud because pretending otherwise would be dishonest.

"The benchmark is cherry-picked." The three repos were chosen to represent common backend frameworks across different languages — JavaScript, Python, Go. No file filtering beyond standard skip patterns. The harness (benchmarks/harness/run_benchmark.py) and query corpus (benchmarks/tasks.json) are in the repo. Run them yourself. Publish your numbers. If they're better than mine, I'll cite you.

"The baseline is unrealistic." It is — but it's intentionally a lower bound. Real agents re-read files, branch across sessions, and load documentation. Actual production baseline costs are higher than what I report. That makes the 99.6% number a conservative floor, not a ceiling.

What others have measured independently

Two reviewers ran their own numbers and published them.

Artur Skowroński (VirtusLab GitHub All-Stars #15): "roughly 80% fewer tokens, or 5× more efficient — index once, query cheaply forever."

Julian Horsey (Geeky Gadgets): "3,850 tokens reduced to just 700 — a 5.5× improvement."

These don't contradict the 99.6%. They're testing different workloads at different scales. Smaller per-query savings × hundreds of queries per session = the same compounding effect at the larger scale. Multiple methodologies converging in the same direction is exactly what you want from a benchmark you'd actually rely on.

How to run it yourself

pip install jcodemunch-mcp tiktoken

jcodemunch index_repo expressjs/express
jcodemunch index_repo fastapi/fastapi
jcodemunch index_repo gin-gonic/gin

python benchmarks/harness/run_benchmark.py

Every release in v1.76.0+ runs the benchmark in CI with a regression gate at 0.02 — if any aggregate metric drops by more than 2% versus the saved baseline, the build fails. I don't ship without the receipts.

What this means for you

If you're building agents on Claude Max, hitting context-window pain, or paying API bills that scale with token count: don't take my word for it. Run the calculator on your own stack.

Token Tax Calculator

It tells you what you're spending today, what jcodemunch saves, and what that maps to annually. Takes 60 seconds. Costs nothing.

The benchmark is a number. Your own stack is the number that matters.

—-jjg
github.com/jgravelle/mcp-retrieval-spec