DEV Community: J. Gravelle

About that 'your 997 says rejected but not why' problem...

J. Gravelle — Tue, 14 Jul 2026 18:35:52 +0000

Somebody on Reddit posted about 997s that just say AK5*R*5 — one or more segments in error — no AK3, no AK4.

Preach.

That's the problem this free doohickey* is for: rejectdecoder.com

^{*If you'd prefer a "gizmo", I can make that happen.}

What it does

Paste the rejection (997, 999, 824, TA1) plus the original bounced document. It parses both locally in your browser and cross-audits them:

control number agreement
segment counts
envelope consistency
code validity
required segments

It then quotes the exact segment byte-for-byte and ranks the likely causes for anything it finds. If it finds nothing, it says the answer isn't in the docs and tells you to escalate to your partner with your control numbers — which beats pulling a diagnosis out of my... AIs.

Where the AI does (and doesn't) fit

I know how and appreciate WHY "AI-powered EDI" is sneered at. So the audits here are deterministic parser code, not a model. The AI only writes the plain-English narration of facts the parser already verified, every card says so, and if the narration fails you still get the full audit results. No hallucinations or guesswork.

Privacy

Parsing runs entirely in-browser (the real Python parser, compiled to WebAssembly via Pyodide) and even works with the WiFi off. If you use narration, only a masked summary you preview first ever leaves the page. Don't take my word for it — check your network tab.

Free. No signup for the examples or the deterministic audits; narration is a handful of decodes a month with just an email.

Built it solo from an in-house tool of mine, so it's young AND kinda old. Please tell me where it's wrong. Walmart's rejection quirks are encoded so far. Whose partner nonsense should be next...?

-jjg

You Don't Need an LLM to Route Agent Context: Regex Beats Classifiers by 45 Points

J. Gravelle — Wed, 08 Jul 2026 18:35:19 +0000

LLM agents burn a ridiculous number of tokens on redundancy: opening the same files again and again, trying a patch, failing, then wandering back through the repo like they’ve never seen it before.

A July 2026 paper, ContextSniper: AntTrail's Token-Efficient Code Memory for Repository-Level Program Repair, puts real numbers behind that waste. In repository-level repair, agents keep dragging in irrelevant code and logs. ContextSniper tackles that with a context layer built around tiered memory and an intention-aware context gate that filters low-value regions before they ever reach the model.

That gate alone cut tokens by 51.5% on one host agent and 38.9% on Claude Code, while submitted-resolution rates stayed basically in the same neighborhood.

The gate is the interesting part, because it is not tied to that paper’s exact system. It is a more general idea, and it is starting to show up across agent architectures.

At heart, the gate is just a classifier. Given a request, it has to decide what kind of retrieval will answer the question cheapest: symbol lookup, semantic search, graph impact, mutation prep, or something else.

That leads to the practical question the paper does not really answer:

Do you need another LLM call just to decide what context to retrieve?

We tested that directly.

Five ways agents get code into context

Before you can gate anything, you need a retrieval strategy. Most current systems fall into one of five rough families:

Grounded read-only retrieval: parse the code and return exact symbol source by name. Byte-precise, no synthesis.
Graph code intelligence: model calls, imports, entities, and dependencies as a graph, then traverse it.
Embedding / RAG search: use vector similarity over chunks.
Whole-repo packers: compress or dump the repo into the context window.
Mutate / execute runtimes: retrieve context, then modify or run code.

None of these is magic. Graphs are great for relationships, but they can drift away from source. RAG is useful, but fuzzy by design. Packers are simple, but expensive. Mutation runtimes are powerful, but they widen the blast radius.

The important point is that every approach still has to answer the same question:

What should I fetch for this request?

That decision is the gate.

The gate is where the real leverage is

An intention-aware gate looks at a request like:

“where is parse_config defined”
“how does caching work here”
“what breaks if I rename this”
“change the timeout to 60s”

Then it chooses the cheapest retrieval path that is likely to work.

That might be a symbol lookup. It might be semantic search. It might be a graph-impact query. It might be mutation prep.

ContextSniper uses a traceback / behavioral / stateful intent split, which is a useful framing. But it leaves an obvious engineering question for anyone building this kind of system:

How heavy does the gate actually need to be?

The default instinct is to reach for an LLM router, because “intent” sounds like a language-understanding problem.

We wanted to know whether that was true.

The setup

We used 140 hand-authored requests, balanced across seven dispatch classes:

symbol lookup
text search
semantic
graph impact
structure
stateful
mutate

The label space borrows from ContextSniper’s intent split. We ran five-fold stratified cross-validation and compared three cheap routing tiers:

heuristic: about 40 lines of regex over the request text
centroid: TF-IDF with nearest class mean
logreg: TF-IDF with a linear classifier

The result

Tier	What	Accuracy	Macro-F1
heuristic	~40 lines of regex	94.3%	0.945
centroid	TF-IDF nearest-class	47.9%	0.474
logreg	TF-IDF linear	48.6%	0.484

The regex tier won by about 45 points.

The two learned models were not terrible for a seven-class problem, but they were nowhere close. Same corpus, same folds, same labels, and the dumb little ruleset walked off with the trophy.

That was surprising enough that we dug into why.

Intent lives in shape, not word frequency

Bag-of-words models care about which words appear and how often. But for these requests, intent usually lives somewhere else. It lives in the shape of the request.

“Where is X defined” and “how does caching work” do not share much useful vocabulary, but both are almost embarrassingly easy to classify from structure alone.

A camelCase or snake_case token usually means symbol lookup.

A leading “how does” or “why” usually means behavioral or semantic exploration.

“What breaks if I” usually means impact analysis.

“Change,” “rename,” or “set ... to” usually means mutation.

A quoted literal or bare number often means text search.

Regex sees those shapes directly.

TF-IDF mostly throws them away, then gets punished again by the tiny dataset. With roughly 16 examples per class, a sparse linear model does not have much signal left to recover.

In this experiment, the cheap learned tier was not merely unnecessary. It was worse than the rules.

The 3% that fights back

Four of the 140 requests resisted both cheap tiers:

Example Request	Why it resists cheap classification
what does the parse method look like	symbol lookup, but no definitional keyword
find where the timeout value 30000 appears	text search phrased like a location query
show me how caching is implemented	semantic vs symbol, genuinely ambiguous
what modules depend on the core package	graph question that reads structural

These are not really rule failures. They are ambiguity.

And the fix for ambiguity is not always “ask a bigger model.” Often, the fix is to probe reality.

Try the cheapest grounded lookup first. If it misses, fall back to semantic search. That resolves the uncertainty by checking the code, not by paying a language model to make a better guess.

What to build

For an intention-aware gate, the order should be:

Start with regex. It handles most traffic, costs nothing, behaves deterministically, and never rate-limits.
Add probe-and-fallback for the ambiguous cases. When a request could mean two things, cheaply test the first interpretation before escalating.
Use a model call only for what survives both. In our corpus, that was about 3% of requests, and even those had cheaper ways out.

The expensive tier should be the last resort, not the front door.

That is the opposite of the usual reflex, which is to reach for an LLM router because the problem sounds like “understanding.” In practice, a lot of the signal is sitting right there in the punctuation, casing, verbs, and shape of the request.

Honest caveats

Two points that really matter:

First, our corpus and rules share an author. So 94.3% is an in-distribution number. It proves that hand-written rules can separate hand-written requests, which is not exactly a thunderclap from Mount Science.

The number is less important than the shape of the result. Real validation needs real request logs. I would happily rerun this on a public trace.

Second, ContextSniper is convergent evidence, not an endorsement of any specific tool. It is a separate group arriving at the same architecture from the repair-agent direction: tiered memory plus an intent gate, large token savings, and resolution held roughly flat.

To be precise about “roughly,” their official validation reports 24.0% of issues resolved with the gate versus 26.0% for the baseline, which they describe as comparable while also noting a validation-error imbalance.

The important signal is that two independent lines of work are landing in the same place: context gates matter. The exact accuracy of one regex tier is not the point.

Where this leaves the taxonomy

The gate sits above whatever retrieval family you use.

In our stack, it routes into a grounded read-only retriever, jCodeMunch, because byte-exact symbol source is often the cheapest correct answer to the most common kind of request.

But the routing lesson is not specific to jCodeMunch. It applies whether you retrieve by symbol, graph, vector, or whole-repo packing.

The thing deciding what to retrieve should be cheap first, deterministic second, shape-aware third, and “smart” only when it has no cheaper option left.

For the full map of the five families and where different tools fit, there is a running field guide to code-context tools.

Links: ContextSniper (arXiv 2607.01916) · field guide to code-context tools · jCodeMunch

They Spent $81,267 By Accident...

J. Gravelle — Mon, 29 Jun 2026 15:43:43 +0000

...and I Spent $1.39 On Purpose

Last week, fintech startup Slash told its whole company to lean into AI coding. One employee took that note personally, sat down with Claude, and built a video game.

Then the bill showed up: $81,267. In one week. On the company card.

The game is called Brainrot Shooter. It's a bare-bones, blocky, Minecraft-looking first-person shooter where you run around blasting characters named after viral memes (Skibidi Toilet, Tung Tung Tung Sahur, the whole brainrot pantheon). Slash handled the receipt the only sane way: they posted it to the internet and begged people to play the game so they could write it off as a marketing expense. It went viral. The dumb little game actually found an audience.

The employee summed it up himself: "This is actually insane, am I going to become a case study for how AI spend can get out of control."

Yes. You are. And this is the case study. But not the one you think.

Everybody blamed the wrong thing

The internet's takeaway was "see, AI coding is a money pit." That's the lazy read, and it's wrong. The model didn't cost him eighty grand. The way he used it did.

Here is the part nobody on X bothered to explain.

A coding agent is stateless. The model has no memory between turns. So every single time you hit send, the harness re-sends the context it needs: your system prompt, the conversation so far, and the file contents the model is supposed to look at. The model reads all of it, fresh, every turn. You pay for all of it, fresh, every turn.

Now watch what happens over a full day of active development. Your codebase grows. Your conversation grows. The agent keeps pulling large files into context so it can "look at the whole project and change this one thing." Every one of those turns re-bills everything you already showed it five minutes ago. Do that across hundreds of turns on a swelling codebase and the meter spins like a slot machine that never pays out.

That is not the AI being expensive. That is you paying to make it re-read the same code a few hundred times.

The tell is in the token split

Here's where it gets concrete, and here's the part a developer can actually use.

Claude Opus 4.8 bills input and output tokens at different rates. As of this writing, straight from Anthropic's own announcement:

Input: $5 per million tokens
Output: $25 per million tokens

Output is five times pricier per token. So your gut says output is where the money goes. In agentic coding, that's backwards. Output tokens are what the model writes, and a model can only write so fast. Input tokens are what the model reads, and there is no ceiling on how many times you can make it re-read the same files. Input is cheaper per token and ruinous in volume, because volume is the thing that compounds turn over turn.

So when a coding bill goes nuclear, it's almost always input-dominated. I don't have the Slash employee's dashboard, so I won't pretend to quote his exact split. But a five-figure bill from a single day of iterative development is the unmistakable signature of context reloading: the same big files, read again and again, hundreds of turns deep. That is the mechanism. That is what burned the card.

Which gave me an idea for a test.

Same game. One shot. On purpose.

I rebuilt Brainrot Shooter from scratch (mine's called jBrainRot, naturally). Same premise, same blocky world, same meme enemies waddling at you while a combo counter climbs. The difference was in how I asked.

Instead of a day-long conversation where the model re-reads a growing codebase on every turn, I wrote one complete, self-contained prompt. A full spec: the tech constraints, the controls, the enemies, the game feel, all of it, up front. One message. One shot. No "now look at the file again and tweak this" loop, because the loop is the leak.

I also ran jCodeMunch in the middle, an MCP tool I built that trims the dead weight out of context before it ever reaches the model. (Full disclosure: that's my product. Use it, don't use it, the principle stands either way.)

The result was a playable browser game. And here's the receipt:

Input: 11,900 tokens = $0.0595
Output: 53,400 tokens = $1.335
Total: $1.39

Look at that split. It's output-heavy. The model spent its tokens writing the game, not re-reading the game. That is the exact inverse of the reloading trap, and it's why the number has a decimal point in front of it instead of five digits.

His $81,267 versus my $1.39. That's roughly 58,000 to one, for the same dumb game.

Don't Nick-Up Your Budget

You don't need my tool to avoid this (but it'd really help). You need to stop paying to make the robot re-read. A few rules that actually move the number:

Spec hard, then let it write. The single biggest lever is turning a hundred small "look again and tweak" turns into one well-specified turn. Re-reads are the cost. Front-load the thinking so the model writes instead of re-reads. An output-heavy bill is a healthy bill.
Watch your context window, not just your prompt. The expensive token is the one you send 300 times without noticing. If your agent is loading huge files every turn, that's your leak. Trim it.
Turn on prompt caching. Anthropic's prompt caching bills repeated context at cache-read rates, up to 90% cheaper than fresh input. If you're running long sessions without it, you're paying full freight to re-read static files. This alone would have gutted a bill like Nick's.
Cap your spend, then watch the meter. Set a hard budget limit in your harness before you start, not after the invoice. And put eyes on the live number: something like our free jmunch-console that not only tracks token usage, savings, and throughput, but also fires threshold alerts so a rogue leak trips a wire instead of surfacing as a five-figure surprise. (Disclosure: also mine. The category is the point.) "I underestimated my own ability" is a beautiful sentence and a terrible budgeting strategy.
Mind the defaults. Opus 4.8 defaults to high effort and bills its thinking tokens at output rates. Great for hard problems, wasteful for trivial ones. Match the effort to the task.

None of this is exotic. It's just the stuff nobody tells you until your CFO takes your nameplate off the wall.

The actual lesson

The viral version of this story is "AI made a guy spend $81,000." The true version is "a guy paid to make a model re-read his code a few hundred times, and nobody had set a limit."

The model isn't the money pit. Your context window is. Watch that, and the same work that torched a five-figure bill costs you next to nothing...

*The full one-shot prompt I used (and jBrainRot, the game it built) are free on the repo. More token-efficiency tooling at jcodemunch.com

Your AI's memory is quietly making it worse

J. Gravelle — Thu, 11 Jun 2026 11:16:16 +0000

...and your CLAUDE.md is next

On June 10, TechCrunch ran a piece called "How memory tools can make AI models worse." I read it the night it dropped, and the next morning I shipped a fix to one of my MCP servers. So this one's personal.

Memory systems built to personalize your AI, the ones that remember your preferences so the model feels like it knows you, can make that model less accurate and more sycophantic as the stored context piles up. The research came out of Writer, led by Dan Bikel, and it tested real systems (Mem0 and Zep) under peer review. The feature sold as making your assistant smarter was, in their tests, making it dumber.

Here's the part most coverage skipped: This isn't a chatbot problem. If you run coding agents with a CLAUDE.md file, a memory MCP, learned routing weights, or any config that accumulates over time, you're exposed to the exact same failure mode. The mechanism doesn't care whether the persistent state lives in a vector store or a markdown file in your repo root.

The common thread

In every version of this failure, the cause is the same. You have persistent state with no notion of relevance and no notion of expiry. The system stores something true at the time you stored it, then keeps applying it forever, to everything, whether it fits or not.

The research has a clean example. Tell the model a user's favorite book is "Station Eleven." Later, ask a question that has nothing to do with reading preferences, and the model reaches for that stored fact anyway and names the book as a bestselling dystopian novel. The anchor leaked into a query it didn't belong in. In their financial-analysis tests it got worse: models with memory enabled started adopting the user's misconceptions instead of running the numbers independently. The memory didn't add knowledge. It added a bias the model felt obligated to honor.

Once you see the pattern, you start seeing it in your own tooling. Four places, specifically. Here's each one and the shape of the fix.

Problem 1: Stale anchors

Learned state goes stale and keeps steering anyway. Tuned weights, cached rankings, accumulated usage stats: if any of them trained on your project as it looked six months ago, they're still nudging behavior toward a codebase that no longer exists. The model isn't wrong about what it learned. It's wrong about when.

The fix is a recency window on anything that learns from a history log. Don't compute proposals from all of time. Compute them from the part of time that still resembles now.

from datetime import datetime, timedelta

def learn_weights(events, window_days=90, lifetime=False):
    if lifetime:                      # explicit escape hatch
        scoped = events               # for true lifetime reads
    else:
        cutoff = datetime.now() - timedelta(days=window_days)
        scoped = [e for e in events if e.timestamp >= cutoff]
    return compute_proposals(scoped)

The window is the whole point. The lifetime flag is there because sometimes you really do want the full ledger, and that should be a deliberate choice you can read in the call site, not the silent default. This is the fix I shipped the morning after the article. It was about nine lines.

Problem 2: Irrelevant anchors and preference leakage

This is the "Station Eleven" failure, generalized. A memory store retrieves candidates by similarity, similarity is loose, and a memory that's vaguely close to your query gets injected into the prompt for a query it has nothing to do with. The model treats it as relevant because you handed it over. So scope the injection.

def gather_memories(query, candidates, threshold=0.82):
    keep = []
    for m in candidates:
        if relevance(query, m) < threshold:
            continue                  # not close enough, drop it
        if not m.provenance_ok():     # where did this come from?
            continue
        keep.append(m)
    return keep

Two gates, not one. Relevance decides whether the memory belongs in this query, and a provenance check decides whether you trust where it came from in the first place. When a candidate is borderline, leave it out. A missing memory costs you one extra lookup. A wrong anchor costs you correctness, and you won't see the bill until the answer's already wrong.

Problem 3: Memory-file rot

Your CLAUDE.md, your AGENTS.md, your .cursorrules: these are memory too, just written in markdown instead of embeddings. And they rot. You rename a function and the file still references the old name. You delete a directory and the file still points at the path. You add a rule in March that contradicts a rule from January, and both are still in there. The agent reads all of it as gospel and obeys ghosts.

The fix is to audit the config against ground truth, meaning the actual code, not the config's memory of the code.

def audit_config(config, code_index):
    issues = []
    for symbol in config.referenced_symbols():
        if symbol not in code_index.symbols:
            issues.append(("stale_ref", symbol))
    for path in config.referenced_paths():
        if not code_index.exists(path):
            issues.append(("dead_path", path))
    issues += find_contradictions(config.rules)
    return issues   # flag for deletion, don't auto-delete

Every symbol the config names should exist in the index. Every path should resolve. Every rule should be consistent with its neighbors. Anything that fails gets flagged for a human to cut. You'd be surprised how much dead weight a six-month-old agent file is carrying.

Problem 4: Silent self-modification

This is the one that turns a small error into a spiral. Some systems write their own memory or rewrite their own config without a checkpoint. So the model makes a wrong inference, stores it, reads it back next turn as a fact it "already knows," and builds on it. That loop is how sycophancy snowballs. Nobody ever told it no, so it keeps agreeing with the wrong thing more confidently each round.

The rule is simple. Suggest, never write.

def propose_memory_update(current, proposed):
    diff = render_diff(current, proposed)
    print(diff)
    if not approval_step(diff):       # human or supervising agent
        return current                # rejected, nothing changes
    return apply(proposed)

The system can propose whatever it wants. It just can't commit. Show the diff, require an explicit yes, and the snowball never starts because there's always a point where someone can look at the change and say that's not right.

The actual fix isn't "turn memory off"

None of this argues for ripping memory out. Memory is useful. A model that remembers your stack and your conventions is a better collaborator than one that re-learns you every session.

The fix is memory with hygiene. Recency windows so old state expires. Relevance scoping so anchors only fire on the queries they fit. Ground-truth audits so your config can't drift away from your code. Human-in-the-loop writes so errors can't compound unsupervised.

And underneath all four guards there's one principle. Grounded retrieval beats accumulated recollection. Deriving your context from the source artifact at query time, the live code, the current files, the real index, will beat trusting what you wrote down about that artifact months ago. Recollection rots. The source doesn't.

You can wire up all four of these guards yourself. The sketches above are most of the shape, and none of them are long.

Or you can install jCodeMunch, a free MCP server (pip install jcodemunch-mcp) where this is already the architecture. Retrieval is grounded in a live index of your actual code instead of accumulated memory, weight learning is recency-windowed out of the box, and a built-in audit_agent_config tool finds the rot in your CLAUDE.md for you. It's here: https://github.com/jgravelle/jcodemunch-mcp

Look What GitLab Invented!

J. Gravelle — Sun, 31 May 2026 12:38:03 +0000

The industry just got slapped awake by two numbers it cannot unsee.

Uber torched its entire 2026 AI coding budget in four months. Its COO went on record saying he still cannot draw a straight line from all that Claude Code spend to shipping more useful features. "That link is not there yet."

At the same time, an unnamed company accidentally dropped half a billion dollars on Claude in a single month because nobody put a cap on employee licenses.

The panic narrative writes itself:

"AI IS TOO EXPENSIVE!"

The diagnosis everyone is missing: most of those tokens were spent forcing models to skim entire files just to find the three functions, classes, or references they actually needed.

You are not paying for intelligence. You are paying retail, per token, to make the robot read the whole damn book when it only needed one page.

GitLab just validated the fix in public

While the budget fires were burning, GitLab was building the same fix in the open. They call it Orbit, their Knowledge Graph. It turns repositories into a structured, queryable map of definitions and cross-file references, then hands agents only the precise context they ask for through what they literally call a unified context API. It even exposes an MCP endpoint.

Sound familiar?

It should. That is the entire premise behind the jMRI spec and jCodeMunch.

GitLab's own Knowledge Graph team wrote down exactly why they built it. In their internal engineering notes they describe treating AGENTS.md as a table of contents instead of an encyclopedia, keeping knowledge in structured docs, and enforcing architecture mechanically. Their stated design goal is agent legibility, on the logic that anything the agent cannot access in-context effectively does not exist.

A four-person team used that discipline to ship roughly 135,000 lines of Rust in about two weeks, around 95% of it agent-generated. They validated the thesis with their own velocity numbers.

I have been screaming that from the rooftop since March.

Here is where the big-vendor version gets... funny

GitLab's full SDLC graph, the one that actually maps merge requests, pipelines, work items, and code together, is Ultimate tier, GitLab.com only, runs in its own Kubernetes cluster, and is still sitting behind an experiment flag that is off by default. The local code piece writes to a private DuckDB file on your machine and is built to feed GitLab's own stack.

The insight is correct.

The delivery vehicle is "buy our most expensive seat, stand up a cluster, and put your source on our infrastructure."

The same idea, on your laptop, for the price of a tank of gas

jCodeMunch does the retrieval part locally. It is MCP-native, so any agent that speaks MCP can use it. It works against any repository on any host. Your code never leaves your machine. No recurring enterprise seat. No cluster to babysit.

This is not theory. Universal Plant Services, an industrial-services company, is already running the full jMunch platform across more than forty engineer seats. Their live dashboard, as of this week:

Nearly 48.7 billion tokens saved across the fleet in 72 days, an API-rate value of $177,256.07.

That is what jCodeMunch's savings counter put up, in a couple months, at a single company.

While other organizations are writing postmortems about leaderboards that rewarded waste, a real forty-plus-seat shop is quietly running the efficient version and letting the numbers speak.

Honest side-by-side

Aspect	GitLab Orbit	jCodeMunch
Where it runs	Their Kubernetes cluster, GitLab.com only	Your laptop
Price	Ultimate-tier seat, ongoing	One time, per seat or platform
Repositories	GitLab-hosted, default branch only	Any repo, any host
Model lock-in	Built for GitLab's own stack	Model-agnostic over MCP
Your source code	Lives on their infrastructure	Stays local
Status	Experiment, feature-flagged, off by default	Shipping, running at 40+ seats today

GitLab independently reached the same architectural conclusion at the exact moment the rest of the industry is waking up to the bill.

If your agent cannot retrieve it, it does not exist. GitLab's own team said it. The receipts for ignoring it are showing up in this week's headlines.

The fix is not "use less AI". It's "use AI less".

Stop buying the whole book when your robot only needs one page...

Saving the World From AI... with AI!

J. Gravelle — Sat, 16 May 2026 12:46:04 +0000

I applied SCI for AI to a coding-assistant tool, and the math actually works

The Green Software Foundation ratified SCI for AI five months ago. It's the AI extension to the Software Carbon Intensity standard, the same ISO/IEC 21031 methodology that's been around for software broadly, now with AI-specific boundaries and functional units bolted on.

If you've never heard of it, that's fine. It's new, the case studies published so far are all from Microsoft, UBS, Google, and Accenture, and the existing literature focuses on training and serving infrastructure rather than the application layer. None of which is where most of us work.

I wanted to know if the spec held up at the LLM-tooling layer, the part of the stack where MCP servers, retrieval-augmentation tools, context compressors, and developer-facing AI assistants actually live. I had a tool with three months of production telemetry and a number I could point at, so I applied the spec to it.

The short version: the spec works cleanly there, the math is more interesting than I expected, and there's a piece of intellectual machinery in it that I think is underappreciated and worth showing to other developers.

The setup

The tool is jCodeMunch, an MCP server I maintain. It sits between AI coding assistants and the codebase they're working on, and serves AST-level summaries and dependency graphs instead of the full-file reads the assistant would otherwise do. The pitch is that the assistant gets the same answers from a tenth of the input tokens.

Since March 3rd I've been collecting opt-in production telemetry on per-call usage.input_tokens deltas, what the assistant would have requested versus what the tool actually returned. As of writing, the counter sits at 225,266,057,553 input tokens across 24,645 reporting sessions. The endpoint is public if you want it: https://j.gravelle.us/APIs/savings/total.php. Returns JSON.

That's the raw observable. The interesting part is what SCI for AI lets you do with it.

The spec, briefly

SCI for AI uses the same base formula as classic SCI:

SCI = ((E × I) + M) per R

where E is energy, I is grid carbon intensity, M is amortized embodied emissions, and R is the functional unit you scale by. What SCI for AI adds is two persona-based boundaries (Consumer, which covers operation and monitoring, i.e. what deployers experience; and Producer, which covers training, fine-tuning, and the upstream lifecycle), plus standardized functional units that let you compare across AI system types.

The LLM-tooling layer sits squarely at the Consumer boundary. We're not training models. We're affecting what gets sent to inference. The relevant R for the buyer's experience is per-developer-task: one self-contained unit of work like "explain this function" or "find where authentication is handled." Each task corresponds to one or more LLM calls with measurable token counts.

That's the whole framework. The rest is putting numbers in the right places.

The thing I want to show other developers

When you go to compute per-task energy reduction, you immediately hit a problem: nobody actually knows what per-input-token energy is. The published estimates span an order of magnitude. The peer-reviewed numbers from Microsoft Research's recent Joule paper (median 0.31 Wh per query, IQR 0.16 to 0.60) are the most defensible, and even they're explicit that their precision is roughly an order of magnitude. ML.ENERGY, TokenPowerBench, and IEA all give different ranges within that uncertainty.

This is where the spec earns its keep. If a developer-task uses T input tokens under the baseline and T' under the with-tool case, and per-token energy is some unknown but bounded e:

E_baseline = T  × e
E_tool     = T' × e
Reduction  = (T - T') / T

The e cancels. The percentage reduction in inference energy equals the percentage reduction in tokens delivered, regardless of which per-token energy estimate you trust. As long as I and M are held constant across runs (same grid, same hardware), the per-task SCI for AI score drops by the same proportion.

This is one of those moments where a spec turns out to be smarter than you'd expect. The ratio is the auditable unit. The absolute conversion is downstream of an unsettled empirical question, and the spec doesn't require you to settle it. You commit to the relationship; the literature handles the magnitude as it improves.

For jCodeMunch's end-to-end production reduction rate (15 to 25 percent per task, established via a 50-iteration A/B test on a Vue 3 + Firebase codebase, archived in the repo), that means the per-task SCI for AI score drops by 15 to 25 percent, full stop. No methodology fight required.

What the absolute numbers look like

If you do want absolutes, here's the conversion against the published energy bounds:

Input-token energy	225B tokens × energy	CO₂ at U.S. grid avg (~400 gCO₂/kWh)
0.1 mWh/token (conservative)	22.5 MWh	9.0 tonnes
0.3 mWh/token (upper)	67.5 MWh	27.0 tonnes

For scale: that's between 2 and 6 U.S. household-years of electricity, or 2 to 6 passenger-cars-removed-for-a-year-equivalent on the EPA's standard conversion in our first 10 weeks online. The grid intensity is U.S. average; the actual number depends on where the LLM provider's inference runs, which they don't disclose at the per-query level. That's a regulatory gap SCI for AI's EU AI Act alignment is meant to close over time.

The reason I trust the absolutes despite the uncertainty: even the conservative bound is real elimination, not paperwork. SCI for AI explicitly rejects offsets, RECs, PPAs, and the rest of the score-reduction-via-financial-instrument toolkit. The only thing that counts under the standard is causing fewer GPU-seconds to be consumed in the first place. That's the intervention here.

The mechanism, if you want it

There are two measurements behind the 15 to 25 percent end-to-end number, and they're worth understanding separately because they measure different things.

Synthetic retrieval-layer benchmark. A back-to-back harness compared jCodeMunch's AST-based BM25 retrieval against dense-retrieval RAG at the optimal chunk size, across three open-source repos. Token-per-query reductions ranged 36 to 74 percent against optimized RAG and 99 percent or higher against a full-file-read baseline. These numbers explain why end-to-end savings are achievable. They are not themselves the SCI for AI claim.

End-to-end A/B test. A 50-iteration A/B test on a real Vue 3 + Firebase codebase, contributed by community member @Mharbulous and archived in the repo, ran the same naming-audit task alternating between native Read/Grep/Glob and jCodeMunch's MCP tools. Same model, same iteration count, controlled for session-order effects. The with-tool variant completed more tasks within the timeout (80 percent vs 72 percent), ran shorter on average (299 s vs 318 s), and showed equivalent finding quality. Tool-layer per-task savings landed in the 15 to 25 percent range, lower than the retrieval-layer figure because each iteration also includes variant-independent fixed overhead.

The end-to-end number is what production telemetry actually accumulates. The retrieval number explains why.

Three things I don't know

This is also a real practitioner contribution, which means flagging what I don't have answers to.

Whose I applies. When a developer in one grid region uses an LLM provider whose inference runs in another, the carbon intensity in the SCI for AI calculation should be the inference-side grid, not the developer's. Providers don't disclose this at the per-query level. The math currently uses U.S. average as an upper bound.

How M is allocated for multi-tenant accelerators. The spec allocates embodied emissions by time-share and resource-share. For multi-tenant inference hardware serving millions of inferences per hour across customers, that allocation is non-trivial and relies on disclosure that mostly doesn't exist yet.

What baseline aggregate industry comparisons should use. A reduction claim is only meaningful relative to a baseline. The most defensible baseline for LLM-tooling claims is the same workflow without the tool, but cross-vendor comparisons would benefit from a published reference workload for the LLM-tooling layer specifically. None exists.

I've filed these on the Green-Software-Foundation/sci-ai repo for the working group. If you're working on something adjacent (a context-compression tool, a retrieval optimizer, an MCP server with its own per-task savings story), these are the gaps you'll hit too, and the working group benefits from more practitioners poking at them.

The fuller version

The detailed case study with the full methodology, the per-repo benchmark tables, the A/B test data, and the failure-mode analysis is in the project wiki: Token Reduction as an Energy-Efficiency Action: A 225-Billion-Token Case Study Against SCI for AI. Same numbers, more of the math.

The point of writing this up

Two reasons, beyond the obvious "I built a thing and the numbers are interesting."

The first is that SCI for AI is a usable standard for application-layer work, and most developers haven't looked at it. The Consumer-boundary framing fits MCP servers, retrieval tools, and AI-augmented developer tooling more naturally than I expected. If you're working in this part of the stack, the spec gives you a defensible way to talk about per-task carbon impact that doesn't require you to commit to controversial absolute energy numbers. That's worth knowing.

The second is that the AI-energy conversation has gotten stuck on data-center construction and grid capacity, and the per-task denominator (which the IEA's most recent update explicitly identifies as the leverage point) gets less attention than it should. The software layer is where that denominator moves fastest, and it's where the lowest-capital interventions live. Worth more developer mindshare than it currently gets.

If you build something in this space, publish your numbers. Even rough numbers. The literature improves through case studies, and right now the case-study record is dominated by the largest organizations in the world. More practitioner data is good for everyone...

-jjg

Headless Claude, Done Right: Slice-Level Retrieval and the Subscription Trap

J. Gravelle — Wed, 06 May 2026 14:07:47 +0000

An open-source CLI that respects your Claude Pro auth, retrieves only what it needs, and stays inside the lines Anthropic drew in their TOS.

= = =

Spawn claude -p from a Python subprocess without the right precautions and you'll silently bill your Anthropic API account instead of using the Claude Pro subscription you already pay for. For a 799,000-token query, that's the difference between $0.00 and $11.99. Two environment variable strips later, my CLI does the right thing by default. And that's the smallest piece of what makes this work.

This is a write-up of jragmunch-cli, an open-source tool I built (Apache 2.0, on PyPI) that wraps the official claude -p binary with a few opinions: respect the user's auth, retrieve only what's needed, and stay inside the lines Anthropic actually drew in their own legal docs. The interesting parts are the ones that aren't obvious from the README.

Why your subscription quietly turns into an API bill

Anthropic's claude CLI binary (the one you install with npm install -g @anthropic-ai/claude-code) is auth-flexible by design. It will use whatever credentials it finds, in priority order:

ANTHROPIC_API_KEY if set
ANTHROPIC_AUTH_TOKEN if set
Your Claude Pro / Max OAuth login otherwise

That's reasonable behavior for the binary as a primitive. The footgun is what happens when you spawn it as a subprocess from Python.

By default, subprocess.Popen (and friends) pass the parent process's full environment to the child. If you have ANTHROPIC_API_KEY exported in your shell (because, say, you also use the API from other scripts), every claude -p invocation your tool makes will silently pick that up and bill your API account. You won't see an error. You won't see a warning. You'll just see a bill at the end of the month and a perfectly preserved subscription quota you never touched.

The fix is mechanically tiny:

import os
import subprocess

env = os.environ.copy()
env.pop("ANTHROPIC_API_KEY", None)
env.pop("ANTHROPIC_AUTH_TOKEN", None)

subprocess.run(
    ["claude", "-p", prompt, "--output-format", "stream-json"],
    env=env,
    check=True,
)

Three lines. But the discipline behind it matters: any tool that spawns claude -p from a parent process should default to subscription mode and require an opt-in flag to switch to API. jRAGmunch-CLI's --use-api flag is exactly that, and jragmunch doctor will tell you which mode you're in before you run anything expensive.

If you're wrapping claude -p in your own scripts, copy the pattern. It'll save someone a surprise bill.

The bigger problem: dumping your repo at the model

Even with auth handled correctly, the default pattern for "ask Claude about my repo" still wastes obscene amounts of tokens. The naive pattern looks like this:

Walk the repo
Concatenate every relevant file into one giant string
Stuff it into the prompt
Hope the model finds what it needs

This is what most "chat with your repo" wrappers do, and it's what burns through Claude Pro session limits in fifteen minutes flat. The 2.5GB Node.js source tree I demo with would need around 21 million tokens to fit in one prompt. Even if that fit (it doesn't), you'd be paying for the model to read 100% of the code to answer a question that touches 0.1% of it.

Here's a real run from AskClaude.py, the side-by-side demo script in the repo:

In its raw form, your request may have used as many as 799,037 tokens,
at a cost of $11.99.

Using jRAGmunch-CLI, our call to Opus 4.7 only used 24,771 tokens.

By using your subscription WITHIN THE TERMS OF ANTHROPIC'S TOS, you paid
$0.00 and used a nearly imperceptible fractional percentage of your quota.

799K tokens versus 24K. Same question, same answer quality. The difference isn't the model. The difference is what gets sent to it.

Slice, don't dump

The retrieval layer is where the real engineering happens. jRAGmunch-CLI delegates retrieval to jcodemunch-mcp, a separate MCP server I maintain that does AST-level symbol extraction across 70+ languages via tree-sitter.

Here's the conceptual difference between this and traditional RAG.

Traditional RAG. Chop the codebase into arbitrary text chunks. Embed each chunk. When a query comes in, embed the query, find the chunks with the highest cosine similarity, send those to the model. The retrieval is statistical and approximate. It can miss things. It can include things that look related but aren't. It treats your code as if it were prose.

Slice-level retrieval. Parse the codebase into an AST. When a query references a symbol (function name, class, identifier), look up that exact symbol in the index. Return the actual function body. Trace the actual import graph. The retrieval is structural and exact. If you ask for AuthMiddleware.verify, you get AuthMiddleware.verify, not the seven chunks that happened to contain the word "auth."

Surgical, not statistical.

The result is what shows up in jRAGmunch-CLI's _meta output on every call:

[tokens in=24 out=1273  cost actual=$0.0000 (notional=$0.5334, auth=subscription)  time=27549ms]

actual is what you really paid (zero, in subscription mode). notional is what the same work would have cost via the API at Opus 4.7's input rate. auth is which credential path the subprocess used. Every verb returns this. You always know what you actually spent and what you would have spent.

That transparency matters more than it sounds. Most LLM tooling hides cost behind the abstraction. jRAGmunch-CLI makes you look at it on every call. After a week of that, your intuition for "what's a reasonable token budget for this question" sharpens dramatically.

It's not just `ask`

The verb most people see first is jragmunch ask, because that's the obvious "chat with your repo" use case. But the more interesting verbs are downstream of that:

jragmunch index indexes a repo via jcodemunch (one-time, then incremental on subsequent calls).
jragmunch review does diff-aware PR review against a git range.
jragmunch changelog summarizes changes since a tag.
jragmunch refactor fans out batch refactors across matched symbols.
jragmunch tests generates tests for symbols that don't have them.
jragmunch sweep does pattern-driven cleanup across the repo.
jragmunch run is a power-user passthrough for direct prompts.
jragmunch doctor verifies your CLI + MCP wiring before you spend tokens.

The review and refactor verbs are where this stops looking like a Q&A wrapper and starts looking like an agentic CLI toolkit. review reads your diff, retrieves the surrounding symbol context that the diff actually touches (not the whole file, not the whole repo, just the symbols affected), and runs a structured review pass. refactor does fan-out work across multiple call sites in parallel, with each subprocess getting only the slice it needs.

That fan-out pattern is also where the TOS line gets interesting.

When subscription mode is the right answer (and when it isn't)

Anthropic's Claude Code Legal and Compliance docs draw a bright line that most wrappers ignore. Paraphrased:

Individual ordinary use of Claude Code on your own machine, with your own subscription, is permitted.
Business, always-on, multi-contributor, or high-throughput use should run against the API with an API key.

jRAGmunch-CLI's defaults are tuned to that line. Subscription mode by default for solo interactive work; explicit --use-api for anything that crosses into the second bucket. The README ships a decision table covering the typical cases:

You are…	Recommended mode
A solo developer running verbs interactively on your own machine	subscription (default)
A solo developer running `jragmunch review` in your own personal repo's CI	subscription (default), with `CLAUDE_CODE_OAUTH_TOKEN`
A team running CI bots on a shared / commercial repo	`--use-api`
Multi-developer or commercial automation	`--use-api`
Heavy parallel fan-out (`refactor --parallel 16`, etc.)	`--use-api`

This isn't a workaround. It isn't a loophole. Anthropic explicitly permits the first column and explicitly directs the second column to the API. jRAGmunch-CLI just makes the right default for each case the easy default.

The recent wave of "I got rate-limited on Claude Pro after two days" complaints comes mostly from tools that don't respect this line. They run on a personal subscription, fan out twenty parallel subprocesses doing CI-grade work, then act surprised when the throttle drops. If you respect the line Anthropic drew, your subscription stays healthy. If you don't, it doesn't. jRAGmunch-CLI is opinionated about which side of the line you're on.

Try it

If you have the claude CLI on your PATH and jCodemunch-MCP registered as an MCP server, getting started is two commands:

pip install jragmunch
jragmunch doctor

doctor will tell you whether your auth resolves to subscription or API, whether the MCP server is reachable, and whether anything is misconfigured before you spend tokens. From there:

jragmunch index --repo .
jragmunch ask "how does auth work in this repo"
jragmunch review --since main

If you want the side-by-side cost comparison I quoted earlier, clone the repo and run python AskClaude.py. It prompts for a repo path and a question, then prints the answer plus the token math. Use it as a sanity check on your own codebases or as a template for embedding jRAGmunch-CLI into other tools.

One last thing

The repo is brand new. Star it if it's useful. File issues if it isn't. Send a PR if you've got opinions about which verb should ship next.

The 2.5GB Node.js demo, the live cost math, and a fuller walkthrough are in the AI Tips With J video premiering today. Both links below.

Repo: github.com/jgravelle/jragmunch-cli
Video: https://www.youtube.com/watch?v=ZP0OPSq0jcQ
Comparison page (vs. RAG, vs. raw file reads): j.gravelle.us/jCodeMunch/versus.php

Slice, don't dump.

How we measured 99.6% token reduction across 15 task-runs

J. Gravelle — Thu, 30 Apr 2026 11:31:33 +0000

Two months after publishing the headline, here are the receipts.

Two months ago I published "Your AI Agent Is Dumpster Diving Through Your Code." The most common reply was some flavor of: "Cool numbers, but how did you actually measure them?"

Fair question. Here's the answer.

What we measured

The jCodeMunch benchmark measures retrieval token efficiency — how many LLM input tokens a code-exploration tool consumes compared to reading all source files. It does not measure answer quality, latency, or end-to-end task completion. Those are separate axes (we measure precision separately in jMunchWorkbench, but that's a different post).

Three repos, five queries, run on 2026-03-28:

Repository	Files	Symbols	Baseline tokens
expressjs/express	165	181	137,978
fastapi/fastapi	951	5,325	699,425
gin-gonic/gin	98	1,489	187,018

The five queries cover the most common code-exploration intents I see in the wild: router route handler, middleware, error exception, request response, context bind. Three repos × five queries = 15 task-runs.

Total baseline cost across all 15: 5,122,105 tokens.
Total jcodemunch cost across all 15: 19,406 tokens.

Reduction: 99.6% average.

How the runs work

For each query, the harness does two things and compares.

Baseline: concatenate every file in the repo, tokenize, count. This is the "read everything first" agent — the minimum cost to put the entire codebase in context.

jcodemunch: call search_symbols(query, max_results=5), then get_symbol_source() on the top 3 matching symbol IDs. Total tokens = search response tokens + 3 × symbol source tokens.

AI summaries are disabled during benchmarking (signature-only fallback). Without that, jcodemunch numbers would look even better, but it would conflate retrieval efficiency with summarization efficiency, and those are separable concerns.

Token counts come from the serialized JSON response strings, not raw source bytes — JSON field names and structure overhead are included. Slightly understates savings, but the count is deterministic and reproducible.

Tokenizer: tiktoken with cl100k_base encoding. Used by GPT-4 and compatible with Claude estimates within ~5%.

The "common misreadings"

The methodology doc has a section called Common Misreadings. It addresses the four pushbacks I get every time I publish a number.

"The claim is up to 99%." No — the primary claim is 99.6% average across all 15 task-runs. Individual queries reach 99.9% on large repos with tight symbol matches (error exception on fastapi/fastapi: 99.9%, 801× reduction). The 99.6% aggregate is the honest headline across the current index state.

"I tested it on a different repo and got 80%." Results vary by repo structure. Flat script collections — repositories of hundreds of unrelated standalone scripts — produce lower savings, because the symbol index can't distinguish which script is relevant to a given query, and the agent still has to scan broadly. The benchmark repos (Express, FastAPI, Gin) are structured application codebases where symbol-based navigation is most effective. Testing a flat script collection and comparing to our benchmark is apples-to-oranges. I say so out loud because pretending otherwise would be dishonest.

"The benchmark is cherry-picked." The three repos were chosen to represent common backend frameworks across different languages — JavaScript, Python, Go. No file filtering beyond standard skip patterns. The harness (benchmarks/harness/run_benchmark.py) and query corpus (benchmarks/tasks.json) are in the repo. Run them yourself. Publish your numbers. If they're better than mine, I'll cite you.

"The baseline is unrealistic." It is — but it's intentionally a lower bound. Real agents re-read files, branch across sessions, and load documentation. Actual production baseline costs are higher than what I report. That makes the 99.6% number a conservative floor, not a ceiling.

What others have measured independently

Two reviewers ran their own numbers and published them.

Artur Skowroński (VirtusLab GitHub All-Stars #15): "roughly 80% fewer tokens, or 5× more efficient — index once, query cheaply forever."

Julian Horsey (Geeky Gadgets): "3,850 tokens reduced to just 700 — a 5.5× improvement."

These don't contradict the 99.6%. They're testing different workloads at different scales. Smaller per-query savings × hundreds of queries per session = the same compounding effect at the larger scale. Multiple methodologies converging in the same direction is exactly what you want from a benchmark you'd actually rely on.

How to run it yourself

pip install jcodemunch-mcp tiktoken

jcodemunch index_repo expressjs/express
jcodemunch index_repo fastapi/fastapi
jcodemunch index_repo gin-gonic/gin

python benchmarks/harness/run_benchmark.py

Every release in v1.76.0+ runs the benchmark in CI with a regression gate at 0.02 — if any aggregate metric drops by more than 2% versus the saved baseline, the build fails. I don't ship without the receipts.

What this means for you

If you're building agents on Claude Max, hitting context-window pain, or paying API bills that scale with token count: don't take my word for it. Run the calculator on your own stack.

Token Tax Calculator

It tells you what you're spending today, what jcodemunch saves, and what that maps to annually. Takes 60 seconds. Costs nothing.

The benchmark is a number. Your own stack is the number that matters.

—-jjg
github.com/jgravelle/mcp-retrieval-spec

I didn't set out to build a sustainability tool...

J. Gravelle — Tue, 21 Apr 2026 15:43:12 +0000

...but only seven weeks in, we've accidentally saved enough electricity to power roughly 65 American households for a year, and enough avoided CO₂ to take 64 gasoline cars off the road.

I set out to fix a dumb problem: LLM coding agents load entire files into their context window to answer questions about single functions. That's expensive in dollars, it's slow, and it pollutes the context with junk the model doesn't need. So I wrote jCodeMunch, an MCP server that returns only the symbol, slice, or bundle the agent actually asked about.

Since I shipped the telemetry on March 3, 2026, opted-in installs have collectively avoided 172,000,000,000 tokens of LLM inference.

def estimate_savings(raw_bytes: int, response_bytes: int) -> int:
    return max(0, (raw_bytes - response_bytes) // 4)

That's the whole thing.

raw_bytes = the bytes the agent would have loaded under "cat the file into context". Whole-file reads are the default retrieval primitive for every major coding agent today.
response_bytes = the bytes jCodeMunch actually returned.
// 4 = OpenAI's published bytes-per-token approximation. Within 5% of tiktoken on real code.
max(0, ...) = if my response is somehow bigger than the baseline, I count zero. No anti-savings allowed.

Every single API call emits a _meta block with the delta. You can audit any call. The accumulator flushes to disk every three calls and ships anonymous deltas (opt-out with one flag) to a public endpoint. 172B is the sum.

Why the number is a floor, not a ceiling

I made four choices that all bias the number down:

File-level dedup. A call that touches 5 symbols in one file counts the file once, not five times.
max(0, ...) clamp. Never negative.
Opt-in telemetry only. Users who never enabled reporting don't count.
Single-file baseline. I compare against "read one whole file." A real agent doing grep-and-cat across a repo would've loaded way more.

Telemetry participation is opt-in. The real savings are higher than 172B. I just can't prove numbers I didn't measure. A valid criticism of this number might be: "Dude. It's way higher than THAT."

Tokens → kilowatt-hours

Here's where it stops being cute and starts mattering.

Peer-reviewed estimates of LLM inference energy cluster around 0.004 Wh per token on a modern H100 stack (Epoch AI, Altman's own disclosure, Google's median text query, and the Surfshark meta-analysis all triangulate here).

172B tokens × 0.004 Wh/token = 688,000 kWh avoided.

That's:

The annual electricity use of ~65 average US homes
~292 metric tons of CO₂ not emitted (US grid avg)
~64 gasoline cars off the road for a year
~14,600 gallons of gasoline not burned

From one MCP server. That's free to the general public, and costs $79 for commercial users. Once.

The part nobody in the sustainability conversation is talking about

AWS has publicly stated inference accounts for more than 90% of an LLM's lifecycle energy. Training is a one-time capital expense. Inference is a forever bill. And the inference bill is bloated with tokens that shouldn't exist — context stuffed with entire files when the model needed one function, whole repos ingested when a symbol lookup would do.

Every AI-sustainability paper I've read treats this as an unsolvable infrastructure problem: build bigger data centers, run them on renewables, hope for the best. Nobody asks the obvious question: what if we just sent the model less garbage?

The answer turns out to be worth 64 cars' worth of CO₂, per small indie product, in eight weeks.

Could this sort of frugality allow AI to continue to thrive exponentially with only 1/8th of the proposed new power plants? 1/10th?

I'm not a climate researcher. I'm a developer in Wisconsin who got tired of watching my token bill. But I think there's a bigger story here, and I don't think it's mine alone to tell. Context-size discipline might be the single highest-leverage thing the AI tooling community can do for energy. Every retrieval-augmented anything is, in aggregate, a carbon-reduction tool. Most of us just don't measure it.

jCodeMunch does. The counter is live at [jcodemunch.com]. The formula is on GitHub. If you want to challenge the math, please do — the methodology doc lists every source citation, every conversion constant, and every assumption.

Until somebody does, the number stands: 172 billion tokens, 688,000 kWh, 64 cars. And counting...

A Radical Diet for Karpathy’s Token-Eating LLM Wiki

J. Gravelle — Sun, 12 Apr 2026 12:57:32 +0000

By Friday, Your Token Bill Looks Like a Phone Number

You did everything right.

You read Karpathy’s post. It clicked immediately. Not because it was simple, but because you’ve lived the pain: spending the first 20 minutes of every session re-teaching a model what you already taught it yesterday.

The LLM Wiki idea felt like a jailbreak.

Compiled knowledge.
Persistent artifacts.
A second brain that compounds instead of resets.

So you built it.

You stood up the structure:

raw/
wiki/
index.md
log.md

You wrote a schema. You defined ingest. You defined lint. You fed it sources.

And it worked.

The model moved through your wiki like it had memory.
Cross-references held. Synthesis stuck.
For the first time in a while, you weren’t rebuilding context—you were building on top of it.

For a moment, you were ahead.

Then the wiki grew.

No alarms. No failure message. Just drag.

Queries got heavier.
Answers got softer.
The model started missing things you knew were there.

You linted it. Structurally clean.

But something had shifted.

The context window was getting crowded, and index.md was quietly becoming the bottleneck.

Here’s the part nobody says out loud in the “RAG is dead” takes:

The LLM Wiki doesn’t eliminate token cost. It moves it.

You traded:

per-query retrieval cost

for:

per-session compilation cost

That’s a fantastic trade…

until the wiki outgrows the window.

The Shape of the Problem (Before You Even See the Numbers)

You don’t need a benchmark to understand what’s happening.

You just need to see the curve.

One approach scales with size.
The other scales with the answer.

That’s the entire story.

What Karpathy Actually Proposed (And Why It Hit So Hard)

Most takes flattened this into “RAG killer.”

That’s not what it is.

RAG is:

stateless
query-time
recompute everything, every time

The LLM Wiki flips that:

do the expensive thinking once (ingest)
resolve contradictions early
build cross-references up front
store the result

Then query the artifact.

That’s not retrieval.

That’s compilation.

The wiki is a persistent, compounding artifact.

And that idea landed because it mirrors how developers already think.

You don’t recompile your entire codebase on every function call.

You compile once.
You run cheap.

The Architecture (Why This Works at All)

Three layers:

Raw Sources

Immutable truth:

PDFs
repos
transcripts
research

Wiki

LLM-owned markdown:

synthesized
structured
cross-linked

This is what you paid to build.

Schema

The discipline layer:

ingest rules
page structure
lint behavior

Without it, you don’t have a system.
You have a pile of files.

The Three Operations That Matter

Ingest

New source → updates multiple pages (often 10–15)

Expensive? Yes.
Correct? Also yes.

You’re building connections up front instead of rediscovering them forever.

Query

Ask → route through wiki → optionally write back

Every query improves the system.

Lint

Detect:

orphan pages
stale knowledge
contradictions

Karpathy’s observation holds:

“The tedious part is the bookkeeping.”

LLMs are extremely good at bookkeeping.

Where It Starts to Crack

The failure mode isn’t theoretical. It’s structural.

1. `index.md` Becomes a Liability

Navigation assumes:

load the index → navigate

That works until it doesn’t.

In practice:

~50K–100K tokens → starts breaking
beyond that → unreliable navigation

You either:

truncate it (lose coverage), or
load it (lose quality)

2. Long Context Isn’t Actually Long

Marketing says million-token windows.

Reality:

~200K–300K → quality degradation begins

Symptoms:

missed links
weaker synthesis
subtle drift

Nothing crashes. It just gets worse.

3. Maintenance Cost Compounds Too

Each ingest touches multiple pages.

That’s correct behavior.

It also means:

more tokens
more updates
more cost

You built a compounding asset.

You also built a compounding bill.

4. The Irony

You didn’t eliminate retrieval.
You postponed it.

At scale, your wiki becomes the corpus.

The Numbers (This Is Where It Gets Real)

From the jDocMunch Wiki benchmark:

Corpus: 7 pages, 7,449 tokens

Results -- Full Wiki Baseline (Realistic)

Query	Baseline	jDocMunch	Saved	Reduction	Ratio
cross repository dependency tracking	7,449	599	6,850	92.0%	12.4x
benchmark token reduction measurement	7,449	314	7,135	95.8%	23.7x
search scoring ranking debug	7,449	344	7,105	95.4%	21.7x
incremental indexing blob SHA performance	7,449	313	7,136	95.8%	23.8x
context bundle symbol imports	7,449	304	7,145	95.9%	24.5x
Total (5 queries)	37,245	1,874	35,371	95.0%	19.9x

Results -- Single File Baseline (Conservative)

Query	File	jDocMunch	Saved	Reduction	Ratio
cross repository dependency tracking	1,700	599	1,101	64.8%	2.8x
benchmark token reduction measurement	1,022	314	708	69.3%	3.3x
search scoring ranking debug	899	344	555	61.7%	2.6x
incremental indexing blob SHA performance	812	313	499	61.5%	2.6x
context bundle symbol imports	914	304	610	66.7%	3.0x
Total (5 queries)	5,347	1,874	3,473	65.0%	2.9x

Visual Reality

Full Wiki: █████████████████████████████████████
jDocMunch: ██

Same questions.
Same corpus.

~95% less context.

The Money Shot

FULL WIKI:     37,245 tokens
jDocMunch:      1,874 tokens

SAVED:         35,371 tokens
REDUCTION:     95.0%
EFFICIENCY:    19.9×

That’s not optimization.

That’s a different class of system.

Why This Happens (Not Magic, Just Mechanics)

Baseline A (full wiki): 37,245 tokens across 5 queries
jDocMunch workflow: 1,874 tokens across 5 queries
─────────────────────────────
Saved: 35,371 tokens (95.0%)
Average ratio: 19.9x

Baseline B (target file): 5,347 tokens across 5 queries
jDocMunch workflow: 1,874 tokens across 5 queries
─────────────────────────────
Saved: 3,473 tokens (65.0%)
Average ratio: 2.9x

You’re not compressing data.

You’re avoiding loading irrelevant data.

That’s it.

The Structural Shift

Old world:

cost ∝ wiki size

New world:

cost ∝ answer complexity

That’s the entire advantage.

The Fix: Stop Loading the Wiki

The mistake isn’t your structure.

It’s your access pattern.

You’re treating the wiki like a document.

It’s not.

It’s a dataset.

Enter jDocMunch (Right Where You Need It)

jDocMunch doesn’t replace the wiki.

It fixes the exact place it breaks.

What It Actually Does

Parses docs into sections
Stores byte offsets
Enables:
- search_sections
- get_section

Two calls.

No full loads.

Why It Works

Old	New
Load index into context	Search index on disk
Navigate in prompt	Retrieve exact section
Cost scales with size	Cost scales with answer

Even the “Fair” Comparison Still Wins

Conservative baseline:

Single File:   5,347 tokens
jDocMunch:     1,874 tokens

Reduction:     65%

Even when you handicap the comparison…

It still wins.

Three-Minute Retrofit

pip install jdocmunch-mcp

MCP config:

{
  "mcpServers": {
    "jdocmunch": {
      "command": "uvx",
      "args": ["jdocmunch-mcp"]
    }
  }
}

The Only Behavior Change That Matters

Stop doing this:

“Read index.md”

Start doing this:

search_sections
get_section

The One-Line Mental Model

Your wiki is not a document.
It’s a database.

Stop loading it.
Start querying it.

Final Thought

Karpathy gave us something real.

Not a tool.
A pattern.

And it’s a good one.

But it has a ceiling:

context doesn’t scale
tokens don’t forgive

If you don’t add structured retrieval:

your second brain becomes your biggest expense

If you do:

you keep the compounding upside
and kill the runaway token curve

That’s the difference between a clever idea…

and a system that actually survives production.

Symbols Not Chunks: 3.9x Less Tokens

J. Gravelle — Sat, 28 Mar 2026 14:34:27 +0000

AST-Based Retrieval Cuts LLM Code Context 1.6 - 3.9x vs. LangChain RAG on Real Codebases

J. Gravelle
March 2026

Abstract

Large language models (LLMs) consume tokens proportionally to the context they receive. When applied to code understanding tasks, the dominant retrieval strategy --- chunk-based Retrieval-Augmented Generation (RAG) using vector embeddings --- injects substantial irrelevant context, wastes tokens, and frequently delivers fragments that split functions mid-definition. This paper presents an alternative: AST-based symbol retrieval, which uses tree-sitter parsing to extract complete syntactic units (functions, classes, methods) and serves them via deterministic lookup. We benchmark both approaches on three open-source web frameworks (Express.js, FastAPI, Gin) totaling 1,214 files and 1,024,421 baseline tokens. In head-to-head comparison against a naive fixed-chunk RAG pipeline (LangChain + FAISS + MiniLM-L6-v2), AST retrieval uses 1.6--3.9x fewer tokens per query on every tested repository. Against a "read all files" lower-bound baseline, the reduction is 99.6% (263.9x). Two controlled A/B tests on a production Vue 3 codebase confirm 20% cost savings in end-to-end agentic workflows (p=0.0074), though with accuracy tradeoffs on fine-grained classification tasks. The structural advantage --- complete code units with no chunk boundary artifacts --- is orthogonal to the search mechanism and would apply equally to RAG pipelines that adopt symbol-level chunking. We argue that for code-specific retrieval, the retrieval unit should be the symbol, not the chunk.

1. Introduction

The integration of LLMs into software engineering workflows --- code generation, review, debugging, refactoring --- has accelerated rapidly. A common pattern has emerged: the agent needs to understand an unfamiliar codebase, so it retrieves relevant code and injects it into the model's context window.

The standard approach borrows from document retrieval: split source files into overlapping text chunks, embed them with a dense model, store the vectors in an index (typically FAISS or Chroma), and retrieve the top-k most similar chunks at query time. This is the RAG pattern, popularized by frameworks like LangChain, LlamaIndex, and Haystack.

RAG works reasonably well for prose documents, where any contiguous passage may contain relevant information. Code, however, has structure that prose does not. A function is a complete unit of meaning. Half a function is noise. The question this paper investigates is straightforward: what happens when we replace arbitrary text chunks with complete syntactic units as the retrieval granularity?

The answer, across three repositories in three languages, is that token consumption drops by 1.6--3.9x compared to a naive fixed-chunk RAG pipeline, with no embedding model, no vector store, and no chunk boundary artifacts. We use "naive" deliberately: the RAG baseline tested here uses a general-purpose embedding model and fixed-size chunking, not code-specific embeddings or AST-aware splitting. The comparison is against a common starting point, not a fully optimized pipeline.

2. Problem Statement

2.1 The Token Cost Problem

LLM API pricing is per-token. Context window size is finite. Both constraints create pressure to minimize irrelevant context. Yet the standard RAG pipeline is structurally biased toward over-retrieval:

Fixed chunk size forces a precision/recall tradeoff. Small chunks (512 tokens) reduce per-result noise but split functions mid-definition. Large chunks (2048 tokens) preserve more structure but include unrelated code from adjacent definitions.
Top-k retrieval returns a fixed number of results regardless of query specificity. A query matching one function still returns k chunks, most of which are noise.
The search-then-fetch pattern double-counts tokens. A typical workflow retrieves k results for inspection, then "fetches" the top n for the LLM. The top n appear in both the search response and the fetch response, inflating the effective token count.

2.2 The Chunk Boundary Problem

Consider a Python file with three functions, each ~400 tokens. A 512-token chunker produces chunks that look like this:

Chunk 1:  [end of function A] [start of function B ... truncated]
Chunk 2:  [... middle of function B ...] [start of function C]
Chunk 3:  [... end of function C] [module-level code]

An LLM receiving Chunk 1 gets the tail of one function and the head of another. It has no reliable way to determine where one definition ends and another begins. This is not a theoretical concern --- our measurements show that 53% of retrieved RAG-512 chunks for FastAPI are split mid-function (Section 7.3).

2.3 Scaling Behavior

As codebases grow, the problem compounds. A 951-file repository like FastAPI produces 2,256 chunks at 512-token granularity. The embedding step alone takes 47 seconds. Query latency, while acceptable (12--36 ms), is orders of magnitude slower than an in-process BM25 lookup (<5 ms). The vector index occupies 7.5 MB on disk --- modest in absolute terms, but unnecessary if the retrieval unit can be derived from the source structure directly.

3. Background

3.1 RAG for Code

RAG (Retrieval-Augmented Generation) augments an LLM's fixed training knowledge with dynamically retrieved context. For code, the standard pipeline is:

Chunking. Source files are split into fixed-size token windows (typically 256--2048 tokens) with overlap (5--15%) to mitigate boundary effects.
Embedding. Each chunk is passed through a dense embedding model (e.g., all-MiniLM-L6-v2, text-embedding-3-small) to produce a vector representation.
Indexing. Vectors are stored in an approximate nearest-neighbor index (FAISS, Chroma, Pinecone).
Retrieval. At query time, the query is embedded and the top-k nearest chunks are returned.

This pipeline was designed for prose documents and adapted for code. The adaptation is imperfect: code has syntactic structure (functions, classes, modules) that prose does not, and that structure is semantically meaningful.

Semantic and AST-aware chunking. The RAG ecosystem has recognized the fixed-chunk limitation. LangChain, LlamaIndex, and other frameworks offer semantic chunking (split at natural breakpoints detected by embedding similarity shifts) and AST-aware chunking (split at function or class boundaries using a parser). AST-aware chunking in particular eliminates the chunk boundary problem described in Section 2.2. We did not benchmark these strategies --- doing so would require choosing among multiple implementations with different heuristics, and the comparison would conflate chunking strategy with embedding model quality. We note, however, that AST-aware chunking and AST symbol retrieval share the same core insight: the retrieval unit for code should align with syntactic boundaries. The remaining difference is the search mechanism (embedding similarity vs. BM25) and the retrieval interface (opaque chunks vs. structured symbol metadata). Section 8 discusses this distinction further.

3.2 Context Window Constraints

Modern LLMs offer context windows ranging from 128K to 2M tokens. A naive approach --- load the entire codebase --- is feasible for small projects but fails quickly. A 951-file Python framework tokenizes to ~700K tokens. A production monorepo can easily exceed 10M tokens. Even where the window is large enough, longer contexts degrade attention quality, increase latency, and cost proportionally more.

3.3 Tree-Sitter and AST Parsing

Tree-sitter is an incremental parsing framework that produces concrete syntax trees for source code in ~40 languages. Unlike regex-based heuristics, tree-sitter parsing is grammar-driven: it identifies functions, classes, methods, type definitions, and other syntactic constructs with the same precision as the language's own compiler front-end. Parse time is typically sub-second for single files and under 15 seconds for a 951-file repository.

4. Approach: AST Symbol Retrieval

4.1 Core Idea

Instead of chunking source files into arbitrary token windows, parse them into their natural syntactic units: functions, classes, methods, type definitions. Index these symbols by name, qualified name, and file path. At query time, search the symbol index (not a vector index) and return the complete source code of matched symbols.

The retrieval unit is no longer a 512-token fragment of unknown provenance. It is a complete, self-contained definition --- the exact code the developer would navigate to in an IDE.

4.2 Indexing Pipeline

Source files  →  tree-sitter parse  →  symbol extraction  →  BM25 index + SQLite store

For each file:

Detect language from file extension.
Parse with the appropriate tree-sitter grammar.
Walk the AST to extract top-level and nested symbols (functions, classes, methods, type aliases, constants).
Store each symbol's metadata (name, qualified name, kind, file path, line range) and full source text in a SQLite database.
Build a BM25 inverted index over symbol names and qualified names.

The entire pipeline is deterministic. No embedding model is involved. No GPU is required. Index build time scales linearly with file count: <1 second for 98 files (Gin), ~5--15 seconds for 951 files (FastAPI).

4.3 Retrieval Workflow

The retrieval workflow mirrors the discover/search/retrieve pattern common in code exploration:

Query: "middleware"
  ↓
Step 1: search_symbols("middleware", max_results=5)
  → Returns ranked symbol metadata: name, kind, file, line range, score
  → Token cost: ~370 tokens (metadata only, not full source)
  ↓
Step 2: get_symbol_source(top_3_symbol_ids)
  → Returns complete source code of the 3 best-matching symbols
  → Token cost: ~640 tokens (3 complete function bodies)
  ↓
Total: ~1,010 tokens
Baseline (all files): 137,978 tokens
Reduction: 99.3%

Three properties distinguish this from RAG retrieval:

The search step returns metadata, not source. The LLM (or agent) can inspect symbol names, kinds, and file locations before deciding which symbols to retrieve in full. This is analogous to scanning a table of contents before reading chapters.
The retrieve step returns complete syntactic units. Every result starts at a definition boundary and ends at the matching closing brace or dedent. There are no mid-function fragments.
Result count is adaptive. If a query matches one symbol strongly, the agent retrieves one symbol. RAG always returns k chunks regardless of query specificity.

4.4 Stable Symbol Identifiers

Each symbol receives a deterministic identifier derived from its repository, file path, and qualified name. This ID is stable across reindexing (unless the symbol is renamed or moved). Stable IDs enable:

Caching. A previously retrieved symbol can be recognized without re-fetching.
Cross-reference. Import graphs, call hierarchies, and blast radius analysis can reference symbols by ID.
Incremental updates. When a file changes, only its symbols are re-extracted. The rest of the index is untouched.

5. Implementation Overview

The implementation described here uses tree-sitter grammars for 25+ languages, a SQLite-backed symbol store, and BM25 for text search. The system runs as an MCP (Model Context Protocol) server, exposing tools that LLM agents call directly.

5.1 Language Support

Symbol extraction is grammar-driven. Each supported language has a tree-sitter grammar and an extraction spec that maps AST node types to symbol kinds:

Language family	Languages	Symbol kinds extracted
C-like	C, C++, C#, Java, Go, Rust, Swift, Kotlin	functions, methods, classes, structs, interfaces, enums, type aliases
Dynamic	Python, JavaScript, TypeScript, Ruby, PHP, Lua	functions, methods, classes, decorators, module-level assignments
Functional	Haskell, Scala, Erlang, R, Julia	functions, type classes, data types, modules
Markup/Config	SQL, TOML, CSS, Bash	definitions, sections, rules
Specialized	Vue SFC, Razor (`.cshtml`), Assembly	component APIs, code blocks, labels/macros

Custom extractors exist for languages where tree-sitter grammars lack clean named fields (Erlang: multi-clause function merging by arity; Fortran: module-qualified names; SQL: dbt Jinja preprocessing).

5.2 Storage and Index Architecture

~/.code-index/
  <repo-hash>/
    index.db          # SQLite: symbols table (name, kind, file, lines, source)
                      #         files table (path, hash, size_bytes)
                      #         imports table (file, specifier, resolved_path)
    content/          # Raw source files (for full-file retrieval)

The SQLite schema supports:

O(1) symbol lookup by ID (hash index built in __post_init__).
BM25 search over symbol names with optional language and file filters.
Import graph queries for cross-reference tools (find_importers, blast_radius, dead_code).
Incremental updates via content hashing --- only changed files are re-parsed.

5.3 Integration with LLM Workflows

The retrieval tools are exposed via MCP (Model Context Protocol), the open standard for LLM tool integration. An agent's interaction looks like:

Agent: search_symbols("router route handler", max_results=5)
  ← {symbols: [{id: "abc123", name: "route", kind: "function",
       file: "lib/router/index.js", lines: [45, 92]},...]}

Agent: get_symbol_source("abc123")
  ← {source: "Router.prototype.route = function route(path) {\n  ...full body...\n};",
     name: "route", kind: "function", lines: [45, 92]}

The agent receives exactly the code it needs --- a complete function definition --- without reading the entire file or receiving adjacent, unrelated code.

6. Benchmark Design

6.1 Repositories Under Test

Three public web frameworks spanning three languages, chosen for structural diversity:

Repository	Language	Files indexed	Symbols extracted	Baseline tokens
expressjs/express	JavaScript	165	181	137,978
fastapi/fastapi	Python	951	5,325	699,425
gin-gonic/gin	Go	98	1,489	187,018

Baseline tokens = all indexed source files concatenated and tokenized with tiktoken cl100k_base. This is the minimum cost for an agent that reads every file once. Real agents typically read files multiple times, making this a conservative baseline.

6.2 Query Corpus

Five queries representing common code exploration intents, defined in a public tasks.json:

Query	Intent
`router route handler`	Core route registration / dispatch
`middleware`	Middleware chaining and execution
`error exception`	Error handling and exception propagation
`request response`	Request/response object definitions
`context bind`	Context creation and parameter binding

Each query is run against each repository, producing 15 task-runs.

6.3 RAG Configuration

The RAG baseline uses a naive LangChain pipeline --- deliberately unoptimized, representing a common starting point rather than a production-tuned system:

Embeddings: sentence-transformers/all-MiniLM-L6-v2 (384-dim, local inference)
Vector store: FAISS (faiss-cpu, in-memory)
Splitter: RecursiveCharacterTextSplitter.from_tiktoken_encoder (true token-based chunks)
Chunk sizes: 512, 1024, 2048 tokens with ~10% overlap
Retrieval: similarity_search(query, k=5), top 3 used as "fetched"

Token counting: search_tokens (all 5 retrieved chunks serialized) + fetch_tokens (top 3 chunks serialized). This mirrors the AST workflow's search_symbols + get_symbol_source two-step pattern.

What is not tested. The RAG baseline does not use code-specific embedding models (CodeBERT, Voyage Code, StarEncoder), re-ranking passes (Cohere Rerank, cross-encoder), hybrid search (BM25 + dense), or AST-aware chunking. Any of these would likely improve RAG's token efficiency. The results in Section 7 should be read as "AST retrieval vs. naive RAG," not "AST retrieval vs. best-possible RAG."

Double-counting note. The two-step token accounting (search 5 + fetch 3) means the top 3 chunks are counted in both passes. A simpler RAG workflow that calls similarity_search(k=3) and uses the results directly would avoid this overhead. We chose the two-step structure to mirror the AST workflow's metadata-then-source pattern, making the comparison structurally parallel. This inflates RAG's token count by roughly 30--40% relative to a single-pass retrieval. The 1.6--3.9x margin would narrow under single-pass accounting, though AST retrieval would still be more efficient due to the metadata-vs-source asymmetry in the search step.

6.4 AST Configuration

Parser: tree-sitter (language-specific grammars)
Search: BM25 over symbol names, max_results=5
Fetch: get_symbol_source on top 3 symbol IDs
Token counting: search response tokens + 3 x symbol source tokens

AI summaries were disabled during benchmarking (signature-only fallback).

6.5 Reproducibility

Both harnesses read file content from the same IndexStore instance (IndexStore.load_index() → index.source_files). Baselines are identical by construction. The harness scripts (run_benchmark.py, run_rag_baseline.py), query corpus (tasks.json), and raw results (rag_baseline_results.json) are open source.

7. Results

7.1 Token Efficiency: AST Retrieval

Repository	Baseline tokens	AST avg/query	Reduction	Ratio
expressjs/express	137,978	924	99.4%	150.1x
fastapi/fastapi	699,425	1,834	99.8%	531.2x
gin-gonic/gin	187,018	1,124	99.4%	171.9x
Grand total (15 runs)	5,122,105	19,406	99.6%	263.9x

Per-query detail (Express.js):

Query	Baseline	AST tokens	Search	Fetch	Reduction
`router route handler`	137,978	886	381	505	99.4%
`middleware`	137,978	1,008	370	638	99.3%
`error exception`	137,978	859	362	497	99.4%
`request response`	137,978	872	372	500	99.4%
`context bind`	137,978	993	372	621	99.3%

Per-query detail (FastAPI):

Query	Baseline	AST tokens	Search	Fetch	Reduction
`router route handler`	699,425	1,199	464	735	99.8%
`middleware`	699,425	1,643	460	1,183	99.8%
`error exception`	699,425	873	383	490	99.9%
`request response`	699,425	4,439	430	4,009	99.4%
`context bind`	699,425	1,016	402	614	99.9%

7.2 Token Efficiency: RAG Baseline

Best-performing RAG configuration per repo (RAG-512 in all cases):

Repository	Baseline tokens	RAG-512 avg/query	Reduction	Ratio
expressjs/express	137,978	2,887	97.9%	56.0x
fastapi/fastapi	699,425	2,850	99.6%	248.5x
gin-gonic/gin	187,018	4,352	97.7%	43.5x

RAG token consumption increases with chunk size:

Repository	RAG-512 avg	RAG-1024 avg	RAG-2048 avg
expressjs/express	2,887	6,023	7,057
fastapi/fastapi	2,850	4,279	5,512
gin-gonic/gin	4,352	7,539	12,850

7.3 Head-to-Head Comparison

Both harnesses ran back-to-back on 2026-03-28 against the same index state.

Repository	Best RAG avg/query	AST avg/query	AST advantage
expressjs/express	2,887 (RAG-512)	924	3.1x
fastapi/fastapi	2,850 (RAG-512)	1,834	1.6x
gin-gonic/gin	4,352 (RAG-512)	1,124	3.9x

AST retrieval uses fewer tokens on every tested repository. The margin ranges from 1.6x (FastAPI) to 3.9x (Gin). The FastAPI result is notable: this is the largest repo (951 files, 5,325 symbols), where dense embedding retrieval might be expected to have an advantage. It does not --- BM25 over symbol names plus selective source retrieval still outperforms vector similarity over text chunks.

7.4 Chunk Integrity

The "complete chunk" rate measures how often a retrieved RAG chunk starts at a definition boundary and has balanced braces/indentation. The "split" rate measures how often a chunk is cut mid-function.

Repository	RAG-512 complete	RAG-512 split	RAG-1024 split	RAG-2048 split
expressjs/express	7%	7%	7%	0%
fastapi/fastapi	7%	53%	40%	33%
gin-gonic/gin	13%	7%	7%	7%

FastAPI's high split rate at 512-token chunks is a direct consequence of its code structure: many functions exceed 512 tokens, so the chunker cuts them. Increasing chunk size reduces splits but does not eliminate them, and increases token cost per retrieval.

AST retrieval produces zero split results by construction. Every returned symbol is a complete AST node --- a function, class, or method with full source from definition to closing delimiter.

7.5 Infrastructure Overhead

Metric	RAG	AST
Embedding model download	~90 MB (one-time)	None
Runtime dependencies	LangChain + FAISS + sentence-transformers + torch (~1 GB)	tiktoken only (for benchmarking)
Index build (FastAPI, 951 files)	23--49s (embedding-dominated)	5--15s (tree-sitter parse)
Index build (Express, 165 files)	6s	<1s
FAISS index size (FastAPI, 512)	7,556 KB	~few hundred KB (SQLite)
Query latency	12--36 ms	<5 ms (BM25 in-process)

The embedding step is the dominant cost in the RAG pipeline. For a 951-file repository, building the 512-token FAISS index requires ~47 seconds of CPU embedding time. The AST pipeline parses the same files in 5--15 seconds with no model inference.

7.6 End-to-End A/B Tests

Two controlled experiments were conducted on a production Vue 3 + Firebase codebase to measure real-world impact beyond synthetic benchmarks.

Test 1: Naming audit (50 iterations, Claude Sonnet 4.6). Each iteration scanned source files for misleading names, then applied fixes via three-subagent consensus.

Metric	Native tools (Grep/Glob/Read)	AST retrieval	Delta
Success rate	72%	80%	+8 pp
Timeout rate	40%	32%	-8 pp
Mean cost/iteration	$0.783	$0.738	-5.7%
Mean cache creation tokens	104,135	93,178	-10.5%

Isolated tool-layer savings (controlling for fixed subagent overhead): 15--25%.

Test 2: Dead code detection (50 iterations, Claude Sonnet 4.6). Pure tool-layer cost measurement with no subagent overhead.

Metric	Native tools	AST retrieval	Delta
Success rate	96%	92%	-4 pp
Mean cost/iteration	$0.4474	$0.3560	-20.0%
Mean total tokens	449,356	289,275	-36%

The 20% cost reduction is statistically significant (Wilcoxon p=0.0074, Cohen's d=-0.583).

Accuracy tradeoff. The cost savings came with measurable accuracy degradation on fine-grained tasks:

F1 metric	Native tools	AST retrieval	Delta
Dead files (all exports unused)	95.8%	95.7%	equivalent
Alive files (with some dead exports)	100.0%	69.6%	-30.4 pp
Export-level (individual export liveness)	93.3%	64.1%	-29.2 pp

Dead-file detection --- the coarsest classification --- was equivalent. But alive-file classification and individual export liveness were significantly worse with AST retrieval. Root cause analysis (detailed in the full report) identified three factors: (1) the JS import extractor missed dynamic import() calls (fixed in v1.8.1), (2) the agent's strategy stopped at file-level liveness without verifying individual exports, and (3) neither variant followed transitive dead-code chains (fixed in v1.8.3). Two of the three gaps were tool bugs subsequently fixed; the third was a task-framing issue.

The honest summary: AST retrieval is cheaper but not uniformly better. For tasks requiring file-level "is this dead?" decisions, accuracy is equivalent at 20% lower cost. For tasks requiring export-level granularity, the agent's retrieval strategy must be more deliberate --- the tool provides the capability (find_references returns zero results for unused exports), but the agent did not use it consistently.

8. Analysis

8.1 Why AST Retrieval Uses Fewer Tokens

Three mechanisms contribute:

No irrelevant context per result. A RAG chunk at any size includes code before and after the relevant definition. A symbol result includes only the definition itself. The average AST fetch returns 200--600 tokens of source per symbol; RAG-512 returns ~500 tokens per chunk, but 3--5 of the 5 chunks typically contain irrelevant code that happens to share embedding-space proximity with the query.
The search step is cheaper. AST search returns symbol metadata (~370 tokens for 5 results): name, kind, file, line range. RAG search returns the full text of 5 chunks (~1,800--2,900 tokens for 5 results). The metadata-first approach lets the agent make retrieval decisions before paying the full-source cost.
Metadata/source separation. In the AST workflow, the search step returns compact metadata (~370 tokens for 5 results) and the fetch step returns full source. No content is transmitted twice. In the RAG workflow as measured here, the top 3 chunks appear in both the search response (5 chunks) and the fetch response (3 chunks). This is a measurement artifact of our two-step accounting, not an inherent RAG limitation --- a single-pass similarity_search(k=3) pipeline would avoid it. We discuss the impact of this accounting choice in Section 6.3.

8.2 Confounded Variables: Unit vs. Search Mechanism

This benchmark varies two things simultaneously: the retrieval unit (chunk vs. symbol) and the search mechanism (embedding similarity vs. BM25). We attribute the advantage primarily to the retrieval unit, but we have not isolated the two variables. Two unrun experiments would help:

Embedding search over AST symbols. Use the same symbol-level retrieval units, but search by embedding similarity instead of BM25. If results are comparable, the retrieval unit is the dominant factor.
BM25 search over fixed-size chunks. Use the same chunk-based retrieval, but search by BM25 instead of embedding similarity. If BM25-over-chunks approaches AST retrieval's efficiency, the search mechanism is the dominant factor.

We suspect the retrieval unit is the larger contributor --- the metadata-vs-source asymmetry in the search step and the absence of irrelevant context per result are structural properties of symbol-level retrieval, independent of how symbols are ranked. But without these controls, we cannot claim this definitively.

Additionally, the query corpus (Section 6.2) consists of short keyword queries that lexically match symbol names. This is the scenario where BM25 has maximum advantage over dense embeddings. Queries requiring semantic inference (e.g., "what runs before the handler on each request" to find middleware) would likely favor embedding search. The results should be read with this bias in mind.

8.3 Why FastAPI's Margin Is Narrower

AST retrieval still wins on FastAPI (1.6x advantage), but the margin is smaller than on Express (3.1x) or Gin (3.9x). Two factors:

FastAPI has high symbol density. With 5,325 symbols across 951 files, BM25 over symbol names produces more candidates, and the top-3 fetched symbols are sometimes larger (e.g., request response on FastAPI fetches 4,009 source tokens due to large Request/Response class definitions).
RAG-512 performs relatively well on large, well-structured Python files. FastAPI's code style produces chunks that, while often split (53%), still contain semantically relevant code due to the framework's dense annotation style.

8.4 Where RAG Still Makes Sense

AST symbol retrieval is not a universal replacement for RAG:

Natural language documentation. Docstrings, README files, API descriptions, and inline comments are not syntactic symbols. RAG over prose documents remains appropriate for these artifacts. (A companion tool for section-level document retrieval handles this case separately.)
Semantic similarity across naming conventions. BM25 search requires lexical overlap between the query and symbol names. A query like "authentication" will not match a function named verify_credentials unless the surrounding qualified name or file path contains relevant terms. Dense embedding models capture this semantic proximity. For codebases with inconsistent naming, RAG may surface relevant code that BM25 misses.
Codebases without parseable structure. Configuration files, data pipelines, template languages, and heavily metaprogrammed code may not produce meaningful AST symbols. RAG handles these as opaque text, which is at least something.

8.5 Failure Modes

AST retrieval fails when:

The query intent maps to code spread across many small utility functions with generic names.
The symbol index is stale (file changed since last parse). Staleness detection mitigates this.
The language lacks a tree-sitter grammar. Coverage is broad (25+ languages) but not complete.

RAG fails when:

The relevant code is smaller than the chunk size (over-retrieval).
The relevant code is larger than the chunk size (under-retrieval, split across chunks).
The query is specific but the embedding model generalizes too aggressively, returning topically related but functionally irrelevant chunks.

9. Discussion

9.1 Implications for Developer Tooling

The results suggest that code retrieval tools should match their retrieval unit to the structure of the data. Code has natural units --- functions, classes, methods --- that are well-defined, complete, and independently meaningful. Using these as retrieval units eliminates an entire class of problems (chunk boundaries, irrelevant context, double-counting) without adding complexity.

This is not a new insight. IDEs have navigated code by symbols since the 1990s (ctags, IntelliSense, Language Server Protocol). What is new is that LLM agents can use the same granularity, and the token economics make it worth doing.

9.2 Toward a Retrieval Interface Standard

The retrieval workflow tested here follows a three-step pattern:

Discover: enumerate available repositories, files, or outlines.
Search: find relevant symbols by name, kind, or text query.
Retrieve: fetch complete source for selected symbols.

This pattern is general enough to standardize. One such effort is the jMunch Retrieval Interface (jMRI) [9], an open specification for token-efficient context retrieval in MCP servers. jMRI formalizes the discover/search/retrieve contract, requires that retrieved content represent complete semantic units (functions, classes, documentation sections), mandates stable identifiers for caching and cross-reference, and includes per-response token savings metadata so agents can measure efficiency per query. The specification defines two compliance tiers (Basic and Full), allowing implementations to adopt the interface incrementally regardless of their underlying search mechanism (BM25, embedding, hybrid).

The key insight behind jMRI --- and the one supported by this paper's results --- is that the retrieval interface should constrain the retrieval unit. An interface that guarantees complete syntactic units eliminates chunk boundary artifacts at the contract level, not as an implementation detail that individual tools may or may not get right.

9.3 Cost at Scale

At current LLM API pricing ($3--15 per million input tokens for frontier models), the difference between ~1,000 and ~3,000 tokens per query is small in absolute terms. At scale, it compounds. An agentic workflow that makes 50 retrieval queries per task, run across 100 tasks per day:

Scenario	RAG-512 tokens/day	AST tokens/day	Monthly savings at $10/M
Best case (Express-like, 3.1x margin)	14,435,000	4,620,000	$98.15
Worst case (FastAPI-like, 1.6x margin)	14,250,000	9,170,000	$50.80

The range matters. On a tightly scoped codebase with well-named symbols, the savings are substantial. On a large, symbol-dense repository, the margin is real but more modest. For teams running agentic CI/CD, code review bots, or continuous refactoring agents across multiple repositories, even the worst-case savings are material over months.

9.4 MCP Ecosystem Fit

The Model Context Protocol (MCP) provides a standardized interface for LLM tools. AST symbol retrieval fits naturally into MCP's tool-call model: search_symbols and get_symbol_source are stateless, cacheable operations that return structured JSON. The agent controls retrieval depth --- it can fetch one symbol or ten, based on the search results. This is the opposite of RAG's "always return k chunks" model, and it gives agents fine-grained control over their token budget.

10. Limitations

10.1 Language Coverage

Tree-sitter grammars exist for ~40 languages, and the implementation tested here supports 25+. Languages without grammars (niche DSLs, proprietary languages) require custom extractors or fall back to file-level retrieval. Adding a new language requires mapping AST node types to symbol kinds --- typically a few hours of work, but non-trivial.

10.2 Indexing Overhead

The AST index must be built before queries can be served. For a 951-file repository, this takes 5--15 seconds. For monorepos with tens of thousands of files, indexing may take minutes. Incremental indexing (re-parse only changed files) mitigates this for iterative workflows, but the initial build cost is unavoidable.

10.3 Query Corpus and Repository Diversity

Five queries across three repositories is sufficient to demonstrate the structural advantage on web framework codebases but does not claim coverage of all code exploration patterns or architectures.

Repository bias. All three repositories are HTTP request-routing frameworks. They share a common conceptual vocabulary (router, middleware, handler, request, response, context), and the query corpus maps directly to this vocabulary. Codebases with different structures --- compilers, ML training loops, game engines, infrastructure-as-code, heavily metaprogrammed or macro-heavy code --- may produce different results. We have not tested these.

Query bias. All five queries are short keyword phrases that lexically match symbol names. Queries requiring semantic inference, natural language phrasing, or cross-file tracing may favor embedding-based retrieval. The results generalize most confidently to keyword-style code navigation queries on well-structured application codebases.

10.4 Non-Code Use Cases

AST symbol retrieval is specific to source code. Documentation, configuration files, data files, and prose artifacts require different retrieval strategies. The benchmarks in this paper measure code retrieval only.

10.5 Retrieval Precision

The benchmark measures token efficiency, not retrieval precision. Whether the top-3 retrieved symbols are the correct symbols for a given query is a separate question. Independent evaluation (jMunchWorkbench) reports 96% precision on the same query corpus, but this metric is not the focus of this paper.

10.6 Single Tokenizer

All token counts use tiktoken with cl100k_base. Claude and GPT tokenizers produce slightly different counts for the same input. We use cl100k_base as a common reference point; relative ratios (AST vs. RAG) are stable across tokenizer choices.

11. Threats to Validity

Internal validity. The two-step token accounting (search 5 + fetch 3) inflates RAG's token count relative to a single-pass pipeline. We estimate this adds 30--40% to RAG's measured tokens. Even after adjusting, AST retrieval remains more efficient, but the margin narrows --- particularly on FastAPI, where the adjusted comparison would approach 1.1--1.2x.

Construct validity. Token count is a proxy for cost and context window pressure, not a direct measure of retrieval quality. A system that uses fewer tokens but returns irrelevant code is worse. We do not measure retrieval precision comparatively in this benchmark.

External validity. Three web frameworks from one architectural pattern, tested with keyword queries, do not represent all codebases or query types. Generalization to monorepos, DSLs, metaprogrammed code, or natural-language queries is unvalidated.

Experimenter bias. The AST retrieval system under test was developed by the author. The RAG baseline was implemented specifically for this comparison and was not optimized. A third-party replication using a production RAG pipeline would strengthen the findings.

12. Related Work

CodeSearchNet (Husain et al., 2019) established benchmarks for code search using natural language queries over function-level documentation. The retrieval unit is the function --- consistent with our approach --- but the search mechanism is embedding-based, not BM25 over symbol names.

RepoMap (Gauthier, 2023) uses tree-sitter to build repository outlines for LLM context, compressing file structure into tag-based summaries. This addresses the "what's in this repo" question but does not provide full source retrieval for individual symbols.

Aider (Gauthier, 2023) integrates repository maps with LLM code editing. Its --map-tokens budget controls how much structural context the LLM receives. This is complementary to symbol retrieval: the map provides orientation, symbol retrieval provides depth.

SWE-agent (Yang et al., 2024) and SWE-bench (Jimenez et al., 2024) evaluate LLM agents on real GitHub issues. These agents use file-level tools (open, scroll, search) that operate at a coarser granularity than symbol retrieval. Integrating symbol-level tools into SWE-agent's action space is a natural extension.

GraphCodeBERT (Guo et al., 2021) and UniXcoder (Guo et al., 2022) use data flow and AST structure during pre-training to improve code understanding. These models could serve as embedding backends for a hybrid approach: AST-structured retrieval with semantic re-ranking.

13. Conclusion

The standard approach to code retrieval for LLM agents --- chunking source files into fixed-size text windows and retrieving by vector similarity --- is structurally mismatched to code. Code has natural boundaries (functions, classes, methods) that chunking ignores. The result is wasted tokens, fragmented context, and unnecessary infrastructure. The RAG ecosystem's own movement toward AST-aware chunking implicitly acknowledges this mismatch.

AST-based symbol retrieval takes the idea to its logical conclusion: the retrieval unit is the symbol, not the chunk. The results on three web framework codebases are concrete: 1.6--3.9x fewer tokens per query than a naive fixed-chunk RAG pipeline, zero chunk-boundary artifacts, no embedding model, and sub-5ms query latency. End-to-end A/B tests on a production codebase confirm 20% cost savings in real agentic workflows, though with accuracy tradeoffs on fine-grained classification tasks that warrant further investigation.

These results have clear scope limitations: three repos from one architectural niche, keyword queries that favor BM25, and a RAG baseline that does not represent production-grade retrieval. The structural argument --- that code retrieval should respect syntactic boundaries --- is stronger than the specific numbers, and holds regardless of whether the search mechanism is BM25, dense embeddings, or a hybrid.

The issue is not the model. It is how we feed it.

References

Husain, H., Wu, H.-H., Gazit, T., Allamanis, M., & Brockschmidt, M. (2019). CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436.
Gauthier, P. (2023). Aider: AI pair programming in your terminal. https://aider.chat
Yang, J., Jimenez, C. E., Wettig, A., et al. (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793.
Jimenez, C. E., Yang, J., Wettig, A., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024.
Guo, D., Ren, S., Lu, S., et al. (2021). GraphCodeBERT: Pre-training Code Representations with Data Flow. ICLR 2021.
Guo, D., Lu, S., Duan, N., et al. (2022). UniXcoder: Unified Cross-Modal Pre-training for Code Representation. ACL 2022.
Maxime Brunet et al. (2024). Tree-sitter: An incremental parsing system for programming tools. https://tree-sitter.github.io
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
Gravelle, J. (2026). jMunch Retrieval Interface (jMRI) Specification. https://github.com/jgravelle/mcp-retrieval-spec

Appendix A: Reproduction Instructions

pip install jcodemunch-mcp tiktoken

# Index the three canonical repos
jcodemunch index_repo expressjs/express
jcodemunch index_repo fastapi/fastapi
jcodemunch index_repo gin-gonic/gin

# Run AST benchmark (prints markdown table + grand summary)
python benchmarks/harness/run_benchmark.py

# Run RAG baseline (requires additional deps)
pip install -r benchmarks/requirements-rag-bench.txt
python benchmarks/harness/run_rag_baseline.py

# Write results to files
python benchmarks/harness/run_benchmark.py --out benchmarks/results.md
python benchmarks/harness/run_rag_baseline.py --out benchmarks/rag_baseline_results.md

Both harnesses read from the same IndexStore, guaranteeing identical file sets.

Appendix B: Raw Data Availability

AST benchmark results: benchmarks/results.md
RAG baseline results: benchmarks/rag_baseline_results.md and rag_baseline_results.json
Task corpus: benchmarks/tasks.json
A/B test reports: benchmarks/ab-test-naming-audit-2026-03-18.md, benchmarks/ab-test-dead-code-2026-03-18.md
A/B test raw data: https://gist.github.com/Mharbulous/bb097396fa92ef1d34d03a72b56b2c61
Source code: https://github.com/jgravelle/jcodemunch-mcp

Auto-Generate & Sync Searchable Code Docs in Notion from Any Repo – Token-Efficient with Claude & MCP

J. Gravelle — Sun, 22 Mar 2026 15:58:23 +0000

This is a submission for the Notion MCP Challenge

What I Built

NotionCodeMirror — a CLI that auto-generates a living code documentation workspace in Notion from any GitHub repo, and keeps it in sync as the code evolves.

Point it at a repo, and within minutes you get a fully structured Notion workspace:

An Overview with language breakdown and symbol inventory
An Architecture page written in real prose by Claude
A searchable API Reference database populated with every function and class
A module page for each top-level directory

Run it again after a PR merges, and only the changed pages update.

The core idea is multi-MCP orchestration.

Phase 1 uses jcodemunch-mcp to analyze the codebase:

Extracts symbols
Ranks them by import-graph centrality
Traces dependency edges
Builds class hierarchies All of this happens without involving Claude.

Phase 2 hands Claude a compact structured digest (about 8–12K tokens) instead of raw source files, so it can focus entirely on synthesis and writing.

The API Reference database is batch-populated directly via HTTP, bypassing the agent loop for large inserts (100+ rows at a time).

A full run on a medium-sized repo costs roughly 10–15K Claude tokens — closer to a single conversation than a traditional indexing job.

Video Demo

https://www.youtube.com/watch?v=C99oAE69Og0

Show us the code

https://github.com/jgravelle/notion-code-mirror

...and the results:

https://www.notion.so/jcodemunch-mcp-CodeMirror-32b752802a7f812da7bee46c5460beb1

How I Used Notion MCP

Notion MCP is the write layer of the pipeline.

After Claude analyzes the gathered repo data, it calls four lightweight Python tools: notion_create_page, notion_create_database, notion_update_page, and done. Each of those dispatches to the Notion MCP server via an async stdio session — the same MCP transport pattern used on the code-analysis side with jcodemunch-mcp. Both MCP servers stay open concurrently for the duration of the run.

Each of those dispatches to the Notion API under the hood. The MCP connection is managed as an async stdio session alongside the jcodemunch-mcp session, so both servers stay open for the duration of the run.

The API Reference rows are the one exception: Claude creates the database shell and signals completion via done, then Python batch-inserts every symbol row through the same Notion MCP session — keeping Claude out of a loop that would otherwise cost 100+ tool-call round-trips.

What Notion MCP specifically unlocks is direct writing into a structured workspace instead of dumping out a Markdown file you then have to paste somewhere. Claude doesn’t just generate text. It decides the page hierarchy, picks emoji icons, chooses what belongs in a database versus a prose page, and places everything under the right parent. That produces a workspace that is immediately navigable and shareable rather than a lonely text artifact drifting around your desktop.

The incremental sync story also depends on MCP making page IDs first-class. On the first run, every created page and database ID is saved to a local state file. On --sync, those IDs go back into Claude’s context, and it calls notion_update_page on the existing objects instead of duplicating them.

Without MCP as the integration layer, you’d need a custom Notion client plus a fair amount of bookkeeping to get the same result...

-jgravelle

DEV Community: J. Gravelle

About that 'your 997 says rejected but not why' problem...

What it does

Where the AI does (and doesn't) fit

Privacy

You Don't Need an LLM to Route Agent Context: Regex Beats Classifiers by 45 Points

Five ways agents get code into context

The gate is where the real leverage is

The setup

The result

Intent lives in shape, not word frequency

The 3% that fights back

What to build

Honest caveats

Where this leaves the taxonomy

They Spent $81,267 By Accident...

...and I Spent $1.39 On Purpose

Everybody blamed the wrong thing

The tell is in the token split

Same game. One shot. On purpose.

Don't Nick-Up Your Budget

The actual lesson

Your AI's memory is quietly making it worse

...and your CLAUDE.md is next

The common thread

Problem 1: Stale anchors

Problem 2: Irrelevant anchors and preference leakage

Problem 3: Memory-file rot

Problem 4: Silent self-modification

The actual fix isn't "turn memory off"

Look What GitLab Invented!

GitLab just validated the fix in public

Here is where the big-vendor version gets... funny

The same idea, on your laptop, for the price of a tank of gas

Honest side-by-side

Saving the World From AI... with AI!

I applied SCI for AI to a coding-assistant tool, and the math actually works

The setup

The spec, briefly

The thing I want to show other developers

What the absolute numbers look like

The mechanism, if you want it

Three things I don't know

The fuller version

The point of writing this up

Headless Claude, Done Right: Slice-Level Retrieval and the Subscription Trap

An open-source CLI that respects your Claude Pro auth, retrieves only what it needs, and stays inside the lines Anthropic drew in their TOS.

Why your subscription quietly turns into an API bill

The bigger problem: dumping your repo at the model

Slice, don't dump

It's not just ask

When subscription mode is the right answer (and when it isn't)

Try it

One last thing

How we measured 99.6% token reduction across 15 task-runs

Two months after publishing the headline, here are the receipts.

What we measured

How the runs work

The "common misreadings"

What others have measured independently

How to run it yourself

What this means for you

I didn't set out to build a sustainability tool...

Why the number is a floor, not a ceiling

Tokens → kilowatt-hours

The part nobody in the sustainability conversation is talking about

A Radical Diet for Karpathy’s Token-Eating LLM Wiki

By Friday, Your Token Bill Looks Like a Phone Number

The Shape of the Problem (Before You Even See the Numbers)

What Karpathy Actually Proposed (And Why It Hit So Hard)

The Architecture (Why This Works at All)

Raw Sources

Wiki

Schema

The Three Operations That Matter

Ingest

Query

Lint

Where It Starts to Crack

1. index.md Becomes a Liability

It's not just `ask`

1. `index.md` Becomes a Liability