J. Gravelle

Posted on May 6

Headless Claude, Done Right: Slice-Level Retrieval and the Subscription Trap

#ai #clau #mcp #opensource

An open-source CLI that respects your Claude Pro auth, retrieves only what it needs, and stays inside the lines Anthropic drew in their TOS.

= = =

Spawn claude -p from a Python subprocess without the right precautions and you'll silently bill your Anthropic API account instead of using the Claude Pro subscription you already pay for. For a 799,000-token query, that's the difference between $0.00 and $11.99. Two environment variable strips later, my CLI does the right thing by default. And that's the smallest piece of what makes this work.

This is a write-up of jragmunch-cli, an open-source tool I built (Apache 2.0, on PyPI) that wraps the official claude -p binary with a few opinions: respect the user's auth, retrieve only what's needed, and stay inside the lines Anthropic actually drew in their own legal docs. The interesting parts are the ones that aren't obvious from the README.

Why your subscription quietly turns into an API bill

Anthropic's claude CLI binary (the one you install with npm install -g @anthropic-ai/claude-code) is auth-flexible by design. It will use whatever credentials it finds, in priority order:

ANTHROPIC_API_KEY if set
ANTHROPIC_AUTH_TOKEN if set
Your Claude Pro / Max OAuth login otherwise

That's reasonable behavior for the binary as a primitive. The footgun is what happens when you spawn it as a subprocess from Python.

By default, subprocess.Popen (and friends) pass the parent process's full environment to the child. If you have ANTHROPIC_API_KEY exported in your shell (because, say, you also use the API from other scripts), every claude -p invocation your tool makes will silently pick that up and bill your API account. You won't see an error. You won't see a warning. You'll just see a bill at the end of the month and a perfectly preserved subscription quota you never touched.

The fix is mechanically tiny:

import os
import subprocess

env = os.environ.copy()
env.pop("ANTHROPIC_API_KEY", None)
env.pop("ANTHROPIC_AUTH_TOKEN", None)

subprocess.run(
    ["claude", "-p", prompt, "--output-format", "stream-json"],
    env=env,
    check=True,
)

Three lines. But the discipline behind it matters: any tool that spawns claude -p from a parent process should default to subscription mode and require an opt-in flag to switch to API. jRAGmunch-CLI's --use-api flag is exactly that, and jragmunch doctor will tell you which mode you're in before you run anything expensive.

If you're wrapping claude -p in your own scripts, copy the pattern. It'll save someone a surprise bill.

The bigger problem: dumping your repo at the model

Even with auth handled correctly, the default pattern for "ask Claude about my repo" still wastes obscene amounts of tokens. The naive pattern looks like this:

Walk the repo
Concatenate every relevant file into one giant string
Stuff it into the prompt
Hope the model finds what it needs

This is what most "chat with your repo" wrappers do, and it's what burns through Claude Pro session limits in fifteen minutes flat. The 2.5GB Node.js source tree I demo with would need around 21 million tokens to fit in one prompt. Even if that fit (it doesn't), you'd be paying for the model to read 100% of the code to answer a question that touches 0.1% of it.

Here's a real run from AskClaude.py, the side-by-side demo script in the repo:

In its raw form, your request may have used as many as 799,037 tokens,
at a cost of $11.99.

Using jRAGmunch-CLI, our call to Opus 4.7 only used 24,771 tokens.

By using your subscription WITHIN THE TERMS OF ANTHROPIC'S TOS, you paid
$0.00 and used a nearly imperceptible fractional percentage of your quota.

799K tokens versus 24K. Same question, same answer quality. The difference isn't the model. The difference is what gets sent to it.

Slice, don't dump

The retrieval layer is where the real engineering happens. jRAGmunch-CLI delegates retrieval to jcodemunch-mcp, a separate MCP server I maintain that does AST-level symbol extraction across 70+ languages via tree-sitter.

Here's the conceptual difference between this and traditional RAG.

Traditional RAG. Chop the codebase into arbitrary text chunks. Embed each chunk. When a query comes in, embed the query, find the chunks with the highest cosine similarity, send those to the model. The retrieval is statistical and approximate. It can miss things. It can include things that look related but aren't. It treats your code as if it were prose.

Slice-level retrieval. Parse the codebase into an AST. When a query references a symbol (function name, class, identifier), look up that exact symbol in the index. Return the actual function body. Trace the actual import graph. The retrieval is structural and exact. If you ask for AuthMiddleware.verify, you get AuthMiddleware.verify, not the seven chunks that happened to contain the word "auth."

Surgical, not statistical.

The result is what shows up in jRAGmunch-CLI's _meta output on every call:

[tokens in=24 out=1273  cost actual=$0.0000 (notional=$0.5334, auth=subscription)  time=27549ms]

actual is what you really paid (zero, in subscription mode). notional is what the same work would have cost via the API at Opus 4.7's input rate. auth is which credential path the subprocess used. Every verb returns this. You always know what you actually spent and what you would have spent.

That transparency matters more than it sounds. Most LLM tooling hides cost behind the abstraction. jRAGmunch-CLI makes you look at it on every call. After a week of that, your intuition for "what's a reasonable token budget for this question" sharpens dramatically.

It's not just `ask`

The verb most people see first is jragmunch ask, because that's the obvious "chat with your repo" use case. But the more interesting verbs are downstream of that:

jragmunch index indexes a repo via jcodemunch (one-time, then incremental on subsequent calls).
jragmunch review does diff-aware PR review against a git range.
jragmunch changelog summarizes changes since a tag.
jragmunch refactor fans out batch refactors across matched symbols.
jragmunch tests generates tests for symbols that don't have them.
jragmunch sweep does pattern-driven cleanup across the repo.
jragmunch run is a power-user passthrough for direct prompts.
jragmunch doctor verifies your CLI + MCP wiring before you spend tokens.

The review and refactor verbs are where this stops looking like a Q&A wrapper and starts looking like an agentic CLI toolkit. review reads your diff, retrieves the surrounding symbol context that the diff actually touches (not the whole file, not the whole repo, just the symbols affected), and runs a structured review pass. refactor does fan-out work across multiple call sites in parallel, with each subprocess getting only the slice it needs.

That fan-out pattern is also where the TOS line gets interesting.

When subscription mode is the right answer (and when it isn't)

Anthropic's Claude Code Legal and Compliance docs draw a bright line that most wrappers ignore. Paraphrased:

Individual ordinary use of Claude Code on your own machine, with your own subscription, is permitted.
Business, always-on, multi-contributor, or high-throughput use should run against the API with an API key.

jRAGmunch-CLI's defaults are tuned to that line. Subscription mode by default for solo interactive work; explicit --use-api for anything that crosses into the second bucket. The README ships a decision table covering the typical cases:

You are…	Recommended mode
A solo developer running verbs interactively on your own machine	subscription (default)
A solo developer running `jragmunch review` in your own personal repo's CI	subscription (default), with `CLAUDE_CODE_OAUTH_TOKEN`
A team running CI bots on a shared / commercial repo	`--use-api`
Multi-developer or commercial automation	`--use-api`
Heavy parallel fan-out (`refactor --parallel 16`, etc.)	`--use-api`

This isn't a workaround. It isn't a loophole. Anthropic explicitly permits the first column and explicitly directs the second column to the API. jRAGmunch-CLI just makes the right default for each case the easy default.

The recent wave of "I got rate-limited on Claude Pro after two days" complaints comes mostly from tools that don't respect this line. They run on a personal subscription, fan out twenty parallel subprocesses doing CI-grade work, then act surprised when the throttle drops. If you respect the line Anthropic drew, your subscription stays healthy. If you don't, it doesn't. jRAGmunch-CLI is opinionated about which side of the line you're on.

Try it

If you have the claude CLI on your PATH and jCodemunch-MCP registered as an MCP server, getting started is two commands:

pip install jragmunch
jragmunch doctor

doctor will tell you whether your auth resolves to subscription or API, whether the MCP server is reachable, and whether anything is misconfigured before you spend tokens. From there:

jragmunch index --repo .
jragmunch ask "how does auth work in this repo"
jragmunch review --since main

If you want the side-by-side cost comparison I quoted earlier, clone the repo and run python AskClaude.py. It prompts for a repo path and a question, then prints the answer plus the token math. Use it as a sanity check on your own codebases or as a template for embedding jRAGmunch-CLI into other tools.

One last thing

The repo is brand new. Star it if it's useful. File issues if it isn't. Send a PR if you've got opinions about which verb should ship next.

The 2.5GB Node.js demo, the live cost math, and a fuller walkthrough are in the AI Tips With J video premiering today. Both links below.

Repo: github.com/jgravelle/jragmunch-cli
Video: https://www.youtube.com/watch?v=ZP0OPSq0jcQ
Comparison page (vs. RAG, vs. raw file reads): j.gravelle.us/jCodeMunch/versus.php

Slice, don't dump.

DEV Community

Headless Claude, Done Right: Slice-Level Retrieval and the Subscription Trap

An open-source CLI that respects your Claude Pro auth, retrieves only what it needs, and stays inside the lines Anthropic drew in their TOS.

Why your subscription quietly turns into an API bill

The bigger problem: dumping your repo at the model

Slice, don't dump

It's not just `ask`

When subscription mode is the right answer (and when it isn't)

Try it

One last thing

Top comments (0)

An open-source CLI that respects your Claude Pro auth, retrieves only what it needs, and stays inside the lines Anthropic drew in their TOS.

Why your subscription quietly turns into an API bill

The bigger problem: dumping your repo at the model

Slice, don't dump

It's not just ask

When subscription mode is the right answer (and when it isn't)

Try it

One last thing

It's not just `ask`