An open-source CLI that respects your Claude Pro auth, retrieves only what it needs, and stays inside the lines Anthropic drew in their TOS.
= = =
Spawn claude -p from a Python subprocess without the right precautions and you'll silently bill your Anthropic API account instead of using the Claude Pro subscription you already pay for. For a 799,000-token query, that's the difference between $0.00 and $11.99. Two environment variable strips later, my CLI does the right thing by default. And that's the smallest piece of what makes this work.
This is a write-up of jragmunch-cli, an open-source tool I built (Apache 2.0, on PyPI) that wraps the official claude -p binary with a few opinions: respect the user's auth, retrieve only what's needed, and stay inside the lines Anthropic actually drew in their own legal docs. The interesting parts are the ones that aren't obvious from the README.
Why your subscription quietly turns into an API bill
Anthropic's claude CLI binary (the one you install with npm install -g @anthropic-ai/claude-code) is auth-flexible by design. It will use whatever credentials it finds, in priority order:
-
ANTHROPIC_API_KEYif set -
ANTHROPIC_AUTH_TOKENif set - Your Claude Pro / Max OAuth login otherwise
That's reasonable behavior for the binary as a primitive. The footgun is what happens when you spawn it as a subprocess from Python.
By default, subprocess.Popen (and friends) pass the parent process's full environment to the child. If you have ANTHROPIC_API_KEY exported in your shell (because, say, you also use the API from other scripts), every claude -p invocation your tool makes will silently pick that up and bill your API account. You won't see an error. You won't see a warning. You'll just see a bill at the end of the month and a perfectly preserved subscription quota you never touched.
The fix is mechanically tiny:
import os
import subprocess
env = os.environ.copy()
env.pop("ANTHROPIC_API_KEY", None)
env.pop("ANTHROPIC_AUTH_TOKEN", None)
subprocess.run(
["claude", "-p", prompt, "--output-format", "stream-json"],
env=env,
check=True,
)
Three lines. But the discipline behind it matters: any tool that spawns claude -p from a parent process should default to subscription mode and require an opt-in flag to switch to API. jRAGmunch-CLI's --use-api flag is exactly that, and jragmunch doctor will tell you which mode you're in before you run anything expensive.
If you're wrapping claude -p in your own scripts, copy the pattern. It'll save someone a surprise bill.
The bigger problem: dumping your repo at the model
Even with auth handled correctly, the default pattern for "ask Claude about my repo" still wastes obscene amounts of tokens. The naive pattern looks like this:
- Walk the repo
- Concatenate every relevant file into one giant string
- Stuff it into the prompt
- Hope the model finds what it needs
This is what most "chat with your repo" wrappers do, and it's what burns through Claude Pro session limits in fifteen minutes flat. The 2.5GB Node.js source tree I demo with would need around 21 million tokens to fit in one prompt. Even if that fit (it doesn't), you'd be paying for the model to read 100% of the code to answer a question that touches 0.1% of it.
Here's a real run from AskClaude.py, the side-by-side demo script in the repo:
In its raw form, your request may have used as many as 799,037 tokens,
at a cost of $11.99.
Using jRAGmunch-CLI, our call to Opus 4.7 only used 24,771 tokens.
By using your subscription WITHIN THE TERMS OF ANTHROPIC'S TOS, you paid
$0.00 and used a nearly imperceptible fractional percentage of your quota.
799K tokens versus 24K. Same question, same answer quality. The difference isn't the model. The difference is what gets sent to it.
Slice, don't dump
The retrieval layer is where the real engineering happens. jRAGmunch-CLI delegates retrieval to jcodemunch-mcp, a separate MCP server I maintain that does AST-level symbol extraction across 70+ languages via tree-sitter.
Here's the conceptual difference between this and traditional RAG.
Traditional RAG. Chop the codebase into arbitrary text chunks. Embed each chunk. When a query comes in, embed the query, find the chunks with the highest cosine similarity, send those to the model. The retrieval is statistical and approximate. It can miss things. It can include things that look related but aren't. It treats your code as if it were prose.
Slice-level retrieval. Parse the codebase into an AST. When a query references a symbol (function name, class, identifier), look up that exact symbol in the index. Return the actual function body. Trace the actual import graph. The retrieval is structural and exact. If you ask for AuthMiddleware.verify, you get AuthMiddleware.verify, not the seven chunks that happened to contain the word "auth."
Surgical, not statistical.
The result is what shows up in jRAGmunch-CLI's _meta output on every call:
[tokens in=24 out=1273 cost actual=$0.0000 (notional=$0.5334, auth=subscription) time=27549ms]
actual is what you really paid (zero, in subscription mode). notional is what the same work would have cost via the API at Opus 4.7's input rate. auth is which credential path the subprocess used. Every verb returns this. You always know what you actually spent and what you would have spent.
That transparency matters more than it sounds. Most LLM tooling hides cost behind the abstraction. jRAGmunch-CLI makes you look at it on every call. After a week of that, your intuition for "what's a reasonable token budget for this question" sharpens dramatically.
It's not just ask
The verb most people see first is jragmunch ask, because that's the obvious "chat with your repo" use case. But the more interesting verbs are downstream of that:
-
jragmunch indexindexes a repo via jcodemunch (one-time, then incremental on subsequent calls). -
jragmunch reviewdoes diff-aware PR review against a git range. -
jragmunch changelogsummarizes changes since a tag. -
jragmunch refactorfans out batch refactors across matched symbols. -
jragmunch testsgenerates tests for symbols that don't have them. -
jragmunch sweepdoes pattern-driven cleanup across the repo. -
jragmunch runis a power-user passthrough for direct prompts. -
jragmunch doctorverifies your CLI + MCP wiring before you spend tokens.
The review and refactor verbs are where this stops looking like a Q&A wrapper and starts looking like an agentic CLI toolkit. review reads your diff, retrieves the surrounding symbol context that the diff actually touches (not the whole file, not the whole repo, just the symbols affected), and runs a structured review pass. refactor does fan-out work across multiple call sites in parallel, with each subprocess getting only the slice it needs.
That fan-out pattern is also where the TOS line gets interesting.
When subscription mode is the right answer (and when it isn't)
Anthropic's Claude Code Legal and Compliance docs draw a bright line that most wrappers ignore. Paraphrased:
- Individual ordinary use of Claude Code on your own machine, with your own subscription, is permitted.
- Business, always-on, multi-contributor, or high-throughput use should run against the API with an API key.
jRAGmunch-CLI's defaults are tuned to that line. Subscription mode by default for solo interactive work; explicit --use-api for anything that crosses into the second bucket. The README ships a decision table covering the typical cases:
| You are⦠| Recommended mode |
|---|---|
| A solo developer running verbs interactively on your own machine | subscription (default) |
A solo developer running jragmunch review in your own personal repo's CI |
subscription (default), with CLAUDE_CODE_OAUTH_TOKEN
|
| A team running CI bots on a shared / commercial repo | --use-api |
| Multi-developer or commercial automation | --use-api |
Heavy parallel fan-out (refactor --parallel 16, etc.) |
--use-api |
This isn't a workaround. It isn't a loophole. Anthropic explicitly permits the first column and explicitly directs the second column to the API. jRAGmunch-CLI just makes the right default for each case the easy default.
The recent wave of "I got rate-limited on Claude Pro after two days" complaints comes mostly from tools that don't respect this line. They run on a personal subscription, fan out twenty parallel subprocesses doing CI-grade work, then act surprised when the throttle drops. If you respect the line Anthropic drew, your subscription stays healthy. If you don't, it doesn't. jRAGmunch-CLI is opinionated about which side of the line you're on.
Try it
If you have the claude CLI on your PATH and jCodemunch-MCP registered as an MCP server, getting started is two commands:
pip install jragmunch
jragmunch doctor
doctor will tell you whether your auth resolves to subscription or API, whether the MCP server is reachable, and whether anything is misconfigured before you spend tokens. From there:
jragmunch index --repo .
jragmunch ask "how does auth work in this repo"
jragmunch review --since main
If you want the side-by-side cost comparison I quoted earlier, clone the repo and run python AskClaude.py. It prompts for a repo path and a question, then prints the answer plus the token math. Use it as a sanity check on your own codebases or as a template for embedding jRAGmunch-CLI into other tools.
One last thing
The repo is brand new. Star it if it's useful. File issues if it isn't. Send a PR if you've got opinions about which verb should ship next.
The 2.5GB Node.js demo, the live cost math, and a fuller walkthrough are in the AI Tips With J video premiering today. Both links below.
Repo: github.com/jgravelle/jragmunch-cli
Video: https://www.youtube.com/watch?v=ZP0OPSq0jcQ
Comparison page (vs. RAG, vs. raw file reads): j.gravelle.us/jCodeMunch/versus.php
Slice, don't dump.
- jjg
Top comments (0)