DEV Community

Cover image for How we cut our Claude Code token costs by 80% — and the open-source tool we built to do it (v1.1.4, 13 commands)
Horilla
Horilla

Posted on

How we cut our Claude Code token costs by 80% — and the open-source tool we built to do it (v1.1.4, 13 commands)

Our Claude Code bill was three times what it should have been. For a 9-developer team, that difference was significant enough to make us actually debug it. What we found was embarrassingly simple — and almost certainly affecting your setup too.

The problem: CLAUDE.md loads on every request

Claude Code reads your CLAUDE.md file on every single request. That's by design — it's how the tool loads your project instructions, coding conventions, and context. But here's the part that sneaks up on you: the file is read in full, every time, regardless of what task you're working on.

Our CLAUDE.md had grown to about 10,000 tokens over six months. It contained:

  • Architecture documentation for 40+ Django apps
  • Coding standards and patterns for two separate codebases
  • API references and import paths
  • Session notes and debugging tips we'd accumulated

Every time a developer asked Claude to fix a typo in a README, the full 10,000-token file was injected into the context. At roughly 60 requests per developer per day, across 9 developers, that's 5.4 million tokens of CLAUDE.md context per month — before writing a single line of code.

Why prompt caching wasn't saving us

Anthropic caches prompts above 1,024 tokens, and it works well — when the prompt is identical between requests. One character of difference and you pay full price.

Our CLAUDE.md had dynamic content: session notes with timestamps, environment-specific paths, and other content that changed between requests. Every request was a cache miss. Every request billed at full input token price.

This is the second thing claudectx analyze catches: dynamic content that breaks caching.

What we built: claudectx v1.1.4

We built claudectx — a CLI that audits and optimizes what Claude Code loads per session. It's now at v1.1.4 with 13 commands across four categories.

npx claudectx analyze
Enter fullscreen mode Exit fullscreen mode

Run this in your project directory right now. No install needed. It outputs:

claudectx analyze — token breakdown
════════════════════════════════════════
CLAUDE.md             7,841 tokens  ████████████████████ 68.1%
Open files            2,840 tokens  ██████               20.6%
Conversation history  1,630 tokens  ███                  11.2%
MCP tool results         14 tokens  ░                     0.1%
────────────────────────────────────────
Total                12,325 tokens

Waste patterns detected (3):
  ⚠  CLAUDE.md: 7,841 tokens — 292% over the 2,000 token recommendation
  ⚠  No .claudeignore file found
  ⚠  CLAUDE.md contains dynamic timestamp — breaks prompt caching
  → Run `claudectx optimize --apply` to fix all 3 issues
Enter fullscreen mode Exit fullscreen mode

The core fix: optimize

claudectx optimize --apply
Enter fullscreen mode Exit fullscreen mode

This does three things automatically:

  1. Splits CLAUDE.md into a lean core + demand-loaded @file sections. Core stays inline (under 2K tokens); large reference sections load only when relevant files are open.
  2. Generates .claudeignore with Node.js, Python, and common binary patterns to stop loading lock files, build artifacts, and assets.
  3. Strips cache-busting content — removes dynamic timestamps and session notes that prevent Anthropic's prompt cache from activating.

Safety: Every file claudectx touches is automatically backed up to ~/.claudectx/backups/. If anything looks wrong, claudectx revert --list shows all backups and claudectx revert --id <id> restores any of them.

On our setup this cut tokens from 18,432 to 3,740 per request (79.7% reduction).

The MCP proxy: symbol-level reads

The less obvious win is claudectx mcp — a local MCP server proxy that intercepts file-read requests and returns symbol-level slices instead of whole files.

When Claude reads a file to find a class definition, it gets the entire file. On a large Django app, that's often 12,000+ tokens for one model. The same information as a symbol-level read (just the class) is typically 800 tokens.

claudectx mcp
Enter fullscreen mode Exit fullscreen mode

Configure Claude Code to use the local MCP server and you get:

  • smart_read — reads a symbol by name (class, function, method)
  • search_symbols — finds symbols across the codebase
  • index_project — builds a local symbol index

Analytics commands

claudectx watch — a live terminal dashboard (Ink/React) showing token burn, cache hit rate, and most-read files as you work.

claudectx compress — distills your session JSONL into a MEMORY.md entry. An 8,000-token session typically compresses to 150–200 tokens. Next session starts lean without losing context.

claudectx report — 7/30-day analytics:

Sessions:          23    Requests:       847
Input tokens:   2,341,200    Cache hits:    51%
Total cost (est.): $4.87   Avg/session:  $0.21
Top waste file:    CLAUDE.md  (12,400 tokens, 847 reads)
→ Run `claudectx drift` to clean up stale sections
Enter fullscreen mode Exit fullscreen mode

`claudectx budget "/.py"`* — estimate token cost before running a task. Shows per-file token counts, cache hit likelihood, total cost, and .claudeignore recommendations. Like git status for your context window.

claudectx drift — scans CLAUDE.md for dead @file references, git-deleted file mentions, and sections with zero reads in the last 30 days. Real cost: you're loading documentation for files that no longer exist.

Teams and multi-assistant support

claudectx teams — per-developer cost attribution for multi-dev teams:

claudectx teams export          # → ~/.claudectx/team-export-{date}.json
claudectx teams aggregate --dir ./reports/  # merge all exports
claudectx teams aggregate --anonymize       # Dev 1, Dev 2...
Enter fullscreen mode Exit fullscreen mode

Each developer exports an anonymized summary of their session data. The lead aggregates them without seeing session content. Know where the budget is going across the team.

claudectx convert --to cursor|copilot|windsurf — exports your CLAUDE.md to other AI assistant formats. Splits sections into .cursor/rules/*.mdc files for Cursor, or .github/copilot-instructions.md for Copilot. One source, every assistant.

claudectx warmup — sends a priming request to Anthropic so your first working request gets a cache hit instead of a full miss. --cron "0 9 * * 1-5" installs as a morning cron job.

claudectx hooks list|add|remove — named hook marketplace. Four built-ins: auto-compress (triggers on file reads), daily-budget (budget check before tool use), slack-digest (session summary to Slack webhook), session-warmup (cache pre-warm on read events).

Real results

After running optimize --apply on our setup:

Metric Before After
Tokens per request 18,432 3,740
Cache hit rate 12% 74%
Monthly cost estimate $87 $17

The 80% figure is real but came from our specific setup — an unusually large CLAUDE.md and a completely unconfigured .claudeignore. If your config is already lean, expect 20–40%. The analyze command will tell you in 30 seconds what your actual baseline is.

Try it

# No install — try immediately
npx claudectx analyze

# Install globally
npm install -g claudectx
# or via Homebrew:
brew tap Horilla/claudectx && brew install claudectx
Enter fullscreen mode Exit fullscreen mode

Website: claudectx.horilla.com
Source (MIT): github.com/Horilla/claudectx

We're the team behind Horilla — an open-source Django HRMS with 40+ apps. This tool came from real pain running Claude Code across a multi-repo, multi-app codebase on a 9-developer team. If you're in a similar situation, we'd love your feedback.

Issues and PRs welcome. If analyze shows something surprising, share it in the comments.

Top comments (0)