Raj

Posted on May 7 • Originally published at elara-labs.github.io

I Cut My Claude Code Token Usage by 94% With This Open Source Tool

#ai #opensource #productivity #claude

If you use Claude Code, Cursor, or any AI coding tool, you're probably burning tokens on the same files over and over. Every session, the AI re-reads your codebase from scratch.

I built Code Context Engine (CCE) to fix this. It indexes your code locally and lets the AI search instead of reading entire files. The result: 94% fewer input tokens, benchmarked on FastAPI with 20 real coding queries.

The Problem

Input tokens are 85-95% of your Claude Code bill. Every time you ask Claude about your payment flow, it reads payments.py, shipping.py, and whatever else it thinks might be relevant. That's 45,000 tokens for a question that needs 800 tokens of context.

Without CCE:    Claude reads payments.py + shipping.py   = 45,000 tokens
With CCE:       context_search "payment flow"            =    800 tokens

How It Works

CCE runs as a local MCP server. Three lines to set up:

uv tool install code-context-engine
cd /path/to/your/project
cce init

That's it. No cloud. No config. cce init auto-detects your editor (Claude Code, VS Code, Cursor, Gemini CLI, Codex, OpenCode) and writes the right config.

Under the hood:

Tree-sitter parses your code into semantic chunks (functions, classes, modules)
Hybrid retrieval combines vector similarity with BM25 keyword matching
Graph expansion walks CALLS/IMPORTS edges to pull in related code
Compression reduces chunks to signatures and docstrings
Memory persists decisions and code areas across sessions

Re-indexing after edits takes under 1 second (96% embedding cache hit rate). Git hooks keep the index current automatically.

The Benchmark

We benchmarked against FastAPI (53 source files, 180K tokens) with 20 real coding questions. No cherry-picking.

Metric	Result
Retrieval savings	94% (83,681 → 4,927 tokens/query)
Compression (additional)	89%
Recall@10	0.90
Latency p50	0.4ms

Important: The 94% is measured against full-file reads, not against Claude Code's built-in exploration. We use full-file as the baseline because it's reproducible and deterministic. Full methodology here.

You can reproduce it yourself:

pip install code-context-engine
python benchmarks/run_benchmark.py --repo https://github.com/fastapi/fastapi.git --source-dir fastapi

What You Get

9 MCP tools that Claude uses automatically:

context_search for hybrid vector + BM25 search
session_recall and record_decision for cross-session memory
related_context for code graph traversal
set_output_compression for controlling response verbosity
Plus expand_chunk, record_code_area, index_status, reindex

A live dashboard with token savings, donut charts, and session history:

cce dashboard

And dollar estimates fetched from live Anthropic pricing:

cce savings --all

Why Not Just Use Cursor's Built-in Indexing?

CCE is editor-agnostic. One index works across Claude Code, VS Code, Cursor, Gemini CLI, and Codex. Your code never leaves your machine. And you get measurable savings with actual dollar amounts, not estimates.

Languages Supported

AST-aware chunking for Python, JavaScript, TypeScript, PHP, Go, Rust, and Java. Language-aware fallback for 40+ more (C, C++, Swift, Kotlin, Ruby, Haskell, and others). All text files are indexed.

Try It

uv tool install code-context-engine
cd your-project
cce init

Three lines. See your savings in 60 seconds.

GitHub | Docs | Benchmark

CCE is MIT licensed, free, and open source. Built by Elara Labs.

Top comments (1)

John • May 16

The honest baseline note is the most important part here. Full-file reads make the savings reproducible, but the real UX win is giving the agent a cheaper first move than “open the whole repo and hope.”

I also like that this is editor-agnostic. Token waste gets much easier to fix when the measurement follows the project instead of one IDE.