DEV Community

Sahil Kathpal
Sahil Kathpal

Posted on • Originally published at codeongrass.com

Cut Claude Code Token Usage 98% with Purpose-Built MCPs

Running Claude Code against a large codebase or a corpus of financial documents will drain your token budget fast — not because the tasks are conceptually hard, but because Claude's default behavior is to read entire files into context. Two recently published open-source MCPs fix this at the tool layer: Semble for semantic code search (98% token reduction, 250ms index build, 1.5ms query latency) and a SEC filing MCP for nav-map document chunking that stops 80K-token 10-Ks from overflowing context. This tutorial walks through installing both, wiring them into Claude Code, and confirming they're actually intercepting the full-file reads.


TL;DR

Claude Code burns tokens because it calls read_file on whole files when it should be making targeted retrieval calls. The fix is an MCP retrieval layer: Semble gives Claude a semantic search_code tool for code (98% fewer tokens per query, NDCG@10 relevance score of 0.854) and the SEC MCP gives it get_filing_section for large documents (single-section retrieval from filings that would otherwise overflow an entire context window). Both are open-source, free, and wired via a standard .mcp.json config.


Why Full-File Reads Blow Your Token Budget

When Claude Code tries to answer "find all places we handle auth tokens," it scans candidate files completely — not just the relevant functions. On a codebase with a few hundred files averaging a few hundred lines each, a single cross-cutting search can pull tens of thousands of tokens into context before the agent writes a single line of output.

The problem is structurally worse for document-heavy workflows. A single SEC 10-K filing can run 80,000+ tokens. The developer who built the SEC MCP described the original failure mode plainly: loading one filing caused context blowout before any analysis started. Full document ingestion isn't a prompt engineering problem — it's an architecture problem.

The correct fix is a retrieval layer between Claude and your files. Instead of read_file, Claude calls search_code or get_filing_section — tools that return only the relevant chunk. MCP (Model Context Protocol) is the right abstraction for this: it extends Claude Code's tool set without changing your prompts, your project structure, or how you think about tasks. For a broader map of what's available in the MCP ecosystem today, The MCP Server Ecosystem in 2026 covers the discovery landscape and a build-vs-find decision matrix worth reading before you build anything custom.


Prerequisites

Required:

  • Claude Code installed and authenticated (claude --version should return a version string)
  • Node.js 18+ (for Semble MCP)
  • Python 3.9+ (for SEC MCP)
  • Git

Optional (strongly recommended for persistent remote runs):


Step 1: Install and Configure Semble MCP for Semantic Code Search

Semble is a local semantic code search MCP built specifically to solve the full-file-read problem. The published benchmark numbers from the r/ClaudeAI announcement thread:

Metric Value
Token reduction vs. full-file baseline 98%
Index build time 250ms
Query latency 1.5ms
Relevance quality (NDCG@10) 0.854
Speed vs. transformer hybrid approach 200x faster

NDCG@10 of 0.854 means the most relevant code chunks consistently rank at the top — critical for ensuring Claude gets the code it actually needs rather than a noisy result set.

Install and Index

Find the repository link in the Reddit post above. Installation follows the standard Node.js MCP server pattern:

# Clone from the repository linked in the announcement post
git clone <semble-repo-url>
cd semble-mcp
npm install
npm run build

# Build the search index from your project root
npx semble index --path ./src --output .semble-index

# Expected output:
# Indexing 847 files...
# Index built in 208ms
# Saved to .semble-index/
Enter fullscreen mode Exit fullscreen mode

At 250ms average index build time, this is fast enough to re-index on every session start if your codebase changes frequently.

Start the MCP Server

npx semble serve --index .semble-index
Enter fullscreen mode Exit fullscreen mode

Leave this process running before starting any Claude Code session. Claude Code connects to MCP servers at startup — if the server isn't live when Claude launches, the search_code tool won't appear in Claude's available tool list for that session.


Step 2: Add the SEC Filing MCP for Large Document Chunking

The SEC MCP provides nav-map chunking for EDGAR filings. Instead of loading a full 10-K into context, Claude calls get_filing_section with a section name (Risk Factors, MD&A, Financial Statements) and receives only that section with an EDGAR HTML citation. Covers 6,000+ publicly registered companies, model-agnostic, free.

The retrieval pattern handles the context math: a Risk Factors section runs roughly 3,000–6,000 tokens. The same document loaded whole runs 80,000+. Nav-map chunking makes the difference between an analysis that fits in one session and one that context-blows before it starts.

Install

The repository and exact install command are linked from the r/ClaudeAI thread. The pattern follows standard Python MCP server setup:

# Install from the repository linked in the announcement post
pip install <sec-mcp-package>

# Or from source
git clone <sec-mcp-repo-url>
cd sec-mcp
pip install -e .
Enter fullscreen mode Exit fullscreen mode

Verify the Chunking Works

# Test section retrieval — should return only the Risk Factors section, not the full document
sec-mcp query --company AAPL --section "Risk Factors" --year 2024
Enter fullscreen mode Exit fullscreen mode

If you get back the full document instead of a section, the nav-map index hasn't built correctly. Check the repository README for the --rebuild-index flag.

Start the MCP Server

sec-mcp serve
Enter fullscreen mode Exit fullscreen mode

Step 3: Wire Both MCPs into Claude Code

Claude Code reads MCP configuration from .mcp.json in your project root, or from your global config at ~/.claude/settings.json. For a thorough walkthrough of local versus remote MCP server tradeoffs, eesel's Claude Code MCP integration guide covers the setup complexity honestly.

MCP servers speak a standardized protocol — the .mcp.json structure below works the same way regardless of which MCP servers you're wiring in.

Project-Level .mcp.json

{
  "mcpServers": {
    "semble": {
      "command": "npx",
      "args": ["semble", "serve", "--index", ".semble-index"],
      "env": {}
    },
    "sec-filings": {
      "command": "sec-mcp",
      "args": ["serve"],
      "env": {}
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Alternatively, use the CLI:

claude mcp add semble "npx semble serve --index .semble-index"
claude mcp add sec-filings "sec-mcp serve"
Enter fullscreen mode Exit fullscreen mode

Verify MCP Servers Are Visible

claude mcp list
Enter fullscreen mode Exit fullscreen mode

Expected output:

semble        (running)   npx semble serve --index .semble-index
sec-filings   (running)   sec-mcp serve
Enter fullscreen mode Exit fullscreen mode

If a server shows (stopped) or is missing, the underlying process wasn't live when Claude Code started. Start the process, then relaunch Claude Code — MCP connections are established at session init, not on-demand.


Step 4: Validate Token Reduction in a Real Session

The fastest validation is a direct cost comparison on the same query.

Baseline (without Semble): Open a Claude Code session without the MCP and ask:

Find all places this codebase calls stripe.charge() or stripe.PaymentIntent.create()
Enter fullscreen mode Exit fullscreen mode

Watch Claude call read_file on multiple files. The result event shown at session end includes API cost and token count — note both.

With Semble active: Start a fresh session with the MCP running. Ask the same query. Claude should now call semble_search("stripe.charge OR stripe.PaymentIntent.create") and receive back only the matching lines with file context — not full files.

The Claude Code power user tips documentation covers how to read the tool call stream in a session, which makes it straightforward to confirm Semble is being called instead of read_file.

Check the result cost. A 98% reduction means what previously cost $0.08–$0.20 on a medium codebase now costs under $0.005. If you're still seeing high costs, see troubleshooting below.


Step 5: Lock In Tool Preference with CLAUDE.md

Claude Code doesn't always prefer the semantically correct tool when multiple tools could satisfy a query. On sessions with many tool calls, it can drift back toward direct file reads even when Semble is available — a documented behavior covered in Why Your Claude Agent Ignores Rules Past ~15 Tool Calls.

The most durable fix is an explicit rule in your project's CLAUDE.md:

## Tool preferences

- For code search: call `semble_search` before calling `read_file`. Only use `read_file`
  if semble_search returned no relevant results.
- For SEC filings: call `get_filing_section` with the specific section name. Never load
  a full filing document unless explicitly asked to.
Enter fullscreen mode Exit fullscreen mode

This architectural constraint survives deep into long sessions in a way that prompt-level instructions don't.


Troubleshooting

Semble not being called — Claude still reads full files

Almost always caused by the MCP server not running at session start. Claude Code connects to all configured MCP servers when it launches; if a server is down at that moment, the tool simply isn't registered for the session. Fix: ensure npx semble serve is running, then run claude mcp list to confirm the server shows (running) before starting work.

Query relevance looks low — Claude gets unhelpful code chunks

Try re-indexing with a larger chunk size. Default chunk sizes work well for typical function lengths, but very long functions get cut mid-logic:

npx semble index --path ./src --chunk-size 150 --output .semble-index
Enter fullscreen mode Exit fullscreen mode

Test a handful of queries you know the answer to — if they return wrong results, the chunk size is the first variable to adjust.

SEC MCP returns full documents instead of sections

The nav-map index may not have built for the specific company or year. Run the query with --rebuild-nav-map to force a fresh section map from EDGAR HTML. EDGAR rate limits can cause partial index builds on first run.

Token usage is still high after MCP installation

Check whether Claude is actually calling semble_search or falling back to read_file. If you see read_file in the tool call stream, the CLAUDE.md rule isn't in place yet (Step 5 above). Add it and start a fresh session — tool selection rules in CLAUDE.md are evaluated at the start of each session.

MCP servers restart and lose state

Semble rebuilds from the .semble-index/ directory on startup — persistent state across restarts. The SEC MCP is stateless (fetches from EDGAR on demand), so restarts are safe. The only state you need to protect is the .semble-index/ directory in your project.


How Grass Makes This Workflow Better

The token-efficiency pattern above works on any machine. The operational problem is keeping it working: Semble's MCP server and the SEC MCP server need to be live before Claude Code starts, stay alive through long sessions, and survive your laptop sleeping or closing. On a local machine, that means extra terminal windows, caffeinate flags on macOS, and losing all MCP connections every time your machine reboots or your SSH session drops.

Grass solves this with an always-on cloud VM where MCP server processes run continuously alongside Claude Code. The practical difference shows up in three places:

Persistent MCP services, not terminal babysitting

On a Grass cloud VM, you register Semble and the SEC MCP as persistent services once. They start automatically on VM boot and are live for every Claude Code session — no manual process management, no checking whether the right terminals are open before starting work.

# On your Grass VM — one-time setup
sudo systemctl enable semble-mcp sec-mcp
sudo systemctl start semble-mcp sec-mcp
Enter fullscreen mode Exit fullscreen mode

From that point, every Claude Code session on the VM inherits both MCP connections without a startup checklist.

Fire off large indexing jobs and forget them

Semble's 250ms indexing benchmark holds for mid-size codebases. Re-indexing a large monorepo takes longer and ties up the process while it runs. On Grass, you schedule Semble to re-index nightly via cron while you're not working, and dispatch Claude Code tasks during the day against a warm, pre-built index:

# Nightly cron on Grass VM
0 2 * * * cd /workspace/myproject && npx semble index --path ./src --output .semble-index
Enter fullscreen mode Exit fullscreen mode

No indexing latency in your working sessions.

Mobile approval forwarding for permission-gated operations

When Claude Code needs to write a file or run a bash command — even via an MCP tool — it can pause for permission. On a remote session without mobile access, that prompt sits unanswered until you're back at your desk. With Grass, permission requests forward to your phone as native modals: tap Allow or Deny from anywhere.

For long-running batch tasks (pulling financials from 50+ companies via the SEC MCP overnight, for example), a single stalled permission prompt can block an entire run for hours. Mobile permission forwarding removes that bottleneck.

To try the persistent setup, get started with Grass in 5 minutes — the free tier includes 10 hours of cloud VM time with no credit card required.


FAQ

How much does Semble actually reduce token usage in practice?

The published benchmark shows 98% reduction specifically on code search tasks — queries where Claude would otherwise read multiple full files. For tasks that don't involve searching across the codebase (e.g., editing a specific file you've already identified by path), token usage is unchanged. The largest gains come from exploration-type queries: "find all X," "where does Y get called," "which files handle Z." Those queries are the common case in production agentic workflows.

Does the SEC MCP work for private documents or internal wikis?

No. The SEC MCP is built on EDGAR, which covers publicly registered companies with SEC reporting obligations. It supports 6,000+ companies but nothing private. For internal document chunking, you'd build a custom MCP server using the same nav-map pattern against your own document store — the MCP protocol is standardized, so the server structure is reusable.

Can I use these MCPs with OpenCode or Codex, not just Claude Code?

Yes. MCP is model-agnostic by design. Any client that implements the protocol can call these tools. OpenCode supports MCP servers natively. The .mcp.json config key names may differ per client, but the underlying server processes and tool schemas are identical.

Why does Claude Code sometimes ignore my MCP tools and fall back to read_file?

Tool selection reliability degrades as session context depth grows. Claude Code may prefer a tool it used successfully in recent turns over a less-familiar MCP tool, even when the MCP tool is semantically correct. Adding explicit tool preference rules to CLAUDE.md (Step 5 above) anchors the behavior architecturally rather than relying on model judgment. The underlying mechanism is explained in detail in Why Your Claude Agent Ignores Rules Past ~15 Tool Calls.

What's a realistic expectation for token savings on a 500-file TypeScript codebase?

Based on the Semble benchmark numbers, a cross-codebase search query that previously read 30–50 files completely (15,000–25,000 tokens of input) should drop to a few hundred tokens per query after Semble intercepts it. The 98% figure is the aggregate reduction across a representative query set — individual results vary by how many files Claude would have opened without the MCP.


Next Steps

Install both MCPs and run a cost comparison on a real task from your own codebase — the delta shows up in the first session. Add the CLAUDE.md tool preference rules immediately; they're what sustains the token savings across deep sessions.

If you're running large-scale document analysis, overnight indexing tasks, or multi-repo code searches, the persistent MCP setup on a cloud VM removes the process management overhead entirely. Start with the free tier at codeongrass.com — 10 hours, no credit card, MCP-ready from first boot.


This post is published by Grass — a machine built for AI coding agents that gives every agent a dedicated always-on cloud VM, controllable from your laptop, phone, or automation. Works with Claude Code, Codex, and OpenCode.


Originally published at codeongrass.com

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.