Hari Venkata Krishna Kotha

Posted on Mar 24 • Edited on Apr 8

How I cut Claude Code's token overhead by 44% and stopped hitting usage limits mid-session.

#ai #webdev #programming #productivity

I'm on a paid Claude Code plan. A few weeks ago, I noticed my usage limits were hitting way faster than expected. I wasn't doing anything unusual, just regular development work. But Claude kept running out of context mid-conversation, forgetting things I'd said 10 messages ago, and compacting earlier than it should. (Compaction is when Claude Code summarizes earlier messages to free up context space. When it happens too early, you lose nuance and detail from earlier in the conversation.)

I went looking for answers. LinkedIn, Dev.to, Instagram, Reddit. Most articles said the same things, and honestly, half of them were copies of each other. Token reduction tips, useful skills lists, prompt tricks. I decided to stop bookmarking and start testing. Tried every method I came across, measured the results, and kept what actually worked.

Here's what I found.

The 50,000 Token Problem You Don't Know You Have

When you install skills in Claude Code, their metadata loads into your context window on every single message. And when a skill's trigger matches your prompt, the full content loads too. The more skills you have installed, the more metadata overhead you carry per turn, and the more likely full skill content gets pulled in during a busy session.

I came across the Everything Claude Code repository and was honestly amazed. Skills, agents, commands, rules, all packaged together. So I did what most people would do: installed everything globally.

That was a mistake.

Here's what my setup looked like before I realized the problem:

Component          Size       Estimated Tokens
Skills (global)    196KB      ~50,000
Agent definitions  58KB       ~15,000
Command files      142KB      ~36,000
Rule files         9KB        ~2,000
TOTAL              405KB      ~103,000 tokens

(Rough estimate: 1KB of text ≈ 250 tokens. Not all of this loads on every turn because skills use progressive disclosure, loading only metadata first and full content when triggered. But the potential overhead is still massive, and in practice, a busy session triggers many of them.)

Over 100,000 tokens of potential overhead sitting in my setup. That's a significant chunk of Claude's context window spent on instructions, most of which weren't relevant to what I was doing at that moment.

No wonder my conversations were getting compacted early. No wonder Claude was "forgetting" things. There wasn't enough room left for the actual work.

How to Check Your Own Overhead

Before you do anything else, run this in your terminal (Windows users: use Git Bash, not PowerShell):

du -sh ~/.claude/skills/ ~/.claude/agents/ ~/.claude/commands/ ~/.claude/rules/

Reading your results:

Each line shows the size of a directory. Add them up for your total overhead.

Example output:

144K    /Users/you/.claude/skills/
76K     /Users/you/.claude/agents/
172K    /Users/you/.claude/commands/
9K      /Users/you/.claude/rules/

That's 401KB total. To estimate tokens, multiply your total KB by 250 (1KB ≈ 250 tokens). So 401KB ≈ 100,000 tokens of potential overhead. Not all of it loads every turn (skills use progressive disclosure), but the more skills you have, the more likely multiple will trigger and load fully during a session.

If your skills directory alone is over 100KB, you're almost certainly carrying skills you don't use in most projects.

For context, my setup was 405KB before I touched anything. After moving domain-specific skills to project level and cleaning up unused agents, it dropped to 232KB. Same capabilities, 44% less overhead.

The Fix: 44% Reduction in One Afternoon

The principle is simple: only keep things globally that you use in 80%+ of your projects. Everything else goes to project level, where it only loads when you're working in that specific project.

I went from 20 global skills down to 6. The other 14 moved to the projects that actually needed them.

Component          Before     After      Saved
Skills (global)    196KB      51KB       145KB (74% reduction)
Agent definitions  58KB       52KB       6KB
Command files      142KB      120KB      22KB
Rule files         9KB        9KB        0KB (modified, not reduced)
TOTAL              405KB      232KB      173KB (~44% reduction)

What I kept globally (the skills I use in every project):

Coding standards (applies to every language)
Security review (should check this everywhere)
TDD workflow (I practice TDD daily)
Verification loop (prevents claiming things are done before checking)
Strategic compaction (suggests when to compact context manually)
Continuous learning (tracks patterns across sessions)

What I moved to project level:
Docker patterns, Python patterns, React patterns, e2e testing, eval harness, iterative retrieval, full-stack patterns, and several others. These are useful but only in specific projects. Loading Docker patterns while I'm writing documentation is pure waste.

The difference was immediate. Conversations lasted longer before compaction. Claude held context from earlier in the session. Fewer "I don't have context on that" moments.

The Tool Output Problem Nobody Talks About

Most optimization advice focuses on what's loaded at the start of a conversation: skills, rules, CLAUDE.md. But there's another source of token waste that's just as big, and almost nobody mentions it.

Every time Claude runs a CLI command (git status, npm test, a build command), the raw output gets dumped into the context window. And here's the thing most people miss: that output gets re-read on every subsequent turn. It doesn't disappear.

Think about it this way. You ask Claude to run your test suite. The output is 5,000 tokens. 4,950 of those tokens are passing tests. 50 tokens are the actual failures you care about. But all 5,000 tokens sit in context and get re-read on turn 2, turn 3, turn 4, and every turn after.

Over a 20-turn session with 50 tool calls, you can easily accumulate 100,000+ tokens of tool output. Most of it noise.

RTK: The Token Saver That Actually Made a Difference

RTK (Rust Token Killer) is an open-source tool that filters CLI output before it enters Claude's context window. It applies four optimization passes: smart filtering (removes noise), grouping (aggregates similar items like errors by type), truncation (keeps relevant context, cuts redundancy), and deduplication (collapses repeated log lines with counts).

Real savings from my sessions:

Command Category	Example Commands	Token Savings
Build output	cargo build, tsc, next build	80-90%
Test output	vitest, pytest, playwright	90-99%
Git operations	git status, git diff, git log	59-80%
File listings	ls, find, grep	60-75%

The way I explain it to people: imagine you ask a librarian to check something. Without RTK, the librarian carries back the entire bookshelf, drops it on your desk, and says "the answer is on page 47." With RTK, the librarian comes back with just page 47, highlighted. Same answer. But your desk isn't buried anymore.

Installing RTK

# macOS/Linux (recommended)
brew install rtk

# Or via Cargo (IMPORTANT: do NOT run "cargo install rtk" without
# the git URL — that installs "Rust Type Kit", a completely
# different package. If "rtk gain" fails, you have the wrong one.)
cargo install --git https://github.com/rtk-ai/rtk

# Or via quick-install script
curl -fsSL https://raw.githubusercontent.com/rtk-ai/rtk/refs/heads/master/install.sh | sh

# Then add to Claude Code globally
rtk init -g

On Unix (macOS/Linux), RTK installs as a PostToolUse hook. It works transparently. Claude doesn't even know it's there. Zero token overhead.

On Windows, it works through Git Bash. The hook and RTK.md get installed the same way. If you're using Claude Code with Git Bash as your shell (which most Windows developers do), the experience is identical to macOS/Linux. The RTK.md file that gets created adds about 1,200 tokens of instructions, but a single filtered git diff saves more than that. Net positive after your first tool call.

Windows-specific tips:

Download the pre-built binary from the releases page (rtk-x86_64-pc-windows-msvc.zip), or install via cargo install --git https://github.com/rtk-ai/rtk in Git Bash
Make sure the binary path is in your system PATH
Run rtk init -g the same as on Unix
Run from Git Bash, not native PowerShell (some shell integrations assume bash)

Measuring Your Savings

RTK has built-in analytics:

# See your cumulative savings
rtk gain

# See savings per command type
rtk gain --history

# Find commands you ran WITHOUT rtk that could have been optimized
rtk discover

The rtk discover command is the most useful one when you're starting out. It scans your Claude Code session logs and shows you exactly which commands you could have filtered but didn't.

The Memory System That Stops Claude From Asking the Same Questions

The last piece that made a real difference wasn't about reducing tokens. It was about making Claude smarter across sessions.

Claude Code has a file-based memory system at ~/.claude/projects/<project>/memory/. You create markdown files with frontmatter and Claude reads them at the start of every session.

I use four types:

User memories: Who I am, my tech stack, my preferences. Instead of explaining my setup every session, Claude already knows.

Feedback memories: Every time I correct Claude, the correction gets saved. "Use plain text in forms, not bullets." "Don't suggest tools I haven't used." Claude stops repeating the same mistakes.

Project memories: Current state of work. Deadlines, decisions, context that would otherwise be lost between sessions.

Reference memories: Where to find things in external systems. "Bug tracking is in Linear project X." Saves the "where is that tracked?" conversation every time.

lessons.md: One File That Changes Everything

This is the simplest thing I did and possibly the most impactful. I keep a lessons.md file in every project's .claude/ directory. Every time I correct Claude on something, it writes a rule:

## 2026-03-15 - Don't add error handling for impossible cases

**Rule:** Only add try-catch blocks at system boundaries (user input,
API calls, file I/O). Don't wrap internal function calls that can't
realistically fail.
**Why:** Added defensive error handling around a pure math function.
User said "this function takes two integers and adds them, it can't
throw. You're adding complexity for nothing."
**Applies when:** Writing or reviewing error handling in any codebase.

Claude reads this file at the start of every session. The correction sticks permanently. Over a few weeks, the file becomes a precise set of rules that make Claude work exactly the way you need.

The principle is simple: never correct the same mistake twice. The first correction is a lesson. The second one means the system failed.

The Priority Order

If you're starting from scratch, here's what I'd do in order:

Priority	What	Effort	Impact
1	Install RTK	30 seconds	60-90% tool output savings
2	Audit global skills, move domain-specific to project level	15 minutes	Free up context window
3	Set up basic memory files (user profile + 2-3 feedback entries)	10 minutes	Smarter responses, fewer repeated mistakes
4	Start a lessons.md file	30 seconds to create, 30 seconds per correction	Permanent mistake prevention
5	Set MAX_THINKING_TOKENS env variable	10 seconds	Cap runaway thinking, save tokens on over-analysis
6	Add model routing rules for subagents	10 minutes	Route exploration/search subagents to cheaper models

None of this is complicated. Most of it takes less than 15 minutes. But the compound effect of doing all six is significant: longer sessions, better context retention, fewer repeated mistakes, and lower token bills.

The tools are there. Most people just don't know they exist, or don't realize how much overhead they're carrying.

This is Part 1 of a series on getting more out of Claude Code. Part 2 covers RTK in depth, including Windows setup, configuration, subagent behavior, and community tools that complement it.

Top comments (2)

Harjot Singh • May 31

Cutting overhead 44% AND escaping the mid-session usage wall is a double win, because that wall is its own productivity killer - nothing breaks flow like Claude tapping out halfway through a refactor. Both problems trace to the same root: the context window fills with stuff that didn't need to be there (full files re-read, stale conversation, redundant tool output), so you're paying for and burning your budget on overhead instead of work.

The structural version of your fix is scoping context per task and offloading mechanical steps to cheaper models so the premium budget lasts. That's how Moonshift sidesteps the limit problem entirely - a multi-agent pipeline (prompt to a shipped SaaS on your own GitHub + Vercel) where each agent gets only its slice of context and routing sends the cheap 80% elsewhere, landing a full build ~$3 flat. First run's free, no card. Genuinely useful - what was the biggest overhead source you trimmed? I'd guess full-file re-reads or uncompacted history, but curious if it was something subtler.

Hari Venkata Krishna Kotha • Jun 10 • Edited

Tool output bloat. Much bigger than file re-reads in my case. Three concrete examples:

rtk dotnet test: in a typical xUnit + bUnit run with 200 tests and 1-2 failing, the "Passed" lines are ~95% noise. RTK strips them and surfaces only the failing assertion + stack frame, which is the only thing I actually need to debug..
rtk err dotnet build: instead of scrolling through 400+ lines of CS warnings, nullability hints, and analyzer noise to find the one hard error actually breaking the build, this filters to errors only. Big win on legacy migration work where every old file throws a warning stream.
rtk git log / rtk git diff: full commit bodies and signed-off lines vs the 1-2 lines that actually matter for "what changed."
rtk docker logs : AKS container logs with repeated patterns (the same 1000 health-check pings) deduplicated to counts. "10,847 × GET /health 200 OK" instead of 10,847 lines.

File re-reads I sidestep differently (reading slices with offset/limit rather than whole files). Uncompacted history I haven't optimized much. Auto-compact handles the bulk of it for me.

For more depth on RTK setup + the community tools I use alongside it, see Part 2.

Haven't tried Moonshift personally, so can't speak to the multi-agent flow. The routing question is interesting though. How does it decide which slice goes to the cheap model vs the premium one? Static per-route, or orchestrator-decided based on prompt content? That's where I'd expect most failure modes (cheap model called for something needing cross-file reasoning).