A CLI coding agent feels cheap until the invoice arrives. You point Claude Code or Codex at a repo, ask it to refactor a module, and ten minutes later it has read forty files, run the test suite three times, and burned six figures of tokens on context it did not need. Multiply that by a team running agents all day and token spend stops being a rounding error. Most coding-agent token waste is fixable from the command line without changing models or accepting worse output.
TL;DR
Cut agent token costs before context reaches the model:
- Scope the working set.
- Keep memory files short.
- Compact or clear long sessions.
- Enable prompt caching for stable prefixes.
- Route cheap subtasks to smaller models.
- Cap tool output.
- Measure cost per run.
Introduction
CLI agents are token-hungry by default. They read whole files when they need ten lines, replay the entire conversation on every turn, dump raw command output back into context, and re-send the same system prompt and repo map many times per day.
A refactor that needs to reason about 2,000 tokens of code should not require 180,000 tokens of context. The gap between those numbers is your savings.
This guide shows where tokens go in a CLI agent run and how to reduce each bucket with practical tactics:
- Context hygiene and memory files
- Prompt caching
- Model routing
- Tool-output trimming
- Retrieval limits
- Per-run cost measurement
The examples use Claude Code and Codex-style workflows, but the same mechanics apply to any token-billed coding agent.
One adjacent cost is debugging. If an agent calls a flaky internal API, it may retry, read error bodies, re-read docs, and loop. Every iteration costs tokens.
π‘ If your agents touch APIs, having those APIs designed, mocked, and tested in Apidog before you point an agent at them removes a whole category of expensive trial-and-error. The agent works against a stable contract instead of a live endpoint that surprises it.
Where tokens actually go in a CLI agent run
A single agent turn has two billable parts:
- Input payload sent to the model
- Output payload returned by the model
You usually pay for both. Output tokens are often several times more expensive than input tokens, but input volume is what grows fastest.
Typical input payloads include:
System prompt and tool definitions
Agent instructions plus tool schemas. This can be 5,000β15,000 tokens and is re-sent every turn.Memory and project files
Files such asCLAUDE.md, repo conventions, and persistent instructions. These are often loaded every turn.Conversation history
Previous user messages, model responses, tool calls, and tool results. This grows throughout the session.Retrieved file content
Files the agent reads. A singleReadon a 1,200-line file can be roughly 12,000β18,000 tokens.Tool output
Test logs, install logs, stack traces,git diffoutput, and generated lockfile diffs.
The key thing to remember: conversation history is replayed every turn.
A 30-turn session does not cost 30 times one turn. Later turns carry everything that happened before them. That is why long, meandering sessions get expensive quickly.
For more detail on how session accounting can surprise you, see how the Claude Code token window resets.
1. Scope the working set before starting
The cheapest token is the one you never send.
Do not start an agent at the repo root with a vague task:
claude "refactor the billing logic"
That invites the agent to crawl the repo.
Instead, name the files and boundaries:
claude "refactor the retry logic so it uses exponential backoff,
only in src/payments/retry.ts and src/payments/retry.test.ts"
If the agent needs to explore, point it at a directory instead of the entire repository:
claude "inspect only src/payments and identify where retry behavior is implemented"
A practical prompt pattern:
Task:
Refactor retry logic to use exponential backoff.
Scope:
- src/payments/retry.ts
- src/payments/retry.test.ts
Do not inspect unrelated directories unless these files reference code you cannot understand without reading.
This reduces exploratory reads and keeps the context focused.
2. Keep memory files short and stable
A CLAUDE.md or equivalent memory file is loaded into context repeatedly. If it becomes a mini-wiki, you pay for that wiki on every turn.
Check its approximate token size:
wc -c CLAUDE.md | awk '{print "β", int($1/4), "tokens per turn"}'
A good memory file should include:
- Build command
- Test command
- Lint command
- Strict repo conventions
- Important architectural constraints
- Pointers to deeper docs
It should not include:
- Full onboarding docs
- Long architecture essays
- Rarely used process notes
- Historical explanations
- Large examples that are only needed occasionally
Example lean CLAUDE.md:
# Project instructions
## Commands
- Install: `npm ci`
- Test: `npm test --silent`
- Lint: `npm run lint`
- Typecheck: `npm run typecheck`
## Conventions
- Use TypeScript strict mode.
- Do not introduce new runtime dependencies without asking.
- Keep API handlers thin; business logic belongs in `src/services`.
- Add or update tests for behavior changes.
## Docs
- API design notes: `docs/api.md`
- Payment flow details: `docs/payments.md`
Move detailed docs out of always-loaded memory and let the agent read them only when needed.
3. Compact or clear long sessions
When a session switches tasks, do not keep typing into the same context. Every new turn carries the old transcript.
In Claude Code:
/compact
Use this when the current task is mostly complete but you want to preserve a short summary.
For a clean break:
/clear
Use this when you are starting an unrelated task.
A simple workflow:
One logical task = one session.
After task completion:
- Use /compact if follow-up work depends on the current state.
- Use /clear if the next task is unrelated.
This can replace tens of thousands of raw transcript tokens with a short digest.
The same scoping habit appears in Claude Code workflows.
4. Ignore generated and irrelevant files
Keep generated artifacts, dependencies, build output, and large lockfile diffs away from the agent.
At minimum, configure your repo ignore rules so agents avoid:
node_modules/
dist/
build/
coverage/
.next/
.cache/
tmp/
*.log
If your agent supports its own ignore file, add one too. The exact filename depends on the tool, but the goal is the same: prevent the agent from reading or diffing files that do not matter.
This is especially important for:
- Generated SDKs
- Minified bundles
- Large snapshots
- Lockfiles
- Vendored dependencies
- Test fixtures with huge payloads
5. Use prompt caching for stable prefixes
Prompt caching lets the provider store a stable prefix of your request so repeated requests can reuse it at a discount.
The stable prefix usually includes:
- Tool definitions
- System prompt
- Repo conventions
- Long-lived instructions
The volatile part should come after the cache boundary:
- User task
- Current file snippets
- Timestamps
- Fresh command output
The structural order is:
tools β system β messages
Changing content before the cache boundary can invalidate the cached prefix.
If you call the model from your own wrapper, cache the stable part explicitly:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT + REPO_CONVENTIONS,
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{
"role": "user",
"content": user_task,
}
],
)
u = response.usage
print("cache write:", u.cache_creation_input_tokens)
print("cache read :", u.cache_read_input_tokens)
print("fresh input:", u.input_tokens)
Operational rules:
- Keep cached prefixes byte-stable.
- Do not insert timestamps into cached content.
- Batch related runs close together to hit a warm cache.
- Inspect usage fields to verify cache reads are happening.
For repeated agent runs with the same system prompt and repo conventions, caching can substantially reduce the cost of the repeated prefix.
OpenAIβs API applies similar cached-input discounts automatically on supported models. The knobs differ, but the principle is the same.
For another angle on reducing model cost, see running GPT-5.5 free through Codex.
6. Route cheap work to cheaper models
Not every task needs the strongest model.
Good candidates for a smaller model:
- Commit messages
- Changelog entries
- Diff summaries
- Boilerplate tests
- Simple renames
- Lint explanations
- Search result summarization
Reserve the stronger model for:
- Architecture decisions
- Complex refactors
- Multi-file reasoning
- Debugging subtle failures
- Security-sensitive changes
Example CLI routing:
# Cheap model for low-risk text generation
claude --model haiku "write a conventional-commit message for the staged diff"
# Stronger model for architecture or complex reasoning
claude --model sonnet "redesign the caching layer for the payments service"
A better team default is:
Default model: cheaper model
Escalation model: stronger model when explicitly needed
Many teams do the opposite and run the flagship model for everything βto be safe.β That is expensive when the task is just summarizing a diff.
If your framework supports sub-agents, route narrow subtasks to cheap models with small context windows. The parent agent should receive a short distilled result instead of doing all grunt work with the expensive model.
The delegation style in the goal command across Codex and Claude Code is useful for this pattern.
If you are on a capped plan, routing also stretches your allowance. The Claude Code weekly limit increase helps, but routing is still what keeps premium-model usage available for hard work.
7. Make tool output quiet
Tool output is easy to ignore because it feels like βjust logs.β But every line returned to the agent becomes context and may be replayed in later turns.
Prefer quiet commands.
Instead of:
npm test
Use:
npm test --silent -- --reporter=dot
Instead of:
npm install
Use:
npm install --silent --no-audit --no-fund
For Python tests:
pytest -q
For noisy test failures, return only the tail:
pytest -q 2>&1 | tail -n 30
For diffs, avoid dumping huge patches unless needed:
git diff --stat
Then inspect a specific file:
git diff -- src/payments/retry.ts
For logs, grep the signal:
npm test 2>&1 | grep -E "(FAIL|β|Error)" | head -n 20
This gives the agent enough signal to act without stuffing the transcript with thousands of irrelevant tokens.
8. Prefer targeted reads over whole-file reads
A common waste pattern:
Agent reads a 1,500-line file to modify one function.
Better prompt:
Find the function that handles payment retries.
Read only that function and nearby helper functions.
Do not read the entire file unless necessary.
If you know the symbol, give it directly:
claude "Update the calculateBackoffDelay function in src/payments/retry.ts.
Read only that function, its direct helpers, and its tests."
Useful shell commands for manual scoping:
grep -R "function calculateBackoffDelay" -n src
grep -R "calculateBackoffDelay" -n src test
Then pass the relevant files or line ranges to the agent.
The difference can be large: a whole large file may cost tens of thousands of tokens, while a focused function window may be under a thousand.
9. Constrain retrieval and RAG scope
If your agent searches docs or code with retrieval, cap both chunk count and chunk size.
Bad default:
Return 50 chunks of 800 tokens each.
Better default:
Return the top 8β10 chunks.
Limit each chunk to around 200β300 tokens.
Prefer exact symbol matches over broad semantic matches.
Practical retrieval rules:
- Retrieve fewer chunks first.
- Ask for more only if needed.
- Prefer exact paths and symbols.
- Avoid returning entire documents.
- Summarize long docs before passing them to the expensive model.
You pay for every retrieved token whether or not the model uses it.
10. Measure cost per run
You cannot optimize what you do not measure.
If you call the API directly, capture usage from every response:
u = response.usage
INPUT_RATE = 3.00 / 1_000_000
OUTPUT_RATE = 15.00 / 1_000_000
CACHE_READ = 0.30 / 1_000_000
CACHE_WRITE = 3.75 / 1_000_000
cost = (
u.input_tokens * INPUT_RATE
+ u.output_tokens * OUTPUT_RATE
+ u.cache_read_input_tokens * CACHE_READ
+ u.cache_creation_input_tokens * CACHE_WRITE
)
print(
f"run cost β ${cost:.4f} "
f"(in={u.input_tokens} "
f"out={u.output_tokens} "
f"cache_read={u.cache_read_input_tokens})"
)
Use live provider rates for your model. The numbers above are illustrative.
If you use an agent CLI, use one of these approaches:
# Check session cost if the CLI supports it
claude /cost
Or isolate spend:
Create one API key per agent, project, or team.
Track spend per key in the provider dashboard.
Or wrap invocations:
#!/usr/bin/env bash
TASK_LABEL="$1"
shift
START="$(date -Iseconds)"
claude "$@" | tee /tmp/agent-output.txt
END="$(date -Iseconds)"
echo "$START,$END,$TASK_LABEL" >> agent-runs.csv
Track representative tasks:
- Cost per daily refactor
- Cost per PR review
- Cost per test-generation run
- Cost per debugging session
When you enable caching, trim memory, or route subtasks to cheaper models, those numbers should move. If they do not, the tactic is not affecting your actual bottleneck.
Tactic comparison
| Tactic | Typical token savings | Effort |
|---|---|---|
| Scope the working set | 30β60% on input per run | Low |
| Short, stable memory file | 5β15% per turn | Low |
/compact or /clear between tasks |
40β80% on long sessions | Low |
| Prompt caching on stable prefix | ~90% on the cached prefix | Medium |
| Model routing | 50β80% on routed subtasks | Medium |
| Quiet or filtered tool output | 20β50% on tool-heavy runs | Low |
| Targeted reads | 70β95% on large-file edits | Low |
| Constrained retrieval scope | 30β60% on RAG-heavy agents | Medium |
| Per-run cost measurement | 0% directly; enables optimization | Low |
Savings ranges are illustrative and stack multiplicatively. Your actual gain depends on where your baseline waste is.
Practical checklist
Before starting an agent run:
[ ] Did I name the exact files or directory scope?
[ ] Is the task one logical unit?
[ ] Is my memory file short?
[ ] Are generated files ignored?
[ ] Am I using the cheapest model that can do this task?
During the run:
[ ] Ask for targeted reads, not whole-file reads.
[ ] Use quiet test and install commands.
[ ] Return only relevant log tails.
[ ] Avoid dumping huge diffs unless needed.
After the run:
[ ] Check token usage or session cost.
[ ] Compact or clear before switching tasks.
[ ] Record cost for representative workflows.
Conclusion
Agent token costs are mostly caused by avoidable context: files the model did not need, logs nobody reads, long transcripts replayed every turn, and expensive models used for cheap tasks.
Start with the low-effort fixes:
- Scope the task.
- Keep memory files lean.
- Use quiet commands.
- Ignore generated files.
- Clear or compact between tasks.
Then add caching, model routing, and per-run measurement. Those changes reduce spend without reducing the quality of the actual coding work.
Top comments (0)