Hassann

Posted on May 20 • Originally published at apidog.com

How to Reduce Agent Token Costs From the CLI (2026 Guide)

#agents #cli #llm #productivity

A CLI coding agent feels cheap until the invoice arrives. You point Claude Code or Codex at a repo, ask it to refactor a module, and ten minutes later it has read forty files, run the test suite three times, and burned six figures of tokens on context it did not need. Multiply that by a team running agents all day and token spend stops being a rounding error. Most coding-agent token waste is fixable from the command line without changing models or accepting worse output.

Try Apidog today

TL;DR

Cut agent token costs before context reaches the model:

Scope the working set.
Keep memory files short.
Compact or clear long sessions.
Enable prompt caching for stable prefixes.
Route cheap subtasks to smaller models.
Cap tool output.
Measure cost per run.

Introduction

CLI agents are token-hungry by default. They read whole files when they need ten lines, replay the entire conversation on every turn, dump raw command output back into context, and re-send the same system prompt and repo map many times per day.

A refactor that needs to reason about 2,000 tokens of code should not require 180,000 tokens of context. The gap between those numbers is your savings.

This guide shows where tokens go in a CLI agent run and how to reduce each bucket with practical tactics:

Context hygiene and memory files
Prompt caching
Model routing
Tool-output trimming
Retrieval limits
Per-run cost measurement

The examples use Claude Code and Codex-style workflows, but the same mechanics apply to any token-billed coding agent.

One adjacent cost is debugging. If an agent calls a flaky internal API, it may retry, read error bodies, re-read docs, and loop. Every iteration costs tokens.

💡 If your agents touch APIs, having those APIs designed, mocked, and tested in Apidog before you point an agent at them removes a whole category of expensive trial-and-error. The agent works against a stable contract instead of a live endpoint that surprises it.

Where tokens actually go in a CLI agent run

A single agent turn has two billable parts:

Input payload sent to the model
Output payload returned by the model

You usually pay for both. Output tokens are often several times more expensive than input tokens, but input volume is what grows fastest.

Typical input payloads include:

System prompt and tool definitions

Agent instructions plus tool schemas. This can be 5,000–15,000 tokens and is re-sent every turn.
Memory and project files

Files such as CLAUDE.md, repo conventions, and persistent instructions. These are often loaded every turn.
Conversation history

Previous user messages, model responses, tool calls, and tool results. This grows throughout the session.
Retrieved file content

Files the agent reads. A single Read on a 1,200-line file can be roughly 12,000–18,000 tokens.
Tool output

Test logs, install logs, stack traces, git diff output, and generated lockfile diffs.

The key thing to remember: conversation history is replayed every turn.

A 30-turn session does not cost 30 times one turn. Later turns carry everything that happened before them. That is why long, meandering sessions get expensive quickly.

For more detail on how session accounting can surprise you, see how the Claude Code token window resets.

1. Scope the working set before starting

The cheapest token is the one you never send.

Do not start an agent at the repo root with a vague task:

claude "refactor the billing logic"

That invites the agent to crawl the repo.

Instead, name the files and boundaries:

claude "refactor the retry logic so it uses exponential backoff,
only in src/payments/retry.ts and src/payments/retry.test.ts"

If the agent needs to explore, point it at a directory instead of the entire repository:

claude "inspect only src/payments and identify where retry behavior is implemented"

A practical prompt pattern:

Task:
Refactor retry logic to use exponential backoff.

Scope:
- src/payments/retry.ts
- src/payments/retry.test.ts

Do not inspect unrelated directories unless these files reference code you cannot understand without reading.

This reduces exploratory reads and keeps the context focused.

2. Keep memory files short and stable

A CLAUDE.md or equivalent memory file is loaded into context repeatedly. If it becomes a mini-wiki, you pay for that wiki on every turn.

Check its approximate token size:

wc -c CLAUDE.md | awk '{print "≈", int($1/4), "tokens per turn"}'

A good memory file should include:

Build command
Test command
Lint command
Strict repo conventions
Important architectural constraints
Pointers to deeper docs

It should not include:

Full onboarding docs
Long architecture essays
Rarely used process notes
Historical explanations
Large examples that are only needed occasionally

Example lean CLAUDE.md:

# Project instructions

## Commands

- Install: `npm ci`
- Test: `npm test --silent`
- Lint: `npm run lint`
- Typecheck: `npm run typecheck`

## Conventions

- Use TypeScript strict mode.
- Do not introduce new runtime dependencies without asking.
- Keep API handlers thin; business logic belongs in `src/services`.
- Add or update tests for behavior changes.

## Docs

- API design notes: `docs/api.md`
- Payment flow details: `docs/payments.md`

Move detailed docs out of always-loaded memory and let the agent read them only when needed.

3. Compact or clear long sessions

When a session switches tasks, do not keep typing into the same context. Every new turn carries the old transcript.

In Claude Code:

/compact

Use this when the current task is mostly complete but you want to preserve a short summary.

For a clean break:

/clear

Use this when you are starting an unrelated task.

A simple workflow:

One logical task = one session.

After task completion:
- Use /compact if follow-up work depends on the current state.
- Use /clear if the next task is unrelated.

This can replace tens of thousands of raw transcript tokens with a short digest.

The same scoping habit appears in Claude Code workflows.

4. Ignore generated and irrelevant files

Keep generated artifacts, dependencies, build output, and large lockfile diffs away from the agent.

At minimum, configure your repo ignore rules so agents avoid:

node_modules/
dist/
build/
coverage/
.next/
.cache/
tmp/
*.log

If your agent supports its own ignore file, add one too. The exact filename depends on the tool, but the goal is the same: prevent the agent from reading or diffing files that do not matter.

This is especially important for:

Generated SDKs
Minified bundles
Large snapshots
Lockfiles
Vendored dependencies
Test fixtures with huge payloads

5. Use prompt caching for stable prefixes

Prompt caching lets the provider store a stable prefix of your request so repeated requests can reuse it at a discount.

The stable prefix usually includes:

Tool definitions
System prompt
Repo conventions
Long-lived instructions

The volatile part should come after the cache boundary:

User task
Current file snippets
Timestamps
Fresh command output

The structural order is:

tools → system → messages

Changing content before the cache boundary can invalidate the cached prefix.

If you call the model from your own wrapper, cache the stable part explicitly:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT + REPO_CONVENTIONS,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": user_task,
        }
    ],
)

u = response.usage

print("cache write:", u.cache_creation_input_tokens)
print("cache read :", u.cache_read_input_tokens)
print("fresh input:", u.input_tokens)

Operational rules:

Keep cached prefixes byte-stable.
Do not insert timestamps into cached content.
Batch related runs close together to hit a warm cache.
Inspect usage fields to verify cache reads are happening.

For repeated agent runs with the same system prompt and repo conventions, caching can substantially reduce the cost of the repeated prefix.

OpenAI’s API applies similar cached-input discounts automatically on supported models. The knobs differ, but the principle is the same.

For another angle on reducing model cost, see running GPT-5.5 free through Codex.

6. Route cheap work to cheaper models

Not every task needs the strongest model.

Good candidates for a smaller model:

Commit messages
Changelog entries
Diff summaries
Boilerplate tests
Simple renames
Lint explanations
Search result summarization

Reserve the stronger model for:

Architecture decisions
Complex refactors
Multi-file reasoning
Debugging subtle failures
Security-sensitive changes

Example CLI routing:

# Cheap model for low-risk text generation
claude --model haiku "write a conventional-commit message for the staged diff"

# Stronger model for architecture or complex reasoning
claude --model sonnet "redesign the caching layer for the payments service"

A better team default is:

Default model: cheaper model
Escalation model: stronger model when explicitly needed

Many teams do the opposite and run the flagship model for everything “to be safe.” That is expensive when the task is just summarizing a diff.

If your framework supports sub-agents, route narrow subtasks to cheap models with small context windows. The parent agent should receive a short distilled result instead of doing all grunt work with the expensive model.

The delegation style in the goal command across Codex and Claude Code is useful for this pattern.

If you are on a capped plan, routing also stretches your allowance. The Claude Code weekly limit increase helps, but routing is still what keeps premium-model usage available for hard work.

7. Make tool output quiet

Tool output is easy to ignore because it feels like “just logs.” But every line returned to the agent becomes context and may be replayed in later turns.

Prefer quiet commands.

Instead of:

npm test

Use:

npm test --silent -- --reporter=dot

Instead of:

npm install

Use:

npm install --silent --no-audit --no-fund

For Python tests:

pytest -q

For noisy test failures, return only the tail:

pytest -q 2>&1 | tail -n 30

For diffs, avoid dumping huge patches unless needed:

git diff --stat

Then inspect a specific file:

git diff -- src/payments/retry.ts

For logs, grep the signal:

npm test 2>&1 | grep -E "(FAIL|✗|Error)" | head -n 20

This gives the agent enough signal to act without stuffing the transcript with thousands of irrelevant tokens.

8. Prefer targeted reads over whole-file reads

A common waste pattern:

Agent reads a 1,500-line file to modify one function.

Better prompt:

Find the function that handles payment retries.
Read only that function and nearby helper functions.
Do not read the entire file unless necessary.

If you know the symbol, give it directly:

claude "Update the calculateBackoffDelay function in src/payments/retry.ts.
Read only that function, its direct helpers, and its tests."

Useful shell commands for manual scoping:

grep -R "function calculateBackoffDelay" -n src
grep -R "calculateBackoffDelay" -n src test

Then pass the relevant files or line ranges to the agent.

The difference can be large: a whole large file may cost tens of thousands of tokens, while a focused function window may be under a thousand.

9. Constrain retrieval and RAG scope

If your agent searches docs or code with retrieval, cap both chunk count and chunk size.

Bad default:

Return 50 chunks of 800 tokens each.

Better default:

Return the top 8–10 chunks.
Limit each chunk to around 200–300 tokens.
Prefer exact symbol matches over broad semantic matches.

Practical retrieval rules:

Retrieve fewer chunks first.
Ask for more only if needed.
Prefer exact paths and symbols.
Avoid returning entire documents.
Summarize long docs before passing them to the expensive model.

You pay for every retrieved token whether or not the model uses it.

10. Measure cost per run

You cannot optimize what you do not measure.

If you call the API directly, capture usage from every response:

u = response.usage

INPUT_RATE = 3.00 / 1_000_000
OUTPUT_RATE = 15.00 / 1_000_000
CACHE_READ = 0.30 / 1_000_000
CACHE_WRITE = 3.75 / 1_000_000

cost = (
    u.input_tokens * INPUT_RATE
    + u.output_tokens * OUTPUT_RATE
    + u.cache_read_input_tokens * CACHE_READ
    + u.cache_creation_input_tokens * CACHE_WRITE
)

print(
    f"run cost ≈ ${cost:.4f} "
    f"(in={u.input_tokens} "
    f"out={u.output_tokens} "
    f"cache_read={u.cache_read_input_tokens})"
)

Use live provider rates for your model. The numbers above are illustrative.

If you use an agent CLI, use one of these approaches:

# Check session cost if the CLI supports it
claude /cost

Or isolate spend:

Create one API key per agent, project, or team.
Track spend per key in the provider dashboard.

Or wrap invocations:

#!/usr/bin/env bash

TASK_LABEL="$1"
shift

START="$(date -Iseconds)"

claude "$@" | tee /tmp/agent-output.txt

END="$(date -Iseconds)"

echo "$START,$END,$TASK_LABEL" >> agent-runs.csv

Track representative tasks:

Cost per daily refactor
Cost per PR review
Cost per test-generation run
Cost per debugging session

When you enable caching, trim memory, or route subtasks to cheaper models, those numbers should move. If they do not, the tactic is not affecting your actual bottleneck.

Tactic comparison

Tactic	Typical token savings	Effort
Scope the working set	30–60% on input per run	Low
Short, stable memory file	5–15% per turn	Low
`/compact` or `/clear` between tasks	40–80% on long sessions	Low
Prompt caching on stable prefix	~90% on the cached prefix	Medium
Model routing	50–80% on routed subtasks	Medium
Quiet or filtered tool output	20–50% on tool-heavy runs	Low
Targeted reads	70–95% on large-file edits	Low
Constrained retrieval scope	30–60% on RAG-heavy agents	Medium
Per-run cost measurement	0% directly; enables optimization	Low

Savings ranges are illustrative and stack multiplicatively. Your actual gain depends on where your baseline waste is.

Practical checklist

Before starting an agent run:

[ ] Did I name the exact files or directory scope?
[ ] Is the task one logical unit?
[ ] Is my memory file short?
[ ] Are generated files ignored?
[ ] Am I using the cheapest model that can do this task?

During the run:

[ ] Ask for targeted reads, not whole-file reads.
[ ] Use quiet test and install commands.
[ ] Return only relevant log tails.
[ ] Avoid dumping huge diffs unless needed.

After the run:

[ ] Check token usage or session cost.
[ ] Compact or clear before switching tasks.
[ ] Record cost for representative workflows.

Conclusion

Agent token costs are mostly caused by avoidable context: files the model did not need, logs nobody reads, long transcripts replayed every turn, and expensive models used for cheap tasks.

Start with the low-effort fixes:

Scope the task.
Keep memory files lean.
Use quiet commands.
Ignore generated files.
Clear or compact between tasks.

Then add caching, model routing, and per-run measurement. Those changes reduce spend without reducing the quality of the actual coding work.

DEV Community