DEV Community

Cover image for How to Reduce Agent Token Costs From the CLI (2026 Guide)
Hassann
Hassann

Posted on • Originally published at apidog.com

How to Reduce Agent Token Costs From the CLI (2026 Guide)

A CLI coding agent feels cheap until the invoice arrives. You point Claude Code or Codex at a repo, ask it to refactor a module, and ten minutes later it has read forty files, run the test suite three times, and burned six figures of tokens on context it did not need. Multiply that by a team running agents all day and token spend stops being a rounding error. Most coding-agent token waste is fixable from the command line without changing models or accepting worse output.

Try Apidog today

TL;DR

Cut agent token costs before context reaches the model:

  • Scope the working set.
  • Keep memory files short.
  • Compact or clear long sessions.
  • Enable prompt caching for stable prefixes.
  • Route cheap subtasks to smaller models.
  • Cap tool output.
  • Measure cost per run.

Introduction

CLI agents are token-hungry by default. They read whole files when they need ten lines, replay the entire conversation on every turn, dump raw command output back into context, and re-send the same system prompt and repo map many times per day.

A refactor that needs to reason about 2,000 tokens of code should not require 180,000 tokens of context. The gap between those numbers is your savings.

This guide shows where tokens go in a CLI agent run and how to reduce each bucket with practical tactics:

  • Context hygiene and memory files
  • Prompt caching
  • Model routing
  • Tool-output trimming
  • Retrieval limits
  • Per-run cost measurement

The examples use Claude Code and Codex-style workflows, but the same mechanics apply to any token-billed coding agent.

One adjacent cost is debugging. If an agent calls a flaky internal API, it may retry, read error bodies, re-read docs, and loop. Every iteration costs tokens.

πŸ’‘ If your agents touch APIs, having those APIs designed, mocked, and tested in Apidog before you point an agent at them removes a whole category of expensive trial-and-error. The agent works against a stable contract instead of a live endpoint that surprises it.

Where tokens actually go in a CLI agent run

A single agent turn has two billable parts:

  1. Input payload sent to the model
  2. Output payload returned by the model

You usually pay for both. Output tokens are often several times more expensive than input tokens, but input volume is what grows fastest.

Typical input payloads include:

  • System prompt and tool definitions

    Agent instructions plus tool schemas. This can be 5,000–15,000 tokens and is re-sent every turn.

  • Memory and project files

    Files such as CLAUDE.md, repo conventions, and persistent instructions. These are often loaded every turn.

  • Conversation history

    Previous user messages, model responses, tool calls, and tool results. This grows throughout the session.

  • Retrieved file content

    Files the agent reads. A single Read on a 1,200-line file can be roughly 12,000–18,000 tokens.

  • Tool output

    Test logs, install logs, stack traces, git diff output, and generated lockfile diffs.

The key thing to remember: conversation history is replayed every turn.

A 30-turn session does not cost 30 times one turn. Later turns carry everything that happened before them. That is why long, meandering sessions get expensive quickly.

For more detail on how session accounting can surprise you, see how the Claude Code token window resets.

1. Scope the working set before starting

The cheapest token is the one you never send.

Do not start an agent at the repo root with a vague task:

claude "refactor the billing logic"
Enter fullscreen mode Exit fullscreen mode

That invites the agent to crawl the repo.

Instead, name the files and boundaries:

claude "refactor the retry logic so it uses exponential backoff,
only in src/payments/retry.ts and src/payments/retry.test.ts"
Enter fullscreen mode Exit fullscreen mode

If the agent needs to explore, point it at a directory instead of the entire repository:

claude "inspect only src/payments and identify where retry behavior is implemented"
Enter fullscreen mode Exit fullscreen mode

A practical prompt pattern:

Task:
Refactor retry logic to use exponential backoff.

Scope:
- src/payments/retry.ts
- src/payments/retry.test.ts

Do not inspect unrelated directories unless these files reference code you cannot understand without reading.
Enter fullscreen mode Exit fullscreen mode

This reduces exploratory reads and keeps the context focused.

2. Keep memory files short and stable

A CLAUDE.md or equivalent memory file is loaded into context repeatedly. If it becomes a mini-wiki, you pay for that wiki on every turn.

Check its approximate token size:

wc -c CLAUDE.md | awk '{print "β‰ˆ", int($1/4), "tokens per turn"}'
Enter fullscreen mode Exit fullscreen mode

A good memory file should include:

  • Build command
  • Test command
  • Lint command
  • Strict repo conventions
  • Important architectural constraints
  • Pointers to deeper docs

It should not include:

  • Full onboarding docs
  • Long architecture essays
  • Rarely used process notes
  • Historical explanations
  • Large examples that are only needed occasionally

Example lean CLAUDE.md:

# Project instructions

## Commands

- Install: `npm ci`
- Test: `npm test --silent`
- Lint: `npm run lint`
- Typecheck: `npm run typecheck`

## Conventions

- Use TypeScript strict mode.
- Do not introduce new runtime dependencies without asking.
- Keep API handlers thin; business logic belongs in `src/services`.
- Add or update tests for behavior changes.

## Docs

- API design notes: `docs/api.md`
- Payment flow details: `docs/payments.md`
Enter fullscreen mode Exit fullscreen mode

Move detailed docs out of always-loaded memory and let the agent read them only when needed.

3. Compact or clear long sessions

When a session switches tasks, do not keep typing into the same context. Every new turn carries the old transcript.

In Claude Code:

/compact
Enter fullscreen mode Exit fullscreen mode

Use this when the current task is mostly complete but you want to preserve a short summary.

For a clean break:

/clear
Enter fullscreen mode Exit fullscreen mode

Use this when you are starting an unrelated task.

A simple workflow:

One logical task = one session.

After task completion:
- Use /compact if follow-up work depends on the current state.
- Use /clear if the next task is unrelated.
Enter fullscreen mode Exit fullscreen mode

This can replace tens of thousands of raw transcript tokens with a short digest.

The same scoping habit appears in Claude Code workflows.

4. Ignore generated and irrelevant files

Keep generated artifacts, dependencies, build output, and large lockfile diffs away from the agent.

At minimum, configure your repo ignore rules so agents avoid:

node_modules/
dist/
build/
coverage/
.next/
.cache/
tmp/
*.log
Enter fullscreen mode Exit fullscreen mode

If your agent supports its own ignore file, add one too. The exact filename depends on the tool, but the goal is the same: prevent the agent from reading or diffing files that do not matter.

This is especially important for:

  • Generated SDKs
  • Minified bundles
  • Large snapshots
  • Lockfiles
  • Vendored dependencies
  • Test fixtures with huge payloads

5. Use prompt caching for stable prefixes

Prompt caching lets the provider store a stable prefix of your request so repeated requests can reuse it at a discount.

The stable prefix usually includes:

  • Tool definitions
  • System prompt
  • Repo conventions
  • Long-lived instructions

The volatile part should come after the cache boundary:

  • User task
  • Current file snippets
  • Timestamps
  • Fresh command output

The structural order is:

tools β†’ system β†’ messages
Enter fullscreen mode Exit fullscreen mode

Changing content before the cache boundary can invalidate the cached prefix.

If you call the model from your own wrapper, cache the stable part explicitly:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT + REPO_CONVENTIONS,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[
        {
            "role": "user",
            "content": user_task,
        }
    ],
)

u = response.usage

print("cache write:", u.cache_creation_input_tokens)
print("cache read :", u.cache_read_input_tokens)
print("fresh input:", u.input_tokens)
Enter fullscreen mode Exit fullscreen mode

Operational rules:

  • Keep cached prefixes byte-stable.
  • Do not insert timestamps into cached content.
  • Batch related runs close together to hit a warm cache.
  • Inspect usage fields to verify cache reads are happening.

For repeated agent runs with the same system prompt and repo conventions, caching can substantially reduce the cost of the repeated prefix.

OpenAI’s API applies similar cached-input discounts automatically on supported models. The knobs differ, but the principle is the same.

For another angle on reducing model cost, see running GPT-5.5 free through Codex.

6. Route cheap work to cheaper models

Not every task needs the strongest model.

Good candidates for a smaller model:

  • Commit messages
  • Changelog entries
  • Diff summaries
  • Boilerplate tests
  • Simple renames
  • Lint explanations
  • Search result summarization

Reserve the stronger model for:

  • Architecture decisions
  • Complex refactors
  • Multi-file reasoning
  • Debugging subtle failures
  • Security-sensitive changes

Example CLI routing:

# Cheap model for low-risk text generation
claude --model haiku "write a conventional-commit message for the staged diff"

# Stronger model for architecture or complex reasoning
claude --model sonnet "redesign the caching layer for the payments service"
Enter fullscreen mode Exit fullscreen mode

A better team default is:

Default model: cheaper model
Escalation model: stronger model when explicitly needed
Enter fullscreen mode Exit fullscreen mode

Many teams do the opposite and run the flagship model for everything β€œto be safe.” That is expensive when the task is just summarizing a diff.

If your framework supports sub-agents, route narrow subtasks to cheap models with small context windows. The parent agent should receive a short distilled result instead of doing all grunt work with the expensive model.

The delegation style in the goal command across Codex and Claude Code is useful for this pattern.

If you are on a capped plan, routing also stretches your allowance. The Claude Code weekly limit increase helps, but routing is still what keeps premium-model usage available for hard work.

7. Make tool output quiet

Tool output is easy to ignore because it feels like β€œjust logs.” But every line returned to the agent becomes context and may be replayed in later turns.

Prefer quiet commands.

Instead of:

npm test
Enter fullscreen mode Exit fullscreen mode

Use:

npm test --silent -- --reporter=dot
Enter fullscreen mode Exit fullscreen mode

Instead of:

npm install
Enter fullscreen mode Exit fullscreen mode

Use:

npm install --silent --no-audit --no-fund
Enter fullscreen mode Exit fullscreen mode

For Python tests:

pytest -q
Enter fullscreen mode Exit fullscreen mode

For noisy test failures, return only the tail:

pytest -q 2>&1 | tail -n 30
Enter fullscreen mode Exit fullscreen mode

For diffs, avoid dumping huge patches unless needed:

git diff --stat
Enter fullscreen mode Exit fullscreen mode

Then inspect a specific file:

git diff -- src/payments/retry.ts
Enter fullscreen mode Exit fullscreen mode

For logs, grep the signal:

npm test 2>&1 | grep -E "(FAIL|βœ—|Error)" | head -n 20
Enter fullscreen mode Exit fullscreen mode

This gives the agent enough signal to act without stuffing the transcript with thousands of irrelevant tokens.

8. Prefer targeted reads over whole-file reads

A common waste pattern:

Agent reads a 1,500-line file to modify one function.
Enter fullscreen mode Exit fullscreen mode

Better prompt:

Find the function that handles payment retries.
Read only that function and nearby helper functions.
Do not read the entire file unless necessary.
Enter fullscreen mode Exit fullscreen mode

If you know the symbol, give it directly:

claude "Update the calculateBackoffDelay function in src/payments/retry.ts.
Read only that function, its direct helpers, and its tests."
Enter fullscreen mode Exit fullscreen mode

Useful shell commands for manual scoping:

grep -R "function calculateBackoffDelay" -n src
grep -R "calculateBackoffDelay" -n src test
Enter fullscreen mode Exit fullscreen mode

Then pass the relevant files or line ranges to the agent.

The difference can be large: a whole large file may cost tens of thousands of tokens, while a focused function window may be under a thousand.

9. Constrain retrieval and RAG scope

If your agent searches docs or code with retrieval, cap both chunk count and chunk size.

Bad default:

Return 50 chunks of 800 tokens each.
Enter fullscreen mode Exit fullscreen mode

Better default:

Return the top 8–10 chunks.
Limit each chunk to around 200–300 tokens.
Prefer exact symbol matches over broad semantic matches.
Enter fullscreen mode Exit fullscreen mode

Practical retrieval rules:

  • Retrieve fewer chunks first.
  • Ask for more only if needed.
  • Prefer exact paths and symbols.
  • Avoid returning entire documents.
  • Summarize long docs before passing them to the expensive model.

You pay for every retrieved token whether or not the model uses it.

10. Measure cost per run

You cannot optimize what you do not measure.

If you call the API directly, capture usage from every response:

u = response.usage

INPUT_RATE = 3.00 / 1_000_000
OUTPUT_RATE = 15.00 / 1_000_000
CACHE_READ = 0.30 / 1_000_000
CACHE_WRITE = 3.75 / 1_000_000

cost = (
    u.input_tokens * INPUT_RATE
    + u.output_tokens * OUTPUT_RATE
    + u.cache_read_input_tokens * CACHE_READ
    + u.cache_creation_input_tokens * CACHE_WRITE
)

print(
    f"run cost β‰ˆ ${cost:.4f} "
    f"(in={u.input_tokens} "
    f"out={u.output_tokens} "
    f"cache_read={u.cache_read_input_tokens})"
)
Enter fullscreen mode Exit fullscreen mode

Use live provider rates for your model. The numbers above are illustrative.

If you use an agent CLI, use one of these approaches:

# Check session cost if the CLI supports it
claude /cost
Enter fullscreen mode Exit fullscreen mode

Or isolate spend:

Create one API key per agent, project, or team.
Track spend per key in the provider dashboard.
Enter fullscreen mode Exit fullscreen mode

Or wrap invocations:

#!/usr/bin/env bash

TASK_LABEL="$1"
shift

START="$(date -Iseconds)"

claude "$@" | tee /tmp/agent-output.txt

END="$(date -Iseconds)"

echo "$START,$END,$TASK_LABEL" >> agent-runs.csv
Enter fullscreen mode Exit fullscreen mode

Track representative tasks:

  • Cost per daily refactor
  • Cost per PR review
  • Cost per test-generation run
  • Cost per debugging session

When you enable caching, trim memory, or route subtasks to cheaper models, those numbers should move. If they do not, the tactic is not affecting your actual bottleneck.

Tactic comparison

Tactic Typical token savings Effort
Scope the working set 30–60% on input per run Low
Short, stable memory file 5–15% per turn Low
/compact or /clear between tasks 40–80% on long sessions Low
Prompt caching on stable prefix ~90% on the cached prefix Medium
Model routing 50–80% on routed subtasks Medium
Quiet or filtered tool output 20–50% on tool-heavy runs Low
Targeted reads 70–95% on large-file edits Low
Constrained retrieval scope 30–60% on RAG-heavy agents Medium
Per-run cost measurement 0% directly; enables optimization Low

Savings ranges are illustrative and stack multiplicatively. Your actual gain depends on where your baseline waste is.

Practical checklist

Before starting an agent run:

[ ] Did I name the exact files or directory scope?
[ ] Is the task one logical unit?
[ ] Is my memory file short?
[ ] Are generated files ignored?
[ ] Am I using the cheapest model that can do this task?
Enter fullscreen mode Exit fullscreen mode

During the run:

[ ] Ask for targeted reads, not whole-file reads.
[ ] Use quiet test and install commands.
[ ] Return only relevant log tails.
[ ] Avoid dumping huge diffs unless needed.
Enter fullscreen mode Exit fullscreen mode

After the run:

[ ] Check token usage or session cost.
[ ] Compact or clear before switching tasks.
[ ] Record cost for representative workflows.
Enter fullscreen mode Exit fullscreen mode

Conclusion

Agent token costs are mostly caused by avoidable context: files the model did not need, logs nobody reads, long transcripts replayed every turn, and expensive models used for cheap tasks.

Start with the low-effort fixes:

  • Scope the task.
  • Keep memory files lean.
  • Use quiet commands.
  • Ignore generated files.
  • Clear or compact between tasks.

Then add caching, model routing, and per-run measurement. Those changes reduce spend without reducing the quality of the actual coding work.

Top comments (0)