I expected prompt caching to be one of the biggest cost optimizations for coding agents.
After all, every request carries system prompts, tool definitions, and instructions that rarely change. Caching those felt like free money.
I also expected a second lever: unused retrieved context. Agents constantly read files, fetch logs, inspect directories, and explore codebases. Surely a meaningful fraction of that context never actually influences the final output.
So I built a small tool, context-audit, and ran it across 27 real coding-agent sessions representing $753.24 of input spend.
Both assumptions didn't survive the benchmark.
What I Measured
Across 27 sessions:
| Metric | Value |
|---|---|
| Context Reuse | 94.5% |
| Novel Context | 5.5% |
| Prompt Cache Savings | 1.0% |
| Unused Context Cost | 0.4% |
Prompt caching accounted for only 1.0% of potential savings.
Unused retrieved context accounted for just 0.4%.
Meanwhile, 94.5% of all context tokens were repeated content.
The dominant cost wasn't unused retrieval.
It wasn't static prompts.
It was accumulated conversation history.
The same information kept appearing again and again as sessions grew longer.
The Bigger the Session, the More Repetitive It Became
Context reuse increased sharply with session size:
| Final Context Size | Avg Reuse |
|---|---|
| < 5k tokens | 66.3% |
| 5k–20k tokens | 92.5% |
| 20k–50k tokens | 96.8% |
| > 50k tokens | 99.2% |
The longer a session ran, the more repetitive it became.
That was not what I expected to find.
My optimization instincts were pointing at the wrong bottleneck.
The Insight: Coding Agents Have Two Memory Systems
While digging through transcripts, I started thinking about coding agents differently from chatbots.
They appear to operate with two distinct memory systems.
Workspace Memory
Examples:
- Files
- Logs
- Terminal outputs
- Build artifacts
- Compiler errors
- Directory structures
This information often exists on disk.
The agent can frequently reconstruct it by reading the workspace again.
Conversational Memory
Examples:
- User preferences
- Design decisions
- Rejected ideas
- Constraints
- Trade-offs
- Architectural rationale
This information exists only inside the conversation.
Once it's removed, it may be gone for good.
That distinction changed how I think about context management.
Not all context is equally disposable.
A Pruning Failure Mode
Picture a long coding session.
Early in the conversation, the user decides:
- No embeddings
- No LLM-as-a-judge
- No HTML dashboards
The team discusses alternatives and agrees on a simpler approach.
Fifty turns later, those decisions may no longer exist anywhere except the conversation history.
The workspace still contains the code.
But it doesn't contain every rejected path.
A naive pruning strategy removes what appears to be old conversation noise.
Unfortunately, it may also remove the reasoning behind the project.
The result is an agent that suddenly starts recommending ideas the user explicitly rejected earlier.
Concretely, a summarizer that compresses primarily by recency may preserve:
- the latest file tree,
- recent command outputs,
- recent compiler logs,
while dropping a critical design decision made dozens of turns earlier.
The expensive-looking context survives.
The cheap-looking context that actually mattered disappears.
Why This Matters
If you're manually pruning, summarizing, or compressing long-running coding-agent sessions, you may be deleting the rationale behind decisions rather than the expensive parts of the context.
Workspace state is often reconstructable.
Conversational decisions frequently are not.
That doesn't mean pruning is wrong.
It means pruning needs to distinguish between:
- Technical execution history that can be recovered from the workspace.
- Alignment history that only exists in the conversation.
Treating both categories the same can produce subtle regressions.
What This Doesn't Prove
This isn't a universal law.
Twenty-seven sessions are enough for an interesting observation, not enough to claim every coding agent behaves this way.
The benchmark covers coding-agent workflows with disk-backed state.
It does not cover:
- RAG systems
- General chatbots
- Research agents
- Customer support agents
- Other non-coding workflows
The findings should be interpreted within that scope.
But they were enough to overturn my expectations.
I started this project expecting prompt caching and retrieval waste to dominate.
In this dataset, they barely moved the needle.
Try It Yourself
# Audit a single transcript
context-audit run transcript.jsonl
# Benchmark an entire directory
context-audit benchmark ~/.claude/projects
The tool reports:
- Context reuse ratios
- Estimated costs
- Repeated blocks
- Context growth patterns
- Potential caching savings
Repository
GitHub: context-audit
The benchmark changed how I think about coding-agent memory.
I started this project looking for waste in static prompts and retrieval.
Instead, I found a system spending most of its context budget carrying forward its own history.
If you're running long Claude Code, Cursor, Aider, or similar coding-agent workflows, I'd love to know whether you're seeing the same pattern.

Top comments (0)