Ishwar

Posted on Jun 19

I Audited $753 of Coding-Agent Usage. I Found 94.5% Context Reuse.

#agents #ai #opensource #showdev

I expected prompt caching to be one of the biggest cost optimizations for coding agents.

After all, every request carries system prompts, tool definitions, and instructions that rarely change. Caching those felt like free money.

I also expected a second lever: unused retrieved context. Agents constantly read files, fetch logs, inspect directories, and explore codebases. Surely a meaningful fraction of that context never actually influences the final output.

So I built a small tool, context-audit, and ran it across 27 real coding-agent sessions representing $753.24 of input spend.

Both assumptions didn't survive the benchmark.

What I Measured

Across 27 sessions:

Metric	Value
Context Reuse	94.5%
Novel Context	5.5%
Prompt Cache Savings	1.0%
Unused Context Cost	0.4%

Prompt caching accounted for only 1.0% of potential savings.

Unused retrieved context accounted for just 0.4%.

Meanwhile, 94.5% of all context tokens were repeated content.

The dominant cost wasn't unused retrieval.

It wasn't static prompts.

It was accumulated conversation history.

The same information kept appearing again and again as sessions grew longer.

The Bigger the Session, the More Repetitive It Became

Context reuse increased sharply with session size:

Final Context Size	Avg Reuse
< 5k tokens	66.3%
5k–20k tokens	92.5%
20k–50k tokens	96.8%
> 50k tokens	99.2%

The longer a session ran, the more repetitive it became.

That was not what I expected to find.

My optimization instincts were pointing at the wrong bottleneck.

The Insight: Coding Agents Have Two Memory Systems

While digging through transcripts, I started thinking about coding agents differently from chatbots.

They appear to operate with two distinct memory systems.

Workspace Memory

Examples:

Files
Logs
Terminal outputs
Build artifacts
Compiler errors
Directory structures

This information often exists on disk.

The agent can frequently reconstruct it by reading the workspace again.

Conversational Memory

Examples:

User preferences
Design decisions
Rejected ideas
Constraints
Trade-offs
Architectural rationale

This information exists only inside the conversation.

Once it's removed, it may be gone for good.

That distinction changed how I think about context management.

Not all context is equally disposable.

A Pruning Failure Mode

Picture a long coding session.

Early in the conversation, the user decides:

No embeddings
No LLM-as-a-judge
No HTML dashboards

The team discusses alternatives and agrees on a simpler approach.

Fifty turns later, those decisions may no longer exist anywhere except the conversation history.

The workspace still contains the code.

But it doesn't contain every rejected path.

A naive pruning strategy removes what appears to be old conversation noise.

Unfortunately, it may also remove the reasoning behind the project.

The result is an agent that suddenly starts recommending ideas the user explicitly rejected earlier.

Concretely, a summarizer that compresses primarily by recency may preserve:

the latest file tree,
recent command outputs,
recent compiler logs,

while dropping a critical design decision made dozens of turns earlier.

The expensive-looking context survives.

The cheap-looking context that actually mattered disappears.

Why This Matters

If you're manually pruning, summarizing, or compressing long-running coding-agent sessions, you may be deleting the rationale behind decisions rather than the expensive parts of the context.

Workspace state is often reconstructable.

Conversational decisions frequently are not.

That doesn't mean pruning is wrong.

It means pruning needs to distinguish between:

Technical execution history that can be recovered from the workspace.
Alignment history that only exists in the conversation.

Treating both categories the same can produce subtle regressions.

What This Doesn't Prove

This isn't a universal law.

Twenty-seven sessions are enough for an interesting observation, not enough to claim every coding agent behaves this way.

The benchmark covers coding-agent workflows with disk-backed state.

It does not cover:

RAG systems
General chatbots
Research agents
Customer support agents
Other non-coding workflows

The findings should be interpreted within that scope.

But they were enough to overturn my expectations.

I started this project expecting prompt caching and retrieval waste to dominate.

In this dataset, they barely moved the needle.

Try It Yourself

# Audit a single transcript
context-audit run transcript.jsonl

# Benchmark an entire directory
context-audit benchmark ~/.claude/projects

The tool reports:

Context reuse ratios
Estimated costs
Repeated blocks
Context growth patterns
Potential caching savings

Repository

GitHub: context-audit

The benchmark changed how I think about coding-agent memory.

I started this project looking for waste in static prompts and retrieval.

Instead, I found a system spending most of its context budget carrying forward its own history.

If you're running long Claude Code, Cursor, Aider, or similar coding-agent workflows, I'd love to know whether you're seeing the same pattern.

DEV Community