DEV Community

Cover image for Why Your AI Agent Costs 7x What It Should
Serhii Panchyshyn
Serhii Panchyshyn Subscriber

Posted on • Originally published at animanovalabs.com

Why Your AI Agent Costs 7x What It Should

Most AI agents are loops. Call the model. Read the response. Run a tool. Feed the result back. Call again.

That loop is also a billing loop. Every iteration re-sends the entire prompt. And unless you've thought about it carefully, every iteration is paying full price for tokens the provider already saw three calls ago.

I learned this the hard way. I had an agent that called the model five to seven times per user interaction. Screenshots were involved. A single medium-resolution image runs about two thousand tokens. Multiply that by seven passes and you're billing fourteen thousand tokens for an image the model only needed to see once.

The fix took about thirty minutes. It cut my input costs by roughly 80%.

The problem is the loop, not the model

When people optimize LLM costs, they usually start with the model. Can I use a smaller one? Can I cut the system prompt? Can I reduce the context window?

Those are fine. But they're linear improvements. You shave off 20% here, 30% there.

The loop is a multiplier. If your agent runs five iterations and you're re-billing the same stable content on every pass, you're not overpaying by a percentage. You're overpaying by a multiple. Five iterations means 5x. Seven means 7x. That's the gap.

What's actually happening under the hood

Every major LLM provider now offers prompt caching. You pay full price the first time you send a prompt. On subsequent calls that share the same beginning, you pay a fraction of the input cost for the cached portion. OpenAI gives up to a 90% discount on cached tokens. Anthropic's cache reads run about 10% of normal input cost.

The key word is "beginning." These caches are prefix caches. They match your prompt from byte zero forward. The moment the bytes diverge from what's stored, the match ends. Everything after that point is a miss.

Call 1:  [A][B][C][D][E]   → writes prefix to cache
Call 2:  [A][B][C][D][F]   → cache hit through [D], full price for [F]
Call 3:  [A][X][C][D][E]   → cache MISS at position 1, full price for everything
Enter fullscreen mode Exit fullscreen mode

In call 3, the tokens [C], [D], and [E] are identical to call 1. Doesn't matter. The chain broke at position 1. The cache is left-anchored and unforgiving.

This isn't theoretical. It's happening in production right now.

I was looking at LightRAG, a popular open-source RAG framework. Their entity extraction pipeline embeds variable content directly inside the system prompt:

System prompt:
  ---Role---         (static, ~100 tokens)
  ---Instructions--- (static, ~400 tokens)
  ---Examples---     (static, ~800 tokens)
  ---Input Text---
  {input_text}       ← CHANGES FOR EVERY CHUNK
Enter fullscreen mode Exit fullscreen mode

Every chunk produces a completely different system prompt string. There's no shared prefix across chunks because the variable content is baked into the same message as the static instructions. Nothing gets cached.

For a typical indexing run of 8,000 chunks, that's roughly 11.6 million prompt tokens all counted as new. If the static prefix (~1,300 tokens) were separated from the variable input, roughly 10.4 million of those tokens would hit the cache. That's a 45% cost reduction just from moving one variable out of the system message and into the user message.

The fix is three lines of code. Split the template. Put static content in the system message. Put {input_text} in the user message. Done.

System message (cached):
  ---Role---
  ---Instructions---
  ---Examples---
  ---Entity Types---

User message (variable):
  ---Input Text---
  {input_text}
Enter fullscreen mode Exit fullscreen mode

This pattern shows up everywhere. If you're building any pipeline that processes documents in chunks, your prompt is probably structured like LightRAG's. And you're probably paying for it.

The layout that actually works

Once you understand prefix caching, prompt layout stops being cosmetic and starts being economic. The shape you want is:

[  STABLE PREFIX  ][ cache breakpoint ][  GROWING TAIL  ]
Enter fullscreen mode Exit fullscreen mode

Everything that stays the same between calls goes to the left. Everything that changes goes to the right. The breakpoint sits between them.

The Claude Code team at Anthropic shared their exact ordering and it's a good template for any agent:

1. Static system prompt + tool definitions  (globally cached)
2. Project-level context                    (cached within a project)
3. Session context                          (cached within a session)
4. Conversation messages                    (the growing tail)
Enter fullscreen mode Exit fullscreen mode

Each layer is stable relative to the layer below it. System prompts change less than project context. Project context changes less than session context. Session context changes less than conversation messages. The cache hits cascade.

The counterintuitive part

In a single-shot call, the user's message naturally goes at the end. That's correct. But in a loop, the user's message is not the tail. The loop's output is.

Think about it. The user's message doesn't change between iterations. It's the same question on pass one as it is on pass five. What changes is the assistant's responses and tool results that accumulate with each iteration.

So the user's content belongs in the prefix:

[ system prompt ][ user message ][ breakpoint ][ loop state → grows each iteration ]
Enter fullscreen mode Exit fullscreen mode

This looks wrong. The user's message isn't at the end. But the cache doesn't care about narrative order. It cares about byte stability. The user's message is frozen across iterations. The loop output is what moves. Frozen things go left. Moving things go right.

The math on images

Text tokens are cheap enough that sloppy caching is survivable. Images are not.

Image tokens Passes Without caching With caching
2,000 1 2,000 2,000
2,000 5 10,000 ~2,400
2,000 7 14,000 ~2,600

The cached version writes the image once and reads it at a fraction of full price on every subsequent pass. That's the difference between an agent that's economically viable and one that burns through your API budget in a week.

If your agent processes screenshots, documents, or any visual input inside a loop, this is probably the single highest-leverage optimization available to you.

The silent cache killers

Even with the right layout, caching breaks in quiet ways. Every one of these has bitten me.

Timestamps in the system prompt. "The current time is 2025-04-22 14:23:07." Changes every call. One line and your entire prefix is invalidated. The fix is to pass time updates in the next user message instead. Claude Code does exactly this. They append a <system-reminder> tag in the next turn rather than touching the system prompt.

Adding or removing tools mid-conversation. This is probably the most common mistake I see. It seems logical to only give the model tools it needs right now. But tool definitions are part of the cached prefix. Adding or removing a tool invalidates the cache for the entire conversation history.

The Claude Code team learned this the hard way. Their plan mode initially swapped out tools for read-only versions. Cache broke every time. The fix: keep all tools in the request always. Make plan mode a tool itself (EnterPlanMode, ExitPlanMode). The tool definitions never change. The model calls a tool to change its own behavior instead of you changing the toolset.

If you have many tools and loading all of them is expensive, send lightweight stubs with just the tool name and let the model discover full schemas through a search tool when needed. The stubs are stable. The prefix stays intact.

Switching models mid-session. Prompt caches are model-specific. If you're 100k tokens into a conversation with a large model and want to hand off an easy subtask to a smaller one, you'd have to rebuild the entire cache for the new model. That rebuild often costs more than just letting the original model answer.

If you need multi-model workflows, use subagents. The primary model prepares a focused handoff message for the secondary model. The secondary model works with a short, fresh context. Neither model's cache gets broken.

Unordered data structures. If you build context from a set or unordered dict, iteration order can drift between calls. Sort before serializing.

Whitespace drift. One version of your template has a trailing newline, another doesn't. The bytes don't match.

In-place edits to history. The moment you mutate a past message, every byte after it shifts. Your cache for that whole conversation is gone.

The unifying principle: content that looks identical to a human is not necessarily byte-identical to a hash function. The cache only speaks bytes.

How to verify it's working

Don't trust your layout. Measure it.

Every major provider returns cache metrics in the response. OpenAI includes cached_tokens in usage.prompt_tokens_details. Anthropic returns cache_creation_input_tokens and cache_read_input_tokens.

On the first call, cached tokens should be zero. On every subsequent call, they should climb to match your stable prefix length. If they don't, your prefix isn't stable.

The best debugging step: dump the raw prompt from two consecutive calls and diff them. You'll find the drift immediately.

A habit that's saved me hours: write a test that runs your prompt builder twice with equivalent inputs and asserts the first N bytes are byte-equal. Humans can't eyeball byte stability. Hash functions can.

And if caching is a meaningful part of your cost structure, monitor it like you'd monitor uptime. The Claude Code team runs alerts on their cache hit rate and treats drops as incidents. A few percentage points of cache miss can dramatically change unit economics. It deserves a dashboard, not a gut check.

The reframe

I used to think of prompts as messages. Now I think of them as data structures with cache semantics. Some regions are stable. Some regions grow. The breakpoint is the contract between them.

Every piece of content gets the same triage: does this change between calls? If yes, it goes to the tail. If no, it goes to the prefix. If it's expensive and it belongs to the user's turn, I figure out how to keep it in the prefix anyway. Even if it means putting things somewhere that looks weird.

This reframe changes how I design features. Instead of asking "what tools does the model need right now?" I ask "how do I model this state change without breaking the prefix?" Instead of editing the system prompt to update context, I pass updates through messages. Instead of switching to a cheaper model mid-conversation, I fork a subagent with a clean context.

The model is the expensive part of your system. The shape of what you send it is the part you actually control.


I help engineering teams ship AI features that work in production, not just in demos. If your agents are burning through API budgets or your LLM infrastructure needs a cost audit, let's talk.

Top comments (0)