Coding agents such as Cortex Code, Claude Code, Codex, and Cursor rely on large language models (LLMs) behind the scenes. A common question from users is: “Why does my first turn consume so many input tokens when I only typed a short prompt?” This post explains how prompt caching works in these systems, why the first turn often looks expensive, and why cache hit rates usually improve as a session continues.
Key point: Coding agents like Cortex Code benefit from the same general prompt-caching principles described by Anthropic and OpenAI. Understanding those mechanics helps you interpret token usage more accurately.
1. Why the First Turn Can Look Expensive
1-1. Why users notice high input token usage on the first turn
When you start a new session in a coding agent and type something simple like “fix the typo in line 3,” the API usage may show thousands of input tokens — far more than your short message. This is because the total prompt sent to the LLM usually includes much more than your message:
| Component | Typical size (example) | Changes between turns? |
|---|---|---|
| System prompt | 2,000–10,000+ tokens | Usually no |
| Tool definitions (file read, write, search, bash, etc.) | 3,000–8,000+ tokens | Usually no |
| Additional instructions and rules | 1,000–5,000+ tokens | Usually no |
| Your message | 10–200 tokens | Yes |
Even a 5-word user prompt can result in a 10,000+ token API request because the system prompt, tool definitions, and other static instructions are prepended to every turn.
1-2. Why a short user prompt does not mean a small total prompt
The model needs the full relevant context on every turn. In a standard stateless API pattern, that means the system prompt, tools, instructions, and conversation history are sent again with each request. On the first turn, there is typically no reusable cache entry yet, so the request must be processed from scratch. Anthropic’s Messages API follows this explicit multi-turn pattern, where developers construct each turn and manage conversation state themselves.
This is not a bug or wasted spend. It is how the model preserves context across separate API calls. The good news is that much of this content is repeated across turns, which is exactly what prompt caching is designed to exploit. Anthropic documents prompt caching for repeated prefixes, and OpenAI likewise describes automatic reuse of previously computed prompt prefixes on supported models.
2. What Gets Reused Across Turns
2-1. What makes up the reusable prefix: system prompt, tools, instructions, and history
Prompt caching works by reusing a previously computed prompt prefix.
A coding-agent request often looks roughly like this:
[Tools] → [System prompt] → [Other static instructions] → [Message history] → [New user message]
←—————————————— reusable prefix candidate ——————————————→ ← new content →
In Anthropic, prompt caching applies to the prompt prefix across cacheable blocks such as tools, system, and messages, depending on where the cache breakpoint is placed. Anthropic currently supports both automatic caching and explicit cache breakpoints. With automatic caching, the system manages the breakpoint for you and moves it forward as the conversation grows.
In OpenAI, prompt caching works automatically on supported models for prompts longer than 1,024 tokens. The API reuses the longest previously computed prompt prefix, starting at 1,024 tokens and increasing in 128-token increments.
2-2. The difference between cacheable context, cache writes, and cache hits
There are three practical categories to think about:
| Category | What it means | Anthropic | OpenAI |
|---|---|---|---|
| Cache write / cache creation | The prefix is being cached for future reuse | Billed separately from normal input; pricing depends on the cache setting | Automatic behavior; no separate cache-write fee |
| Cache read / cache hit | A previously cached prefix is reused | Discounted relative to uncached input | Discounted relative to uncached input |
| Uncached input | Tokens after the reusable prefix, or tokens not served from cache | Standard input pricing | Standard input pricing |
On the first turn, there is usually no prior cache entry, so most or all of the request is effectively new. On later turns, repeated prefix content may be served from cache, depending on factors such as model support, prompt length, retention window, routing, and whether the prefix remains unchanged. Anthropic exposes cache_creation_input_tokens and cache_read_input_tokens in usage reporting, while OpenAI exposes cached prompt usage through prompt_tokens_details.cached_tokens.
For Anthropic, the usage formula is documented as:
total_input_tokens = cache_read_input_tokens + cache_creation_input_tokens + input_tokens
Anthropic also notes that minimum cacheable prompt lengths differ by model family, rather than being a single universal threshold.
3. Why Cache Hit Rates Improve Over Time
3-1. Why cache hit rates usually improve after the first few turns
Here is what typically happens in a multi-turn coding-agent session:
| Turn | What happens | Likely cache behavior |
|---|---|---|
| Turn 1 | System + Tools + User(1) | Everything is new |
| Turn 2 | System + Tools + User(1) + Asst(1) + User(2) | Earlier shared prefix may be reused; newer content is new |
| Turn 3 | System + Tools + User(1) + Asst(1) + User(2) + Asst(2) + User(3) | A longer shared prefix may be reused |
| Turn N | Full history | A large fraction of the repeated prefix may be cached |
In many healthy multi-turn sessions, the cache hit rate improves after the first turn because the repeated prefix gets larger and more stable. That is why judging costs based only on the first turn can be misleading. Anthropic explicitly describes multi-turn caching that moves forward with the conversation, and OpenAI describes reuse of the longest previously computed prefix.
That said, the exact hit rate will vary. It depends on the model, the prompt length, whether the repeated prefix is identical, how long the session has been idle, and provider-specific routing behavior. So percentages like “60–80% on turn 2” or “80–95% on turn 4+” should be treated as common patterns, not guarantees.
4. How to Improve Cache Reuse in Practice
4-1. How to improve cache hit rates without changing developer behavior too much
Based on how caching works, here are some practical tips:
Keep sessions alive when possible
Starting a brand-new session often means rebuilding the reusable prefix from scratch. Longer, continuous sessions generally create more opportunities for cache reuse.Keep the repeated prefix stable
Prompt caching depends on prefix reuse. If your system prompt, static instructions, or tool definitions change frequently, cache reuse will usually drop.Avoid unnecessary prompt reordering
Put the most stable content first and the most dynamic content later. OpenAI’s guidance explicitly recommends placing static content at the beginning and variable content at the end to improve cache effectiveness.Use provider features that preserve cacheability
Anthropic documents that, in some tool-search workflows, additional tools can be introduced without breaking the cached prefix because the prefix itself remains unchanged.Let the platform handle cache placement when supported
In many cases, prompt caching is automatic. Anthropic supports automatic caching via a top-levelcache_controlsetting, and OpenAI applies prompt caching automatically on supported models without additional integration changes.
Switching models mid-session usually reduces cache reuse
Prompt caches are provider- and model-specific. If you switch to a different model in the middle of a session, previously reusable prefixes may no longer apply to the new model, and the next request may need to be processed largely from scratch for that model. As a rule of thumb, if you want to maximize cache reuse, it is usually better to stick with one model throughout a session. This follows from the provider documentation being model-specific and from extended retention options being available only for certain OpenAI models.
4-2. Common misunderstandings about “high token usage on the first prompt”
| Misunderstanding | Reality |
|---|---|
| “The first turn used 15K tokens — that’s wasteful” | Those tokens often include the system prompt, tools, and static instructions that can later be reused through caching |
| “A bigger system prompt is always a bad idea” | A larger stable prefix may have an upfront cost, but repeated reuse can make it much cheaper over a session |
| “Cache writes are always an extra penalty” | Anthropic and OpenAI expose caching differently; you should interpret usage using each provider’s own pricing and usage fields |
| “My cache hit rate is low” | The first turn often has little or no reuse. It is more useful to evaluate cache behavior across the full session |
5. How to Calculate and Monitor Cache Hit Rates
5-1. How cache hit rate is calculated
A simple way to think about cache hit rate is:
cache_hit_rate = cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens + input_tokens)
This formula maps cleanly onto Anthropic’s usage fields. For OpenAI, you typically estimate cache reuse using prompt_tokens_details.cached_tokens relative to total prompt tokens. Anthropic and OpenAI expose different fields, so the exact calculation is provider-specific.
In a healthy multi-turn coding session, you will often see this pattern:
- Turn 1: little or no cache reuse
- Turn 2–3: noticeable reuse if the prefix is stable
- Later turns: higher reuse if the conversation remains active and the repeated prefix continues to match
5-2. How to monitor cache hit rates in practice
Anthropic API returns usage fields like:
{
"cache_creation_input_tokens": 12500,
"cache_read_input_tokens": 0,
"input_tokens": 50,
"output_tokens": 300
}
On an early turn, cache_read_input_tokens may be 0 while cache_creation_input_tokens is high. On later turns, cache_read_input_tokens may grow as more of the repeated prefix is reused.
OpenAI API returns fields like:
{
"prompt_tokens": 2006,
"prompt_tokens_details": {
"cached_tokens": 1920
},
"completion_tokens": 300
}
The cached_tokens field indicates how many prompt tokens were served from cache. OpenAI documents automatic prompt caching on supported models for prompts longer than 1,024 tokens.
Cache retention also matters.
- Anthropic: default 5-minute cache lifetime, with a 1-hour option available. Anthropic also notes that cache behavior is managed either automatically or via explicit cache breakpoints.
- OpenAI: in-memory cached prefixes are typically retained for a short inactive window, and extended prompt cache retention of up to 24 hours is available for supported models.
If your session sits idle long enough for the retention window to expire, the next request may behave more like a fresh cache write or a cache miss. In interactive coding sessions, though, users often stay well within these windows.
6. Final Takeaway
If your coding agent shows high input token usage on the first turn, that does not necessarily mean the system is inefficient. In many cases, you are seeing the cost of sending and processing the full reusable context: tools, system instructions, static guidance, and conversation history. Prompt caching exists precisely to make those repeated prefixes cheaper and faster on later turns. Anthropic and OpenAI both document prefix-based caching that rewards stable repeated context over multi-turn sessions.
So when evaluating token usage in a coding agent, do not judge the session by the first turn alone. Look at the full conversation. The first turn often establishes the reusable prefix; later turns are where prompt caching usually starts to pay off.
Top comments (0)