The 1M Context Window vs Prompt Caching: When to Use Which

#ai #productivity #claudecode #automation

1M context costs full price on every query, caching cuts repeated tokens to 1/10
Use 1M for one-shot deep dives, caching for repeated calls against fixed docs
Hybrid: cache the stable 80%, stream the dynamic 20% fresh
Real numbers from my own pipeline show caching paying off after 3 calls

I spent two weeks running the same workload two different ways. One used the full 1M token context window on every call. The other cached a fixed system prompt and only paid full price for the new bits. The cached version cost roughly a tenth as much once I made more than three calls against the same documents. Here is how to decide which one you actually need.

What the 1M context window actually charges you for

The 1M context window is a flat deal. You load up to a million tokens of text, code, or documentation into a single request, and Claude reads all of it before answering. The appeal is obvious. You drop an entire codebase, a long PDF, or a year of chat logs into one prompt and ask one question. No chunking, no vector search, no retrieval pipeline to maintain.

The catch is that you pay for every token on every single request. If your context is 400,000 tokens and you ask ten questions, you pay for 4,000,000 input tokens total. The model does not remember the previous request. Each call is a fresh read of the whole thing.

I tested this with a 280,000 token document set. One question cost me a specific amount. Ten questions cost ten times that. The numbers add up fast when you are iterating, and iterating is what real work looks like. You rarely ask one perfect question. You ask, read, refine, ask again.

So the 1M window is a convenience tax. You trade money for not having to build infrastructure. For a one-time job that is a great trade. I once needed to audit a legal-style contract bundle that ran to 190,000 tokens. I asked four questions, got my answers, and closed the tab. Building a retrieval system for that would have taken longer than the task. The flat per-call cost was worth it.

Where it stops being worth it is repetition. The moment you find yourself asking the same model the same set of background documents over and over, you are paying full freight for tokens that never change. That is the exact situation prompt caching was designed to fix. If you want the deeper context on loading large documents in one shot, see 1M Context Window Workflows.

The mental model: 1M context is a taxi. Great for a single trip. Terrible if you ride it to the same office every day.

How prompt caching changes the math

Prompt caching lets you mark a chunk of your prompt as reusable. The first time Claude reads it, you pay a small write premium. Every call after that, you pay roughly a tenth of the normal input price for those cached tokens. The cache lives for a short window, and there is a longer cache option that holds for an hour if you keep it warm.

Here is the structure that matters. A typical request has two parts. There is the stable part (your system prompt, your reference documents, your instructions, your examples) and the dynamic part (the actual user question or the new data for this specific call). The stable part is usually the big one. In my pipelines the fixed context is often 85 to 90 percent of the total token count.

So you cache the stable 85 percent. The first call writes it. Every call after reads it at a tenth of the price. The dynamic question, which is small, gets charged at full rate but it is tiny so who cares.

Let me put numbers on it. Say my fixed context is 200,000 tokens and my question is 2,000 tokens. With the 1M window, every call charges all 202,000 tokens at full price. With caching, the first call pays a write premium on the 200,000 plus full price on the 2,000. Every subsequent call pays one tenth on the 200,000 plus full price on the 2,000. By the third call the caching approach is already ahead. By the tenth call it is not close.

The break-even point in my testing landed at around three calls. Below three calls, the cache write premium means you barely save anything, and the 1M flat read is simpler. Above three calls, caching wins and keeps winning. The more you reuse, the bigger the gap.

I covered the timing details and the one-hour cache window in more depth at Claude Code 1-Hour Caching Update. The hour-long option matters because the default cache expires in minutes. If your calls are spread out across a work session, the longer window keeps the cache alive between them so you do not pay the write premium twice.

Caching is the monthly transit pass. Pay once, ride cheap all day.

The hybrid pattern that runs my actual pipeline

In practice I do not pick one. I run a hybrid. The trick is splitting your prompt cleanly into the part that never changes and the part that always does, then handling each correctly.

My content pipeline is a good example. I generate descriptions, social posts, and metadata for products in my Shopify store. The stable part is enormous and fixed: brand voice guidelines, formatting rules, a long list of example outputs I want the model to match, and the rules about what words I can never use. That block is around 40,000 tokens and it is identical for every single product.

So I cache it. Once. Then for each product I send the small dynamic chunk: the product name, the key features, the angle I want. That is maybe 800 tokens. I process 60 products in a session. With caching, I pay the 40,000 token write premium one time, then 60 reads at a tenth of the price plus 60 small dynamic questions.

If I had used the 1M window flat, I would have paid 40,000 plus 800 tokens at full price, 60 times over. That is 2.4 million input tokens versus the cached path that costs a fraction of it. The savings on a single batch run were dramatic, and I run that batch every week.

The ordering rule is strict. Cached content must come first in the prompt and stay byte-for-byte identical across calls. If you change one character in the cached block, the cache misses and you pay the write premium again. So I keep my system prompt in a separate file and never edit it mid-batch. The dynamic content always goes at the end.

Where the hybrid breaks down is when your "stable" part is not actually stable. If you are constantly tweaking instructions, you keep busting the cache and paying write premiums over and over. In that case the simple 1M flat read is honestly less hassle. I learned this the hard way during a week when I kept editing my brand rules between every run. The cache never warmed up and I paid the premium maybe 30 times.

One more pattern: cache the documents, stream the conversation. For a long research session against a fixed report, cache the report and let the back-and-forth questions ride cheap on top. That single setup saved me the most.

A decision checklist you can apply in five seconds

I made this simple because I was tired of overthinking it. Here is the actual flow I run in my head before every job.

First question: am I going to call the model more than three times against the same background material? If no, use the 1M window flat. The convenience beats the marginal savings, and you skip the engineering. A one-off contract review, a single long document summary, a one-time codebase question. Just load it and ask.

Second question: if yes, is the background material truly fixed? If I will not edit it during the session, cache it. Brand guidelines, reference docs, a stable codebase snapshot, a fixed dataset. These are perfect cache candidates because they never change between calls.

Third question: is my session spread out over time? If my calls happen within a few minutes, the default cache is fine. If they are spread across an hour of work, I use the longer cache window so it does not expire and force a re-write. This one catches people. They cache correctly, then take a 20 minute coffee break, come back, and pay the write premium again because the cache died.

Fourth question: is the dynamic part small relative to the stable part? Caching only pays off when the fixed block dwarfs the changing block. If your question is as big as your context, caching saves you less because you are still paying full price on a big dynamic chunk. The sweet spot is a huge fixed context and a tiny dynamic question.

When I am scheduling the social posts that come out of this pipeline, I push them through Buffer so the timing is handled separately from the generation cost. Keeping those concerns split makes the cost picture clearer. I know exactly what each batch of generation costs and exactly what scheduling costs, and neither hides inside the other.

The honest summary: 1M context is for exploration and one-shots. Caching is for production loops. Hybrid is for everything serious that runs more than once. Do not cache something you only touch once, and do not flat-read something you touch fifty times. That single rule covers 90 percent of real decisions.

Bottom Line

The choice is not about which feature is better. It is about how many times you call. One call, the 1M window wins on simplicity. Many calls against fixed material, caching wins on cost by roughly ten to one once you pass three calls. And most real pipelines want the hybrid: cache the stable 85 percent, stream the dynamic 15 percent fresh.

I run this every week and the difference between getting it right and getting it wrong is the difference between a cheap batch and an expensive one. The setup takes a few minutes once you understand the ordering rule and the cache lifetime. After that it just runs.

If you want the full system I use to build these AI pipelines end to end, the Claude Blueprint walks through the whole setup, including how I structure cached prompts and split fixed from dynamic content. Start there if you are building anything that calls the model more than a handful of times. Get the split right once and you stop thinking about it.

This article contains affiliate links. If you sign up through them, I may earn a small commission at no extra cost to you. (Ad)