How many of your API calls are re-processing the same instructions?
I couldn't answer that question either. I had this 4,000-token context block I was shipping to Claude Opus 4.7 with every request. Cost me about a cent per call. I kept moving on because it worked. One day I looked at the math. $300 a month. And realized I was wasting money I didn't need to waste.
I started looking at what was actually changing per request and what stayed the same. System prompt? Same. Instruction block? Same. Company style guide? Same. User question? Different.
That's when prompt caching clicked. But here's the thing: I didn't really understand why I was doing it until I had to explain it to someone else.
The Question You Can't Answer Without Implementing It
Prompt caching seems simple: mark some blocks of your request as "cache this," and Claude reads them from cache on the next request. 90% cost reduction on cached tokens.
But it only works if you truly understand what in your request is static and what's dynamic. A lot of developers think they know the difference. They don't. Not until they try to cache.
I'll show you the code in a second. But first: the hard part isn't the code. It's the thinking.
Why Caching Matters
The obvious reason: cost. A 2,000-token system prompt that you send 10,000 times a month costs real money. Caching cuts that to nearly nothing.
The actual reason you should care: caching forces you to author your prompts consciously. Either your instructions are truly reusable. In which case, cache them. Or they're not. In which case, you've got a mess in your code.
Here's what happens when you try to cache without thinking: you'll put a user ID in the system prompt. Or today's date. Or a flag that changes per request. Then you'll wonder why the cache never hits. You'll add logging. You'll go three layers deep into debugging. And then you'll realize: you never actually separated "stable instruction" from "user input."
That moment is worth the whole exercise, regardless of the cost savings.
The Code
const Anthropic = require("@anthropic-ai/sdk");
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
const STYLE_GUIDE = `
# Code Style Standards
## Naming Conventions
- Variables: camelCase
- Classes: PascalCase
- Constants: UPPER_SNAKE_CASE
- Private methods: _leading underscore
## Comments
- Comment the why, not the what
- Every public function gets a JSDoc block
...
[imagine 3,000 tokens of actual guidelines]
`;
async function reviewCode(userSubmittedCode) {
const response = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 1024,
system: [
{
type: "text",
text: "You are a code reviewer. Apply the style guide strictly. Be direct and specific.",
},
{
type: "text",
text: STYLE_GUIDE,
cache_control: { type: "ephemeral" },
},
],
messages: [
{
role: "user",
content: `Review this code:\n\n${userSubmittedCode}`,
},
],
});
return response.content[0].text;
}
That cache_control: { type: "ephemeral" } tells Claude to cache the STYLE_GUIDE block. The first call pays the cost of processing it. Every call in the next 5 minutes reads from cache.
You can verify the hit in the response:
const response = await client.messages.create({...});
console.log(response.usage.cache_creation_input_tokens); // first request
console.log(response.usage.cache_read_input_tokens); // subsequent requests
If cache_read_input_tokens > 0, the cache worked.
The Gotcha That Catches Everyone
Cache hits depend on exact byte-for-byte matching. Exact.
// Cache HIT
const guide = `You are a code reviewer...`;
const guide = `You are a code reviewer...`;
// Cache MISS (extra space)
const guide = `You are a code reviewer...`;
const guide = `You are a code reviewer... `; // trailing space
// Cache MISS (different model)
// change from opus-4-7 to sonnet-4-6
// Cache MISS (you vary user messages before the cached block)
// Cache key changes every time the message history changes
This is where understanding matters. You can't just scatter cache_control through your code and expect it to work. You have to commit to treating those blocks as constants.
A lot of developers find this frustrating. I found it clarifying. If I'm shipping code where the instructions change per request, the cache can't help. So either: rewrite the instructions to not change, or remove the cache. Either way, I'm forced to make a choice.
Where Caching Actually Wins
Batch processing with consistent context. Processing 100 files with the same analysis rules.
Customer support: load the knowledge base once, every user question reuses it.
Long-form analysis: load a document, ask 20 follow-up questions against it. Only the first question pays for the document.
Code generation with a stable system persona and reference library.
Anything where you're making multiple requests with the same expensive-to-process input.
The Thing Nobody Talks About
I started caching because I wanted to cut costs. I kept caching because it made my code clearer.
When you have to decide "is this instruction static or dynamic?" you start thinking about what your prompts actually are. You stop cargo-culting patterns. You own the code because you can explain it.
That matters more than the 85% cost reduction.
Suggested tags: ai, tutorial, javascript, beginners
Top comments (0)