The first time a client of mine got their Anthropic invoice, they paid for it in panic. The automation was working — Claude was triaging support tickets against a 30-page knowledge base, accurately, around the clock — and the monthly bill had a comma in it that didn't used to be there. Most of that bill was a tax I didn't have to pay. I just hadn't turned on prompt caching yet.
I run n8n in production for clients whose workflows lean on Claude for the reasoning step — lead enrichment, document extraction, customer-support triage. Across that book of work, turning on Anthropic prompt caching properly cut the input-token bill by roughly 70%, and shaved noticeable latency off every warm execution. This post is the version of that conversation I wish I'd had before I shipped the first version.
TL;DR
- Prompt caching lets Anthropic store a static prefix of your prompt (system, tools, RAG context, few-shot examples) and replay it on subsequent calls.
- Cache reads cost ~10% of normal input tokens. Cache writes cost 1.25× for the 5-minute TTL and 2× for the 1-hour TTL.
- For a workflow that re-uses the same 8K-token prefix across 5,000 monthly executions, that's the difference between ~$200/mo and ~$30/mo on the input side.
- It pays off only if the prefix is reused enough times within the TTL window. Two reads to break even at 5-minute TTL; three at 1-hour.
- In n8n, the trick is making the messages array byte-identical across executions and putting
cache_controlon the last stable block.
What Prompt Caching Actually Is
Anthropic prompt caching is a prefix match. You mark a point in your prompt with a cache_control breakpoint, and Anthropic stores the bytes up to that point. The next request that sends the exact same prefix skips the work of re-reading those tokens — you pay roughly 10% of the normal input price for them, and the model gets to your fresh content faster.
The render order is fixed: tools → system → messages. A breakpoint on the last system block caches your tool definitions and your system prompt together. A breakpoint on the most recent message extends the cache through the conversation history.
The cache is keyed off the literal bytes. One floating timestamp in your system prompt, one unsorted JSON dump, one user ID interpolated into the wrong place, and the prefix mismatches — you pay full price every time and never see a single cache read.
How the Pricing Actually Breaks Down
This is the part that surprised me. The cache isn't free — writing to it costs more than a normal call. The economics only work if you read the same cached prefix back enough times to amortize that write.
| Token type | Multiplier vs. base input price |
|---|---|
| Normal input token | 1.0× |
| Cache read | 0.1× (~90% discount) |
| 5-minute cache write | 1.25× |
| 1-hour cache write | 2.0× |
Two implications:
- Five-minute TTL pays off after one cache read. Write at 1.25× plus one read at 0.1× equals 1.35× — already cheaper than two uncached calls at 2.0×.
- One-hour TTL needs at least three calls within the hour to break even. Write at 2.0× plus two reads at 0.2× equals 2.2× — cheaper than three uncached calls at 3.0×.
The 1-hour TTL exists for bursty workflows where you don't want the cache to expire between runs. If your support triage fires every 30 seconds, the default 5-minute TTL is fine. If it runs in a daily batch, the 1-hour TTL prevents every batch from cold-starting the cache.
What to Cache vs. What Not to Cache
This is the single biggest decision and the one I see people get wrong. The rule:
Cache the stable prefix. Send the volatile part after it.
| Put in the cached block | Keep out of the cached block |
|---|---|
| System prompt (instructions, persona) | The user's actual question |
| Tool definitions (sorted, deterministic) | Per-user IDs, session IDs, request IDs |
| Knowledge base / RAG context | Today's date or Date.now()
|
| Few-shot examples | Per-customer variables (name, account, plan) |
| Style guides, format specs | Anything that changes between runs |
If you interpolate current_date into your system prompt header, the prefix changes every day and your cache invalidates daily for no reason. If you stuff a user's name into the system prompt to make Claude "personable," every user gets their own private cache and you lose the cross-customer reuse that makes caching valuable.
The fix is structural: keep the system prompt frozen, and pass anything dynamic as a user-turn message after the cached prefix.
The Minimum Cacheable Size (and Why Short Prompts Silently Don't Cache)
Caching only kicks in once you cross a model-specific threshold. Below it, the API still accepts the cache_control marker, but cache_creation_input_tokens comes back zero — silently. No error.
| Model | Minimum cacheable prefix |
|---|---|
| Claude Opus 4.7, Opus 4.6 | 4,096 tokens |
| Claude Sonnet 4.6 | 1,024 tokens |
| Claude Haiku 4.5 | 4,096 tokens |
This caught me out once. I'd built a Sonnet-4.5-era workflow, cached fine on a 2,500-token prefix, then upgraded to Opus 4.7 and watched my cache hit rate go to zero — because the same prefix was now under the higher minimum. If your prefix is short, either pad it with stable context (more retrieved docs, longer style guide) or skip caching entirely.
You also get up to four cache_control breakpoints per request. In practice you almost never need more than one, but multiple breakpoints are useful if part of your prefix changes per session and another part changes per day — you cache each segment with its own TTL.
A Worked Example: Customer-Support Triage Through n8n
This is a representative shape from the workflows I run. A customer support inbox forwards each new message to an n8n webhook. The workflow loads a 30-page product knowledge base (~8,000 tokens), passes it to Claude Sonnet 4.6 along with five few-shot examples (~1,500 tokens) and a system prompt (~500 tokens), then routes the response based on Claude's classification.
Total stable prefix: ~10,000 input tokens per run.
The workflow runs ~5,000 times a month.
Without caching, on Sonnet 4.6 at $3 per million input tokens:
- 5,000 runs × 10,000 tokens = 50M input tokens
- 50M × $3/M = $150/month on input alone
With 5-minute caching, assuming the support volume keeps the cache warm during business hours and the cache rebuilds maybe 12 times a day (cold starts after lulls):
- ~360 cache writes/month × 10,000 tokens × 1.25× = ~$13.50
- ~4,640 cache reads/month × 10,000 tokens × 0.1× = ~$13.92
- Total: ~$27/month on input — about 82% lower than uncached.
That's the structural pattern. The exact numbers depend on traffic shape — a workflow that fires once an hour will see worse cache reuse than one that fires every 30 seconds. Output tokens cost the same either way; this is purely an input-side optimization.
How This Maps to n8n's HTTP Request Node
n8n calls Anthropic via the HTTP Request node (or the dedicated Anthropic node, but the HTTP node gives you the most control). The request body is what matters.
Three things have to be true on every execution:
-
The
toolsarray, system prompt, and the cached message blocks are byte-identical across runs. Sort tool definitions by name. Don't include timestamps. Don't interpolate workflow variables into the system prompt unless they're truly stable. -
The dynamic content goes in a separate user-turn message after the cached prefix. Use n8n's expression syntax (
{{ $json.user_message }}) only in the dynamic part, never in the cached part. -
A
cache_control: {"type": "ephemeral"}block sits on the last stable content block. That's the breakpoint. Everything before it gets cached.
A minimal body shape looks like this (simplified):
{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"system": [
{
"type": "text",
"text": "You are a support triage agent. Knowledge base follows.\n\n<KB CONTENT — ~8K tokens, identical every run>",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [
{ "role": "user", "content": "{{ $json.customer_message }}" }
]
}
The cache_control marker on the last system block caches both the system prompt and anything in the tools array (which renders before system). The user message is the only thing that changes per run.
To verify caching is working, inspect the response — Anthropic returns usage.cache_creation_input_tokens (tokens written this call) and usage.cache_read_input_tokens (tokens served from cache this call). If reads are zero across repeated calls, you have a silent invalidator somewhere in the prefix. The most common culprits are non-deterministic JSON serialization, a date in the system prompt, or the workflow accidentally rebuilding the tool list each run.
Warming the Cache on Workflow Start
For workflows that don't fire frequently enough to keep the cache warm, the trick I use is to fire one no-op call at the start of the workflow with the full cached prefix and a trivial user message — something like "Acknowledge with OK.". That call pays the cache write cost. Every subsequent call in that execution path reads from the cache at the discounted rate.
This is especially worth it for batch jobs that hit Claude many times in quick succession on the same prefix — nightly enrichment runs, bulk document extractions, cron-triggered reports.
Common Mistakes I See
-
Putting
cache_controlon a prefix that's under the minimum. No error, no cache. Always checkcache_creation_input_tokenson the first call. -
Interpolating dynamic data into the system prompt. A
current_timevariable, a session ID, a feature-flag toggle — any of these in the system prompt invalidates the cache for every other request that doesn't share that exact value. -
Reordering or rebuilding tools per request. If your workflow constructs the
toolsarray dynamically and the order isn't deterministic, you'll see zero cache hits. Sort by tool name and freeze it. -
Forgetting to update
cache_controlplacement when you grow the conversation. In a multi-turn agent, you want the breakpoint on the last stable block — usually the most recent assistant turn. Leave it on an old block and you stop caching anything new. - Caching a prefix that's only used once. Caching costs more than a normal call on the first hit. If you're calling Claude once and never again with that prefix, you've made it 25% more expensive, not cheaper.
- Switching models mid-conversation. Caches are model-scoped — moving from Sonnet to Opus invalidates everything. Pick a model and stay on it for the workflow.
Latency, Not Just Cost
The cost story is the headline, but caching also makes warm calls noticeably faster. The tokens in the cached prefix don't have to be re-processed by the model on every call. For a workflow with a 10K-token prefix and a 200-token user message, the time-to-first-token on warm hits drops by roughly the ratio of prefix to total input — sometimes by half a second or more.
For a customer-facing automation (a chatbot, a real-time triage system, a voice agent), that latency improvement is worth real money on its own.
How Smart AI Workspace Approaches This
When I take over a Claude-driven n8n workflow that's running too expensive, the first audit is always the same: pull the request bodies, check the prefix size, look at how often the same prefix repeats, and compare that against cache_read_input_tokens in the response usage data. The fixes are almost always structural — moving a date out of the system prompt, sorting the tool array, separating the dynamic user message from the static context.
This pattern compounds with the architecture choices I wrote about in the n8n vs Zapier vs Make pillar — the same workflow that costs you per-task on Zapier costs you per-execution on n8n, and now also costs you 10% of the input price on the cached portion. The cost gap between "Zapier with no caching" and "n8n with prompt caching" on a high-volume, AI-heavy workflow is the difference between an expensive line item and a rounding error.
If you're running Claude through automation — n8n, Make, custom code, anything — and your monthly bill keeps drifting up, prompt caching is almost always the first lever to pull. Anthropic's official prompt caching docs cover the API surface; the structural decisions above are what determine whether it actually works in production.
Cut Your Claude Bill For Me Instead
If you've read this far and the answer you actually want is "just audit my Claude workflows, fix the caching, and hand me back a 70%-lower invoice," that's exactly what I do. I'll trace the request bodies, find the silent invalidators, restructure the prompt prefix, and ship the change on infrastructure you own.
Talk to Smart AI Workspace about your Claude costs →
More from Smart AI Workspace
- 🌐 Website: www.smartaiworkspace.tech
- 📧 Email: info@smartaiworkspace.tech
- ▶️ YouTube: @SmartAIWorkspace
Sources: Anthropic Prompt Caching Docs · Anthropic API Pricing 2026 · Anthropic API Pricing — Finout 2026 · Claude Prompt Caching Cost Study (Du'An Lightfoot) · Claude Prompt Caching TTL Analysis 2026
Top comments (0)