DEV Community

Ravi Patel
Ravi Patel

Posted on • Originally published at ssimplifi.com

OpenAI prompt caching, explained: automatic, free to enable, 90% off cached input tokens

OpenAI's prompt caching is the easiest LLM cost-reduction technique to deploy because there's nothing to deploy. The cache engages automatically on any prompt over 1,024 tokens; cached portions of the prompt are billed at 10% of normal input price (a 90% discount); the savings show up in the cached_tokens field of the response's usage block. No markers to attach, no SDK upgrade required, no caller-side configuration. If your application has a system prompt over 1,024 tokens that's stable across requests — which is almost every production application — the discount is already engaging or it's engaging the moment you stabilise the leading content. This post walks through the mechanics, the math, the gotchas, and the production patterns that maximise cache hit rate. It pairs with the Anthropic prompt caching deep dive — same underlying concept, similar discount, different implementation.

The parent guide AI API caching covers the broader caching strategy; this article is the OpenAI-specific deep dive.

What it caches and why

Like Anthropic's, OpenAI's prompt cache is provider-side prefix-attention caching. When a request arrives with a prompt prefix the provider has seen recently, OpenAI serves the cached attention state rather than recomputing it from scratch. The response still gets generated token-by-token; what gets discounted is the input-token billing on the cached portion.

The mechanism is conceptually simple: the model has to encode the input prompt into its internal representation before generating a response. For long stable system prompts (often thousands of tokens of instructions, retrieved context, tool definitions), this encoding step is non-trivial compute. If the same prefix shows up repeatedly, the provider can reuse the cached representation. OpenAI passes the savings on as a 90% input-token discount on the cached portion.

The catch with all provider-side caching: it's opaque. You can't directly inspect what's cached; you can only observe its effects via the cached_tokens field returned in the response's usage block. The provider decides what to cache and for how long; you control whether your prompts are cacheable by keeping the prefix stable.

The pricing math

The mechanics:

Token category Price multiplier (vs base input price) Notes
Normal input (uncached) 1.0x Standard input pricing
Cached input 0.1x The 90% discount — applies automatically on prompts ≥1,024 tokens
Output normal output pricing Unchanged

Concretely:

  • GPT-5.5: $5.00/M input → $0.50/M cached (a $4.50/M saving on every cached token)
  • GPT-5.4: $2.50/M input → $0.25/M cached
  • GPT-5.4 Mini: $0.75/M input → $0.075/M cached

No write premium. Unlike Anthropic's 25%-or-100% write premium on first writes, OpenAI doesn't charge extra for cache writes. The first request pays normal input price; subsequent cache hits pay 0.1x. Break-even is immediate — every cache hit is pure saving.

Worked savings on a typical workload:

Assume a customer support chatbot built on GPT-5.4:

  • 50,000 requests/day
  • Average prompt: 1,500 tokens (1,400-token stable system prompt + 100-token user message)
  • Average output: 200 tokens

Without caching: 50,000 × (1,500 × $2.50 + 200 × $15) / 1M = $337.50/day

With caching (assume 90% of input tokens hit cache after warm-up):

  • Cached input: 50,000 × 1,400 × 0.9 × $0.25 / 1M = $15.75/day
  • Uncached input: 50,000 × (1,400 × 0.1 + 100) × $2.50 / 1M = $30.00/day
  • Output: 50,000 × 200 × $15 / 1M = $150.00/day
  • Total: $195.75/day

Net saving: ~42% on the total bill, or ~85% on the input-token portion. Workloads with longer outputs see smaller total bill reduction because output isn't discounted; workloads with longer inputs see bigger total savings.

VERIFY (founder): replace the worked example with one drawn from a real Prism customer or representative aggregated data at current OpenAI pricing. The illustrative numbers above are reasonable but worth grounding in production data.

The 1,024-token minimum + 128-token boundary

Two structural rules that determine whether caching engages:

Minimum prompt length: 1,024 tokens. Prompts shorter than this aren't cached. Most production applications have system prompts that comfortably cross this threshold; toy examples and short tool-call workflows often don't.

128-token boundary for additional caching. Beyond the 1,024-token base, OpenAI caches additional content in 128-token chunks. The practical implication: if your prompt is 2,200 tokens, OpenAI may cache around 2,176 tokens (the closest 128-token boundary below the prompt length) and treat the remaining ~24 tokens as uncached.

The strategic implication: structure your prompt with stable content first, variable content last. The cache key is the leading portion of the prompt; everything before the variable content has a chance to hit the cache.

GOOD STRUCTURE (stable content first):
┌─────────────────────────────────────┐
│ System prompt (1,200 tokens)        │ ← cached after first hit
│ Tool definitions (400 tokens)       │ ← cached after first hit
│ Retrieved context (variable, 600t)  │ ← cacheable if stable across users
│ User message (variable, 50 tokens)  │ ← not cached
└─────────────────────────────────────┘

BAD STRUCTURE (variable content first):
┌─────────────────────────────────────┐
│ User message (50 tokens)            │ ← cache key starts here
│ System prompt (1,200 tokens)        │ ← invalidated by user message variation
│ Tool definitions (400 tokens)       │ ← invalidated
└─────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

If your application has the bad structure, the fix is a one-time refactor that pays for itself within hours of deployment on any meaningful traffic.

Reading cache hits from the response

OpenAI returns the cached-tokens count in the response's usage block:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-5-4",
    messages=[
        {"role": "system", "content": "...(long stable system prompt)..."},
        {"role": "user", "content": "How do I reset my password?"}
    ]
)

print(response.usage)
# CompletionUsage(
#     prompt_tokens=1532,
#     completion_tokens=87,
#     total_tokens=1619,
#     prompt_tokens_details=PromptTokensDetails(
#         cached_tokens=1408,        # 1,408 tokens hit the cache
#         audio_tokens=0
#     )
# )
Enter fullscreen mode Exit fullscreen mode

The cached_tokens field is the count of input tokens served from the cache (billed at 0.1x). Total cost calculation:

def calculate_openai_cost(usage, input_price_per_million, output_price_per_million):
    cached = usage.prompt_tokens_details.cached_tokens
    uncached = usage.prompt_tokens - cached

    cost = (
        uncached * input_price_per_million / 1_000_000
        + cached * input_price_per_million * 0.1 / 1_000_000   # 90% discount
        + usage.completion_tokens * output_price_per_million / 1_000_000
    )
    return cost
Enter fullscreen mode Exit fullscreen mode

The first thing to check when deploying prompt caching: is cached_tokens non-zero on the second and subsequent requests? If yes, caching is working. If zero, something is wrong — either the prefix is shorter than 1,024 tokens or it's drifting per request.

TTL — when the cache expires

OpenAI doesn't officially publish a precise TTL. Empirically, the cache stays warm for approximately 5-10 minutes of inactivity. Active workloads with consistent traffic see continuous cache hits because each request resets the warming window. Workloads with hits every few minutes see consistent caching. Workloads with hits every hour or more typically see the cache expire between requests and pay full input price each time.

Production implications:

  • Continuous traffic → cache stays warm continuously. Best case.
  • Bursty traffic (e.g. concentrated during business hours) → caches expire overnight. Each morning's first requests pay full price; the cache warms within a few requests; subsequent traffic hits the warm cache. Acceptable.
  • Sparse traffic (e.g. one request every 30 minutes) → cache expires between requests. Caching effectively never engages. Other techniques (response-level caching, model-tier routing) carry more weight on these workloads.

The lack of explicit TTL control is the one structural difference vs Anthropic. Anthropic offers an explicit 1-hour extended-TTL option (with a higher write premium); OpenAI doesn't expose TTL as a caller-side dial.

What invalidates the cache

The cache match requires byte-exact match of the leading prompt content. Things that invalidate:

  • Any change to the leading content. Different system prompt, different tool definitions, different leading messages. The fingerprint changes; cache misses.
  • Different model parameter. Cache entries are per-model; a GPT-5.4 cache doesn't serve a GPT-5.4-mini request.
  • Variable content at the start of the prompt. Timestamps, user IDs, session IDs injected into the system prompt invalidate the cache per request. The most common cause of caching not engaging.
  • Cache TTL elapsed. ~5-10 minutes of inactivity to the same prefix.

Things that don't invalidate:

  • Variable user messages at the end. The cache key is the leading content; the user message is the variable suffix and doesn't affect caching.
  • Different sampling parameters (temperature, top_p, max_tokens). Affect generation, not cache match.
  • Different request IDs, metadata, headers. Not part of the cache key.

The discipline matches the broader prompt cache fingerprinting discipline — keep your leading content stable, and the cache hits.

Production patterns that maximise hit rate

The shapes that work in production:

Stable system prompt + retrieved context + user message. The canonical pattern. System prompt and tool definitions go at the very start of the prompt (stable); retrieved context follows (semi-stable, often cached after warm-up); user message at the end (variable, never cached). Almost every production LLM workload looks like this.

Prompt-template versioning. When you update the system prompt, the cache invalidates wholesale. Plan for it: deploy prompt updates during low-traffic windows so the re-warming pain is bounded. The 5-10 minute TTL means caches re-populate quickly once new requests start flowing.

Co-location of variable content. If your application has multiple variable elements (e.g. user's session history + current message), put them together at the end of the prompt rather than scattered through. Reduces accidental invalidations from interleaving variable content into otherwise-stable sections.

Cache-warming for predictable workloads. If your traffic pattern is predictable (e.g. business-hours support chatbot that ramps up at 9 AM), fire a synthetic warm-up request at the start of the active window to populate the cache. The first real user request hits the warmed cache instead of paying full input price.

The anti-patterns

Three patterns that defeat OpenAI's prompt cache:

Timestamps in the system prompt. "You are responding at [timestamp]. [Instructions...]" The cache fingerprint changes per request. Caching never engages. Strip the timestamp; if you need it, put it in the user message or as a metadata field.

Per-user customisation injected into the system prompt. "You are an assistant for user [user_id]. [Generic instructions...]" Same problem — the system prompt varies per user; the cache invalidates per request. Move per-user customisation to the user message itself, or keep it generic in the system prompt and inject user-specific behaviour via fewer variable points.

Short system prompts (sub-1024 tokens). The minimum threshold means short prompts don't cache at all. If your system prompt is only 500 tokens, you're not benefiting from prompt caching. Either pad with useful content (additional instructions, examples) until you cross 1,024 tokens, or rely on different cost-reduction techniques.

OpenAI vs Anthropic — the surprising near-tie

For most of the prompt-caching era, conventional wisdom was "Anthropic for max savings (90%), OpenAI for simplicity (50%)." That conventional wisdom is now wrong. As of mid-2026, both providers offer the same 90% discount on cached input tokens. The difference is now structural, not magnitude:

Feature OpenAI Anthropic
Discount on cached input 90% off (0.1x) 90% off (0.1x)
Write premium None 5-min cache: +25% (1.25x base); 1-hour cache: +100% (2x base)
Default TTL ~5-10 min empirical 5 min
Custom TTL Not exposed 1-hour extended TTL option (with higher write premium)
Caller-side config None (automatic) Explicit cache_control marker required
Minimum prompt length 1,024 tokens ~few hundred tokens (with marker)

OpenAI now wins on operational simplicity AND matches on savings. No marker discipline, no write premium, no SDK changes — and the 90% discount that was previously Anthropic-exclusive. The right default for teams that want the discount without engineering investment.

Anthropic's only remaining structural advantage: the explicit 1-hour TTL option. For predictable but spaced-out workloads (e.g. one request every 20-30 minutes against the same prompt), Anthropic's 1-hour cache + 2x write premium can beat OpenAI's ~5-10 minute auto-cache that may expire between requests. For typical continuous-traffic workloads the difference is invisible.

Most production deployments running both providers (which is most production deployments) capture both — automatic discount on OpenAI traffic, marker-driven discount on Anthropic traffic, both at 90% off. The deeper comparison: provider-native caching glossary.

How Prism handles OpenAI prompt caching

Prism's request handler is fully transparent to OpenAI's automatic caching:

  • Pass-through preservation. Requests forwarded to OpenAI carry the same prompt structure the customer sent. No prompt-modification, no marker injection (which OpenAI doesn't use anyway).
  • cached_tokens read from upstream response. Prism reads prompt_tokens_details.cached_tokens from the OpenAI response usage block and uses it in billing calculation.
  • Discount pass-through. The customer's bill applies the 90% discount on cached tokens directly — Prism doesn't absorb the savings as gateway margin. The X-Prism-Native-Cache-Saved-Cents response header surfaces the actual saving per request.
  • Surfaced in usage logs. The cached_tokens count lands in usage_logs.provider_native_cache_read_tokens for downstream observability. Dashboards aggregate the savings into the public live counter on the landing page.

For broader prompt-caching context including the Anthropic equivalent: prompt caching glossary and provider-native caching glossary.

VERIFY (founder): confirm the Prism field naming for OpenAI cached-tokens in usage_logs (should be provider_native_cache_read_tokens or similar — confirm against current schema).

Decision framework

If you're standing up OpenAI prompt caching on a production workload:

  1. Verify your system prompt is ≥1,024 tokens. Below this, caching doesn't engage. Add content if needed; rely on other techniques if not feasible.
  2. Structure the prompt: stable first, variable last. System prompt + tool definitions at the start; user message at the end. Move per-user customisation out of the system prompt.
  3. Verify hits in the response. Check response.usage.prompt_tokens_details.cached_tokens > 0 on second and subsequent requests. If zero, the prefix isn't stable.
  4. No code change needed for the discount. OpenAI applies it automatically. Just keep the prefix stable and watch the savings appear.
  5. Layer with response-level caching for full coverage. Prompt caching discounts the calls that go through; response caching avoids many of them entirely. See AI API caching.
  6. Consider Anthropic specifically when 1-hour TTL matters for your traffic pattern. Otherwise both providers deliver the same 90% discount with OpenAI being simpler to implement.

The mechanic is simple once the structure is right. The wedge is large (90% off the input-token portion on workloads with stable prefixes — which is most workloads). The most common failure mode is the prompt structure: variable content at the start, stable content at the end. Fix that, and the discount lands without further work.

Where to go next

For the Anthropic counterpart: Anthropic prompt caching explained. For the parent OpenAI-specific cost optimization pillar: OpenAI cost optimization. For the broader provider-native caching glossary: provider-native caching and prompt caching.

For modelling OpenAI-cached cost on your workload: savings calculator. For comparing per-model costs across providers including the OpenAI tier: cost comparison by model.


FAQ

Do I need to do anything to enable OpenAI's prompt caching?

No. Caching is automatic on prompts ≥1,024 tokens. The discount appears as cached_tokens in the response's usage block, billed at 10% of normal input price (a 90% discount). The only requirement is that your prompt prefix is stable across requests — even minor variations like timestamps invalidate the cache.

What if my system prompt is shorter than 1,024 tokens?

Caching won't engage. Either pad your system prompt with useful content until you cross the threshold (more instructions, examples, formatting guidance), or rely on different cost-reduction techniques (response-level caching, model-tier routing). Short prompts also have less to save from caching anyway — the absolute dollar impact is small.

Wait, I read somewhere OpenAI's cached input was only 50% off?

That was true historically — OpenAI's original prompt-caching discount was 50%. Current pricing (as of mid-2026) is 90% off, matching Anthropic. The "50% vs 90%" framing in older comparison posts and tutorials is outdated. Verify against the current openai.com/api/pricing page (which shows the cached input rate per-model alongside the standard rate).

Does prompt caching work with streaming responses?

Yes. The cached_tokens count appears in the final usage chunk of the stream (with stream_options.include_usage=True). Streaming and prompt caching are independent — the discount applies regardless of whether you stream the response or buffer it.

Can I see what's cached?

Indirectly. You can't inspect OpenAI's cache state directly, but cached_tokens tells you how many input tokens hit the cache on each request. By comparing prompt structure variations and watching the cached_tokens field, you can infer what's being cached.

What happens if I change the model from gpt-5-4 to gpt-5-4-mini?

The cache is per-model. Switching models invalidates the cache — the new model has its own cache state. Either accept the warming cost (first few requests on the new model pay full price) or pre-warm the new model's cache before switching.

Can structured outputs (JSON mode) be cached?

Yes. The response_format parameter doesn't affect cache match. If your prompt is otherwise stable, the cache engages whether the response is JSON-mode or free-form text.

What about the OpenAI Batch API and prompt caching together?

They stack. Batch API gives 50% off chat completions; prompt caching gives 90% off cached input. On batch-eligible workloads with stable prefixes, both apply simultaneously. The combined effective price on input tokens with both engaged approaches very low rates — see batch API vs real-time OpenAI for the stacking math.


OpenAI's prompt cache is the easiest LLM cost-reduction technique to deploy because there's nothing to deploy. Layer it with the rest of the AI API caching stack and the broader LLM cost reduction playbook for the full cost-engineering wedge.

Top comments (0)