TL;DR
Caching in AI gateways is not one feature. It's two:
- L1 — Result cache skips the upstream model entirely. 100% savings per hit.
- L2 — Prompt cache (vendor-native) reduces cached input token cost 50-90%, but still calls the model.
Most teams on OpenRouter, Portkey, or similar gateways get only L2. Adding L1 (Helicone or self-hosted Redis) compounds the savings. Real production math: a typical 10M request/month workload saves 39% with L2 alone, 54% with L1 + L2 stacked.
Full analysis with pricing tables and architecture patterns: tokenmix.ai/blog/ai-gateway-caching-l1-l2-guide-2026
The Misconception Everyone Has
When developers say "my gateway has caching," they usually mean one of:
- Semantic cache (Helicone style)
- Vendor prompt caching (Claude / OpenAI / DeepSeek native)
- "I set a Redis in front of my API calls"
These are three different things with different savings and different stale-risk profiles. Conflating them leads to architectural bugs: either you pay for duplicate caching, or you think you're caching when you're not.
Let's separate them cleanly.
L1: Result Cache — Skip the Model Entirely
The gateway remembers past responses and returns them for matching new requests.
Client → Gateway (L1 cache check) → ┬─ HIT → return cached response (100% saved)
└─ MISS → forward to model → cache + return
Two matching strategies:
- Exact match: hash of (model + messages + params). Byte-identical requests hit.
- Semantic match: vector similarity. "What is photosynthesis?" matches "Explain photosynthesis."
Who ships L1 today:
- Helicone — 1-line proxy swap. Reports 20-30% savings typical, up to 95% for highly repetitive workloads (Helicone docs).
- Self-hosted Redis — 1-2 engineer-weeks to build.
- OpenRouter / Portkey — do not ship L1 by default. They're pass-through gateways.
When L1 is dangerous: dynamic content (news, stock prices, user-specific data). Stale response served from cache when source changed.
When L1 wins: temperature=0 paths, documentation QA, fixed-corpus RAG, code completion. Enable with TTL that matches your content refresh cadence.
L1 example — Helicone drop-in
from openai import OpenAI
client = OpenAI(
api_key="your-key",
base_url="https://oai.helicone.ai/v1", # ← just the base_url change
default_headers={
"Helicone-Auth": f"Bearer {HELICONE_KEY}",
"Helicone-Cache-Enabled": "true",
"Helicone-Cache-Bucket-Max-Size": "10",
},
)
# Same request sent twice: second call hits L1, skips OpenAI entirely
resp = client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": "What is the capital of France?"}],
)
L2: Prompt Cache — The Model Still Runs, But Cheaper
L2 is what OpenAI, Anthropic, Google, and DeepSeek all ship under different names. Mechanism:
- You send a long prompt with a stable prefix (system prompt, tools, documents).
- Provider computes KV state for the prefix, stores in hot cache.
- Subsequent calls with same prefix skip prefix computation, pay 50-90% less on cached input tokens.
- The model still generates output every call — this is not L1.
L2 pricing across the four majors (April 2026)
| Provider | Base input | Cache read | Cache write | Auto? |
|---|---|---|---|---|
| Claude Sonnet 4.6 | $3/M | $0.30/M (90% off) | $3.75/M (25% premium, 5-min TTL) | Explicit — cache_control
|
| Claude Opus 4.6 | $5/M | $0.50/M (90% off) | $6.25/M (5-min) | Explicit |
| DeepSeek V3.2 | $0.28/M | $0.028/M (90% off) | Same as base | Automatic |
| OpenAI GPT-5.4 | $2.50/M | $0.25/M (90% off) | Same as base | Automatic ≥1024 tokens |
| Gemini 3.1 Pro | $2/M | ~25% off | Storage $4.50/M per hour | Explicit — cachedContents.create
|
Sources: Anthropic pricing, OpenAI prompt caching, DeepSeek context caching, Anthropic prompt caching docs.
Claude break-even math
Claude's cache write is 25% more expensive than base input (5-min TTL) or 2x more (1-hour TTL). So:
- 1 cache read pays off the 5-min write premium. Every hit after is pure savings.
- 2 cache reads pay off the 1-hour write premium. Anything more is profit.
RAG system answering multi-turn questions on the same document? Cache pays for itself instantly.
L2 example — Claude with explicit cache_control
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6", # not "claude-sonnet-4-6" — use dot: "claude-sonnet-4.6"
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer support AI..." # Short instruction, not cached
},
{
"type": "text",
"text": LONG_DOCUMENT_CONTEXT, # 50K tokens of product docs
"cache_control": {"type": "ephemeral"} # ← cache this 50K chunk
}
],
messages=[{"role": "user", "content": "How do I enable 2FA?"}]
)
# Next call with same system[] → 90% cheaper on those 50K input tokens
L2 example — DeepSeek (zero config)
from openai import OpenAI # DeepSeek is OpenAI-compatible
client = OpenAI(
api_key=DEEPSEEK_KEY,
base_url="https://api.deepseek.com/v1",
)
# Cache fires automatically on the second call if prefix matches
resp = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": LONG_SYSTEM_PROMPT}, # stable prefix, auto-cached
{"role": "user", "content": user_input},
],
)
# Inspect cache hits in usage metadata
print(resp.usage.prompt_cache_hit_tokens) # tokens served from cache
print(resp.usage.prompt_cache_miss_tokens) # tokens computed fresh
Real Cost Math — 10M Requests/Month
Assumptions: 4,000 input tokens avg (3,500 stable prefix + 500 unique), 500 output tokens avg, Claude Sonnet 4.6.
No caching (baseline)
- Input: 40B tokens × $3/M = $120,000
- Output: 5B tokens × $15/M = $75,000
- Total: $195,000/month
L2 only (80% prefix cache hit rate)
- Cached input: 28B × $0.30/M = $8,400
- Uncached input: 12B × $3/M = $36,000
- Output: $75,000
- Cache write overhead: $300
- Total: $119,700/month (−39%)
L1 + L2 stacked (25% L1 hit rate, remaining 75% via L2)
- L1-served: 2.5M requests → $0 LLM cost (+ $500 infra)
- L2-eligible: 7.5M requests
- Cached input: 21B × $0.30/M = $6,300
- Uncached input: 9B × $3/M = $27,000
- Output: $56,250
- Total: ~$90,050/month (−54%)
L1 + L2 is not additive — it's compound in the right direction. The requests L1 absorbs don't dilute the L2 savings.
Architecture Patterns
Pattern 1: Helicone only (single vendor)
App → Helicone (L1) → Vendor (L2)
Simplest multi-layer setup. Both caches fire with one proxy hop.
Pattern 2: Gateway + Helicone (multi-model)
App → TokenMix.ai / OpenRouter (routing + L2 passthrough) → Helicone (L1) → Vendors
Gateway handles model routing, failover, billing. Helicone adds L1.
Pattern 3: Self-hosted L1 + Gateway
App → Own Redis L1 → Gateway → Vendors
Fine control over TTL and invalidation. More ops work.
Pattern 4: Vendor direct (no gateway, no L1)
App → Vendor
Simplest. L2 auto-fires on OpenAI/DeepSeek, explicit on Claude/Gemini. No multi-model routing, no L1.
Common Gotchas
Prefix instability kills L2. If your gateway (or middleware) rewrites system prompts inconsistently, the cache key hash changes every call. Check actual cached-token count in provider response metadata to verify caching fires.
Dynamic content + L1 = stale responses. News, prices, user-specific data — do not L1-cache these. Use conditional caching based on path or prompt content.
Semantic cache false positives. Cosine similarity threshold too loose returns wrong answers. Start at 0.95+ and tune.
Claude 5-min TTL surprise. If your workload has gaps >5 min between cache reads, the cache expires and you pay the 25% write premium again. Use 1-hour TTL for bursty patterns with longer gaps.
Forgetting to measure. No observability = running blind. Helicone, Langfuse, or provider response metadata at minimum.
Decision Matrix
| Your situation | Recommended setup |
|---|---|
| Single vendor, simple app | Pattern 4 (direct) |
| Single vendor, want L1 savings | Pattern 1 (Helicone only) |
| Multi-vendor with routing | Pattern 2 (Gateway + Helicone) |
| Strict compliance / data residency | Pattern 3 (self-hosted L1) |
| High-repetition workloads (support, FAQ) | Any pattern + aggressive L1 |
| Dynamic content (news, personalized) | L2 only, skip L1 |
TL;DR (repeated for scrollers)
- L1 result cache = skip model entirely, 100% saved per hit, stale-risk on dynamic content
- L2 prompt cache = vendor-native, 50-90% off cached input tokens, model still runs
- OpenRouter / Portkey = L2 passthrough only. No L1.
- Real savings: L2 alone ≈ 39% on realistic production. L1 + L2 stacked ≈ 54%.
- Always enable L2 (it's free money on Claude/OpenAI/DeepSeek). Add L1 when repetition is real and staleness is tolerable.
Full article with 8-question FAQ and deeper architectural analysis: Read the full version on TokenMix.ai →
Originally published on TokenMix.ai — a unified AI API gateway providing OpenAI-compatible access to 150+ LLMs. TokenMix Research Lab publishes data-driven analysis of LLM pricing, benchmarks, and cost optimization strategies across every major model provider.
Top comments (1)
tokenmix.ai ------ Every Model. No Barriers. Access every major AI model in your language, with your preferred payment method. OpenAI SDK compatible, zero migration.