tokenmixai

Posted on Apr 21 • Originally published at tokenmix.ai

AI Gateway Caching Explained — Why L1 + L2 Cache Layers Cut 90% of Your LLM Bill

#ai #llm #caching #api

TL;DR

Caching in AI gateways is not one feature. It's two:

L1 — Result cache skips the upstream model entirely. 100% savings per hit.
L2 — Prompt cache (vendor-native) reduces cached input token cost 50-90%, but still calls the model.

Most teams on OpenRouter, Portkey, or similar gateways get only L2. Adding L1 (Helicone or self-hosted Redis) compounds the savings. Real production math: a typical 10M request/month workload saves 39% with L2 alone, 54% with L1 + L2 stacked.

Full analysis with pricing tables and architecture patterns: tokenmix.ai/blog/ai-gateway-caching-l1-l2-guide-2026

The Misconception Everyone Has

When developers say "my gateway has caching," they usually mean one of:

Semantic cache (Helicone style)
Vendor prompt caching (Claude / OpenAI / DeepSeek native)
"I set a Redis in front of my API calls"

These are three different things with different savings and different stale-risk profiles. Conflating them leads to architectural bugs: either you pay for duplicate caching, or you think you're caching when you're not.

Let's separate them cleanly.

L1: Result Cache — Skip the Model Entirely

The gateway remembers past responses and returns them for matching new requests.

Client → Gateway (L1 cache check) → ┬─ HIT  → return cached response (100% saved)
                                    └─ MISS → forward to model → cache + return

Two matching strategies:

Exact match: hash of (model + messages + params). Byte-identical requests hit.
Semantic match: vector similarity. "What is photosynthesis?" matches "Explain photosynthesis."

Who ships L1 today:

Helicone — 1-line proxy swap. Reports 20-30% savings typical, up to 95% for highly repetitive workloads (Helicone docs).
Self-hosted Redis — 1-2 engineer-weeks to build.
OpenRouter / Portkey — do not ship L1 by default. They're pass-through gateways.

When L1 is dangerous: dynamic content (news, stock prices, user-specific data). Stale response served from cache when source changed.

When L1 wins: temperature=0 paths, documentation QA, fixed-corpus RAG, code completion. Enable with TTL that matches your content refresh cadence.

L1 example — Helicone drop-in

from openai import OpenAI

client = OpenAI(
    api_key="your-key",
    base_url="https://oai.helicone.ai/v1",  # ← just the base_url change
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Helicone-Cache-Bucket-Max-Size": "10",
    },
)

# Same request sent twice: second call hits L1, skips OpenAI entirely
resp = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)

L2: Prompt Cache — The Model Still Runs, But Cheaper

L2 is what OpenAI, Anthropic, Google, and DeepSeek all ship under different names. Mechanism:

You send a long prompt with a stable prefix (system prompt, tools, documents).
Provider computes KV state for the prefix, stores in hot cache.
Subsequent calls with same prefix skip prefix computation, pay 50-90% less on cached input tokens.
The model still generates output every call — this is not L1.

L2 pricing across the four majors (April 2026)

Provider	Base input	Cache read	Cache write	Auto?
Claude Sonnet 4.6	$3/M	$0.30/M (90% off)	$3.75/M (25% premium, 5-min TTL)	Explicit — `cache_control`
Claude Opus 4.6	$5/M	$0.50/M (90% off)	$6.25/M (5-min)	Explicit
DeepSeek V3.2	$0.28/M	$0.028/M (90% off)	Same as base	Automatic
OpenAI GPT-5.4	$2.50/M	$0.25/M (90% off)	Same as base	Automatic ≥1024 tokens
Gemini 3.1 Pro	$2/M	~25% off	Storage $4.50/M per hour	Explicit — `cachedContents.create`

Sources: Anthropic pricing, OpenAI prompt caching, DeepSeek context caching, Anthropic prompt caching docs.

Claude break-even math

Claude's cache write is 25% more expensive than base input (5-min TTL) or 2x more (1-hour TTL). So:

1 cache read pays off the 5-min write premium. Every hit after is pure savings.
2 cache reads pay off the 1-hour write premium. Anything more is profit.

RAG system answering multi-turn questions on the same document? Cache pays for itself instantly.

L2 example — Claude with explicit cache_control

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",  # not "claude-sonnet-4-6" — use dot: "claude-sonnet-4.6"
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support AI..."  # Short instruction, not cached
        },
        {
            "type": "text",
            "text": LONG_DOCUMENT_CONTEXT,  # 50K tokens of product docs
            "cache_control": {"type": "ephemeral"}  # ← cache this 50K chunk
        }
    ],
    messages=[{"role": "user", "content": "How do I enable 2FA?"}]
)

# Next call with same system[] → 90% cheaper on those 50K input tokens

L2 example — DeepSeek (zero config)

from openai import OpenAI  # DeepSeek is OpenAI-compatible

client = OpenAI(
    api_key=DEEPSEEK_KEY,
    base_url="https://api.deepseek.com/v1",
)

# Cache fires automatically on the second call if prefix matches
resp = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": LONG_SYSTEM_PROMPT},  # stable prefix, auto-cached
        {"role": "user", "content": user_input},
    ],
)

# Inspect cache hits in usage metadata
print(resp.usage.prompt_cache_hit_tokens)  # tokens served from cache
print(resp.usage.prompt_cache_miss_tokens)  # tokens computed fresh

Real Cost Math — 10M Requests/Month

Assumptions: 4,000 input tokens avg (3,500 stable prefix + 500 unique), 500 output tokens avg, Claude Sonnet 4.6.

No caching (baseline)

Input: 40B tokens × $3/M = $120,000
Output: 5B tokens × $15/M = $75,000
Total: $195,000/month

L2 only (80% prefix cache hit rate)

Cached input: 28B × $0.30/M = $8,400
Uncached input: 12B × $3/M = $36,000
Output: $75,000
Cache write overhead: $300
Total: $119,700/month (−39%)

L1 + L2 stacked (25% L1 hit rate, remaining 75% via L2)

L1-served: 2.5M requests → $0 LLM cost (+ $500 infra)
L2-eligible: 7.5M requests
- Cached input: 21B × $0.30/M = $6,300
- Uncached input: 9B × $3/M = $27,000
- Output: $56,250
Total: ~$90,050/month (−54%)

L1 + L2 is not additive — it's compound in the right direction. The requests L1 absorbs don't dilute the L2 savings.

Architecture Patterns

Pattern 1: Helicone only (single vendor)

App → Helicone (L1) → Vendor (L2)

Simplest multi-layer setup. Both caches fire with one proxy hop.

Pattern 2: Gateway + Helicone (multi-model)

App → TokenMix.ai / OpenRouter (routing + L2 passthrough) → Helicone (L1) → Vendors

Gateway handles model routing, failover, billing. Helicone adds L1.

Pattern 3: Self-hosted L1 + Gateway

App → Own Redis L1 → Gateway → Vendors

Fine control over TTL and invalidation. More ops work.

Pattern 4: Vendor direct (no gateway, no L1)

App → Vendor

Simplest. L2 auto-fires on OpenAI/DeepSeek, explicit on Claude/Gemini. No multi-model routing, no L1.

Common Gotchas

Prefix instability kills L2. If your gateway (or middleware) rewrites system prompts inconsistently, the cache key hash changes every call. Check actual cached-token count in provider response metadata to verify caching fires.
Dynamic content + L1 = stale responses. News, prices, user-specific data — do not L1-cache these. Use conditional caching based on path or prompt content.
Semantic cache false positives. Cosine similarity threshold too loose returns wrong answers. Start at 0.95+ and tune.
Claude 5-min TTL surprise. If your workload has gaps >5 min between cache reads, the cache expires and you pay the 25% write premium again. Use 1-hour TTL for bursty patterns with longer gaps.
Forgetting to measure. No observability = running blind. Helicone, Langfuse, or provider response metadata at minimum.

Decision Matrix

Your situation	Recommended setup
Single vendor, simple app	Pattern 4 (direct)
Single vendor, want L1 savings	Pattern 1 (Helicone only)
Multi-vendor with routing	Pattern 2 (Gateway + Helicone)
Strict compliance / data residency	Pattern 3 (self-hosted L1)
High-repetition workloads (support, FAQ)	Any pattern + aggressive L1
Dynamic content (news, personalized)	L2 only, skip L1

TL;DR (repeated for scrollers)

L1 result cache = skip model entirely, 100% saved per hit, stale-risk on dynamic content
L2 prompt cache = vendor-native, 50-90% off cached input tokens, model still runs
OpenRouter / Portkey = L2 passthrough only. No L1.
Real savings: L2 alone ≈ 39% on realistic production. L1 + L2 stacked ≈ 54%.
Always enable L2 (it's free money on Claude/OpenAI/DeepSeek). Add L1 when repetition is real and staleness is tolerable.

Full article with 8-question FAQ and deeper architectural analysis: Read the full version on TokenMix.ai →

Originally published on TokenMix.ai — a unified AI API gateway providing OpenAI-compatible access to 150+ LLMs. TokenMix Research Lab publishes data-driven analysis of LLM pricing, benchmarks, and cost optimization strategies across every major model provider.

Top comments (1)

tokenmixai • Apr 21

tokenmix.ai ------ Every Model. No Barriers. Access every major AI model in your language, with your preferred payment method. OpenAI SDK compatible, zero migration.