DEV Community

Cover image for AI Gateway Caching Explained — Why L1 + L2 Cache Layers Cut 90% of Your LLM Bill
tokenmixai
tokenmixai

Posted on • Originally published at tokenmix.ai

AI Gateway Caching Explained — Why L1 + L2 Cache Layers Cut 90% of Your LLM Bill

TL;DR

Caching in AI gateways is not one feature. It's two:

  1. L1 — Result cache skips the upstream model entirely. 100% savings per hit.
  2. L2 — Prompt cache (vendor-native) reduces cached input token cost 50-90%, but still calls the model.

Most teams on OpenRouter, Portkey, or similar gateways get only L2. Adding L1 (Helicone or self-hosted Redis) compounds the savings. Real production math: a typical 10M request/month workload saves 39% with L2 alone, 54% with L1 + L2 stacked.

Full analysis with pricing tables and architecture patterns: tokenmix.ai/blog/ai-gateway-caching-l1-l2-guide-2026


The Misconception Everyone Has

When developers say "my gateway has caching," they usually mean one of:

  • Semantic cache (Helicone style)
  • Vendor prompt caching (Claude / OpenAI / DeepSeek native)
  • "I set a Redis in front of my API calls"

These are three different things with different savings and different stale-risk profiles. Conflating them leads to architectural bugs: either you pay for duplicate caching, or you think you're caching when you're not.

Let's separate them cleanly.


L1: Result Cache — Skip the Model Entirely

The gateway remembers past responses and returns them for matching new requests.

Client → Gateway (L1 cache check) → ┬─ HIT  → return cached response (100% saved)
                                    └─ MISS → forward to model → cache + return
Enter fullscreen mode Exit fullscreen mode

Two matching strategies:

  • Exact match: hash of (model + messages + params). Byte-identical requests hit.
  • Semantic match: vector similarity. "What is photosynthesis?" matches "Explain photosynthesis."

Who ships L1 today:

  • Helicone — 1-line proxy swap. Reports 20-30% savings typical, up to 95% for highly repetitive workloads (Helicone docs).
  • Self-hosted Redis — 1-2 engineer-weeks to build.
  • OpenRouter / Portkey — do not ship L1 by default. They're pass-through gateways.

When L1 is dangerous: dynamic content (news, stock prices, user-specific data). Stale response served from cache when source changed.

When L1 wins: temperature=0 paths, documentation QA, fixed-corpus RAG, code completion. Enable with TTL that matches your content refresh cadence.

L1 example — Helicone drop-in

from openai import OpenAI

client = OpenAI(
    api_key="your-key",
    base_url="https://oai.helicone.ai/v1",  # ← just the base_url change
    default_headers={
        "Helicone-Auth": f"Bearer {HELICONE_KEY}",
        "Helicone-Cache-Enabled": "true",
        "Helicone-Cache-Bucket-Max-Size": "10",
    },
)

# Same request sent twice: second call hits L1, skips OpenAI entirely
resp = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)
Enter fullscreen mode Exit fullscreen mode

L2: Prompt Cache — The Model Still Runs, But Cheaper

L2 is what OpenAI, Anthropic, Google, and DeepSeek all ship under different names. Mechanism:

  1. You send a long prompt with a stable prefix (system prompt, tools, documents).
  2. Provider computes KV state for the prefix, stores in hot cache.
  3. Subsequent calls with same prefix skip prefix computation, pay 50-90% less on cached input tokens.
  4. The model still generates output every call — this is not L1.

L2 pricing across the four majors (April 2026)

Provider Base input Cache read Cache write Auto?
Claude Sonnet 4.6 $3/M $0.30/M (90% off) $3.75/M (25% premium, 5-min TTL) Explicit — cache_control
Claude Opus 4.6 $5/M $0.50/M (90% off) $6.25/M (5-min) Explicit
DeepSeek V3.2 $0.28/M $0.028/M (90% off) Same as base Automatic
OpenAI GPT-5.4 $2.50/M $0.25/M (90% off) Same as base Automatic ≥1024 tokens
Gemini 3.1 Pro $2/M ~25% off Storage $4.50/M per hour Explicit — cachedContents.create

Sources: Anthropic pricing, OpenAI prompt caching, DeepSeek context caching, Anthropic prompt caching docs.

Claude break-even math

Claude's cache write is 25% more expensive than base input (5-min TTL) or 2x more (1-hour TTL). So:

  • 1 cache read pays off the 5-min write premium. Every hit after is pure savings.
  • 2 cache reads pay off the 1-hour write premium. Anything more is profit.

RAG system answering multi-turn questions on the same document? Cache pays for itself instantly.

L2 example — Claude with explicit cache_control

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",  # not "claude-sonnet-4-6" — use dot: "claude-sonnet-4.6"
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support AI..."  # Short instruction, not cached
        },
        {
            "type": "text",
            "text": LONG_DOCUMENT_CONTEXT,  # 50K tokens of product docs
            "cache_control": {"type": "ephemeral"}  # ← cache this 50K chunk
        }
    ],
    messages=[{"role": "user", "content": "How do I enable 2FA?"}]
)

# Next call with same system[] → 90% cheaper on those 50K input tokens
Enter fullscreen mode Exit fullscreen mode

L2 example — DeepSeek (zero config)

from openai import OpenAI  # DeepSeek is OpenAI-compatible

client = OpenAI(
    api_key=DEEPSEEK_KEY,
    base_url="https://api.deepseek.com/v1",
)

# Cache fires automatically on the second call if prefix matches
resp = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": LONG_SYSTEM_PROMPT},  # stable prefix, auto-cached
        {"role": "user", "content": user_input},
    ],
)

# Inspect cache hits in usage metadata
print(resp.usage.prompt_cache_hit_tokens)  # tokens served from cache
print(resp.usage.prompt_cache_miss_tokens)  # tokens computed fresh
Enter fullscreen mode Exit fullscreen mode

Real Cost Math — 10M Requests/Month

Assumptions: 4,000 input tokens avg (3,500 stable prefix + 500 unique), 500 output tokens avg, Claude Sonnet 4.6.

No caching (baseline)

  • Input: 40B tokens × $3/M = $120,000
  • Output: 5B tokens × $15/M = $75,000
  • Total: $195,000/month

L2 only (80% prefix cache hit rate)

  • Cached input: 28B × $0.30/M = $8,400
  • Uncached input: 12B × $3/M = $36,000
  • Output: $75,000
  • Cache write overhead: $300
  • Total: $119,700/month (−39%)

L1 + L2 stacked (25% L1 hit rate, remaining 75% via L2)

  • L1-served: 2.5M requests → $0 LLM cost (+ $500 infra)
  • L2-eligible: 7.5M requests
    • Cached input: 21B × $0.30/M = $6,300
    • Uncached input: 9B × $3/M = $27,000
    • Output: $56,250
  • Total: ~$90,050/month (−54%)

L1 + L2 is not additive — it's compound in the right direction. The requests L1 absorbs don't dilute the L2 savings.


Architecture Patterns

Pattern 1: Helicone only (single vendor)

App → Helicone (L1) → Vendor (L2)
Enter fullscreen mode Exit fullscreen mode

Simplest multi-layer setup. Both caches fire with one proxy hop.

Pattern 2: Gateway + Helicone (multi-model)

App → TokenMix.ai / OpenRouter (routing + L2 passthrough) → Helicone (L1) → Vendors
Enter fullscreen mode Exit fullscreen mode

Gateway handles model routing, failover, billing. Helicone adds L1.

Pattern 3: Self-hosted L1 + Gateway

App → Own Redis L1 → Gateway → Vendors
Enter fullscreen mode Exit fullscreen mode

Fine control over TTL and invalidation. More ops work.

Pattern 4: Vendor direct (no gateway, no L1)

App → Vendor
Enter fullscreen mode Exit fullscreen mode

Simplest. L2 auto-fires on OpenAI/DeepSeek, explicit on Claude/Gemini. No multi-model routing, no L1.


Common Gotchas

  1. Prefix instability kills L2. If your gateway (or middleware) rewrites system prompts inconsistently, the cache key hash changes every call. Check actual cached-token count in provider response metadata to verify caching fires.

  2. Dynamic content + L1 = stale responses. News, prices, user-specific data — do not L1-cache these. Use conditional caching based on path or prompt content.

  3. Semantic cache false positives. Cosine similarity threshold too loose returns wrong answers. Start at 0.95+ and tune.

  4. Claude 5-min TTL surprise. If your workload has gaps >5 min between cache reads, the cache expires and you pay the 25% write premium again. Use 1-hour TTL for bursty patterns with longer gaps.

  5. Forgetting to measure. No observability = running blind. Helicone, Langfuse, or provider response metadata at minimum.


Decision Matrix

Your situation Recommended setup
Single vendor, simple app Pattern 4 (direct)
Single vendor, want L1 savings Pattern 1 (Helicone only)
Multi-vendor with routing Pattern 2 (Gateway + Helicone)
Strict compliance / data residency Pattern 3 (self-hosted L1)
High-repetition workloads (support, FAQ) Any pattern + aggressive L1
Dynamic content (news, personalized) L2 only, skip L1

TL;DR (repeated for scrollers)

  • L1 result cache = skip model entirely, 100% saved per hit, stale-risk on dynamic content
  • L2 prompt cache = vendor-native, 50-90% off cached input tokens, model still runs
  • OpenRouter / Portkey = L2 passthrough only. No L1.
  • Real savings: L2 alone ≈ 39% on realistic production. L1 + L2 stacked ≈ 54%.
  • Always enable L2 (it's free money on Claude/OpenAI/DeepSeek). Add L1 when repetition is real and staleness is tolerable.

Full article with 8-question FAQ and deeper architectural analysis: Read the full version on TokenMix.ai →


Originally published on TokenMix.ai — a unified AI API gateway providing OpenAI-compatible access to 150+ LLMs. TokenMix Research Lab publishes data-driven analysis of LLM pricing, benchmarks, and cost optimization strategies across every major model provider.

Top comments (1)

Collapse
 
tokenmixai profile image
tokenmixai

tokenmix.ai ------ Every Model. No Barriers. Access every major AI model in your language, with your preferred payment method. OpenAI SDK compatible, zero migration.