DEV Community

Cover image for We Measured LLM Prompt Caching in Production — Same Prompt, 0% to 91% Hit Rates
sm1ck
sm1ck

Posted on • Originally published at honeychat.bot

We Measured LLM Prompt Caching in Production — Same Prompt, 0% to 91% Hit Rates

We run an AI companion bot. Every chat turn, the model sees the same ~5K-token prefix — character persona, content-tier rules, formatting guardrails, a memory blob — plus one new user line. Without caching, we pay for those 5K input tokens every single turn. So we turned on prompt caching across the providers we route through, measured it, and the spread was bigger than any of the marketing pages prepared us for.

Here's the table that survived four weeks in production, plus the one gotcha that ate two weeks before we figured it out.

The hit-rate table

Provider / model Hit rate Latency Δ Notes
Cydonia (via OpenRouter) 91 % −43 % Just works, no marker needed
Gemini 3.1 Flash Lite 75 % −49 % Requires cache_control marker
Grok (xAI) 51 % −40 % "Sticky" — best on active sessions
Same code, 600-token test prompt 0 % 0 % Methodology bug — see below

Same exact 5K-token system prefix across all rows. Same 10 follow-up turns. Wildly different cache behaviour.

The marker that "didn't matter" (until it did)

Most OpenAI-compat examples skip any cache hint and assume the provider figures it out from prefix repetition. Some do. Anthropic-style routes — and anything going through OpenRouter that supports cache_control — don't:

messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": SYSTEM_PROMPT,          # the long, stable prefix
                "cache_control": {"type": "ephemeral"},
            }
        ],
    },
    {"role": "user", "content": user_msg},      # the only volatile part
]
Enter fullscreen mode Exit fullscreen mode

Cydonia caches without it. Grok caches without it.

Gemini 3.1 Flash Lite caches at exactly 0 % without it. The same model jumps to 75 % with one extra field on the last cacheable content block.

We had Gemini 3.1 routed in production for a week showing zero cache reads in usage. Concluded the model "just didn't support caching." It does — we were calling the API the way every other model wanted to be called. Cost of including the marker on providers that ignore it: zero. Cost of skipping it on a provider that needs it: your entire spend on that route.

Why our first "no, it doesn't cache" test was wrong

Before we caught the marker thing, we'd already wrongly concluded a couple of models "don't cache" — because we'd tested with the wrong prompt.

The first probe was a ~600-token prompt repeated 10 times. Cache reads: zero, across every provider. Conclusion: this provider doesn't cache.

Conclusion: wrong. Most providers have a minimum prefix length before caching kicks in (≥ 1K tokens for some routes, closer to ≥ 4K for others). Below that floor, you pay full price even though the prompt repeats verbatim. The cache simply doesn't engage.

The corrected probe:

  • Prefix ≥ 5K tokens, shaped like real production (system prompt + persona + retrieved memory).
  • 10 identical follow-up turns, fresh request each time.
  • For Anthropic-style providers, include the cache_control marker on the last cacheable content block.
  • Read usage.cache_creation_input_tokens and usage.cache_read_input_tokens (or the provider's equivalent) back — don't trust round-trip latency alone.

Once we did that, every "broken" provider started reporting cache reads.

What "sticky" caching looks like (Grok)

Grok was the weird one. Hit rate 51 % — lower than Cydonia and Gemini — but the cache survived longer between calls. Other providers behaved like a ~5-minute ephemeral cache; Grok looked more like a hot-window-then-slow-decay curve. Practical consequence: Grok did better than its hit rate suggested when the same user kept chatting actively, and worse when they came back hours later.

Lesson — a single hit-rate number per provider lies a little. The shape (how it decays, how it warms) matters as much as the headline percentage when your traffic is bursty.

What it actually saved

We route turns through different model tiers depending on the user's plan. After caching landed and the marker was wired in everywhere it was needed:

  • Cached input tokens are billed at roughly 10 % of normal price (provider-dependent, sometimes lower).
  • Cost per turn on the heavy-tier routes dropped about 40–45 %, matching the hit rates above.
  • End-to-end latency dropped 40–49 %, which users actually notice — the typing-dots animation snapping back faster feels like a different product.

The pleasant surprise was that latency mattered to retention more than cost mattered to the P&L. Cheaper turns are nice; faster replies are felt.

Lessons we'd pin to the wall

  1. Test with a production-shaped prompt. Short toy prompts will tell you caching doesn't work on providers where it works fine. The minimum-prefix floor is real and silent.
  2. Read provider-specific cache hints. Anthropic-style cache_control is required on some routes (Gemini 3.1 line via OpenRouter, in our case) and ignored by others. Always send it.
  3. Verify with usage fields, not vibes. cache_read_input_tokens doesn't lie. End-to-end latency does — TTFB swings hide a lot of noise.
  4. One hit-rate per provider lies a little. The decay curve matters more than the headline number for bursty vs. steady chat patterns.
  5. Re-probe quarterly. Providers ship cache changes silently. The 75 % on Gemini 3.1 Flash Lite is a 2026 number — the same code on the same model gave us 0 % earlier this year, before the marker was wired in.

If you're running an AI app where the system prompt dwarfs the user input — companion bots, RAG with chunky retrieved context, agentic loops — you almost certainly leave 40 % of your bill and half a second of latency on the table by trusting the defaults. The marker is one line. The corrected methodology is one afternoon.


If you've got hit-rate numbers from a different routing setup (Bedrock, Fireworks, Together, direct Anthropic), drop them in the comments — curious how the marker situation compares outside the OpenRouter ecosystem.

This write-up is from production work at HoneyChat — a Telegram-native AI companion where the system prompt is the load-bearing wall (persona + content tier + memory blob = the whole 5K). The canonical version of this post lives at honeychat.bot/en/blog/llm-prompt-caching-in-production.

HoneyChat Engineering

Sources

Top comments (0)