Stop Caching the Whole LLM Response. Cache the Embedding.

#ai #rag #llm #observability

Book: RAG Pocket Guide
Also by me: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A team running a customer-support copilot put a Redis cache in front of their gpt-4o calls. Key was a SHA-256 of the prompt, value was the completion. Six weeks later they checked the hit rate. It was roughly 4% (the 4.1% figure is illustrative of what teams in this shape report, not a single published benchmark).

The reason is what every team building an LLM cache eventually figures out: humans do not ask the same question twice. They ask "how do I cancel my subscription," then "where do I go to stop billing," then "I want to end my plan." Three different prompt hashes, one identical answer. An exact-match key only fires when somebody hits the same wording, and the long tail of phrasing destroys hit rate.

The fix is to stop keying the cache on the prompt and start keying on the meaning. Embed the query, search for prior queries within a cosine-similarity threshold, and return the cached response when one is close enough. GPTCache, Zilliz's open-source library, has been doing exactly this since 2023. The 2024 GPT Semantic Cache paper reports cache hit rates between 61.6% and 68.8% across query categories.

Semantic caching beats exact-match by an order of magnitude. The interesting question is why the economics work even when you have to embed every query that comes in, including the misses.

The cost shape that makes this trivial

Look at OpenAI's pricing page (figures below are as of April 2026; check the page for current values). Per 1M tokens:

text-embedding-3-small: $0.02 input, no output tokens.
text-embedding-3-large: $0.13 input, no output tokens.
gpt-4o (or whatever your generation model is in the same class): roughly $2.50 input, $10 output at standard tier.

The per-query dollar figures that follow are illustrative estimates derived from these inputs; rerun the math against current pricing before quoting them in a budget.

A typical chat query is 200 input tokens. The embedding for it costs you 200 × $0.02 / 1,000,000 = $0.000004. The completion for it, generating maybe 400 output tokens with another 800 of system prompt and retrieved context as input, costs you roughly $0.0065. The completion is about 1,600× more expensive than the embedding.

That asymmetry is the reason semantic caching pays. You can embed every query, including the misses, including queries that turn out to be completely novel, and the embedding cost is a rounding error compared to the completion you would otherwise have to call. Say your cache hit rate is 50% (well below the published GPTCache numbers). You are paying:

Hit path: 1 embedding ($0.000004) + cache lookup (~$0).
Miss path: 1 embedding ($0.000004) + 1 completion ($0.0065) + cache write.

Average per query: $0.000004 + 0.5 × $0.0065 ≈ $0.0033. Without the cache: $0.0065. You halved your inference bill, and you spent 0.06% of the savings on embeddings.

This is what people mean when they say embeddings are "100x cheaper" than completions per token. The gap stretches further once you include output tokens: text-embedding-3-small at $0.02/1M input vs gpt-4o blended at roughly $2.50 input + $10 output is closer to a 1,000x ratio on a 200-in / 400-out query (200 × $0.02 = $0.000004 for the embedding vs ~$0.005 of output spend alone, ≈1,250x). That gap is what makes "embed every query, even the ones that miss" a strictly winning strategy. CloudZero's 2026 OpenAI pricing breakdown and Tiger Data's pgvector guide both spell out the gap.

What the cache actually stores

Three things, per entry:

The query embedding (a vector, dim 1536 for text-embedding-3-small).
The original query text — for debugging and logging, never used in lookup.
The completion the LLM produced, plus a timestamp.

You do not store the retrieved context that went into the prompt. That is a separate decision. Some teams cache the full (query, retrieved_chunks, response) triple and only count it as a hit when both query and retrieval match. That collapses the hit rate back toward 4%. The bet of semantic caching is that for a given query meaning, the retrieved chunks will be stable enough that the same response is still good. If not, you detect drift another way: TTL, manual invalidation on doc updates, a freshness eval.

A 70-line implementation

Python, using numpy for the cosine math and openai for both embedding and chat. No vector DB — for prototypes the in-memory list is fine, and swapping to pgvector or FAISS once it grows is mechanical.

import time
from dataclasses import dataclass

import numpy as np
from openai import OpenAI

client = OpenAI()
EMBED_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o-2024-11-20"
SIM_THRESHOLD = 0.93
TTL_SECONDS = 60 * 60 * 24


@dataclass
class CacheEntry:
    embedding: np.ndarray
    query: str
    response: str
    created_at: float


class SemanticCache:
    def __init__(self) -> None:
        self.entries: list[CacheEntry] = []

    def _embed(self, text: str) -> np.ndarray:
        r = client.embeddings.create(model=EMBED_MODEL, input=text)
        v = np.array(r.data[0].embedding, dtype=np.float32)
        return v / np.linalg.norm(v)

    def _gc(self) -> None:
        now = time.time()
        self.entries = [
            e for e in self.entries
            if now - e.created_at < TTL_SECONDS
        ]

    def lookup(self, embedding: np.ndarray) -> CacheEntry | None:
        self._gc()
        if not self.entries:
            return None
        matrix = np.stack([e.embedding for e in self.entries])
        sims = matrix @ embedding
        idx = int(np.argmax(sims))
        if float(sims[idx]) >= SIM_THRESHOLD:
            return self.entries[idx]
        return None

    def store(
        self, embedding: np.ndarray, query: str, response: str
    ) -> None:
        self.entries.append(CacheEntry(
            embedding=embedding,
            query=query,
            response=response,
            created_at=time.time(),
        ))


cache = SemanticCache()


def ask(query: str, system_prompt: str = "You are helpful.") -> str:
    embedding = cache._embed(query)
    hit = cache.lookup(embedding)
    if hit is not None:
        return hit.response
    completion = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query},
        ],
    )
    response = completion.choices[0].message.content or ""
    cache.store(embedding, query, response)
    return response

A walkthrough of what each block does, because the details are where the cache leaks if you're not careful.

Normalization at embed time. Every embedding gets normalized to unit length before storage. That turns the cosine-similarity calculation into a dot product, which is a single matrix-vector multiply at lookup. Skipping this is the single most common performance pitfall — without normalization you're doing a per-entry division at every query.

Threshold of 0.93. This is the dial that determines hit rate vs answer correctness. Too low (0.85) and you start serving "where is my refund" results to "where is my receipt" queries — the embeddings are close, the meanings diverge. Too high (0.98) and your hit rate collapses back toward exact-match territory. The right number depends on the embedding model and your domain. The GPT Semantic Cache paper lands on 0.91-0.95 as the operating range for text-embedding-ada-002-class models. Calibrate on your own eval set.

TTL. Stale answers are how semantic caching destroys trust. Pricing changed. Policy changed. The product page got renamed. A 24-hour TTL is the conservative default; tune down to 1 hour for fast-moving content, up to 7 days for stable reference material. Pair this with explicit invalidation on document updates if you're caching on top of a RAG pipeline.

The _gc call inside lookup. Lazy garbage collection: clean up expired entries when you're already iterating. For the in-memory toy version this is fine. At scale, do it on a background job and use a TTL-aware store (Redis with EXPIRE, or a created_at index in pgvector with a periodic delete).

The store after a miss. Note we store the raw query alongside the embedding. The query is never used for lookup. It's there for the logs. When a hit happens with sims = 0.94 between two queries, you want to be able to read both texts during incident review and confirm the semantic match was good.

The class of bug this introduces

Exact-match caches have one failure mode: they miss too often. Semantic caches have a different one: they hit when they shouldn't.

Two queries that look like paraphrases:

"How do I cancel my Pro subscription"
"How do I cancel a Pro subscription"

Cosine similarity on text-embedding-3-small: about 0.97. They will collide. If user A is asking about their own account and user B is asking generically, returning the same response is fine. If your system has personalization in the response ("Your subscription ends June 14" is a constructed example, not a real incident), you've just leaked one user's data into another user's chat.

The fix is structural. Never put a semantic cache in front of personalized output. The cache lives upstream of personalization, on the unbiased portion of the response (the docs lookup, the policy explanation), and the personalization layer happens after. Or, key the cache by (user_id, embedding) so semantic matching is scoped to one user. Hit rates drop, but never to zero, because users genuinely re-ask their own questions.

The other failure: queries with negation. "How do I disable two-factor auth" and "How do I enable two-factor auth" are 0.95+ similar in most embedding models because the verbs and the noun phrase dominate. A semantic cache will happily return one for the other. The literature review on GPT Semantic Cache flags this exact category as the dominant false-positive class. Your defenses are higher thresholds for short queries, a small "negation guard" classifier, or an LLM-judge pass on the cached candidate before serving.

Observability is non-optional

A semantic cache that you cannot inspect is a liability. The minimum bar:

Log every lookup with (query, top_match_query, similarity, hit/miss). This is your audit log when somebody reports a wrong answer.
A daily eval that samples 50 cache hits and asks an LLM judge whether the cached response would have been the right answer for the new query. Hit-rate numbers without this eval are vanity metrics.
A dashboard panel for cache size, eviction rate, and avg similarity at hit time. A drift in avg similarity is the early signal that your threshold needs tuning or the embedding model has been swapped under you.

This is the part that shows up in chapter 6 of the LLM Observability Pocket Guide. Caches without span instrumentation are how teams discover, three months in, that their hit rate was actually 12% and the lookup was returning wrong answers the rest of the time. You cannot eyeball this; you have to instrument it.

When not to do this

A short list of cases where exact-match is the right call and semantic caching is overkill or actively wrong:

Tool-call outputs where the response is a structured action ("call function X with arg Y"). The cost of a wrong tool call is enormous; semantic similarity is too lossy.
Code generation. "Write a Python function that sorts a list" and "Write a Python function that sorts an array" are 0.96 similar and might want different code.
Anything where the response is short enough that the completion cost barely exceeds the embedding cost. A tiny model serving 50-token completions has a small enough cost gap that the embedding overhead eats your savings.

For chat, support, search-style RAG, and FAQ-shaped traffic, semantic caching is the boring high-leverage move that pays for itself in the first week. The implementation is sixty-something lines. The hit rate is what 4% caches were always supposed to be.

If this was useful

The full retrieval pipeline is what the RAG Pocket Guide covers end-to-end: embedding choice, similarity thresholds, threshold calibration, the eval rig that tells you when a cache is silently degrading. The instrumentation half (cache spans, hit-quality dashboards, the LLM-judge eval loop you actually run on a schedule) is in the LLM Observability Pocket Guide. Both are short.