Caching Strategies for LLM Systems: Exact-Match & Semantic Caching

#llm #ai #caching #performance

LLM calls are expensive in latency, tokens, and compute. Caching is one of the most effective levers to reduce cost and speed up responses. This post explains two foundational caching techniques you can implement today: Exact-match (key-value) caching and Semantic (embedding) caching. We cover how each works, typical implementations, pros/cons, and common pitfalls.

Why caching matters for LLM systems

Every LLM call carries three primary costs:

Network latency — round-trip time to the API or inference cluster.
Token cost — many APIs charge per input + output tokens.
Compute overhead — CPU/GPU time spent running the model.

In production applications many queries repeat (exactly or semantically). A cache allows the system to return prior results without re-running the model, producing immediate wins in latency, throughput, and cost.

Key benefits:

Lower response time for end users.
Reduced API bills and compute consumption.
Higher throughput and better user experience at scale.

A thoughtful caching layer is often one of the highest-ROI engineering efforts for LLM products.

Exact-match (Key-Value) caching

How it works

Exact-match caching stores an LLM response under a deterministic key derived from the prompt (and any contextual state). When the same key is seen again, the cache returns the stored response.

Input prompt → Normalization → Hash/key → Lookup in KV store → Return stored response

Implementation notes

Normalization (optional but recommended): trim whitespace, canonicalize newlines, remove ephemeral metadata, and ensure consistent parameter ordering.
Key generation: use a stable hashing function (SHA-256) over the normalized prompt plus any relevant metadata (system prompt, temperature, model name, conversation id, schema version).
Storage: simple in-memory dict for prototypes; Redis/KeyDB for production; or a persistent object store for large responses.
Validation: store metadata with the response — model version, temperature, timestamp, source prompt — so you can safely decide whether a cached result is still valid or should be invalidated.

Simple Python example (conceptual)

import hashlib
import json

def make_key(prompt: str, system_prompt: str = "", model: str = "gpt-x", schema_version: str = "v1") -> str:
    normalized = "\n".join(line.strip() for line in prompt.strip().splitlines())
    payload = json.dumps({
        "system": system_prompt,
        "prompt": normalized,
        "model": model,
        "schema": schema_version,
    }, sort_keys=True)
    return hashlib.sha256(payload.encode()).hexdigest()

# Example usage:
# key = make_key(user_prompt, system_prompt, model_name)
# if key in kv_store: return kv_store[key]

When to use exact caching

Deterministic workflows (e.g., agent step outputs).
Repeated system prompts and templates.
Where correctness requires exact reuse (no hallucination risk from mismatched context).

Advantages: simple, deterministic, zero false-positive risk.

Limitations: low hit rate for free-form natural language; brittle to minor prompt changes.

Semantic caching

How it works

Semantic caching stores an embedding for each prompt together with the response. For a new prompt, compute its embedding, perform a nearest-neighbor search among cached vectors, and reuse the cached response if similarity exceeds a threshold.

Prompt → Embedding → Similarity search in vector store → If max_sim ≥ threshold → reuse response

Implementation notes

Embeddings: choose a consistent embedding model. Store normalized prompt text, the embedding vector, response, and metadata (model, generation parameters, timestamp, schema version).
Vector store: FAISS, Milvus, Pinecone, Weaviate, or Redis Vector are common options depending on scale and latency needs.
Similarity metric: cosine similarity is standard for text embeddings. Use the same metric in indexing and querying.
Thresholding: set a threshold that balances reuse vs. safety. Typical cosine thresholds vary by embedding model — tune on your dataset (often starting conservatively around 0.85–0.9).

Conceptual example (pseudo-Python)

# compute embedding for new prompt
q_vec = embed(prompt)

# nearest neighbor search -> returns (id, sim_score)
nearest_id, sim = vector_store.search(q_vec, k=1)

if sim >= SIM_THRESHOLD:
    response = cache_lookup(nearest_id)
else:
    response = call_llm(prompt)
    store_embedding_and_response(q_vec, prompt, response)

Tuning similarity and safety

Calibration: evaluate the similarity threshold on a held-out set of paraphrases and unrelated prompts to estimate false-positive reuse.
Hybrid checks: for high-risk outputs, combine semantic match with lightweight heuristics (e.g., entity overlap, output-shape checks) or a fast reranker before returning cached content.
Metadata gating: ensure model version, schema version, or prompt-template changes invalidate or block reuse.

Advantages: handles paraphrases; higher effective cache hit rate for conversational queries.

Limitations: requires embeddings, vector storage, and careful tuning to avoid incorrect reuse.

Choosing between exact and semantic caching

Use exact-match caching when correctness and determinism matter and prompts are highly templated.
Use semantic caching when queries are natural language, paraphrases are common, and some approximation is acceptable in exchange for higher hit rates.
Hybrid approach: an effective production design usually combines both. Try exact-match first; if it misses, fall back to semantic search. Store both kinds of keys and de-duplicate on insertion.

Metrics, monitoring, and operational concerns

Track these key metrics:

Cache hit rate (exact / semantic)
End-to-end latency for cache hits vs misses
Cost saved (tokens/compute avoided)
False reuse incidents (semantic false positives) and user impact

Operational concerns:

Eviction policy & TTL — balance storage costs and freshness.
Model upgrades — invalidate or tag cache entries produced by older model versions (or bump schema version).
Privacy & sensitivity — avoid caching PII or sensitive outputs unless encrypted and access-controlled.
Auditability — log when responses were served from cache and the matched key/score.

Implementation & Code

Want to see working examples? Check out the implementation with code:

VaibhavAhluwalia / llm-caching-systems

Practical implementations and experiments for building fast, scalable, and cost-efficient Large Language Model (LLM) applications using caching techniques.

The repository includes:

Interactive notebooks demonstrating both caching strategies
Requirements file for easy setup

Conclusion and what's next (Part 2)

Exact-match and semantic caching are foundational. Together they allow LLM systems to be faster and cheaper while retaining the benefits of large models.

In Part 2 of this series we'll cover other techniques

What caching strategies have worked best in your LLM projects? Share your experiences in the comments below!

Connect with me: