Derrick Pedranti

Posted on Mar 29

Semantic Caching for LLMs: Faster Responses, Lower Costs

#ai #machinelearning #python #webdev

If you're building AI applications with LLMs, you've probably noticed a pattern:

The same (or very similar) questions keep coming in
Each one triggers a full LLM call
Latency adds up, and token costs quietly grow in the background

What makes this especially frustrating is that many of these requests aren't truly unique. They're slightly reworded versions of things you've already answered.

For example:

"What is the capital of France?"
"What's France's capital?"
"Can you tell me the capital city of France?"

From an LLM's perspective, these are three separate requests. From a user's perspective, they're the same question. Without caching, you pay for each one.

Semantic caching solves this. Instead of treating every request as new, your system recognizes when a query is similar enough to a previous one and reuses the existing response.

In real-world systems, this single optimization can reduce LLM calls by 30–70%, drop latency from seconds to milliseconds, and significantly lower your token costs. It's one of the highest-leverage improvements you can make early in your architecture.

How It Works

Traditional caching relies on exact string matches. Change a single character and the cache misses.

Semantic caching takes a different approach: instead of comparing raw text, it compares meaning using embeddings.

Here's the flow:

User Query
    ↓
Generate embedding
    ↓
Search cache for similar embeddings
    ↓
Match found? → Return cached response
    ↓
No match? → Call LLM → Store result in cache → Return response

The key insight: you avoid calling the LLM unless you have to. Before every request, you ask "Have I already answered something similar enough?" If yes, you skip the most expensive part of your system entirely.

Under the hood, this works by converting queries into vectors and measuring how close they are in vector space. If the distance is below a threshold, the system considers them a match.

Implementation

Let's build a working semantic cache. We'll use FAISS for vector search and sentence-transformers for embeddings, which keeps everything local and dependency-light.

Install dependencies

pip install faiss-cpu sentence-transformers numpy

Note on Dependencies

Depending on your environment (especially Python 3.12 on macOS), you may need to pin a few dependencies due to PyTorch compatibility.

For example:

pip install "sentence-transformers<4" "transformers<5" "numpy<2"

The versions in this article are intentionally left unpinned to keep things simple, but if you run into installation issues, try the pinned versions above.

Define a cache interface

from abc import ABC, abstractmethod

class ResponseCache(ABC):
    @abstractmethod
    def get(self, query: str, context: dict) -> str | None:
        """Look up a cached response. Returns None on miss."""
        ...

    @abstractmethod
    def put(self, query: str, context: dict, response: str) -> None:
        """Store a response for future reuse."""
        ...

Defining an interface keeps your application decoupled from the caching backend. You might start with FAISS locally, then move to Redis or Qdrant in production. Your LLM logic shouldn't need to change.

Implement a no-op cache

class NoCache(ResponseCache):
    def get(self, query, context):
        return None

    def put(self, query, context, response):
        pass

This gives you a safe default for environments where caching isn't available, and a clean baseline for benchmarking.

Implement the semantic cache

import json
import time
import hashlib
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer


class SemanticCache(ResponseCache):
    def __init__(
        self,
        model_name: str = "all-MiniLM-L6-v2",
        distance_threshold: float = 0.25,
        ttl_seconds: int = 3600,
    ):
        self.encoder = SentenceTransformer(model_name)
        self.dimension = self.encoder.get_sentence_embedding_dimension()
        self.distance_threshold = distance_threshold
        self.ttl_seconds = ttl_seconds

        # FAISS index for fast similarity search (L2 distance)
        self.index = faiss.IndexFlatL2(self.dimension)

        # Parallel store: maps index position → cached entry
        self.entries: list[dict] = []

    def _context_key(self, context: dict) -> str:
        """Create a deterministic key from context so we only match
        responses generated under the same conditions."""
        stable = json.dumps(context, sort_keys=True)
        return hashlib.sha256(stable.encode()).hexdigest()

    def _embed(self, text: str) -> np.ndarray:
        vector = self.encoder.encode([text], normalize_embeddings=False)
        return np.array(vector, dtype="float32")

    def get(self, query: str, context: dict) -> str | None:
        if self.index.ntotal == 0:
            return None

        query_vector = self._embed(query)
        distances, indices = self.index.search(query_vector, k=5)

        ctx_key = self._context_key(context)
        now = time.time()

        for dist, idx in zip(distances[0], indices[0]):
            if idx == -1:
                continue
            if dist > self.distance_threshold:
                continue

            entry = self.entries[idx]

            # Context must match (model, temperature, user, etc.)
            if entry["context_key"] != ctx_key:
                continue

            # Respect TTL
            if now - entry["created_at"] > self.ttl_seconds:
                continue

            return entry["response"]

        return None

    def put(self, query: str, context: dict, response: str) -> None:
        query_vector = self._embed(query)
        self.index.add(query_vector)
        self.entries.append({
            "query": query,
            "response": response,
            "context_key": self._context_key(context),
            "created_at": time.time(),
        })

A few things to note:

all-MiniLM-L6-v2 is a lightweight embedding model (~80MB) that's fast enough for real-time use. For higher accuracy on domain-specific queries, consider all-mpnet-base-v2 or a fine-tuned model.
FAISS IndexFlatL2 does exact nearest-neighbor search using L2 (Euclidean) distance. For millions of entries, switch to IndexIVFFlat or IndexHNSWFlat for approximate search.
The context key ensures we never return a cached response generated under different conditions (different model, temperature, user, etc.).

Tie it all together

def handle_request(query: str, context: dict, cache: ResponseCache, llm) -> str:
    # 1. Check cache
    cached = cache.get(query, context)
    if cached:
        print("[cache hit]")
        return cached

    # 2. Cache miss — call the LLM
    print("[cache miss → calling LLM]")
    response = llm.generate(query)

    # 3. Store for future reuse
    cache.put(query, context, response)

    return response

That's it. Three steps: check, call, store.

Try it out

# Swap in your actual LLM client here
class MockLLM:
    def generate(self, query):
        return f"The capital of France is Paris."

cache = SemanticCache(distance_threshold=0.25, ttl_seconds=3600)
llm = MockLLM()
context = {"model": "claude-sonnet-4-20250514", "temperature": 0.0, "user": "user_123"}

# First call — cache miss, calls the LLM
r1 = handle_request("What is the capital of France?", context, cache, llm)

# Second call — semantically similar, cache hit
r2 = handle_request("What's France's capital city?", context, cache, llm)

# Different context — cache miss even though query is similar
other_context = {"model": "gpt-4o", "temperature": 0.7, "user": "user_123"}
r3 = handle_request("Tell me the capital of France", other_context, cache, llm)

Tuning the Distance Threshold

The distance threshold is the most important tuning parameter in your system. It controls the tradeoff between precision (returning only correct matches) and recall (catching more cache hits).

Lower values → stricter matching, fewer false positives, lower hit rate
Higher values → more matches, higher hit rate, risk of returning wrong responses

The right value depends on your embedding model and distance metric:

Metric	Typical Range	Notes
L2 (Euclidean)	0.15 – 0.40	Used in our FAISS example above
Cosine distance	0.05 – 0.15	1 - cosine_similarity; common in Redis, Qdrant

Start around 0.25 for L2 or 0.10 for cosine, then adjust based on real traffic.

How to calibrate

Don't guess. Log your near-misses and spot-check them:

# During development, log borderline matches for review
if dist <= self.distance_threshold * 1.5:
    print(f"Near match: dist={dist:.4f} | query='{query}' | cached='{entry['query']}'")

Review these logs periodically. If you see incorrect matches slipping through, tighten the threshold. If you see obvious matches being missed, loosen it.

Context Filters: Correctness and Security

Semantic similarity alone isn't enough. Two queries can be nearly identical in meaning but require different responses based on context.

Correctness concerns:

Different models produce different outputs
Temperature affects randomness — a cached temperature=0 response shouldn't serve a temperature=1 request
System prompts or attached documents change the answer

Security concerns:

In multi-tenant systems, responses that contain user-specific data (account details, personalized recommendations, user-scoped RAG results) must include the user identifier in the context key. Without it, User A could receive User B's cached response. Treat this as a security boundary, not just a correctness optimization.

context = {
    "model": "claude-sonnet-4-20250514",
    "temperature": 0.0,
    "user": user_id,
    "system_prompt_hash": hash(system_prompt),
}

A note on cache hit rate: Including the user in the context key means every user builds separate cache entries, even for identical answers. For applications where responses don't depend on who's asking — general knowledge, shared documentation, public FAQs — consider omitting the user from the context key so all users share the same cache entries. This can dramatically improve your hit rate. The right approach depends on your application; the important thing is to make the decision intentionally rather than including or excluding the user by default across the board.

Cache Invalidation

TTL handles the simple case: responses expire after a set period. But in practice, you'll also need to think about:

Knowledge updates. If the underlying data your LLM references changes (e.g., you update your RAG corpus), cached responses built on the old data become stale. Consider including a version identifier in your context key so that corpus updates automatically invalidate old entries.
System prompt changes. If you modify your system prompt, cached responses from the previous version may no longer be appropriate. Hashing the system prompt into your context key (as shown above) handles this automatically.
Selective invalidation. Sometimes you need to invalidate specific entries rather than waiting for TTL. Adding a purge(context_key) method to your cache gives you this escape hatch.

def purge(self, context: dict) -> int:
    """Remove all entries matching a given context. Returns count removed."""
    ctx_key = self._context_key(context)
    removed = 0
    for i, entry in enumerate(self.entries):
        if entry["context_key"] == ctx_key:
            entry["created_at"] = 0  # Force expiry
            removed += 1
    return removed

For production systems with frequent knowledge updates, you'll likely want a more sophisticated approach — e.g., tagging cache entries with a corpus version and bulk-invalidating when the version changes.

Common Pitfalls

Caching everything. Some responses should never be cached: real-time data (stock prices, weather), sensitive PII-containing responses, or anything where staleness causes harm. Maintain an explicit skip list.

No TTL. Without expiration, your cache will silently return outdated responses. Always set a TTL, even if it's generous (e.g., 24 hours).

Ignoring context. If you cache without filtering on model, temperature, and user, you will eventually serve wrong or leaked responses. This is the most dangerous pitfall because it often doesn't surface in testing — only in production with real multi-user traffic.

Poor serialization. If you're caching structured LLM responses (tool calls, JSON, streaming chunks), make sure your serialization round-trips correctly. A subtle bug here can produce responses that look right but are subtly broken.

Embedding model mismatch. If you change your embedding model, your existing cache becomes invalid — the vector spaces are incompatible. Either clear the cache on model change or version your cache keys.

When to Use It (and When Not To)

Good fit:

FAQ-style or search-style applications with high query overlap
Customer support bots where similar questions recur frequently
RAG systems where the same retrievals happen repeatedly
Internal tools with a bounded set of common queries

Poor fit:

Highly personalized responses that differ per user even for the same query
Real-time data applications where freshness is critical
Creative applications where variety is the point (e.g., brainstorming tools)

Semantic caching works best when there's meaningful reuse across queries. If every request is genuinely unique, the cache overhead adds cost without benefit.

Going to Production

The FAISS implementation above is great for prototyping and single-process applications. When you're ready to scale, here's what changes:

Vector store. Move to a dedicated vector database: Redis with vector search, Qdrant, Pinecone, or Weaviate. These give you persistence, replication, and filtering built in.
Embedding service. Consider moving embedding generation to an API (OpenAI, Cohere, or a self-hosted model) so your application server stays lightweight.
Monitoring. Track your cache hit rate, average distance of matches, and latency savings. These metrics tell you if your threshold is calibrated correctly and how much value the cache is providing.
Warm-up. Pre-populate the cache with common queries from your logs to maximize hit rate from day one.

Wrapping Up

Semantic caching doesn't require retraining models or changing your core LLM logic. It adds a decision layer in front of your existing system: Have I already answered something similar enough? If yes, skip the LLM call.

That single decision can make your system significantly faster, cheaper, and more scalable. If you're running LLMs in production, it's worth building in early — the ROI only grows as your traffic does.

The full working code from this article is available as a single file you can drop into your project and start experimenting with immediately.

DEV Community