This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.
LLM Caching: Semantic Cache, Exact Match, TTL, Invalidation Strategies
Introduction
LLM API calls are expensive, both in cost and latency. Caching previously generated responses can reduce costs by 20-80% depending on the application. Unlike traditional HTTP caching where exact URL matching suffices, LLM caching must handle semantically equivalent but textually different queries. This article covers caching strategies from simple exact match to sophisticated semantic caching.
Exact Match Cache
The simplest cache: identical inputs produce identical outputs:
import hashlib
import json
from functools import lru_cache
class ExactMatchCache:
def __init__(self, ttl_seconds: int = 3600):
self.cache = {}
self.ttl = ttl_seconds
def _make_key(self, messages: list[dict], model: str, params: dict) -> str:
canonical = json.dumps({
"messages": messages,
"model": model,
"temperature": params.get("temperature", 0),
"max_tokens": params.get("max_tokens"),
}, sort_keys=True)
return hashlib.sha256(canonical.encode()).hexdigest()
def get(self, messages: list[dict], model: str, params: dict) -> str | None:
key = self._make_key(messages, model, params)
if key in self.cache:
entry = self.cache[key]
if time.time() - entry["timestamp"] < self.ttl:
return entry["response"]
else:
del self.cache[key]
return None
def set(self, messages: list[dict], model: str, params: dict, response: str):
key = self._make_key(messages, model, params)
self.cache[key] = {"response": response, "timestamp": time.time()}
Exact match works well when identical questions recur: FAQs, repeated classification tasks, or template-based prompts where only parameters change.
Semantic Cache
Semantic caching returns cached responses for semantically equivalent questions:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
class SemanticCache:
def __init__(self, embedding_fn, similarity_threshold: float = 0.92):
self.embedding = embedding_fn
self.threshold = similarity_threshold
self.cache_entries: list[dict] = []
def get(self, query: str) -> str | None:
query_emb = self.embedding(query)
for entry in self.cache_entries:
similarity = cosine_similarity([query_emb], [entry["embedding"]])[0][0]
if similarity >= self.threshold:
entry["access_count"] += 1
entry["last_accessed"] = time.time()
return entry["response"]
return None
def set(self, query: str, response: str):
entry = {
"query": query,
"embedding": self.embedding(query),
"response": response,
"created_at": time.time(),
"access_count": 1,
"last_accessed": time.time(),
}
self.cache_entries.append(entry)
def evict_lru(self, max_entries: int = 10000):
if len(self.cache_entries) > max_entries:
self.cache_entries.sort(key=lambda e: e["last_accessed"])
self.cache_entries = self.cache_entries[-max_entries:]
The similarity threshold controls the precision-recall tradeoff. A threshold of 0.95 is safe but rarely matches. A threshold of 0.85 catches more queries but risks returning irrelevant cached responses. Test on your specific query distribution.
Two-Level Cache
Combine both strategies for maximum coverage:
class TwoLevelLLMCache:
def __init__(self, embedding_fn):
self.exact = ExactMatchCache(ttl_seconds=7200)
self.semantic = SemanticCache(embedding_fn, similarity_threshold=0.92)
def get(self, messages: list[dict], model: str, params: dict) -> str | None:
# Try exact match first (fast, no embedding computation)
exact_result = self.exact.get(messages, model, params)
if exact_result:
return exact_result
# Try semantic match for the last user message
last_user_msg = self._get_last_user_message(messages)
if last_user_msg:
semantic_result = self.semantic.get(last_user_msg)
if semantic_result:
return semantic_result
return None
def set(self, messages: list[dict], model: str, params: dict, response: str):
self.exact.set(messages, model, params, response)
last_user_msg = self._get_last_user_message(messages)
if last_user_msg:
self.semantic.set(last_user_msg, resp
Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.
Found this useful? Check out more developer guides and tool comparisons on AI Study Room.
Top comments (0)