LLM API costs add up fast. If your application calls a language model API for every user request, you are paying for a lot of duplicate work. In many production systems, 30–50% of incoming queries are either exact repeats or semantically near-identical to something you have already answered. A caching layer captures those hits before they reach the API.
This tutorial builds a two-tier cache: exact-match via Redis (SHA-256 key) and semantic near-duplicate detection via cosine similarity over stored embeddings.
The problem in numbers
Say you are running a customer support assistant that handles 100,000 queries per day. Your LLM costs $0.01 per 1,000 tokens, and the average query + response is 500 tokens.
- Daily cost without caching: 100,000 × 0.5 × $0.01 = $500/day
- With a 40% cache hit rate: 60,000 API calls × 0.5 × $0.01 = $300/day
- Monthly saving: $6,000
That 40% is conservative — real-world support bots often hit 60%+ once the cache is warm.
Architecture
User query
│
▼
[SHA-256 exact match] ──hit──▶ return cached response
│ miss
▼
[Embedding cosine similarity] ──hit (>0.92)──▶ return cached response
│ miss
▼
[LLM API call]
│
▼
[Store in both caches]
│
▼
Return response
Dependencies
pip install redis openai numpy python-dotenv
You also need a running Redis instance:
docker run -d -p 6379:6379 redis:7-alpine
The exact cache (Redis + SHA-256)
Exact caching is trivial but catches a surprising volume of traffic — especially bots, retries, and repeated UI actions.
import hashlib
import json
import redis
class ExactCache:
def __init__(self, host="localhost", port=6379, db=0, ttl=86400):
self.client = redis.Redis(host=host, port=port, db=db, decode_responses=True)
self.ttl = ttl # 24 hours default
def _key(self, prompt: str, model: str) -> str:
payload = f"{model}::{prompt}"
return "llm:exact:" + hashlib.sha256(payload.encode()).hexdigest()
def get(self, prompt: str, model: str) -> dict | None:
raw = self.client.get(self._key(prompt, model))
if raw:
return json.loads(raw)
return None
def set(self, prompt: str, model: str, response: dict) -> None:
self.client.setex(
self._key(prompt, model),
self.ttl,
json.dumps(response)
)
The semantic cache (embeddings + cosine similarity)
Near-duplicate detection requires converting queries to embeddings and comparing them in vector space. We store embedding vectors in Redis alongside their cached responses.
import numpy as np
from openai import OpenAI
class SemanticCache:
def __init__(self, redis_client: redis.Redis, threshold=0.92, ttl=3600):
self.redis = redis_client
self.threshold = threshold
self.ttl = ttl
self.embed_client = OpenAI() # or any OpenAI-compatible client
def _embed(self, text: str) -> np.ndarray:
response = self.embed_client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(response.data[0].embedding, dtype=np.float32)
def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def _all_entries(self):
keys = self.redis.keys("llm:semantic:*")
for key in keys:
raw = self.redis.get(key)
if raw:
yield key, json.loads(raw)
def get(self, prompt: str) -> dict | None:
query_vec = self._embed(prompt)
best_score = 0.0
best_entry = None
for key, entry in self._all_entries():
stored_vec = np.array(entry["embedding"], dtype=np.float32)
score = self._cosine_similarity(query_vec, stored_vec)
if score > best_score:
best_score = score
best_entry = entry
if best_score >= self.threshold and best_entry:
return best_entry["response"]
return None
def set(self, prompt: str, response: dict) -> None:
vec = self._embed(prompt)
key = "llm:semantic:" + hashlib.sha256(prompt.encode()).hexdigest()
entry = {
"prompt": prompt,
"embedding": vec.tolist(),
"response": response
}
self.redis.setex(key, self.ttl, json.dumps(entry))
Note on scale: scanning all keys works fine up to a few thousand entries. Beyond that, swap the linear scan for a vector database like Qdrant or pgvector.
The unified cache manager
import time
from dataclasses import dataclass, field
@dataclass
class CacheMetrics:
exact_hits: int = 0
semantic_hits: int = 0
misses: int = 0
total_requests: int = 0
latency_saved_ms: float = 0.0
@property
def hit_rate(self) -> float:
if self.total_requests == 0:
return 0.0
return (self.exact_hits + self.semantic_hits) / self.total_requests
class LLMCacheManager:
def __init__(self):
r = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)
self.exact = ExactCache(ttl=86400)
self.semantic = SemanticCache(r, threshold=0.92, ttl=3600)
self.llm_client = OpenAI()
self.metrics = CacheMetrics()
def query(self, prompt: str, model: str = "gpt-4o-mini") -> dict:
self.metrics.total_requests += 1
t0 = time.perf_counter()
# Tier 1: exact match
cached = self.exact.get(prompt, model)
if cached:
self.metrics.exact_hits += 1
self.metrics.latency_saved_ms += (time.perf_counter() - t0) * 1000
return {**cached, "_cache": "exact"}
# Tier 2: semantic match
cached = self.semantic.get(prompt)
if cached:
self.metrics.semantic_hits += 1
self.metrics.latency_saved_ms += (time.perf_counter() - t0) * 1000
return {**cached, "_cache": "semantic"}
# Cache miss — call the API
self.metrics.misses += 1
response = self.llm_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
result = {
"content": response.choices[0].message.content,
"model": model,
"usage": response.usage.model_dump(),
"_cache": "miss"
}
# Store in both caches
self.exact.set(prompt, model, result)
self.semantic.set(prompt, result)
return result
def print_metrics(self):
m = self.metrics
print(f"Total requests : {m.total_requests}")
print(f"Exact hits : {m.exact_hits}")
print(f"Semantic hits : {m.semantic_hits}")
print(f"Misses : {m.misses}")
print(f"Hit rate : {m.hit_rate:.1%}")
print(f"Latency saved : {m.latency_saved_ms:.0f} ms")
Putting it together
if __name__ == "__main__":
cache = LLMCacheManager()
queries = [
"What is a buffer overflow vulnerability?",
"Explain buffer overflow attacks", # semantic near-duplicate
"What is a buffer overflow vulnerability?", # exact duplicate
"How does SQL injection work?",
"What is SQL injection?", # semantic near-duplicate
]
for q in queries:
result = cache.query(q)
print(f"[{result['_cache']:8s}] {q[:60]}")
print()
cache.print_metrics()
Sample output after the cache is warm:
[miss ] What is a buffer overflow vulnerability?
[semantic] Explain buffer overflow attacks
[exact ] What is a buffer overflow vulnerability?
[miss ] How does SQL injection work?
[semantic] What is SQL injection?
Total requests : 5
Exact hits : 1
Semantic hits : 2
Misses : 2
Hit rate : 60.0%
Latency saved : 1847 ms
Tuning the similarity threshold
The 0.92 threshold is a good starting point but you should calibrate it on your own data:
def evaluate_threshold(cache: SemanticCache, pairs: list[tuple[str, str, bool]]):
"""
pairs: list of (query_a, query_b, should_match)
"""
for q_a, q_b, expected in pairs:
vec_a = cache._embed(q_a)
vec_b = cache._embed(q_b)
score = cache._cosine_similarity(vec_a, vec_b)
match = score >= cache.threshold
status = "OK" if match == expected else "WRONG"
print(f"[{status}] {score:.3f} '{q_a[:40]}' vs '{q_b[:40]}'")
Run this with a labelled sample of 50–100 query pairs from your actual traffic to find the right threshold for your domain.
Cost accounting
Add a simple cost tracker to quantify savings in real time:
COST_PER_1K_TOKENS = 0.01 # adjust for your model
def estimated_savings(metrics: CacheMetrics, avg_tokens: int = 500) -> float:
hits = metrics.exact_hits + metrics.semantic_hits
saved_tokens = hits * avg_tokens
return (saved_tokens / 1000) * COST_PER_1K_TOKENS
For teams building AI-powered security tooling — where queries about CVEs, attack patterns, and compliance requirements tend to repeat heavily — caching can eliminate a majority of API spend. We cover patterns like this at AYI NEDJIMI Consultants when auditing AI pipelines for production readiness.
What to do next
- Replace the linear embedding scan with Qdrant, Weaviate, or pgvector for production scale
- Add cache warming: pre-embed your top-100 FAQ at startup
- Track per-prompt hit rates to find eviction candidates
- Add cache invalidation hooks when underlying knowledge changes
The full source is intentionally minimal — around 150 lines — so you can drop it into any existing codebase without pulling in a heavy framework.
Top comments (0)