Ayi NEDJIMI

Posted on May 23

Building a cost-efficient LLM caching layer in Python

#python #ai #llm #performance

LLM API costs add up fast. If your application calls a language model API for every user request, you are paying for a lot of duplicate work. In many production systems, 30–50% of incoming queries are either exact repeats or semantically near-identical to something you have already answered. A caching layer captures those hits before they reach the API.

This tutorial builds a two-tier cache: exact-match via Redis (SHA-256 key) and semantic near-duplicate detection via cosine similarity over stored embeddings.

The problem in numbers

Say you are running a customer support assistant that handles 100,000 queries per day. Your LLM costs $0.01 per 1,000 tokens, and the average query + response is 500 tokens.

Daily cost without caching: 100,000 × 0.5 × $0.01 = $500/day
With a 40% cache hit rate: 60,000 API calls × 0.5 × $0.01 = $300/day
Monthly saving: $6,000

That 40% is conservative — real-world support bots often hit 60%+ once the cache is warm.

Architecture

User query
    │
    ▼
[SHA-256 exact match] ──hit──▶ return cached response
    │ miss
    ▼
[Embedding cosine similarity] ──hit (>0.92)──▶ return cached response
    │ miss
    ▼
[LLM API call]
    │
    ▼
[Store in both caches]
    │
    ▼
Return response

Dependencies

pip install redis openai numpy python-dotenv

You also need a running Redis instance:

docker run -d -p 6379:6379 redis:7-alpine

The exact cache (Redis + SHA-256)

Exact caching is trivial but catches a surprising volume of traffic — especially bots, retries, and repeated UI actions.

import hashlib
import json
import redis

class ExactCache:
    def __init__(self, host="localhost", port=6379, db=0, ttl=86400):
        self.client = redis.Redis(host=host, port=port, db=db, decode_responses=True)
        self.ttl = ttl  # 24 hours default

    def _key(self, prompt: str, model: str) -> str:
        payload = f"{model}::{prompt}"
        return "llm:exact:" + hashlib.sha256(payload.encode()).hexdigest()

    def get(self, prompt: str, model: str) -> dict | None:
        raw = self.client.get(self._key(prompt, model))
        if raw:
            return json.loads(raw)
        return None

    def set(self, prompt: str, model: str, response: dict) -> None:
        self.client.setex(
            self._key(prompt, model),
            self.ttl,
            json.dumps(response)
        )

The semantic cache (embeddings + cosine similarity)

Near-duplicate detection requires converting queries to embeddings and comparing them in vector space. We store embedding vectors in Redis alongside their cached responses.

import numpy as np
from openai import OpenAI

class SemanticCache:
    def __init__(self, redis_client: redis.Redis, threshold=0.92, ttl=3600):
        self.redis = redis_client
        self.threshold = threshold
        self.ttl = ttl
        self.embed_client = OpenAI()  # or any OpenAI-compatible client

    def _embed(self, text: str) -> np.ndarray:
        response = self.embed_client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return np.array(response.data[0].embedding, dtype=np.float32)

    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def _all_entries(self):
        keys = self.redis.keys("llm:semantic:*")
        for key in keys:
            raw = self.redis.get(key)
            if raw:
                yield key, json.loads(raw)

    def get(self, prompt: str) -> dict | None:
        query_vec = self._embed(prompt)
        best_score = 0.0
        best_entry = None

        for key, entry in self._all_entries():
            stored_vec = np.array(entry["embedding"], dtype=np.float32)
            score = self._cosine_similarity(query_vec, stored_vec)
            if score > best_score:
                best_score = score
                best_entry = entry

        if best_score >= self.threshold and best_entry:
            return best_entry["response"]
        return None

    def set(self, prompt: str, response: dict) -> None:
        vec = self._embed(prompt)
        key = "llm:semantic:" + hashlib.sha256(prompt.encode()).hexdigest()
        entry = {
            "prompt": prompt,
            "embedding": vec.tolist(),
            "response": response
        }
        self.redis.setex(key, self.ttl, json.dumps(entry))

Note on scale: scanning all keys works fine up to a few thousand entries. Beyond that, swap the linear scan for a vector database like Qdrant or pgvector.

The unified cache manager

import time
from dataclasses import dataclass, field

@dataclass
class CacheMetrics:
    exact_hits: int = 0
    semantic_hits: int = 0
    misses: int = 0
    total_requests: int = 0
    latency_saved_ms: float = 0.0

    @property
    def hit_rate(self) -> float:
        if self.total_requests == 0:
            return 0.0
        return (self.exact_hits + self.semantic_hits) / self.total_requests

class LLMCacheManager:
    def __init__(self):
        r = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)
        self.exact = ExactCache(ttl=86400)
        self.semantic = SemanticCache(r, threshold=0.92, ttl=3600)
        self.llm_client = OpenAI()
        self.metrics = CacheMetrics()

    def query(self, prompt: str, model: str = "gpt-4o-mini") -> dict:
        self.metrics.total_requests += 1
        t0 = time.perf_counter()

        # Tier 1: exact match
        cached = self.exact.get(prompt, model)
        if cached:
            self.metrics.exact_hits += 1
            self.metrics.latency_saved_ms += (time.perf_counter() - t0) * 1000
            return {**cached, "_cache": "exact"}

        # Tier 2: semantic match
        cached = self.semantic.get(prompt)
        if cached:
            self.metrics.semantic_hits += 1
            self.metrics.latency_saved_ms += (time.perf_counter() - t0) * 1000
            return {**cached, "_cache": "semantic"}

        # Cache miss — call the API
        self.metrics.misses += 1
        response = self.llm_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        result = {
            "content": response.choices[0].message.content,
            "model": model,
            "usage": response.usage.model_dump(),
            "_cache": "miss"
        }

        # Store in both caches
        self.exact.set(prompt, model, result)
        self.semantic.set(prompt, result)

        return result

    def print_metrics(self):
        m = self.metrics
        print(f"Total requests : {m.total_requests}")
        print(f"Exact hits     : {m.exact_hits}")
        print(f"Semantic hits  : {m.semantic_hits}")
        print(f"Misses         : {m.misses}")
        print(f"Hit rate       : {m.hit_rate:.1%}")
        print(f"Latency saved  : {m.latency_saved_ms:.0f} ms")

Putting it together

if __name__ == "__main__":
    cache = LLMCacheManager()

    queries = [
        "What is a buffer overflow vulnerability?",
        "Explain buffer overflow attacks",          # semantic near-duplicate
        "What is a buffer overflow vulnerability?", # exact duplicate
        "How does SQL injection work?",
        "What is SQL injection?",                   # semantic near-duplicate
    ]

    for q in queries:
        result = cache.query(q)
        print(f"[{result['_cache']:8s}] {q[:60]}")

    print()
    cache.print_metrics()

Sample output after the cache is warm:

[miss    ] What is a buffer overflow vulnerability?
[semantic] Explain buffer overflow attacks
[exact   ] What is a buffer overflow vulnerability?
[miss    ] How does SQL injection work?
[semantic] What is SQL injection?

Total requests : 5
Exact hits     : 1
Semantic hits  : 2
Misses         : 2
Hit rate       : 60.0%
Latency saved  : 1847 ms

Tuning the similarity threshold

The 0.92 threshold is a good starting point but you should calibrate it on your own data:

def evaluate_threshold(cache: SemanticCache, pairs: list[tuple[str, str, bool]]):
    """
    pairs: list of (query_a, query_b, should_match)
    """
    for q_a, q_b, expected in pairs:
        vec_a = cache._embed(q_a)
        vec_b = cache._embed(q_b)
        score = cache._cosine_similarity(vec_a, vec_b)
        match = score >= cache.threshold
        status = "OK" if match == expected else "WRONG"
        print(f"[{status}] {score:.3f}  '{q_a[:40]}' vs '{q_b[:40]}'")

Run this with a labelled sample of 50–100 query pairs from your actual traffic to find the right threshold for your domain.

Cost accounting

Add a simple cost tracker to quantify savings in real time:

COST_PER_1K_TOKENS = 0.01  # adjust for your model

def estimated_savings(metrics: CacheMetrics, avg_tokens: int = 500) -> float:
    hits = metrics.exact_hits + metrics.semantic_hits
    saved_tokens = hits * avg_tokens
    return (saved_tokens / 1000) * COST_PER_1K_TOKENS

For teams building AI-powered security tooling — where queries about CVEs, attack patterns, and compliance requirements tend to repeat heavily — caching can eliminate a majority of API spend. We cover patterns like this at AYI NEDJIMI Consultants when auditing AI pipelines for production readiness.

What to do next

Replace the linear embedding scan with Qdrant, Weaviate, or pgvector for production scale
Add cache warming: pre-embed your top-100 FAQ at startup
Track per-prompt hit rates to find eviction candidates
Add cache invalidation hooks when underlying knowledge changes

The full source is intentionally minimal — around 150 lines — so you can drop it into any existing codebase without pulling in a heavy framework.

DEV Community