LLM Caching: Semantic Cache, Exact Match, TTL, Invalidation Strategies

#ai #machinelearning #llm

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

LLM Caching: Semantic Cache, Exact Match, TTL, Invalidation Strategies

Introduction

LLM API calls are expensive, both in cost and latency. Caching previously generated responses can reduce costs by 20-80% depending on the application. Unlike traditional HTTP caching where exact URL matching suffices, LLM caching must handle semantically equivalent but textually different queries. This article covers caching strategies from simple exact match to sophisticated semantic caching.

Exact Match Cache

The simplest cache: identical inputs produce identical outputs:

import hashlib

import json

from functools import lru_cache

class ExactMatchCache:

    def __init__(self, ttl_seconds: int = 3600):

        self.cache = {}

        self.ttl = ttl_seconds

    def _make_key(self, messages: list[dict], model: str, params: dict) -> str:

        canonical = json.dumps({

            "messages": messages,

            "model": model,

            "temperature": params.get("temperature", 0),

            "max_tokens": params.get("max_tokens"),

        }, sort_keys=True)

        return hashlib.sha256(canonical.encode()).hexdigest()

    def get(self, messages: list[dict], model: str, params: dict) -> str | None:

        key = self._make_key(messages, model, params)

        if key in self.cache:

            entry = self.cache[key]

            if time.time() - entry["timestamp"] < self.ttl:

                return entry["response"]

            else:

                del self.cache[key]

        return None

    def set(self, messages: list[dict], model: str, params: dict, response: str):

        key = self._make_key(messages, model, params)

        self.cache[key] = {"response": response, "timestamp": time.time()}

Exact match works well when identical questions recur: FAQs, repeated classification tasks, or template-based prompts where only parameters change.

Semantic Cache

Semantic caching returns cached responses for semantically equivalent questions:

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

class SemanticCache:

    def __init__(self, embedding_fn, similarity_threshold: float = 0.92):

        self.embedding = embedding_fn

        self.threshold = similarity_threshold

        self.cache_entries: list[dict] = []

    def get(self, query: str) -> str | None:

        query_emb = self.embedding(query)

        for entry in self.cache_entries:

            similarity = cosine_similarity([query_emb], [entry["embedding"]])[0][0]

            if similarity >= self.threshold:

                entry["access_count"] += 1

                entry["last_accessed"] = time.time()

                return entry["response"]

        return None

    def set(self, query: str, response: str):

        entry = {

            "query": query,

            "embedding": self.embedding(query),

            "response": response,

            "created_at": time.time(),

            "access_count": 1,

            "last_accessed": time.time(),

        }

        self.cache_entries.append(entry)

    def evict_lru(self, max_entries: int = 10000):

        if len(self.cache_entries) > max_entries:

            self.cache_entries.sort(key=lambda e: e["last_accessed"])

            self.cache_entries = self.cache_entries[-max_entries:]

The similarity threshold controls the precision-recall tradeoff. A threshold of 0.95 is safe but rarely matches. A threshold of 0.85 catches more queries but risks returning irrelevant cached responses. Test on your specific query distribution.

Two-Level Cache

Combine both strategies for maximum coverage:

class TwoLevelLLMCache:

    def __init__(self, embedding_fn):

        self.exact = ExactMatchCache(ttl_seconds=7200)

        self.semantic = SemanticCache(embedding_fn, similarity_threshold=0.92)

    def get(self, messages: list[dict], model: str, params: dict) -> str | None:

        # Try exact match first (fast, no embedding computation)

        exact_result = self.exact.get(messages, model, params)

        if exact_result:

            return exact_result

        # Try semantic match for the last user message

        last_user_msg = self._get_last_user_message(messages)

        if last_user_msg:

            semantic_result = self.semantic.get(last_user_msg)

            if semantic_result:

                return semantic_result

        return None

    def set(self, messages: list[dict], model: str, params: dict, response: str):

        self.exact.set(messages, model, params, response)

        last_user_msg = self._get_last_user_message(messages)

        if last_user_msg:

            self.semantic.set(last_user_msg, resp

Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.