Your agent asks the same question every time it runs. "What is the company's refund policy?" The answer never changes. But every time a user talks to the agent, you make a fresh LLM call with the same prompt context and pay full price for the same response.
Or your agent evaluates documents. Ten users submit the same document this hour. Ten identical LLM calls. Ten identical bills.
llm-cache-mem is an in-process LRU cache for LLM calls. Same input, same output, zero API cost.
The Shape of the Fix
from llm_cache_mem import LLMCache
cache = LLMCache(max_size=500, ttl_seconds=3600)
def call_llm_cached(messages: list[dict], model: str = "claude-sonnet-4-6") -> dict:
result = cache.get(messages, model=model)
if result is not None:
return result
response = anthropic_client.messages.create(
model=model,
max_tokens=1024,
messages=messages,
)
result = {"content": response.content, "usage": response.usage}
cache.put(messages, result, model=model)
return result
Cache hit: returns the stored response with zero API cost. Cache miss: makes the API call, stores the result, returns it. The TTL means results expire after one hour by default so stale answers do not persist indefinitely.
What It Does NOT Do
llm-cache-mem does not share the cache across processes. It is in-process only. If you run five worker processes, each has its own cache. For a shared cache, use Redis or Memcached and compute the cache key the same way.
It does not handle streaming responses. The cache stores complete response objects. If your LLM client uses streaming, collect the full response first and then cache the assembled result.
It does not cache errors. If the API call fails, nothing is stored. The next identical call will retry the API. This is intentional: caching errors would mean a transient provider outage permanently blocks a valid request.
Inside the Library
The cache key is a SHA-256 hash of the canonical JSON of the input:
class LLMCache:
def __init__(self, max_size: int = 256, ttl_seconds: float | None = 3600):
self._cache: OrderedDict[str, CacheEntry] = OrderedDict()
self._max_size = max_size
self._ttl = ttl_seconds
self._lock = threading.Lock()
self._hits = 0
self._misses = 0
def _key(self, messages: list[dict], **kwargs) -> str:
payload = {"messages": messages, **kwargs}
canonical = json.dumps(payload, sort_keys=True)
return hashlib.sha256(canonical.encode()).hexdigest()
def get(self, messages: list[dict], **kwargs) -> dict | None:
key = self._key(messages, **kwargs)
with self._lock:
if key not in self._cache:
self._misses += 1
return None
entry = self._cache[key]
if self._ttl and time.time() - entry.stored_at > self._ttl:
del self._cache[key]
self._misses += 1
return None
# Move to end (LRU: most recently used)
self._cache.move_to_end(key)
self._hits += 1
return entry.value
def put(self, messages: list[dict], value: dict, **kwargs) -> None:
key = self._key(messages, **kwargs)
with self._lock:
if key in self._cache:
self._cache.move_to_end(key)
self._cache[key] = CacheEntry(value=value, stored_at=time.time())
# Evict oldest if over capacity
while len(self._cache) > self._max_size:
self._cache.popitem(last=False)
def stats(self) -> CacheStats:
with self._lock:
total = self._hits + self._misses
return CacheStats(
hits=self._hits,
misses=self._misses,
hit_rate=self._hits / total if total > 0 else 0.0,
size=len(self._cache),
)
OrderedDict provides O(1) LRU operations: move_to_end(key) on access, popitem(last=False) to evict the least recently used entry. The TTL check happens on read so expired entries are evicted lazily rather than requiring a background cleanup thread.
When to Use It
Use it for agents that answer questions from a fixed knowledge base. The system prompt changes slowly; the user question is often similar across sessions; the answer is deterministic. A cache hit rate of 20% on a high-volume agent pays for the library many times over.
Use it for evaluation pipelines. If you are running the same model evaluation across a batch of test cases and some test cases are identical, a cache stops you from paying for duplicates.
Use it for development. During development, you run the same agent against the same test input dozens of times. A cache makes those iterations faster and cheaper without changing the code.
Skip it for conversations that must reflect current state. If the LLM response depends on real-time data (stock prices, current inventory, live status), caching will return stale results. Gate the cache behind a flag or exclude those call sites.
Install
pip install git+https://github.com/MukundaKatta/llm-cache-mem
# Or from PyPI
pip install llm-cache-mem
from llm_cache_mem import LLMCache
cache = LLMCache(max_size=1000, ttl_seconds=1800)
class CachedAnthropicClient:
def __init__(self, client):
self._client = client
self._cache = LLMCache(max_size=1000, ttl_seconds=1800)
def messages_create(self, model: str, messages: list[dict], **kwargs) -> dict:
result = self._cache.get(messages, model=model, **kwargs)
if result is not None:
return result
response = self._client.messages.create(
model=model,
messages=messages,
**kwargs,
)
result = response
self._cache.put(messages, result, model=model, **kwargs)
return result
def cache_stats(self):
stats = self._cache.stats()
print(f"Hit rate: {stats.hit_rate:.1%} ({stats.hits}/{stats.hits + stats.misses})")
print(f"Cache size: {stats.size}")
Sibling Libraries
| Library | What it solves |
|---|---|
llm-batch-coalesce |
Single-flight dedup for concurrent identical in-flight calls |
prompt-cache-warmer |
Pre-warm Anthropic's server-side prompt cache |
cachebench |
Benchmark client-side vs server-side cache performance |
tool-result-cache |
LRU+TTL cache for tool call results (not LLM responses) |
llm-fixture-replay |
VCR-style record/replay for test fixtures |
The caching stack: llm-cache-mem for identical-request dedup in production, prompt-cache-warmer for Anthropic prefix caching, llm-batch-coalesce for concurrent duplicate requests, llm-fixture-replay for test replay.
What's Next
Persistent cache backend: LLMCache(backend="sqlite://./cache.db") so the cache survives process restarts. Useful for development where you restart the agent repeatedly against the same test inputs.
Partial key matching: sometimes you want to cache on a subset of the messages (the system prompt and the last user message, not the full history). A key_fn parameter would let callers provide a custom cache key function.
Cost tracking: cache.stats() already returns hit rate. Adding per-model cost estimates would let you compute dollars saved by the cache automatically, which is useful for justifying the complexity.
Built as part of the agent-stack family: composable Python primitives for production LLM agents.
Top comments (0)