DEV Community

Cover image for Stop Paying for the Same Answer Twice: A Deep Dive into llm-cache
Gaurav Vij
Gaurav Vij

Posted on

Stop Paying for the Same Answer Twice: A Deep Dive into llm-cache

Every AI engineer has been there. You open the billing dashboard, squint at the number, and do a quiet double-take. You know the product is working, traffic is healthy, users are happy. But somewhere in that invoice is a dirty secret: you are paying for the same computation over and over again.

Someone asks your support bot "How do I reset my password?" Fifty other users ask "What are the steps to reset my password?" Twenty more ask "Can you help me change my password?" The LLM doesn't know it has answered this question a hundred times today. It just runs the full forward pass every single time, burns your tokens, and charges you accordingly.

This is not a fringe problem. It is the default state of almost every production LLM deployment.

llm-cache is a Python middleware library that fixes this. It caches LLM responses not by exact string match, but by semantic similarity. The project was built fully autonomously by NEO, an AI coding agent, and the code is clean, thoughtful, and surprisingly production-ready for a library with just two commits. Let's get into how it works, what it costs you, and what you can do with it.


The Core Insight: Meaning, Not Characters

Most developers' first instinct when building a cache is a hash map. Take the prompt string, hash it, store the result. This works if your users send byte-for-byte identical queries. They never do.

Users paraphrase. They make typos. They use formal phrasing in one context and casual phrasing in another. A naive cache misses all of these. llm-cache approaches the problem differently: instead of comparing strings, it compares meaning.

The pipeline is elegant:

  1. Incoming prompt is converted into a 384-dimensional embedding vector using all-MiniLM-L6-v2, a sentence-transformers model that runs entirely locally.
  2. The vector is L2-normalized so that inner product becomes equivalent to cosine similarity.
  3. A FAISS IndexFlatIP index does exact nearest-neighbor search over all previously cached vectors.
  4. If the closest match clears a configurable similarity threshold (default 0.95), the cached response is returned immediately, no API call made.
  5. On a miss, the real LLM API is called, the response is stored, and future similar queries will hit the cache.

The result is that "What is the capital of France?" and "Tell me the capital city of France" return the same cached response. One API call served two users.


The Architecture Under the Hood

The project is organized into four tight modules plus two SDK wrappers.

embedder.py wraps sentence-transformers with an LRU cache so repeated embeddings of the same text do not trigger redundant model inference. It normalizes vectors before returning them so the FAISS layer can do cosine comparisons via inner product without any extra math.

store.py is where the actual cache lives. It holds a FAISS index and a parallel Python dict of metadata keyed by integer IDs. Thread safety is handled with a threading.RLock, which means you can use this in multi-threaded FastAPI or Django setups without adding your own locking. Persistence is handled by periodically flushing both the FAISS index (via faiss.write_index) and the metadata dict (via pickle) to ~/.llm_cache/. Every 10 writes by default, configurable if you want more frequent or less frequent saves.

cache.py is the high-level interface that ties embedder and store together. It exposes get, set, lookup_or_call, get_similar, delete, clear, save, and stats. The lookup_or_call method is particularly useful: it takes a prompt and a callable, checks the cache first, and only invokes the callable on a miss.

wrappers/openai_wrapper.py and wrappers/anthropic_wrapper.py are where the ergonomic magic happens. CachedOpenAI and CachedAnthropic subclass the official SDKs and intercept the chat.completions.create and messages.create methods respectively. From the outside, they are drop-in replacements. You change one import line and one constructor call. Everything else in your codebase stays identical.

# Before
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# After
from llm_cache import CachedOpenAI
client = CachedOpenAI(api_key="sk-...", threshold=0.90)
Enter fullscreen mode Exit fullscreen mode

That is the entire migration. Your existing call sites do not change.


How Much Money Does This Actually Save?

Let's put concrete numbers on this. The library's own README claims 40 to 60 percent cost reduction on repetitive workloads. That tracks with how LLM usage actually distributes in production.

Consider a customer support application running on GPT-4o. GPT-4o is priced at $2.50 per million input tokens and $10.00 per million output tokens at the time of writing. A typical support query might be 150 input tokens and 300 output tokens: that is $0.000375 on input plus $0.003000 on output, coming to roughly $0.003375 per call. If you handle 100,000 queries a day, that is $337.50 a day, around $10,125 a month, and over $123,000 a year.

Now think about the actual query distribution. Support traffic is highly repetitive. Password resets, billing questions, shipping status, cancellation flows. If even 40 percent of queries are semantically similar to something already cached, you are looking at potential savings in the range of $4,000 a month from API costs alone. On Claude Sonnet or GPT-4 Turbo, where token prices differ, the math shifts accordingly.

Disclaimer

That said, these are illustrative numbers based on a simplified model. Real production costs vary considerably depending on prompt length distribution, how diverse your user base actually is, which threshold you settle on, and how much of your traffic is genuinely repetitive versus novel. The 40 to 60 percent savings figure from the library's README is a reasonable ballpark for repetitive workloads, but your actual hit rate depends entirely on your specific use case. Treat these numbers as a directional estimate, not a guarantee.

For batch processing workloads the economics can be even more compelling. If you are enriching a product catalog, generating descriptions for SKUs, or running the same analysis prompts across thousands of documents with overlapping content, cache hit rates can push above 70 percent. On a $10,000 monthly LLM bill, that kind of hit rate could represent thousands of dollars in avoided API calls, though again the actual figure depends on how much genuine repetition exists in your data.

There is also a latency dimension that is easy to overlook. A cache hit returns in milliseconds. A real API call takes 500ms to 3 seconds depending on model and load. In user-facing applications, this latency improvement translates directly to perceived product quality, which is harder to put a dollar figure on but is real.

The configurable threshold gives you a dial between savings and correctness. At 0.95 you are catching clear paraphrases while being conservative about false positives. At 0.88 to 0.91 you are being more aggressive, which works well for batch workloads where the cost of an occasional semantically-mismatched cache hit is low. At 0.85 and below you risk serving stale or wrong responses for queries that are topically related but not actually equivalent.


What to Watch Out For

The library is honest about its limitations, which is a good sign.

Streaming responses are not cached. If you use stream=True, the call passes through unchanged. This is a real gap for chat applications where streaming UX is expected. The architecture would need changes to buffer the streamed response and store it post-completion, which is doable but adds complexity.

Tool and function calls are not cached either. If your agents use tool use, those responses pass through. This matters less than it sounds for cost savings because tool call responses are usually dynamic by nature, but it is worth knowing.

The cache is model-agnostic. The key is the semantic content of the prompt, not the model name. If you ask the same question to GPT-4o and Claude Sonnet, they will share a cache entry by default. This is fine if you want that behavior, but if you need model-specific caches, use different cache_name values per model.

The cache is also not context-aware. If the same question means different things depending on prior conversation turns, the cache will incorrectly serve a response from a different context. This matters for multi-turn chat where the embedding of the final user message does not capture the full conversational state.


Built Fully by NEO: Your AI Engineering Agent

The llm-cache repository was built autonomously by NEO - A fully autonomous AI Engineering Agent capable of finetuning, evaluating, experimenting with AI models and building/deploying AI pipelines such as RAG, classical ML experimentation and much more.

This project is definitely not a toy. The codebase has a proper package structure with separated concerns across embedder, store, cache, and wrappers. It has a test suite covering the cache, the OpenAI wrapper, and the Anthropic wrapper. It has working examples that run without API keys using mock responses. It has configuration documentation, a thresholds reference table, and SVG architecture diagrams. It has async support with AsyncCachedOpenAI and AsyncCachedAnthropic. The FAISS persistence strategy, the RLock threading model, the LRU cache on the embedder, the L2 normalization before FAISS inner product: these are not random choices. They are informed engineering decisions.

NEO can be used in your VS Code IDE via VS Code extension or Cursor and works as an autonomous AI engineering agent that can take a high-level goal, plan the implementation, write the code, run tests, and iterate until the project is complete. This library was produced from a single prompt.

What that unlocks is interesting. The tool itself is useful. But the meta-point is that an engineer with a clear idea and access to NEO can build a production-ready Python library in the time it used to take to write a design doc. The feedback loop between idea and artifact has collapsed.


How to Build on This with NEO

The library works well as-is, but there are several directions where it could go further. If you want to extend it, NEO is the fastest way to do that.

Multi-turn context caching is the most valuable near-term addition. Right now the cache key is the embedding of a single message. A more robust implementation would embed the last N turns of conversation concatenated together, so that "what about France?" in the context of a geography discussion hashes differently from the same phrase in a cooking discussion. You could prompt NEO like:

Clone the llm-cache repo: https://github.com/dakshjain-1616/llm-cache and extend llm-cache to support multi-turn context-aware caching by embedding the last 3 messages as a single context string
Enter fullscreen mode Exit fullscreen mode

and it would handle the implementation.

A Redis-backed store would make the cache shareable across multiple instances of your application. Right now each process has its own FAISS index on disk. A distributed cache requires a shared vector store. Qdrant, Pinecone, or Redis with the RedisSearch module are all viable backends. NEO could scaffold the new store.py backend and the adapter pattern to keep the existing API unchanged.

Cache warming is another high-value addition for predictable workloads. If you know your users will ask about a known set of topics, you can pre-populate the cache before the first real user query arrives, guaranteeing zero-latency responses for those cases. A simple CLI command, llm-cache warm --questions questions.txt, would do this.

Analytics and observability around the cache would help you tune the threshold intelligently. Right now get_stats() returns hits, misses, and hit rate. A richer implementation would log the similarity score of every hit, let you visualize the distribution, and suggest threshold adjustments. Integrating with OpenTelemetry or Datadog would make this production-observable.

Streaming support is the most technically involved gap. You would need to buffer streamed response chunks, detect completion, reconstruct the full response, serialize it, and store it. The tricky part is that Python generators are not directly picklable. A design that stores the reconstructed text and then re-streams it from the cache on subsequent hits would give users the same UX without paying the API cost.

To build any of these with NEO, install the VS Code extension, clone the llm-cache repo and open the directory, and describe what you want in NEO's new chat prompt. NEO reads the existing codebase, understands the architecture, and writes code that fits the existing patterns rather than starting from scratch.


Getting Started with llm-cache

Install the dependencies and run a demo that shows the cache working without any API key:

pip install faiss-cpu sentence-transformers openai anthropic
pip install -e .
python examples/openai_example.py
Enter fullscreen mode Exit fullscreen mode

You will see [CACHE HIT] and [CACHE MISS] labels with a final stats block showing the hit rate. The sentence-transformers model downloads automatically on first run and is about 90 MB.

llm-cache management system example view for post fleet deployment

For a real application, the migration is two lines:

from llm_cache import CachedOpenAI
client = CachedOpenAI(api_key="sk-...", threshold=0.90)
Enter fullscreen mode Exit fullscreen mode

Start with threshold=0.90. Watch client.get_stats() for a few days in staging. If you are seeing false positives, move to 0.93 or 0.95. If your hit rate is low and your workload is genuinely repetitive, try 0.88. The right number is workload-specific.


Final Thought

The unsexy reality of running LLMs in production is that most of the cost is not in the interesting, novel queries. It is in the mundane, repeated ones that look slightly different on the surface but mean exactly the same thing. llm-cache is a focused solution to that specific problem, and it is well-engineered enough to drop into a production system with confidence.

The fact that it was built autonomously by NEO in a single session is, honestly, the most interesting detail in the whole story. Not because it diminishes the quality of the code, but because it demonstrates what becomes possible when the cost of building a tool drops to nearly zero. You stop asking "is this worth building?" and start asking "why haven't I built this yet?"

The code is at github.com/dakshjain-1616/llm-cache. Go look at it. Your API bill will thank you.

Top comments (0)