Kushal

Posted on Mar 16

What's semantic caching?

#llm #ai #performance #architecture

As more applications for generative AI come, its shortcomings become more apparent. One huge problem with LLMs is how expensive each query is, for example take Gemini — Gemini 2.5 Pro charges $1.25 per million input tokens and $10 per million output tokens. Their flagship Gemini 3.1 Pro doubles that to $2 and $12 per million tokens respectively. Even a moderately active app can rack up thousands of dollars a month pretty quickly. Imagine a small customer support bot with just 500 daily users — by month two, the API bill has quietly crossed $2,000. That's not an edge case, that's just what happens when you're not caching. As a business (or a personal user) saving costs where possible and speeding up operations is a huge important factor that decides how well your product does. One way to speed up and minimise costs is to use a simple 'semantic cache'.

What it is

A semantic cache is not too different from a traditional cache, it has the same idea behind it. Normally a traditional cache stores either LRU (Least Recently Used) or LFU (Least Frequently Used) data so that when the same query comes, it can just fetch the result stored rather than search it up again.

You however cannot apply the exact same pipeline for RAG or genAI products simply because the output is not 'deterministic', i.e, it's not the same. Take these examples:

What is the situation regarding AI in professional workplaces?

How are AI tools affecting workplaces?

Now semantically these seem similar enough to use and we can gauge that they kinda mean the same thing, but a normal cache does not understand that. It thinks these both are different because they are not exactly the same.

That's where semantic caching comes in. Rather than compare them directly, it compares the semantic meaning behind them and understands that it's kinda the same and thus we get a cache hit! We normally check how similar two documents are based on cosine similarity.

How it works

This is a typical pipeline for RAG systems that use semantic caching.

First the documents are chopped up etc and converted to word embeddings (vectors). Ofc you store them in a vector db like Chroma, FAISS or something of that sort which suits your use case. After the user sends a query we don't go to the db. Instead we first check with the semantic cache. It sees if the query is relevant to the cached query.

Two things can happen from here:

Cache hit: The query is similar enough to a cached one (above the threshold) → cached context is pulled and handed to the LLM → response is generated. Fast and cheap, no db lookup needed.

Cache miss: Nothing similar in the cache → normal vector db retrieval happens → relevant chunks are fetched, response is generated, and the new query gets cached for next time. Normal speed, but the cache is now warmer.

Word embeddings are compared using cosine similarity:

cosine(θ) = (A · B) / (||A|| × ||B||)

It's a very fast and simple method to see the angle between the direction of vectors. If similar, then they would aim in similar direction, i.e, the angle between them is low, i.e, cos of that angle is higher. Output is from 0 to 1 where 0 means not at all similar and 1 ofc means they are the exact same.

For example:

"What is the impact of AI on jobs?" vs "How is AI changing employment?" → score of ~0.91 → cache hit
"What is the impact of AI on jobs?" vs "How do I bake sourdough bread?" → score of ~0.08 → cache miss

Those first two are clearly the same question in spirit, and the score reflects that.

Why use it

Significant cost savings. By reducing the queries sent to vector dbs, you cut down on a huge portion of charges incurred.
Faster response time. If you already have the cached content, you don't need to retrieve it again. This allows the system to be a whole lot faster in production.
Better use of resources. Since you aren't redoing similar queries, the system is free to do more tasks, allowing you to scale better or handle more complex features.

Compared to other approaches in RAG

Approach	Handles Semantic Similarity	Cost Savings	Speed Boost	Setup Complexity	Works for Unique Queries	Best For
Traditional Cache	No (exact match only)	High (when hits)	Very High	Low	No	High-volume apps with repetitive, exact queries
Semantic Cache	Yes	High	High	Medium	No	Apps with overlapping but varied query patterns
Query Rewriting	Partially	Low	Low (adds a step)	Medium	Yes	Improving retrieval on ambiguous or poorly phrased queries
Re-ranking	No	Low	No (adds latency)	Medium	Yes	Boosting relevance when retrieval is decent but ordering is off
Hybrid Search	Partially	Low	Moderate	High	Yes	Complex domains needing both keyword and semantic retrieval
Chunking Optimisation	No	Moderate	Moderate	Low–Medium	Yes	Improving retrieval quality at the source

As you can see, semantic caching isn't a silver bullet. It shines when there's a decent overlap in the kinds of queries your users send. For more diverse or unique query patterns, approaches like re-ranking or hybrid search may be better suited.

The cons

More complex to build than a traditional cache system.
Higher chances of getting semantically similar chunks that may not be relevant or useful for answering the query. Think of it like asking a librarian for "books about space travel" and getting recommendations cached from a previous "books about space exploration" query — close enough on the surface. But when you follow up with "books about the health risks of space travel", the cache might still serve those same exploration books because the queries look similar, even though what you actually need is quite different.
Need to balance out the threshold. A higher threshold does not yield useful chunks and a lower limit may not bring semantically similar chunks, both degrade performance of system. Important to find out the right balance.
Empty cache is slow and has high latency.
Not suitable when every user query is unique.

When not to use it

Semantic caching isn't always the right tool. Skip it if:

Every query your users send is unique. Think code generation, legal research, or anything highly personalised — the cache will almost never hit and you're just adding overhead.
Your app is low traffic. If you're getting a handful of queries a day, there's no real benefit.
Your knowledge base changes constantly. If documents are being updated all the time, you'll spend more time invalidating the cache than benefiting from it.
Accuracy is non-negotiable. Cached context can be slightly off. For use cases where being slightly wrong is worse than being slow, don't cache.

How to best utilise it

Calibrate your threshold carefully. A good starting point is somewhere between 0.85–0.90. From there, tune it based on your specific use case and monitor quality. There's no universal right answer here.
Use TTL (Time To Live) values. Cached entries should expire, especially when your underlying data changes or when topics are time-sensitive. Stale cache is worse than no cache.
Warm up your cache. Pre-populate it with common or anticipated queries so you're not starting completely cold in production. A cold cache gives you none of the benefits.
Invalidate when your knowledge base updates. If the documents in your vector db change, cached responses based on old chunks can quietly degrade your output quality without you noticing.
Monitor your hit rate. A healthy semantic cache typically sees somewhere around 30–60% hit rates. Too low and your threshold might be too strict; suspiciously high but quality is dropping means it's too loose.
Think about scope — global vs user-level caching. A global cache saves the most but can serve mismatched cached results across very different user contexts. For personalised applications, a user-scoped cache might make more sense even if it's less efficient.

Tools that already do this

You don't have to build it from scratch. A few libraries have semantic caching built in or easily pluggable:

GPTCache — an open source library built specifically for caching LLM responses. Pretty flexible and worth looking at if you're rolling your own pipeline.
LangChain — has caching layers that plug into existing chains without too much effort. Good starting point if you're already using it.
Redis — with vector similarity extensions, Redis can act as a fast semantic cache layer, especially if you're already using it in your stack.

Worth knowing these exist before you reinvent the wheel.

DEV Community