DEV Community: Kushal

What's semantic caching?

Kushal — Mon, 16 Mar 2026 15:34:31 +0000

As more applications for generative AI come, its shortcomings become more apparent. One huge problem with LLMs is how expensive each query is, for example take Gemini — Gemini 2.5 Pro charges $1.25 per million input tokens and $10 per million output tokens. Their flagship Gemini 3.1 Pro doubles that to $2 and $12 per million tokens respectively. Even a moderately active app can rack up thousands of dollars a month pretty quickly. Imagine a small customer support bot with just 500 daily users — by month two, the API bill has quietly crossed $2,000. That's not an edge case, that's just what happens when you're not caching. As a business (or a personal user) saving costs where possible and speeding up operations is a huge important factor that decides how well your product does. One way to speed up and minimise costs is to use a simple 'semantic cache'.

What it is

A semantic cache is not too different from a traditional cache, it has the same idea behind it. Normally a traditional cache stores either LRU (Least Recently Used) or LFU (Least Frequently Used) data so that when the same query comes, it can just fetch the result stored rather than search it up again.

You however cannot apply the exact same pipeline for RAG or genAI products simply because the output is not 'deterministic', i.e, it's not the same. Take these examples:

What is the situation regarding AI in professional workplaces?

How are AI tools affecting workplaces?

Now semantically these seem similar enough to use and we can gauge that they kinda mean the same thing, but a normal cache does not understand that. It thinks these both are different because they are not exactly the same.

That's where semantic caching comes in. Rather than compare them directly, it compares the semantic meaning behind them and understands that it's kinda the same and thus we get a cache hit! We normally check how similar two documents are based on cosine similarity.

How it works

This is a typical pipeline for RAG systems that use semantic caching.

First the documents are chopped up etc and converted to word embeddings (vectors). Ofc you store them in a vector db like Chroma, FAISS or something of that sort which suits your use case. After the user sends a query we don't go to the db. Instead we first check with the semantic cache. It sees if the query is relevant to the cached query.

Two things can happen from here:

Cache hit: The query is similar enough to a cached one (above the threshold) → cached context is pulled and handed to the LLM → response is generated. Fast and cheap, no db lookup needed.

Cache miss: Nothing similar in the cache → normal vector db retrieval happens → relevant chunks are fetched, response is generated, and the new query gets cached for next time. Normal speed, but the cache is now warmer.

Word embeddings are compared using cosine similarity:

cosine(θ) = (A · B) / (||A|| × ||B||)

It's a very fast and simple method to see the angle between the direction of vectors. If similar, then they would aim in similar direction, i.e, the angle between them is low, i.e, cos of that angle is higher. Output is from 0 to 1 where 0 means not at all similar and 1 ofc means they are the exact same.

For example:

"What is the impact of AI on jobs?" vs "How is AI changing employment?" → score of ~0.91 → cache hit
"What is the impact of AI on jobs?" vs "How do I bake sourdough bread?" → score of ~0.08 → cache miss

Those first two are clearly the same question in spirit, and the score reflects that.

Why use it

Significant cost savings. By reducing the queries sent to vector dbs, you cut down on a huge portion of charges incurred.
Faster response time. If you already have the cached content, you don't need to retrieve it again. This allows the system to be a whole lot faster in production.
Better use of resources. Since you aren't redoing similar queries, the system is free to do more tasks, allowing you to scale better or handle more complex features.

Compared to other approaches in RAG

Approach	Handles Semantic Similarity	Cost Savings	Speed Boost	Setup Complexity	Works for Unique Queries	Best For
Traditional Cache	No (exact match only)	High (when hits)	Very High	Low	No	High-volume apps with repetitive, exact queries
Semantic Cache	Yes	High	High	Medium	No	Apps with overlapping but varied query patterns
Query Rewriting	Partially	Low	Low (adds a step)	Medium	Yes	Improving retrieval on ambiguous or poorly phrased queries
Re-ranking	No	Low	No (adds latency)	Medium	Yes	Boosting relevance when retrieval is decent but ordering is off
Hybrid Search	Partially	Low	Moderate	High	Yes	Complex domains needing both keyword and semantic retrieval
Chunking Optimisation	No	Moderate	Moderate	Low–Medium	Yes	Improving retrieval quality at the source

As you can see, semantic caching isn't a silver bullet. It shines when there's a decent overlap in the kinds of queries your users send. For more diverse or unique query patterns, approaches like re-ranking or hybrid search may be better suited.

The cons

More complex to build than a traditional cache system.
Higher chances of getting semantically similar chunks that may not be relevant or useful for answering the query. Think of it like asking a librarian for "books about space travel" and getting recommendations cached from a previous "books about space exploration" query — close enough on the surface. But when you follow up with "books about the health risks of space travel", the cache might still serve those same exploration books because the queries look similar, even though what you actually need is quite different.
Need to balance out the threshold. A higher threshold does not yield useful chunks and a lower limit may not bring semantically similar chunks, both degrade performance of system. Important to find out the right balance.
Empty cache is slow and has high latency.
Not suitable when every user query is unique.

When not to use it

Semantic caching isn't always the right tool. Skip it if:

Every query your users send is unique. Think code generation, legal research, or anything highly personalised — the cache will almost never hit and you're just adding overhead.
Your app is low traffic. If you're getting a handful of queries a day, there's no real benefit.
Your knowledge base changes constantly. If documents are being updated all the time, you'll spend more time invalidating the cache than benefiting from it.
Accuracy is non-negotiable. Cached context can be slightly off. For use cases where being slightly wrong is worse than being slow, don't cache.

How to best utilise it

Calibrate your threshold carefully. A good starting point is somewhere between 0.85–0.90. From there, tune it based on your specific use case and monitor quality. There's no universal right answer here.
Use TTL (Time To Live) values. Cached entries should expire, especially when your underlying data changes or when topics are time-sensitive. Stale cache is worse than no cache.
Warm up your cache. Pre-populate it with common or anticipated queries so you're not starting completely cold in production. A cold cache gives you none of the benefits.
Invalidate when your knowledge base updates. If the documents in your vector db change, cached responses based on old chunks can quietly degrade your output quality without you noticing.
Monitor your hit rate. A healthy semantic cache typically sees somewhere around 30–60% hit rates. Too low and your threshold might be too strict; suspiciously high but quality is dropping means it's too loose.
Think about scope — global vs user-level caching. A global cache saves the most but can serve mismatched cached results across very different user contexts. For personalised applications, a user-scoped cache might make more sense even if it's less efficient.

Tools that already do this

You don't have to build it from scratch. A few libraries have semantic caching built in or easily pluggable:

GPTCache — an open source library built specifically for caching LLM responses. Pretty flexible and worth looking at if you're rolling your own pipeline.
LangChain — has caching layers that plug into existing chains without too much effort. Good starting point if you're already using it.
Redis — with vector similarity extensions, Redis can act as a fast semantic cache layer, especially if you're already using it in your stack.

Worth knowing these exist before you reinvent the wheel.

RAG+ is kinda cool

Kushal — Tue, 08 Jul 2025 10:34:11 +0000

So you probably already know what RAG (Retrieval-Augmented Generation) is. It goes and grabs relevant info and feeds it to a language model so it can answer better and yeah, that works pretty well… until it doesn’t.

The catch? RAG is good at retrieving stuff but not really great when it comes to reasoning. Like it knows things but doesn’t apply them all that well. That’s exactly the issue RAG+ is built to solve and it does it in a surprisingly clean way.

What’s Different in RAG+?

RAG+ adds a second brain to the operation. Instead of just retrieving knowledge chunks, it builds two corpora:

Knowledge base – where we get our information from (textbooks, docs, etc)
Application base – actual examples that show how to use that knowledge (examples showing how to use math formulas)

So when it’s inference time, the model doesn’t just get “Formula for mean median and mode”, it gets both the definition and a walkthrough of solving a question about it. That combo lets it reason better, especially for stuff like math, law, and medicine.

These application examples can be written manually or generated using another model. Either way, the model gets both the concept and how to use it.

Modularity

Another win: RAG+ is modular, so you don’t need to rebuild up your current setup. Just wire in the application corpus, do some indexing work, and it slides into place.

In some domains, it’s basically a free 3% performance uplift. You know, the kind of performance boost you'd usually need a lot more compute for? Yeah, you get it just from smarter retrieval.

Two Types of Knowledge: Conceptual vs Procedural

If you really think about it there are two kinds of things LLMs need to know:

Conceptual	Procedural
Facts and definitions	How to use the facts to solve problems

Like:

Conceptual: “This is heron's formula”
Procedural: “Using heron's formula solve this”

RAG+ leans into this by pairing both types of knowledge when it retrieves stuff. That way, the model isn’t just reading text, rather it’s also seeing worked out examples like how a student would learn in class.

Image 1: A side-by-side of conceptual vs procedural reasoning. Conceptual chunks give you facts whereas procedural chunks show you how to use them.

Application Matching (really cool)

Instead of just slapping random examples into the prompt, RAG+ does application matching:

It first categorizes both knowledge and examples (into themes/domains).
Then an LLM does many-to-many linking, matching examples to the concepts they help explain.
If a good match can't be found then it just generates new knowledge pieces on the fly to fill the gap.

This kind of thing turns the model into more of a polymath, able to connect distant concepts like someone who can link thermodynamics to economics or something like that.

How They Ran the Experiments (and What the RAG Variants Are)

They compared this application-augmented method on 4 variants of RAG:

RAG (Standard)

Retrieves relevant info from a corpus and feeds it into the LLM during inference. The OG method. Simple and widely used.

GraphRAG

It’s like RAG but it builds entity-relation graphs between corpus chunks. Helps capture similar ideas across multiple sources. Good for multi-hop reasoning but pretty heavy to run.

Re-rank RAG

Retrieves k results, re-ranks them based on how relevant they are to the question (sometimes using another LLM), and uses the top few as context. It's like pre-filtering the prompt.

AFRAG (Answer-First RAG)

The LLM first gives a generic blind response to the question. Based on that guess, relevant context is fetched. Then the model uses both the initial guess and new context to generate the final answer. It’s like searching for a toy blindfolded… but this time, you kinda remember how it feels. So the search becomes smarter.

Setup Summary:

Tested on maths, legal, and medicine
Across various model sizes, from ~7B to 70B
Compared with and without application augmentation

My Observation: Model Size Makes a Big Difference for Re-rank RAG

Now here's something I noticed from the results table:

With small models like GLM4-9B and Qwen2.5-7B, regular RAG+ actually beats Re-rank RAG.
But when you throw in the big bois like Qwen2.5-72B and LLaMA3-70B, Re-rank RAG starts dominating.

Why? Well, the paper says small models kinda suck at following reranking instructions. Instead of reranking, they just start answering immediately which defeats the whole point.

So yeah, reranking isn’t reliable on smaller models, and it’s not even a prompt engineering issue. It’s just that those models don’t have the capacity for task separation like rerank-then-generate.

Where They Tested This: Math, Legal, Medicine

The authors tested RAG+ on three very different domains:

Math – where reasoning is very procedural and examples are structured.
Legal – complex language, long docs.
Medical – domain-specific, technical, and high-stakes.

For law and medicine, they went with automatic generation of application examples. Manually writing those would be a extremely difficult and time consuming.

Results: Does RAG+ Work?

Yep! And sometimes outperforms by a lot.

Image 2: A performance chart comparing three setups — examples only, standard RAG, and RAG+. RAG+ clearly comes out ahead, especially when it gets both the knowledge and its application during inference. (This is for the legal tasks dataset)

Also:

GraphRAG and AFRAG don’t work great with small models.
- AFRAG relies on the model’s initial answer. If that’s off, the follow-up steps flop.
- GraphRAG needs complex reasoning to understand document connections, something small models aren’t good at.
GraphRAG was skipped entirely for medicine, because it’s so compute-hungry.
But adding application examples still helped GraphRAG improve on math tasks!
Across the board, Re-rank RAG was top-tier when models were big enough.

Also worth noting: the bigger the model, the more benefit it gets from the application examples. Makes sense since big models have the horsepower to actually use them properly.

What If You Only Use a Big Model for Reranking?

Now here's a move I liked, instead of letting a big model do everything, they had it just handle the reranking, and let a smaller model do the actual answering.

At first that sounds pointless, right? Like if you’re already using a big model, why not let it finish the job? What makes it any different from just using a bigger model from start to end?

But reranking is way cheaper than generating. The big model’s just doing short scoring tasks and when it picks better inputs, the small model can do a decent job answering.

It’s a great cost/performance tradeoff. Cheap inference, better results.

TL;DR: What You Should Take Away

RAG+ improves reasoning by retrieving both knowledge and its application.
It’s modular, so you don’t have to rebuild your pipeline.
Re-rank RAG only works well on big models probably because small ones don’t follow instructions well.
GraphRAG and AFRAG are cool but either too costly or too fragile on small models.
Letting big models do reranking only is a smart hack: better results, cheaper cost.
Most importantly: don’t just throw examples at your model, you need to give it both the facts and how to use them.

Limitations? Yup they exist

Unfortunately nothing is perfect, these were the problems that persist.

Building the application dataset is expensive

Especially in low-resource domains. And automatic generation via LLMs? Still error-prone.
Bad retrieval = bad matches

If the system pulls the wrong knowledge, it might link it to the wrong example, which breaks the whole reasoning chain and leads to poorer reasoning.
RAG+ doesn't fix bad retrieval

Continuation of previous point, It depends entirely on what’s retrieved. So if your retriever’s garbage, your answer will still be garbage. It's just more logically structured garbage.

Final thoughts

RAG+ isn't a new revolution, It's just improving on existing tech. Yet what it does just seems so obvious that I don't understand why we didn't do it before. It helps the model to answer properly, look at examples and think kinda like us. It can also be integrated into existing stacks so there's no real harm in trying it out (provided you spend the time configuring the applications generation part).

I'd love to see how it's helping you guys out!

All of this came from a paper I read recently, I highly recommend you guys checking it out for the full details: RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning