<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kushal </title>
    <description>The latest articles on DEV Community by Kushal  (@kushal0532).</description>
    <link>https://dev.to/kushal0532</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3334761%2F7cf3b58a-8573-42b0-9be0-37970e0be24e.png</url>
      <title>DEV Community: Kushal </title>
      <link>https://dev.to/kushal0532</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kushal0532"/>
    <language>en</language>
    <item>
      <title>What's semantic caching?</title>
      <dc:creator>Kushal </dc:creator>
      <pubDate>Mon, 16 Mar 2026 15:34:31 +0000</pubDate>
      <link>https://dev.to/kushal0532/whats-semantic-caching-4hon</link>
      <guid>https://dev.to/kushal0532/whats-semantic-caching-4hon</guid>
      <description>&lt;p&gt;As more applications for generative AI come, its shortcomings become more apparent. One huge problem with LLMs is how expensive each query is, for example take Gemini — Gemini 2.5 Pro charges $1.25 per million input tokens and $10 per million output tokens. Their flagship Gemini 3.1 Pro doubles that to $2 and $12 per million tokens respectively. Even a moderately active app can rack up thousands of dollars a month pretty quickly. Imagine a small customer support bot with just 500 daily users — by month two, the API bill has quietly crossed $2,000. That's not an edge case, that's just what happens when you're not caching. As a business (or a personal user) saving costs where possible and speeding up operations is a huge important factor that decides how well your product does. One way to speed up and minimise costs is to use a simple 'semantic cache'.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it is
&lt;/h2&gt;

&lt;p&gt;A semantic cache is not too different from a traditional cache, it has the same idea behind it. Normally a traditional cache stores either LRU (Least Recently Used) or LFU (Least Frequently Used) data so that when the same query comes, it can just fetch the result stored rather than search it up again.&lt;/p&gt;

&lt;p&gt;You however cannot apply the exact same pipeline for RAG or genAI products simply because the output is not 'deterministic', i.e, it's not the same. Take these examples:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;What is the situation regarding AI in professional workplaces?&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;How are AI tools affecting workplaces?&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now semantically these seem similar enough to use and we can gauge that they kinda mean the same thing, but a normal cache does not understand that. It thinks these both are different because they are not exactly the same.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qlhkrsnlnvwubed2jc1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qlhkrsnlnvwubed2jc1.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's where semantic caching comes in. Rather than compare them directly, it compares the semantic meaning behind them and understands that it's kinda the same and thus we get a cache hit! We normally check how similar two documents are based on cosine similarity.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;This is a typical pipeline for RAG systems that use semantic caching.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xp6e6lfli9rkotyujnb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xp6e6lfli9rkotyujnb.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First the documents are chopped up etc and converted to word embeddings (vectors). Ofc you store them in a vector db like &lt;a href="https://www.trychroma.com/" rel="noopener noreferrer"&gt;Chroma&lt;/a&gt;, &lt;a href="https://faiss.ai/" rel="noopener noreferrer"&gt;FAISS&lt;/a&gt; or something of that sort which suits your use case. After the user sends a query we don't go to the db. Instead we first check with the semantic cache. It sees if the query is relevant to the cached query.&lt;/p&gt;

&lt;p&gt;Two things can happen from here:&lt;/p&gt;

&lt;p&gt;Cache hit: The query is similar enough to a cached one (above the threshold) → cached context is pulled and handed to the LLM → response is generated. Fast and cheap, no db lookup needed.&lt;/p&gt;

&lt;p&gt;Cache miss: Nothing similar in the cache → normal vector db retrieval happens → relevant chunks are fetched, response is generated, and the new query gets cached for next time. Normal speed, but the cache is now warmer.&lt;/p&gt;

&lt;p&gt;Word embeddings are compared using cosine similarity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cosine(θ) = (A · B) / (||A|| × ||B||)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's a very fast and simple method to see the angle between the direction of vectors. If similar, then they would aim in similar direction, i.e, the angle between them is low, i.e, cos of that angle is higher. Output is from 0 to 1 where 0 means not at all similar and 1 ofc means they are the exact same.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"What is the impact of AI on jobs?"&lt;/code&gt; vs &lt;code&gt;"How is AI changing employment?"&lt;/code&gt; → score of ~0.91 → cache hit&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;"What is the impact of AI on jobs?"&lt;/code&gt; vs &lt;code&gt;"How do I bake sourdough bread?"&lt;/code&gt; → score of ~0.08 → cache miss&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those first two are clearly the same question in spirit, and the score reflects that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use it
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Significant cost savings. By reducing the queries sent to vector dbs, you cut down on a huge portion of charges incurred.&lt;/li&gt;
&lt;li&gt;Faster response time. If you already have the cached content, you don't need to retrieve it again. This allows the system to be a whole lot faster in production.&lt;/li&gt;
&lt;li&gt;Better use of resources. Since you aren't redoing similar queries, the system is free to do more tasks, allowing you to scale better or handle more complex features.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Compared to other approaches in RAG
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Handles Semantic Similarity&lt;/th&gt;
&lt;th&gt;Cost Savings&lt;/th&gt;
&lt;th&gt;Speed Boost&lt;/th&gt;
&lt;th&gt;Setup Complexity&lt;/th&gt;
&lt;th&gt;Works for Unique Queries&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traditional Cache&lt;/td&gt;
&lt;td&gt;No (exact match only)&lt;/td&gt;
&lt;td&gt;High (when hits)&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;High-volume apps with repetitive, exact queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic Cache&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Apps with overlapping but varied query patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query Rewriting&lt;/td&gt;
&lt;td&gt;Partially&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Low (adds a step)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Improving retrieval on ambiguous or poorly phrased queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Re-ranking&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;No (adds latency)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Boosting relevance when retrieval is decent but ordering is off&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hybrid Search&lt;/td&gt;
&lt;td&gt;Partially&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Complex domains needing both keyword and semantic retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunking Optimisation&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Low–Medium&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Improving retrieval quality at the source&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;As you can see, semantic caching isn't a silver bullet. It shines when there's a decent overlap in the kinds of queries your users send. For more diverse or unique query patterns, approaches like re-ranking or hybrid search may be better suited.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cons
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;More complex to build than a traditional cache system.&lt;/li&gt;
&lt;li&gt;Higher chances of getting semantically similar chunks that may not be relevant or useful for answering the query. Think of it like asking a librarian for "books about space travel" and getting recommendations cached from a previous "books about space exploration" query — close enough on the surface. But when you follow up with "books about the health risks of space travel", the cache might still serve those same exploration books because the queries look similar, even though what you actually need is quite different.&lt;/li&gt;
&lt;li&gt;Need to balance out the threshold. A higher threshold does not yield useful chunks and a lower limit may not bring semantically similar chunks, both degrade performance of system. Important to find out the right balance.&lt;/li&gt;
&lt;li&gt;Empty cache is slow and has high latency.&lt;/li&gt;
&lt;li&gt;Not suitable when every user query is unique.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  When not to use it
&lt;/h2&gt;

&lt;p&gt;Semantic caching isn't always the right tool. Skip it if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every query your users send is unique. Think code generation, legal research, or anything highly personalised — the cache will almost never hit and you're just adding overhead.&lt;/li&gt;
&lt;li&gt;Your app is low traffic. If you're getting a handful of queries a day, there's no real benefit.&lt;/li&gt;
&lt;li&gt;Your knowledge base changes constantly. If documents are being updated all the time, you'll spend more time invalidating the cache than benefiting from it.&lt;/li&gt;
&lt;li&gt;Accuracy is non-negotiable. Cached context can be slightly off. For use cases where being slightly wrong is worse than being slow, don't cache.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to best utilise it
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Calibrate your threshold carefully. A good starting point is somewhere between 0.85–0.90. From there, tune it based on your specific use case and monitor quality. There's no universal right answer here.&lt;/li&gt;
&lt;li&gt;Use TTL (Time To Live) values. Cached entries should expire, especially when your underlying data changes or when topics are time-sensitive. Stale cache is worse than no cache.&lt;/li&gt;
&lt;li&gt;Warm up your cache. Pre-populate it with common or anticipated queries so you're not starting completely cold in production. A cold cache gives you none of the benefits.&lt;/li&gt;
&lt;li&gt;Invalidate when your knowledge base updates. If the documents in your vector db change, cached responses based on old chunks can quietly degrade your output quality without you noticing.&lt;/li&gt;
&lt;li&gt;Monitor your hit rate. A healthy semantic cache typically sees somewhere around 30–60% hit rates. Too low and your threshold might be too strict; suspiciously high but quality is dropping means it's too loose.&lt;/li&gt;
&lt;li&gt;Think about scope — global vs user-level caching. A global cache saves the most but can serve mismatched cached results across very different user contexts. For personalised applications, a user-scoped cache might make more sense even if it's less efficient.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Tools that already do this
&lt;/h2&gt;

&lt;p&gt;You don't have to build it from scratch. A few libraries have semantic caching built in or easily pluggable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/zilliztech/GPTCache" rel="noopener noreferrer"&gt;GPTCache&lt;/a&gt; — an open source library built specifically for caching LLM responses. Pretty flexible and worth looking at if you're rolling your own pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://python.langchain.com/docs/how_to/llm_caching/" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt; — has caching layers that plug into existing chains without too much effort. Good starting point if you're already using it.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://redis.io/blog/what-is-vector-similarity-search/" rel="noopener noreferrer"&gt;Redis&lt;/a&gt; — with vector similarity extensions, Redis can act as a fast semantic cache layer, especially if you're already using it in your stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Worth knowing these exist before you reinvent the wheel.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>performance</category>
    </item>
    <item>
      <title>RAG+ is kinda cool</title>
      <dc:creator>Kushal </dc:creator>
      <pubDate>Tue, 08 Jul 2025 10:34:11 +0000</pubDate>
      <link>https://dev.to/kushal0532/rag-is-kinda-cool-25f1</link>
      <guid>https://dev.to/kushal0532/rag-is-kinda-cool-25f1</guid>
      <description>&lt;p&gt;So you probably already know what &lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt; is. It goes and grabs relevant info and feeds it to a language model so it can answer better and yeah, that works pretty well… until it doesn’t.&lt;/p&gt;

&lt;p&gt;The catch? &lt;strong&gt;RAG is good at retrieving stuff but not really great when it comes to reasoning&lt;/strong&gt;. Like it knows things but doesn’t &lt;em&gt;apply&lt;/em&gt; them all that well. That’s exactly the issue &lt;strong&gt;RAG+&lt;/strong&gt; is built to solve and it does it in a surprisingly clean way.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Different in RAG+?
&lt;/h2&gt;

&lt;p&gt;RAG+ adds a second brain to the operation. Instead of just retrieving knowledge chunks, it builds &lt;strong&gt;two corpora&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Knowledge base&lt;/strong&gt; –  where we get our information from (textbooks, docs, etc)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Application base&lt;/strong&gt; – actual examples that &lt;em&gt;show&lt;/em&gt; how to use that knowledge (examples showing how to use math formulas)&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So when it’s inference time, the model doesn’t just get “&lt;strong&gt;Formula for mean median and mode&lt;/strong&gt;”, it gets both the &lt;strong&gt;definition&lt;/strong&gt; &lt;em&gt;and&lt;/em&gt; a &lt;strong&gt;walkthrough of solving a question about it&lt;/strong&gt;. That combo lets it reason better, especially for stuff like math, law, and medicine.&lt;/p&gt;

&lt;p&gt;These application examples can be written manually or generated using another model. Either way, the model gets both the &lt;strong&gt;concept&lt;/strong&gt; and &lt;strong&gt;how to use it&lt;/strong&gt;.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Modularity
&lt;/h2&gt;

&lt;p&gt;Another win: RAG+ is &lt;strong&gt;modular&lt;/strong&gt;, so you don’t need to rebuild up your current setup. Just wire in the application corpus, do some indexing work, and it slides into place.&lt;/p&gt;

&lt;p&gt;In some domains, it’s basically a &lt;strong&gt;free 3% performance uplift&lt;/strong&gt;. You know, the kind of performance boost you'd usually need a lot more compute for? Yeah, you get it just from smarter retrieval.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Types of Knowledge: Conceptual vs Procedural
&lt;/h2&gt;

&lt;p&gt;If you really think about it there are two kinds of things LLMs need to know:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Conceptual&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Procedural&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Facts and definitions&lt;/td&gt;
&lt;td&gt;How to use the facts to solve problems&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Conceptual&lt;/strong&gt;: “This is heron's formula”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Procedural&lt;/strong&gt;: “Using heron's formula solve this”&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RAG+ leans into this by &lt;strong&gt;pairing both&lt;/strong&gt; types of knowledge when it retrieves stuff. That way, the model isn’t just reading text, rather it’s also seeing worked out examples like how a student would learn in class.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6hqwu26wu0ep45z9x6xp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6hqwu26wu0ep45z9x6xp.png" alt="Pic 1" width="662" height="1048"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image 1&lt;/strong&gt;: A side-by-side of conceptual vs procedural reasoning. Conceptual chunks give you facts whereas procedural chunks show you how to use them.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Application Matching (really cool)
&lt;/h2&gt;

&lt;p&gt;Instead of just slapping random examples into the prompt, RAG+ does &lt;strong&gt;application matching&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;It first &lt;strong&gt;categorizes&lt;/strong&gt; both knowledge and examples (into themes/domains).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Then an LLM does &lt;strong&gt;many-to-many linking&lt;/strong&gt;, matching examples to the concepts they help explain.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If a good match can't be found then it just &lt;strong&gt;generates new knowledge pieces&lt;/strong&gt; on the fly to fill the gap.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This kind of thing turns the model into more of a &lt;strong&gt;polymath&lt;/strong&gt;, able to connect distant concepts like someone who can link thermodynamics to economics or something like that.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How They Ran the Experiments (and What the RAG Variants Are)
&lt;/h2&gt;

&lt;p&gt;They compared this application-augmented method on &lt;strong&gt;4 variants of RAG&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;RAG (Standard)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Retrieves relevant info from a corpus and feeds it into the LLM during inference. The OG method. Simple and widely used.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GraphRAG&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It’s like RAG but it builds &lt;strong&gt;entity-relation graphs&lt;/strong&gt; between corpus chunks. Helps capture similar ideas across multiple sources. Good for multi-hop reasoning but pretty heavy to run.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Re-rank RAG&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Retrieves &lt;em&gt;k&lt;/em&gt; results, re-ranks them based on how relevant they are to the question (sometimes using another LLM), and uses the top few as context. It's like pre-filtering the prompt.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AFRAG (Answer-First RAG)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The LLM first gives a &lt;strong&gt;generic blind response&lt;/strong&gt; to the question. Based on that guess, relevant context is fetched. Then the model uses both the initial guess and new context to generate the final answer. It’s like searching for a toy blindfolded… but this time, you kinda remember how it feels. So the search becomes smarter.&lt;/p&gt;

&lt;h3&gt;
  
  
  Setup Summary:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Tested on &lt;strong&gt;maths&lt;/strong&gt;, &lt;strong&gt;legal&lt;/strong&gt;, and &lt;strong&gt;medicine&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Across &lt;strong&gt;various model sizes&lt;/strong&gt;, from ~7B to 70B&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compared &lt;strong&gt;with and without&lt;/strong&gt; application augmentation&lt;br&gt;
&lt;br&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  My Observation: Model Size Makes a Big Difference for Re-rank RAG
&lt;/h2&gt;

&lt;p&gt;Now here's something I noticed from the results table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;With &lt;strong&gt;small models&lt;/strong&gt; like GLM4-9B and Qwen2.5-7B, &lt;strong&gt;regular RAG+ actually beats Re-rank RAG&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;But when you throw in the big bois like Qwen2.5-72B and LLaMA3-70B, &lt;strong&gt;Re-rank RAG starts dominating&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why? Well, the paper says small models kinda suck at following reranking instructions. Instead of reranking, they just start answering immediately which defeats the whole point.&lt;/p&gt;

&lt;p&gt;So yeah, &lt;strong&gt;reranking isn’t reliable on smaller models&lt;/strong&gt;, and it’s not even a prompt engineering issue. It’s just that those models don’t have the capacity for task separation like rerank-then-generate.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where They Tested This: Math, Legal, Medicine
&lt;/h2&gt;

&lt;p&gt;The authors tested RAG+ on three very different domains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Math&lt;/strong&gt; – where reasoning is very procedural and examples are structured.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Legal&lt;/strong&gt; – complex language, long docs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Medical&lt;/strong&gt; – domain-specific, technical, and high-stakes.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For law and medicine, they went with &lt;strong&gt;automatic generation&lt;/strong&gt; of application examples. Manually writing those would be a extremely difficult and time consuming.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: Does RAG+ Work?
&lt;/h2&gt;

&lt;p&gt;Yep! And sometimes outperforms by a &lt;strong&gt;lot&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frne3ln7i1fr3tncq7pj4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frne3ln7i1fr3tncq7pj4.png" alt="Pic 2" width="800" height="573"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image 2&lt;/strong&gt;: A performance chart comparing three setups — examples only, standard RAG, and RAG+. RAG+ clearly comes out ahead, especially when it gets both the knowledge and its application during inference. (This is for the legal tasks dataset)&lt;/p&gt;

&lt;p&gt;Also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;GraphRAG and AFRAG&lt;/strong&gt; don’t work great with small models.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AFRAG&lt;/strong&gt; relies on the model’s initial answer. If that’s off, the follow-up steps flop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GraphRAG&lt;/strong&gt; needs complex reasoning to understand document connections, something small models aren’t good at.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;GraphRAG was skipped entirely for medicine&lt;/strong&gt;, because it’s so compute-hungry.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;But &lt;strong&gt;adding application examples&lt;/strong&gt; still helped GraphRAG improve on math tasks!&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Across the board, &lt;strong&gt;Re-rank RAG&lt;/strong&gt; was top-tier when models were big enough.&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Also worth noting: &lt;strong&gt;the bigger the model, the more benefit it gets from the application examples&lt;/strong&gt;. Makes sense since big models have the horsepower to actually use them properly.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What If You Only Use a Big Model for Reranking?
&lt;/h2&gt;

&lt;p&gt;Now here's a move I liked, instead of letting a big model do everything, they had it just handle the &lt;strong&gt;reranking&lt;/strong&gt;, and let a &lt;strong&gt;smaller model do the actual answering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At first that sounds pointless, right? Like if you’re already using a big model, why not let it finish the job? What makes it any different from just using a bigger model from start to end?&lt;/p&gt;

&lt;p&gt;But reranking is &lt;strong&gt;way cheaper&lt;/strong&gt; than generating. The big model’s just doing short scoring tasks  and when it picks better inputs, the small model can do a decent job answering. &lt;/p&gt;

&lt;p&gt;It’s a great &lt;strong&gt;cost/performance tradeoff&lt;/strong&gt;. Cheap inference, better results.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR: What You Should Take Away
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;RAG+ improves reasoning by retrieving &lt;strong&gt;both knowledge and its application&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It’s modular, so you don’t have to rebuild your pipeline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Re-rank RAG only works well on &lt;strong&gt;big models&lt;/strong&gt; probably because small ones don’t follow instructions well.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GraphRAG and AFRAG&lt;/strong&gt; are cool but either too costly or too fragile on small models.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Letting big models do reranking only&lt;/strong&gt; is a smart hack: better results, cheaper cost.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Most importantly: &lt;strong&gt;don’t just throw examples at your model&lt;/strong&gt;, you need to give it both the facts and how to use them.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Limitations? Yup they exist
&lt;/h2&gt;

&lt;p&gt;Unfortunately nothing is perfect, these were the problems that persist.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Building the application dataset is expensive&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Especially in low-resource domains. And automatic generation via LLMs? Still error-prone.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bad retrieval = bad matches&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
If the system pulls the wrong knowledge, it might link it to the wrong example, which breaks the whole reasoning chain and leads to poorer reasoning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;RAG+ doesn't fix bad retrieval&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Continuation of previous point, It depends entirely on what’s retrieved. So if your retriever’s garbage, your answer will still be garbage. It's just more logically structured garbage.&lt;br&gt;
&lt;br&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;RAG+ isn't a new revolution, It's just improving on existing tech. Yet what it does just seems so obvious that I don't understand why we didn't do it before. It helps the model to answer properly, look at examples and think kinda like us. It can also be integrated into existing stacks so there's no real harm in trying it out (provided you spend the time configuring the applications generation part). &lt;/p&gt;

&lt;p&gt;I'd love to see how it's helping you guys out!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All of this came from a paper I read recently, I highly recommend you guys checking it out for the full details:&lt;/em&gt; &lt;a href="https://arxiv.org/abs/2506.11555" rel="noopener noreferrer"&gt;RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>nlp</category>
    </item>
  </channel>
</rss>
