Kushal

Posted on Jul 8

RAG+ is kinda cool

#ai #nlp #rag #llm

So you probably already know what RAG (Retrieval-Augmented Generation) is. It goes and grabs relevant info and feeds it to a language model so it can answer better and yeah, that works pretty well… until it doesn’t.

The catch? RAG is good at retrieving stuff but not really great when it comes to reasoning. Like it knows things but doesn’t apply them all that well. That’s exactly the issue RAG+ is built to solve and it does it in a surprisingly clean way.

What’s Different in RAG+?

RAG+ adds a second brain to the operation. Instead of just retrieving knowledge chunks, it builds two corpora:

Knowledge base – where we get our information from (textbooks, docs, etc)
Application base – actual examples that show how to use that knowledge (examples showing how to use math formulas)

So when it’s inference time, the model doesn’t just get “Formula for mean median and mode”, it gets both the definition and a walkthrough of solving a question about it. That combo lets it reason better, especially for stuff like math, law, and medicine.

These application examples can be written manually or generated using another model. Either way, the model gets both the concept and how to use it.

Modularity

Another win: RAG+ is modular, so you don’t need to rebuild up your current setup. Just wire in the application corpus, do some indexing work, and it slides into place.

In some domains, it’s basically a free 3% performance uplift. You know, the kind of performance boost you'd usually need a lot more compute for? Yeah, you get it just from smarter retrieval.

Two Types of Knowledge: Conceptual vs Procedural

If you really think about it there are two kinds of things LLMs need to know:

Conceptual	Procedural
Facts and definitions	How to use the facts to solve problems

Like:

Conceptual: “This is heron's formula”
Procedural: “Using heron's formula solve this”

RAG+ leans into this by pairing both types of knowledge when it retrieves stuff. That way, the model isn’t just reading text, rather it’s also seeing worked out examples like how a student would learn in class.

Image 1: A side-by-side of conceptual vs procedural reasoning. Conceptual chunks give you facts whereas procedural chunks show you how to use them.

Application Matching (really cool)

Instead of just slapping random examples into the prompt, RAG+ does application matching:

It first categorizes both knowledge and examples (into themes/domains).
Then an LLM does many-to-many linking, matching examples to the concepts they help explain.
If a good match can't be found then it just generates new knowledge pieces on the fly to fill the gap.

This kind of thing turns the model into more of a polymath, able to connect distant concepts like someone who can link thermodynamics to economics or something like that.

How They Ran the Experiments (and What the RAG Variants Are)

They compared this application-augmented method on 4 variants of RAG:

RAG (Standard)

Retrieves relevant info from a corpus and feeds it into the LLM during inference. The OG method. Simple and widely used.

GraphRAG

It’s like RAG but it builds entity-relation graphs between corpus chunks. Helps capture similar ideas across multiple sources. Good for multi-hop reasoning but pretty heavy to run.

Re-rank RAG

Retrieves k results, re-ranks them based on how relevant they are to the question (sometimes using another LLM), and uses the top few as context. It's like pre-filtering the prompt.

AFRAG (Answer-First RAG)

The LLM first gives a generic blind response to the question. Based on that guess, relevant context is fetched. Then the model uses both the initial guess and new context to generate the final answer. It’s like searching for a toy blindfolded… but this time, you kinda remember how it feels. So the search becomes smarter.

Setup Summary:

Tested on maths, legal, and medicine
Across various model sizes, from ~7B to 70B
Compared with and without application augmentation

My Observation: Model Size Makes a Big Difference for Re-rank RAG

Now here's something I noticed from the results table:

With small models like GLM4-9B and Qwen2.5-7B, regular RAG+ actually beats Re-rank RAG.
But when you throw in the big bois like Qwen2.5-72B and LLaMA3-70B, Re-rank RAG starts dominating.

Why? Well, the paper says small models kinda suck at following reranking instructions. Instead of reranking, they just start answering immediately which defeats the whole point.

So yeah, reranking isn’t reliable on smaller models, and it’s not even a prompt engineering issue. It’s just that those models don’t have the capacity for task separation like rerank-then-generate.

Where They Tested This: Math, Legal, Medicine

The authors tested RAG+ on three very different domains:

Math – where reasoning is very procedural and examples are structured.
Legal – complex language, long docs.
Medical – domain-specific, technical, and high-stakes.

For law and medicine, they went with automatic generation of application examples. Manually writing those would be a extremely difficult and time consuming.

Results: Does RAG+ Work?

Yep! And sometimes outperforms by a lot.

Image 2: A performance chart comparing three setups — examples only, standard RAG, and RAG+. RAG+ clearly comes out ahead, especially when it gets both the knowledge and its application during inference. (This is for the legal tasks dataset)

Also:

GraphRAG and AFRAG don’t work great with small models.
- AFRAG relies on the model’s initial answer. If that’s off, the follow-up steps flop.
- GraphRAG needs complex reasoning to understand document connections, something small models aren’t good at.
GraphRAG was skipped entirely for medicine, because it’s so compute-hungry.
But adding application examples still helped GraphRAG improve on math tasks!
Across the board, Re-rank RAG was top-tier when models were big enough.

Also worth noting: the bigger the model, the more benefit it gets from the application examples. Makes sense since big models have the horsepower to actually use them properly.

What If You Only Use a Big Model for Reranking?

Now here's a move I liked, instead of letting a big model do everything, they had it just handle the reranking, and let a smaller model do the actual answering.

At first that sounds pointless, right? Like if you’re already using a big model, why not let it finish the job? What makes it any different from just using a bigger model from start to end?

But reranking is way cheaper than generating. The big model’s just doing short scoring tasks and when it picks better inputs, the small model can do a decent job answering.

It’s a great cost/performance tradeoff. Cheap inference, better results.

TL;DR: What You Should Take Away

RAG+ improves reasoning by retrieving both knowledge and its application.
It’s modular, so you don’t have to rebuild your pipeline.
Re-rank RAG only works well on big models probably because small ones don’t follow instructions well.
GraphRAG and AFRAG are cool but either too costly or too fragile on small models.
Letting big models do reranking only is a smart hack: better results, cheaper cost.
Most importantly: don’t just throw examples at your model, you need to give it both the facts and how to use them.

Limitations? Yup they exist

Unfortunately nothing is perfect, these were the problems that persist.

Building the application dataset is expensive

Especially in low-resource domains. And automatic generation via LLMs? Still error-prone.
Bad retrieval = bad matches

If the system pulls the wrong knowledge, it might link it to the wrong example, which breaks the whole reasoning chain and leads to poorer reasoning.
RAG+ doesn't fix bad retrieval

Continuation of previous point, It depends entirely on what’s retrieved. So if your retriever’s garbage, your answer will still be garbage. It's just more logically structured garbage.

Final thoughts

RAG+ isn't a new revolution, It's just improving on existing tech. Yet what it does just seems so obvious that I don't understand why we didn't do it before. It helps the model to answer properly, look at examples and think kinda like us. It can also be integrated into existing stacks so there's no real harm in trying it out (provided you spend the time configuring the applications generation part).

I'd love to see how it's helping you guys out!

All of this came from a paper I read recently, I highly recommend you guys checking it out for the full details: RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning