<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manav Sutar</title>
    <description>The latest articles on DEV Community by Manav Sutar (@manav_sutar_d86f7312465e6).</description>
    <link>https://dev.to/manav_sutar_d86f7312465e6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3230897%2Fa2a33634-e193-46d7-973c-a8e534318ce3.jpg</url>
      <title>DEV Community: Manav Sutar</title>
      <link>https://dev.to/manav_sutar_d86f7312465e6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manav_sutar_d86f7312465e6"/>
    <language>en</language>
    <item>
      <title>RAG Explained: How AI Systems Got Smarter by Learning to Look Things Up</title>
      <dc:creator>Manav Sutar</dc:creator>
      <pubDate>Wed, 29 Oct 2025 04:49:24 +0000</pubDate>
      <link>https://dev.to/manav_sutar_d86f7312465e6/rag-explained-how-ai-systems-got-smarter-by-learning-to-look-things-up-65k</link>
      <guid>https://dev.to/manav_sutar_d86f7312465e6/rag-explained-how-ai-systems-got-smarter-by-learning-to-look-things-up-65k</guid>
      <description>&lt;h2&gt;
  
  
  A practical breakdown of the research paper that changed how AI handles knowledge
&lt;/h2&gt;

&lt;p&gt;The Problem: AI's Memory Dilemma&lt;br&gt;
Imagine trying to answer questions about current events using only what you memorized in school years ago. That's essentially what traditional AI language models do—they rely entirely on knowledge baked into their parameters during training.&lt;/p&gt;

&lt;p&gt;creates three major problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Outdated Information: Once trained, the model's knowledge is frozen in time&lt;/li&gt;
&lt;li&gt;Hallucinations: Models confidently generate false information when they don't know something&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  - No Citations: You can't verify where the information came from
&lt;/h2&gt;

&lt;p&gt;The 2020 paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduced a solution that's now powering many modern AI systems: RAG (Retrieval-Augmented Generation).&lt;br&gt;
The Big Idea: Combining Two Types of Memory&lt;/p&gt;

&lt;p&gt;Think of RAG like a student taking an open-book exam instead of a closed-book one. The system combines:&lt;br&gt;
Parametric Memory (The Brain)&lt;/p&gt;

&lt;p&gt;A pre-trained language model (like BART)&lt;br&gt;
Stores general patterns and language understanding&lt;br&gt;
400M+ parameters of learned knowledge&lt;/p&gt;

&lt;p&gt;Non-Parametric Memory (The Library)&lt;/p&gt;

&lt;p&gt;A searchable database (like Wikipedia)&lt;br&gt;
21 million document chunks&lt;br&gt;
Can be updated without retraining&lt;/p&gt;

&lt;p&gt;The magic happens when these work together: the model can retrieve relevant information and use it to generate better, more factual responses.&lt;/p&gt;

&lt;p&gt;How RAG Actually Works: A Technical Walkthrough&lt;/p&gt;

&lt;p&gt;Step 1: Query Processing&lt;br&gt;
When you ask a question like "Who is the president of Peru?":&lt;br&gt;
User Query → Query Encoder (BERT) → Dense Vector Representation&lt;br&gt;
The query encoder transforms your question into a mathematical representation (a dense vector) that captures its semantic meaning.&lt;/p&gt;

&lt;p&gt;Step 2: Document Retrieval&lt;br&gt;
Query Vector → MIPS Search → Top-K Documents Retrieved&lt;br&gt;
The system uses Maximum Inner Product Search (MIPS) to find the most relevant documents by comparing the query vector with pre-computed document vectors. This happens blazingly fast—searching through 21 million documents in milliseconds.&lt;br&gt;
Key Innovation: Instead of keyword matching (like traditional search), this uses semantic similarity. "Who leads Peru?" and "President of Peru" would retrieve similar documents even with different words.&lt;/p&gt;

&lt;p&gt;Step 3: Generation with Context&lt;br&gt;
Two different approaches:&lt;br&gt;
RAG-Sequence: Uses the same retrieved documents for the entire answer&lt;br&gt;
P(answer|question) = Σ P(document|question) × P(answer|question, document)&lt;br&gt;
RAG-Token: Can use different documents for different parts of the answer&lt;br&gt;
P(answer|question) = Π Σ P(document|question) × P(token|question, document, previous_tokens)&lt;br&gt;
Think of RAG-Token like citing different sources for different claims in an essay, while RAG-Sequence is like writing an entire paragraph based on one source.&lt;/p&gt;

&lt;p&gt;The Results: Where RAG Shines&lt;br&gt;
Open-Domain Question Answering&lt;br&gt;
RAG set new state-of-the-art results on multiple benchmarks:&lt;br&gt;
TaskPrevious BestRAG-SequenceImprovementNatural Questions40.4%44.5%+10%TriviaQA57.9%68.0%+17%WebQuestions41.1%45.2%+10%&lt;br&gt;
Why this matters: RAG outperformed both pure retrieval systems and pure generation systems, showing the power of the hybrid approach.&lt;br&gt;
More Factual, Less Hallucination&lt;br&gt;
In human evaluations for Jeopardy question generation:&lt;/p&gt;

&lt;p&gt;42.7% of cases: RAG was more factual than BART&lt;br&gt;
7.1% of cases: BART was more factual than RAG&lt;br&gt;
RAG also produced more specific answers 37.4% vs 16.8%&lt;/p&gt;

&lt;p&gt;Example comparison:&lt;br&gt;
Question: Generate a Jeopardy clue for "The Divine Comedy"&lt;br&gt;
BART (wrong): "This epic poem by Dante is divided into 3 parts: the Inferno, the Purgatorio &amp;amp; the Purgatorio."&lt;br&gt;
RAG (correct): "This 14th-century work is divided into 3 sections: 'Inferno', 'Purgatorio' &amp;amp; 'Paradiso'"&lt;/p&gt;

&lt;p&gt;Key Technical Insights&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;End-to-End Training
The retriever and generator are trained jointly, but with a clever shortcut:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The document encoder stays frozen (computationally cheaper)&lt;br&gt;
Only the query encoder and generator get updated&lt;br&gt;
Training signal flows through the marginalization of retrieved documents&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The Retrieval Quality Matters
Ablation studies showed:
Retrieval MethodNQ ScoreDifferenceBM25 (keyword)31.8%BaselineDense Retrieval (frozen)41.2%+30%Dense Retrieval (trained)44.0%+39%
Insight: Learning to retrieve the right documents for your task is crucial. Generic retrieval doesn't cut it.&lt;/li&gt;
&lt;li&gt;Hot-Swapping Knowledge
One of RAG's coolest features: you can update its knowledge by swapping the document index.
Experiment: Testing world leaders who changed between 2016 and 2018&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;2016 index + 2016 leaders: 70% accuracy&lt;br&gt;
2018 index + 2018 leaders: 68% accuracy&lt;br&gt;
Mismatched: 4-12% accuracy&lt;/p&gt;

&lt;p&gt;This means you can update RAG's knowledge without expensive retraining!&lt;/p&gt;

&lt;p&gt;Practical Implementation Guide&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Architecture Stack
Input Question
    ↓
[Query Encoder: BERT-base]
    ↓
[FAISS Index: 21M documents]
    ↓
[Top-K Retrieval: typically 5-10 docs]
    ↓
[Generator: BART-large]
    ↓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output Answer&lt;/p&gt;

&lt;p&gt;Key Design Decisions&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Document Chunking&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Split Wikipedia into 100-word chunks&lt;br&gt;
Creates 21M searchable passages&lt;br&gt;
Balance between context and precision&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Number of Retrieved Documents (K)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Training: K = 5 or 10&lt;br&gt;
More documents = better recall but slower&lt;br&gt;
RAG-Sequence benefits from more docs; RAG-Token peaks around K=10&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Decoding Strategy&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;RAG-Token: Standard beam search works&lt;br&gt;
RAG-Sequence: Needs special "thorough decoding" that marginalizes over documents&lt;/p&gt;

&lt;p&gt;Training Details&lt;/p&gt;

&lt;p&gt;Mixed precision (FP16) for efficiency&lt;br&gt;
8x NVIDIA V100 GPUs (32GB each)&lt;br&gt;
Document index: ~100GB CPU memory (compressed to 36GB)&lt;br&gt;
Framework: Originally Fairseq, now HuggingFace Transformers&lt;/p&gt;

&lt;p&gt;When RAG Works Best (and When It Doesn't)&lt;br&gt;
Excellent Performance:&lt;br&gt;
✅ Fact-based Q&amp;amp;A: Where there's a clear knowledge need&lt;br&gt;
✅ Verifiable claims: FEVER fact-checking within 4.3% of SOTA&lt;br&gt;
✅ Specific knowledge: Better than 11B parameter models with 15x fewer parameters&lt;/p&gt;

&lt;p&gt;Limitations Found:&lt;br&gt;
❌ Creative tasks: On story generation, retrieval sometimes "collapsed"&lt;br&gt;
❌ Implicit knowledge: Tasks not clearly requiring factual lookup&lt;br&gt;
❌ Long-form generation: Less informative gradients for retriever training&lt;/p&gt;

&lt;p&gt;The Bigger Picture: Why RAG Matters&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Efficiency Revolution&lt;br&gt;
RAG achieves better results than T5-11B (11 billion parameters) using only 626M trainable parameters. The secret? Offload knowledge storage to a retrievable index.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Interpretability Win&lt;br&gt;
Unlike pure neural models, you can inspect which documents influenced the answer. This is huge for:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Debugging model behavior&lt;br&gt;
Building trust in AI systems&lt;br&gt;
Meeting regulatory requirements&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Knowledge Updatability
No need to retrain when facts change. Just update the document index. This is essential for:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;News and current events&lt;br&gt;
Medical information&lt;br&gt;
Legal databases&lt;/p&gt;

&lt;p&gt;Evolution and Modern Variants&lt;br&gt;
Since this paper, RAG has evolved significantly:&lt;br&gt;
Improvements:&lt;/p&gt;

&lt;p&gt;Better retrievers (ColBERT, ANCE)&lt;br&gt;
Hybrid search (dense + sparse)&lt;br&gt;
Multi-hop reasoning&lt;br&gt;
Query rewriting for better retrieval&lt;/p&gt;

&lt;p&gt;Modern Applications:&lt;/p&gt;

&lt;p&gt;ChatGPT plugins and web browsing&lt;br&gt;
Enterprise knowledge bases&lt;br&gt;
Customer support systems&lt;br&gt;
Legal and medical AI assistants&lt;/p&gt;

&lt;p&gt;Building Your Own RAG System: Quick Start&lt;br&gt;
Minimal Implementation:&lt;br&gt;
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Load pre-trained RAG
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
retriever = RagRetriever.from_pretrained("facebook/rag-token-nq")
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

# Ask a question

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;input_text = "Who won the Nobel Prize in Physics in 2020?"&lt;br&gt;
inputs = tokenizer(input_text, return_tensors="pt")&lt;br&gt;
outputs = model.generate(**inputs)&lt;br&gt;
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;For Production:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Build or obtain a document corpus&lt;br&gt;
Encode documents with a dense encoder&lt;br&gt;
Create a FAISS index for fast retrieval&lt;br&gt;
Fine-tune on your domain-specific data&lt;br&gt;
Monitor retrieval quality and update the index regularly&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key Takeaways&lt;/p&gt;

&lt;p&gt;Hybrid is better: Combining parametric and non-parametric memory beats either approach alone&lt;br&gt;
Retrieval quality is critical: Learning task-specific retrieval significantly outperforms generic search&lt;br&gt;
Marginalization matters: Treating retrieval as a latent variable and marginalizing over documents enables end-to-end training&lt;br&gt;
Updatable knowledge: Hot-swapping document indices solves the "frozen knowledge" problem&lt;br&gt;
Efficient and interpretable: Achieves SOTA with fewer parameters while providing traceable sources&lt;/p&gt;

&lt;p&gt;Further Reading&lt;/p&gt;

&lt;p&gt;Original Paper: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020)&lt;br&gt;
Code: Available in HuggingFace Transformers&lt;br&gt;
Demo: Try it at huggingface.co/rag&lt;br&gt;
Related Work: REALM, DPR, ColBERT for retrieval improvements&lt;/p&gt;

&lt;p&gt;Have you implemented RAG in your projects? What challenges did you face? Share your experiences in the comments!&lt;/p&gt;

&lt;p&gt;About the Research: This paper from Facebook AI Research (now Meta AI) and University College London introduced a foundational technique now used across the industry. The authors include Patrick Lewis, Ethan Perez, and teams from leading AI institutions&lt;/p&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
