Why Should You Care About RAG?
Imagine you work in the HR department of a company that has a 100-page PDF filled with employee policies. One day, an intern walks up to your desk and asks:
“How many work-from-home days are allowed for interns?”
You open the document, press Ctrl+F, and type “work-from-home.” But the PDF uses the term “remote work flexibility.” You scroll endlessly, read random sections, and still can’t find a clear answer. It’s frustrating.
Now imagine a smart chatbot that can read the entire PDF, understand the meaning, and say in 3 seconds:
“Interns are eligible for 5 remote working days per month.”
That’s the power of RAG: Retrieval-Augmented Generation. It gives real answers from real documents — not guesses.
Why Large Language Models Alone Aren’t Enough
LLMs like GPT-4 are powerful but have key limitations:
- They sometimes hallucinate — they make up answers that sound real but aren’t true.
- Their knowledge is frozen. For example, GPT-4 was trained only until 2023, so anything after that is unknown.
- They can’t access private documents like your PDFs or internal policies.
- They don’t search, they just generate responses from memory.
The Better Way: Use RAG
RAG (Retrieval-Augmented Generation) connects a smart language model to external documents. Instead of guessing, it retrieves the correct information and generates accurate responses.
So if the intern asks the same question again, the system will search the HR policy and respond:
“Interns are eligible for 5 remote working days per month.”
It’s fast, trustworthy, and grounded in real content.
Step 1: Prepare the Data (Ingestion)
Chunking
Your document is like a big cake. Chunking is slicing it into small parts — 256 or 512 tokens — so it’s easier to search.
Embedding
Each chunk is turned into a vector (a list of numbers) using models like:
- OpenAI Embeddings
- BERT (Bidirectional Encoder Representations from Transformers)
Why? Because computers understand numbers, not words. Embeddings help the machine capture the meaning behind the text.
Example: “holiday leave” and “paid vacation” are different phrases but mean the same thing. Embeddings can tell.
Storing in a Vector Database
These vectors go into special databases built for fast search:
- Chroma – beginner friendly and local
- Pinecone – cloud-based and scalable
- FAISS – open-source tool by Facebook for high-speed search
What Are Embeddings and Why Do They Matter?
Embeddings convert text into vectors so we can search and compare meaning — not just words.
Sparse Embeddings
- Tools: TF-IDF (Term Frequency–Inverse Document Frequency), BM25
- Fast, matches exact terms
- Doesn’t understand deeper meaning
Example: If you ask about “holiday” and the doc says “vacation,” sparse embedding will miss it.
Dense Embeddings
- Tools: BERT, Sentence-BERT, OpenAI Embeddings
- Understands context and meaning
- Better match, even if exact words differ
Example: Ask about “vacation policy,” and the doc says “30 days paid leave.” Dense embeddings will match it.
Dense embeddings are ideal for RAG because meaning matters more than words.
Step 2: Retrieval (Find the Right Chunks)
When a user asks something:
- It is converted into a vector
- Compared with all document vectors
- The closest matches are selected
Similarity Techniques
- Cosine Similarity: Measures the angle between vectors. Smaller angle = more similar.
- Euclidean Distance: Measures the distance between points. Closer = more similar.
Retrieval Methods
Standard Retrieval
Just pick the top-matching chunk and send to the model. Fast but might lack context.Sentence-Window Retrieval
Picks the match and adds surrounding sentences — so the model understands context better.Ensemble Retrieval
Tries multiple chunk sizes (128, 256, 512), combines best chunks, and sorts them with a Re-Ranker.
Step 3: Re-Ranking
You might get 10 chunks — but LLMs can only handle 3-5. So, we sort them by importance.
Types of Re-Ranking:
- Lexical: Based on keywords (TF-IDF, BM25)
- Semantic: Based on meaning (BERT, Cohere)
- LTR (Learning to Rank): ML model trained to choose best
- Hybrid: Combines keyword and meaning-based methods
Think of re-ranking like a judge picking the best answers to pass to the LLM.
Problems You Might Face
Lost in the Middle
LLMs focus more on the start and end of the input — often skipping the middle.
Fix: Move key content to start/end, limit total chunks, and use better re-ranking.
Example: If the answer is in paragraph 3 of 5, reorder the chunk or split it to push the key info up.
Wrong Retrieval
Sometimes irrelevant chunks get retrieved — leading to wrong answers.
Fix:
- Improve chunking (e.g., avoid breaking sentences)
- Use better embeddings (dense vs sparse)
- Add filters to improve search accuracy
Example: A policy question brings in finance data? You likely need to refine your vector store or chunk size.
Fine-Tuning vs RAG
Fine-Tuning
You retrain the LLM to speak in a specific tone or style.
- Great for personalization or branding
- Expensive, slow, needs lots of data
Example: You fine-tune a model to sound like Shakespeare.
RAG
You don’t touch the model. Just add your documents, and the model uses them for answering.
- Easy to update
- No retraining needed
- Works out-of-the-box
Example: You upload HR policies. Now the chatbot answers HR questions instantly.
Start with RAG — fine-tune only if your use case demands personality or tone changes.
Common Tools and Full Forms
- RAG: Retrieval-Augmented Generation
- LLM: Large Language Model
- BERT: Bidirectional Encoder Representations from Transformers
- FAISS: Facebook AI Similarity Search
- TF-IDF: Term Frequency-Inverse Document Frequency
- LTR: Learning to Rank
- NLTK: Natural Language Toolkit
Final Thoughts
RAG is a game changer. It connects LLMs to real, updated knowledge — making AI assistants smarter and more trustworthy.
I’m currently preparing for software engineering interviews, and AI is everywhere. I thought, if I’m learning this, why not help others too?
That’s why I wrote this post in order to make RAG simple and useful for anyone interested in AI.
I'll be posting more content around AI, tools, and interview prep. Stay connected
5 Must-Know RAG Interview Questions
- What is Retrieval-Augmented Generation (RAG), and how is it different from traditional LLMs?
- What is the difference between sparse and dense embeddings? When should you use each?
- Explain the “Lost in the Middle” problem and how to handle it.
- How do cosine similarity and Euclidean distance help in finding relevant document chunks?
- When should you choose fine-tuning over RAG, and what trade-offs come with it?
🖊️ Written by Shaik Salma Aga
Top comments (0)