Shravya K

Posted on Jan 13

How RAG Changed the Way We Use Large Language Models

#ai #rag #llm

#From guessing to grounded answers

We've been asking AI the wrong questions.

Or rather, we've been asking the right questions to the wrong system.

Large Language Models are incredible at reasoning, connecting ideas, and explaining complex concepts. But there's something they're fundamentally not designed to do: remember specific facts with perfect accuracy.

For years, we tried to make LLMs better at memorization. Bigger models, more training data, longer context windows.

Then someone had a different idea: What if we stopped trying to make them remember everything and instead taught them how to look things up?

That shift from memory to retrieval is what RAG is all about.

LLMs Don't Know. They Predict.
Here's the fundamental thing about Large Language Models: they don't look things up. They predict.

When you type a message, it gets broken into tokens (pieces of words), which flow through a neural network asking one question billions of times: "What comes next?"

This is called autoregressive generation - each new word makes sense because of all the previous ones. It's like autocomplete on steroids, predicting one token at a time based on patterns learned during training.

Prompt: 

The weather today is

What the model does
Predicts the most likely next word:

The weather today is → "sunny"

Run the same prompt twice? You'll get different answers. The model is making probabilistic choices, not retrieving stored facts.

The Models You're Actually Using

Different LLMs have different strengths:

ChatGPT (GPT-4): Best for conversations, coding, and following complex instructions
Claude: Excels at long documents and detailed analysis
Gemini: Strong with multimodal tasks (text, images, video)
Llama: Open-source, customizable for specialized applications

They all work the same way: predicting tokens based on context. And they all share the same core limitation.

The Limitation That Was Always There

LLMs can't look things up. When you ask about a certain product, the model doesn't search files. It generates an answer based on patterns from training data, data that's months old and has never seen your internal documents.

If it doesn't know something, it doesn't say "I don't know." It predicts what a reasonable answer might sound like.

The problem:

Training data becomes outdated immediately
No access to private or recent information
No verification mechanism
Confident responses regardless of accuracy

This works fine for creative tasks or explaining concepts. But for anything requiring factual accuracy? It's a dealbreaker.

The Breakthrough: Stop Memorizing, Start Retrieving

The solution came from rethinking the problem: What if LLMs don't need to remember everything?

That's RAG - Retrieval-Augmented Generation.

The process is simple:

User asks a question
System searches a knowledge base
Retrieved information gets added to the prompt
LLM reads and responds

RAG = LLM + Retrieved Context

Just like humans don't rely purely on memory when writing reports, we gather sources first, RAG lets AI read before responding. The LLM handles reasoning, retrieval handles facts.

How RAG Works: The Complete Flow

Setup Phase (happens once):

Ingestion: Load your documents (PDFs, web pages, databases)

 Gather 600 help articles from a travel booking app, 150 PDF guides about cancellations and refunds, and fare rules from the internal system.

Chunking: Break them into smaller pieces (usually a few paragraphs)

A long document explaining ticket cancellations is split into smaller sections—one for refunds, one for rescheduling, one for penalties.

Embedding: Convert each chunk into a mathematical representation

A line like "Tickets cancelled within 24 hours are fully refundable" is turned into a format that captures its meaning.

Storage: Save in a vector database optimized for similarity search

All sections are stored with tags like category="refunds", travel_type="flight", and last_updated="2024".

Query Phase (every time someone asks):

User question: "My flight booking disappeared even though I paid for it"

Convert the question into a mathematical form

[booking_issue, payment_done, booking_missing, flight]

Search for chunks with similar meanings

"booking not confirmed after payment"
"payment successful but ticket not issued"
"flight booking disappeared"

Retrieve the top matches (usually 3-10 chunks)

Chunk 1:
"If payment is successful but ticket is not issued within 15 minutes,
the booking may remain in pending state."

Chunk 2:
"Bookings are auto-cancelled if airline confirmation is not received."

Chunk 3:
"Some banks show payment success even when airline rejects the transaction."

Chunk 4:
"Pending bookings disappear from user dashboard after 30 minutes."

Chunk 5:
"Refunds for failed bookings are processed within 5–7 days."

Insert these chunks into the prompt sent to the LLM


Context:
- Payment success does not always mean ticket issued
- Pending bookings may disappear after 30 minutes
- Airline confirmation failure causes auto-cancellation
- Refund timeline: 5–7 days

Answer the user clearly.

LLM reads and generates an answer based on retrieved context

FInal answer:
“Your booking disappeared because the airline did not confirm the ticket after payment. The system auto-cancelled it, and your payment will be refunded within 5–7 days.”

The LLM never sees your entire document collection — just the most relevant pieces for each specific question.

Three Ways to Find Information

Retrieval isn't just "search" — there are three fundamentally different approaches:

1. Keyword Search: Exact Matching

Finds documents with specific words or phrases. Fast and predictable, but misses synonyms.
If the words don’t match, it pretends the information doesn’t exist.

Example:
Search: “pizza menu”
 Finds: “Pizza Menu – Downtown Branch”
 Misses: “Our Italian Dishes” or “What We Serve”

2. Semantic Search: Understanding Meaning

Uses mathematical representations (embeddings) to find conceptually similar content, even with different wording.

Search: “I’m hungry but don’t want to cook”
 Finds: “Nearby restaurants,” “Food delivery options,” “Quick meals”

3. Metadata Filtering: Structured Criteria

Applies hard rules: Metadata filtering narrows results using clear rules and labels.

Search: “Show my assignments”
 Filters: subject = “AI” AND status = “pending” AND due_date = “this week”

Why Use All Three (Hybrid Search)

Modern RAG systems combine these approaches:

Semantic search finds conceptually relevant content
Keyword search ensures exact matches aren't missed
Metadata filtering applies necessary constraints

Each catches what the others miss.

Real-World Example

Scenario: Food Ordering App Help Chatbot

Setup:

Knowledge base:

300 FAQs (orders, payments, delivery)
150 issue guides (refunds, late delivery, app bugs)

Process:

Break articles into small pieces
Turn them into searchable representations
Store them so the system can quickly look them up

When user asks: "Why does my food order get cancelled every time I add desserts?"

Converts the question into a form it can compare with stored information

Looks for related content using:

Meaning (order issues, add-ons, desserts)
Keywords (“cancelled,” “desserts,” “order”)

Applies filters:

category = “order problems”
status = “current”

Picks the top 5 most relevant explanations and adds them to the prompt

Result: LLM response is based on actual app rules, not guesswork.

Before RAG: "Please try restarting the app or placing the order again." (generic)

With RAG: "Orders with desserts get cancelled if the restaurant is marked ‘no cold storage’. Try removing desserts or choosing a different outlet." (specific, accurate)

Why This Changes Everything

RAG enables use cases that weren't reliably possible before:

Internal company AI: Train on your specific docs, policies, codebase
Medical assistance: Reference current treatment guidelines and research
Legal research: Search case law with specific citations
Customer support: Know exact product features and current policies
Research tools: Find and connect recent publications

All of these need one thing: answers grounded in specific, current, trusted information.

The Core Insight

Large Language Models are reasoning engines that happen to be trained on data. We mistook that training data for a feature when it was really a limitation.

RAG separates reasoning from knowledge: