DEV Community

Cover image for How RAG Changed the Way We Use Large Language Models
Shravya K
Shravya K

Posted on

How RAG Changed the Way We Use Large Language Models

#From guessing to grounded answers

We've been asking AI the wrong questions.

Or rather, we've been asking the right questions to the wrong system.

Large Language Models are incredible at reasoning, connecting ideas, and explaining complex concepts. But there's something they're fundamentally not designed to do: remember specific facts with perfect accuracy.

For years, we tried to make LLMs better at memorization. Bigger models, more training data, longer context windows.

Then someone had a different idea: What if we stopped trying to make them remember everything and instead taught them how to look things up?

That shift from memory to retrieval is what RAG is all about.

LLMs Don't Know. They Predict.
Here's the fundamental thing about Large Language Models: they don't look things up. They predict.

When you type a message, it gets broken into tokens (pieces of words), which flow through a neural network asking one question billions of times: "What comes next?"

This is called autoregressive generation - each new word makes sense because of all the previous ones. It's like autocomplete on steroids, predicting one token at a time based on patterns learned during training.

Prompt: 

The weather today is

What the model does
Predicts the most likely next word:

The weather today is → "sunny"
Enter fullscreen mode Exit fullscreen mode

Run the same prompt twice? You'll get different answers. The model is making probabilistic choices, not retrieving stored facts.

The Models You're Actually Using

Different LLMs have different strengths:

  • ChatGPT (GPT-4): Best for conversations, coding, and following complex instructions
  • Claude: Excels at long documents and detailed analysis
  • Gemini: Strong with multimodal tasks (text, images, video)
  • Llama: Open-source, customizable for specialized applications

They all work the same way: predicting tokens based on context. And they all share the same core limitation.

The Limitation That Was Always There

LLMs can't look things up. When you ask about a certain product, the model doesn't search files. It generates an answer based on patterns from training data, data that's months old and has never seen your internal documents.

If it doesn't know something, it doesn't say "I don't know." It predicts what a reasonable answer might sound like.

The problem:

  • Training data becomes outdated immediately
  • No access to private or recent information
  • No verification mechanism
  • Confident responses regardless of accuracy

This works fine for creative tasks or explaining concepts. But for anything requiring factual accuracy? It's a dealbreaker.

The Breakthrough: Stop Memorizing, Start Retrieving

The solution came from rethinking the problem: What if LLMs don't need to remember everything?

That's RAG - Retrieval-Augmented Generation.

The process is simple:

  1. User asks a question
  2. System searches a knowledge base
  3. Retrieved information gets added to the prompt
  4. LLM reads and responds

RAG = LLM + Retrieved Context

Just like humans don't rely purely on memory when writing reports, we gather sources first, RAG lets AI read before responding. The LLM handles reasoning, retrieval handles facts.

How RAG Works: The Complete Flow

Setup Phase (happens once):

  1. Ingestion: Load your documents (PDFs, web pages, databases)
 Gather 600 help articles from a travel booking app, 150 PDF guides about cancellations and refunds, and fare rules from the internal system.

Enter fullscreen mode Exit fullscreen mode
  1. Chunking: Break them into smaller pieces (usually a few paragraphs)
A long document explaining ticket cancellations is split into smaller sections—one for refunds, one for rescheduling, one for penalties.

Enter fullscreen mode Exit fullscreen mode
  1. Embedding: Convert each chunk into a mathematical representation
A line like "Tickets cancelled within 24 hours are fully refundable" is turned into a format that captures its meaning.

Enter fullscreen mode Exit fullscreen mode
  1. Storage: Save in a vector database optimized for similarity search
All sections are stored with tags like category="refunds", travel_type="flight", and last_updated="2024".

Enter fullscreen mode Exit fullscreen mode

Query Phase (every time someone asks):

User question: "My flight booking disappeared even though I paid for it"

  1. Convert the question into a mathematical form
[booking_issue, payment_done, booking_missing, flight]

Enter fullscreen mode Exit fullscreen mode
  1. Search for chunks with similar meanings
"booking not confirmed after payment"
"payment successful but ticket not issued"
"flight booking disappeared"

Enter fullscreen mode Exit fullscreen mode
  1. Retrieve the top matches (usually 3-10 chunks)
Chunk 1:
"If payment is successful but ticket is not issued within 15 minutes,
the booking may remain in pending state."

Chunk 2:
"Bookings are auto-cancelled if airline confirmation is not received."

Chunk 3:
"Some banks show payment success even when airline rejects the transaction."

Chunk 4:
"Pending bookings disappear from user dashboard after 30 minutes."

Chunk 5:
"Refunds for failed bookings are processed within 5–7 days."

Enter fullscreen mode Exit fullscreen mode
  1. Insert these chunks into the prompt sent to the LLM

Context:
- Payment success does not always mean ticket issued
- Pending bookings may disappear after 30 minutes
- Airline confirmation failure causes auto-cancellation
- Refund timeline: 5–7 days

Answer the user clearly.

Enter fullscreen mode Exit fullscreen mode
  1. LLM reads and generates an answer based on retrieved context
FInal answer:
“Your booking disappeared because the airline did not confirm the ticket after payment. The system auto-cancelled it, and your payment will be refunded within 5–7 days.”
Enter fullscreen mode Exit fullscreen mode

The LLM never sees your entire document collection — just the most relevant pieces for each specific question.

Three Ways to Find Information

Retrieval isn't just "search" — there are three fundamentally different approaches:

1. Keyword Search: Exact Matching

Finds documents with specific words or phrases. Fast and predictable, but misses synonyms.
If the words don’t match, it pretends the information doesn’t exist.

Example:
Search: “pizza menu”
 Finds: “Pizza Menu – Downtown Branch”
 Misses: “Our Italian Dishes” or “What We Serve”
Enter fullscreen mode Exit fullscreen mode

2. Semantic Search: Understanding Meaning

Uses mathematical representations (embeddings) to find conceptually similar content, even with different wording.

Search: “I’m hungry but don’t want to cook”
 Finds: “Nearby restaurants,” “Food delivery options,” “Quick meals”

Enter fullscreen mode Exit fullscreen mode

3. Metadata Filtering: Structured Criteria

Applies hard rules: Metadata filtering narrows results using clear rules and labels.

Search: “Show my assignments”
 Filters: subject = “AI” AND status = “pending” AND due_date = “this week”
Enter fullscreen mode Exit fullscreen mode

Why Use All Three (Hybrid Search)

Modern RAG systems combine these approaches:

  • Semantic search finds conceptually relevant content
  • Keyword search ensures exact matches aren't missed
  • Metadata filtering applies necessary constraints

Each catches what the others miss.

Real-World Example

Scenario: Food Ordering App Help Chatbot

Setup:

Knowledge base:

  • 300 FAQs (orders, payments, delivery)
  • 150 issue guides (refunds, late delivery, app bugs)

Process:

  • Break articles into small pieces
  • Turn them into searchable representations
  • Store them so the system can quickly look them up

When user asks: "Why does my food order get cancelled every time I add desserts?"

Converts the question into a form it can compare with stored information

Looks for related content using:

  • Meaning (order issues, add-ons, desserts)
  • Keywords (“cancelled,” “desserts,” “order”)

Applies filters:

  • category = “order problems”
  • status = “current”

Picks the top 5 most relevant explanations and adds them to the prompt

Result: LLM response is based on actual app rules, not guesswork.

Before RAG: "Please try restarting the app or placing the order again." (generic)

With RAG: "Orders with desserts get cancelled if the restaurant is marked ‘no cold storage’. Try removing desserts or choosing a different outlet." (specific, accurate)

Why This Changes Everything

RAG enables use cases that weren't reliably possible before:

  • Internal company AI: Train on your specific docs, policies, codebase
  • Medical assistance: Reference current treatment guidelines and research
  • Legal research: Search case law with specific citations
  • Customer support: Know exact product features and current policies
  • Research tools: Find and connect recent publications

All of these need one thing: answers grounded in specific, current, trusted information.

The Core Insight

Large Language Models are reasoning engines that happen to be trained on data. We mistook that training data for a feature when it was really a limitation.

RAG separates reasoning from knowledge:

  • LLM: Understanding context, connecting ideas, generating responses
  • Retrieval: Storing information, keeping it current, finding relevant pieces

This division of labor makes AI systems reliable.

The future isn't about models that know everything. It's about systems that know how to find what they need, when they need it, from trusted sources.

That's what RAG represents. And it's just the beginning.

Top comments (1)

Collapse
 
pavan_517af519589aebc50b8 profile image
PAVAN

Alright