Vaishali

Posted on Apr 8

How AI Apps Actually Use LLMs: Introducing RAG

#ai #webdev #rag #llm

If you’ve been exploring AI applications, you’ve probably come across the term RAG.

It appears everywhere - chatbots, AI assistants, internal knowledge tools, and documentation search.

But before understanding how it works, it helps to understand why it exists in the first place.

Large language models are powerful. However, when used on their own, they have a few fundamental limitations.

⚠️ Problems With LLMs On Their Own

LLMs are impressive — until they start failing in real-world scenarios.

1. Outdated Knowledge
Every model has a training cutoff date.

If asked about something that happened after that point, the model may:

say it doesn't know
generate an answer that sounds plausible but is incorrect

2. Hallucinations

LLMs do not know things in the traditional sense.

They generate text by predicting what is most likely to come next based on patterns in training data.

When the correct information is missing, the model may still produce a confident-sounding but incorrect answer.

That behavior is known as a hallucination.

3. No Access to Private Data

Most models are trained on public datasets.

That means internal information such as:

company documentation
product knowledge bases
internal policies
customer data

is completely unknown to the model.

It is possible to paste documents into the prompt, but this approach has clear limitations:

context window limits
increasing token cost
poor scalability

These constraints make it difficult to build reliable AI systems using only an LLM.

That is where RAG comes in.

🧩 What RAG Actually Is

RAG stands for Retrieval-Augmented Generation.

It is an architectural approach where relevant information is retrieved first and then provided to the model before it generates a response.

Instead of relying only on what the model remembers from training, the system fetches external knowledge at runtime.

No retraining is required.
No fine-tuning is necessary.

The model simply receives the right context at the right moment.

The goal is to ground the model’s response in data that is relevant and known to be correct.

⚙️ The Basic Components of a RAG System

Although production systems can become complex, the core pipeline is relatively simple.

Most RAG systems include these stages:

Data Intake: Documents or knowledge sources are collected.
Chunking: Large documents are broken into smaller, manageable pieces.
Embeddings: Each chunk is converted into a vector representation.
Vector Database: These vectors are stored in a database designed for similarity search.
Retrieval: Relevant chunks are retrieved based on the user’s query.
Generation: The retrieved context is sent to the LLM to generate the final response.

🔄 How RAG Actually Flows

The diagram below illustrates the typical RAG pipeline.

The process typically works as follows.

1. User Query: A user asks a question.

2. Query Embedding: The query is converted into a vector representation using an embedding model. This vector represents the semantic meaning of the query.

3. Vector Search: The vector is sent to a vector database that stores embeddings of all document chunks.
The database finds the chunks that are most similar in meaning to the query.

4. Retrieval: Only the most relevant pieces of text are retrieved. Not the entire document — just the chunks that match the query.
This is the retrieval step.

5. Augmentation: The retrieved text is added to the prompt. The prompt now contains:

the user’s question
the retrieved context

6. Generation: The augmented prompt is sent to the LLM.
The model generates a response based on the retrieved information, not just its training data.

📚 A Simple Example

Consider a chatbot built for company documentation.

Without RAG:

User asks:

"How do I reset my account password?"

The model might generate a generic answer based only on training data.

With RAG:

The system searches the documentation
The section describing password reset is retrieved
That section is added to the prompt
The model generates an answer grounded in the documentation

The response becomes more accurate and reliable.

📈 Advantages of RAG

RAG solves several practical challenges when building AI systems.

Reduced Hallucinations: Because the model receives real supporting information, the chances of hallucination are reduced.
Better Retrieval in Large Documents: Finding one relevant paragraph inside a 2000-page document can be difficult for a model working alone.
RAG retrieves only the relevant chunks, reducing noise and improving accuracy.
Efficient Use of Data: Uploading large datasets into prompts repeatedly is expensive.
RAG processes documents once during indexing, and only the relevant pieces are retrieved when needed.
This makes the system significantly more efficient.