Gokul Kannan

Posted on May 2

Hello World of RAG - Day 1

#ai #beginners #llm #rag

As a beginner in understanding LLMs, when I heard the term RAG-Retrieval Augmented Generation, I assumed it was a technique used within LLMs. However, from this session, I learned that RAG is all about use of our own custom or private data along with an LLM to generate more relevant responses.

Before understanding RAG, we need to have clarity on what exactly these LLMs are?

What does a Model mean?

A model means an equation. Let's say now we have this equation
y = mx + c
This is a straight line equation.
If the values of x and y are provided, then the system just tweak the values of m and c to come up with best fits.
Here lets say x = 1 & y = 2, now I can have m=1 & c=1 or m=0 & c=2, etc., Here it learns different patterns. This process is called as learning.

Parameters and Weights

Similarly in an AI model, the equation would be much more larger with the billions of parameters. The more complex the equation, the more patterns the model can learn and so the relevance and accuracy improves. Based on the Data exposed to train a model , its prediction varies.
This is the reason why bigger models often perform better. That's why AI companies like OpenAI, Gemini and Claude come up with their model containing billions of parameters which helps to learn complex relationships.

And along with these parameters, we have something called as weights.
For Example : m1x^2 + m2x^3
Here the m1 and m2 are called as weights. These are the things that comes from the data. These are the values learned during the training which act as deciding factors.
For Example :
When a model learns about animals,
"Cat" gets one weight
"Dog" gets another
"Lion" gets another

Based on the weights, the relevance changes. And using this the model could prioritize the importance of one over the other.

What does an LLM do?

Now we understood about the model. So, what does an LLM actually do?
The answer is "It just predict the next word."

If you ask a question to an AI, it does not understand like how we do. Instead it uses the question or prompt as an input and it predicts the next word and uses the predicted word again as an input to predict the next. This gets repeated until it generates the complete response. And this is the reason, why it always streams the response and does not just give a whole response at once.

But how does it predict the next word?

It uses the weights which the pretrained model already has for all data which it had trained on. What if we ask about a word, which the model didn't get trained on? Will it say "I don't know"?
No, it just Hallucinates.

For example, the model is trained only on:
Cats
Dogs
and if we ask about:
Lions

The model was never exposed to data related to "Lions".
Instead of saying:
“I don’t know”
the model answers a wrong answer confidently. This is called Hallucination.

This is why RAG becomes necessary. Here we would give it a context by providing our private data, so that it doesn't hallucinate when we ask anything related to our data, instead it uses this private data and then generate the response rather than just using the pretrained data.

Temperature

Temperature controls the creativity of the model.
It usually ranges from 0 to 1:

Low Temperature (0.1)

More factual
More stable
Less creative

Medium Temperature (0.5)

Balanced output

High Temperature (0.9)

More creative
More imaginative
Higher chance of hallucination Temperature does not directly control truth. It controls randomness.

LLM and SLM

We don't always need a bigger model which knows everything when we actually just need it for our specific use cases. In this situation, we may need a specialized model. Here SLM helps.

SLM - Smaller Language Model
This helps us with specific use cases. For example: ChatBots, Any Domain-Specific Tasks, Voice Assistants.
These models may have millions of parameters instead of billions.

It is much more cheaper, smaller than a LLM

LLM - Large Language Model
It is a Generalized model which has knowledge from different domains. It has billions of parameters. Example : Claude, Gemini and ChatGPT.

Why do we need RAG?

All these LLMs have few major limitations like

Outdated Knowledge- They may not now about recent events. They only know about the data with which it had been trained on.
Hallucination - Outcome of first limitation is when we ask about something it doesn't know, it hallucinates.
They doesn't have any knowledge about a private data which they cannot access. Example : Private Business Data, HR Documents, Finance Documents, Project Reports , Project Management Tool Data, etc.,

This is where the RAG comes into picture.

RAG - Retrieval Augmented Generation

This is self Explanatory.

RAG typically involves three main steps:

Retrieve – Relevant data is fetched from external sources like PDFs, databases, internal files, knowledge bases or documents based on the user’s query.
Augment – The retrieved data is added to the prompt/context that is sent to the pre-trained LLM.
(Important: we are not modifying or retraining the model itself, just giving it extra context.)
Generate – The LLM uses both its pre-trained knowledge and the retrieved context to generate a more accurate and relevant response.

So instead of relying only upon its pretrained data, it just looks up on this retrieved private data and then generates the response. This way, RAG helps the LLM to overcome the above mentioned limitations.

Where is this private data gets stored?

The Private Data is stored inside a Database known as a Vector Database. This vector database is a concept.

For example: The private data like AzureDevOps board content, HR Policy documents, Jira content, Internal Business Docs.
All these are not directly fed to the LLM. Instead, they are converted and stored intelligently.

How these documents are stored?
Documents are broken into smaller parts called as Chunks.
These chunks are always a
Sentence Groups or Paragraph Chunks and
not individual Words.
This is because meaning comes from the context and not with isolated words.
This contextual chunks helps the RAG to give a more relevant responses.

What is a Vector?

A vector has Magnitude and Direction.

Each Chunk is converted into a numerical vector.
Example:
A paragraph about Lion becomes.
P1=[...700 dimensions]
P2=[...700 dimensions]
P3=[...700 dimensions]
Here P1, P2 and P3 are the points in a graph. All these points are defined with a 700 dimensions.
For our understanding, in a 2D graph, we represent a point with x and y value. It means a point P1 can be defined as (x, y). Similarly we can define a point with any number of dimensions.
Now the system measures the distance between the vectors and finds the relevant information.

It checks which are the points which are closer to P1. It finds P2 and P3 by measuring the distance.

Example :

Apple
Orange
Pear
Lemon
Doctor

Here all the fruits related words stay closer and the word Doctor stays farther from these words.
This is how the relevance works.

How relevant Chunks are found?

When we say it measures the distance between each vectors, it means it involves different kinds of Algorithms.
Examples :
ANN - Approximate Nearest Neighbors
KNN - K- Nearest Neighbors
These help quickly find the most relevant chunks.
The same idea is used in:

Spotify, Netflix recommendations
Amazon suggestions
YouTube feed
Social media recommendations

Summary

The flow of RAG system is
User asks a Query - Prompt
↓
System retrieves the chunks from this prompt
↓
This retrieved context goes into the LLM
↓
Based on the context, it retrieves all related chunks of data (Stored in a vector DB)
↓
LLM Generates the Answer with the retrieved relevant chunks of data
↓
User Receives a better response with their context.

LLM- Predicts
Vector DB - Stores your private Data as vectors
RAG provides the context by searching the vector DB for relevant chunks. Send those to LLM.
More contextual responses are generated.

RAG is a method where an LLM retrieves relevant information (often from pre-indexed data in a vector database) and uses it to generate a more accurate answer.

One simple analogy to have a clear understanding is:
If you are getting into a project, you may need a Senior person's help to know about a particular application.
So, RAG can be that senior person by below way:
Here the Data means your application's documentations-files, Jira/ADO data. This is a private data.

LLM <--> YOUR DATA
|________|
↓ (Combining these two)
RAG (Now this acts as that Senior Person-You can interact with)

DEV Community