DEV Community: Indumathi R

Prompting styles - Advanced

Indumathi R — Mon, 08 Jun 2026 16:21:28 +0000

In the previous post, we saw some basic prompting styles. In this post, we shall see some advanced prompting styles

1.Chain of thoughts(COT)
To enable this, we can provide prompts like below

Think step by step

Explain the intermediary steps

Break complex part into simpler terms

LLM will think step by step for the user query and provide output. During this step by step thinking process, output of previous step will be fed as input to the next step.

Lets see a prompt without applying COT and compare it with result after applying COT

Without COT

With COT

As you can see instead of just getting only the answer, prompt with COT provides intermediatory steps for arriving the answer. In essence, COT enables provides more reasoning and transparency.

2.React Framework (Reason + Action)
Let's say a model is trained on 2025 data. If we ask current gold price, it will provide answer based on 2025. It won't know the current price. How can we make LLM to provide up to date information ? For this, we can connect tools with LLM. Tools are nothing but a function or api or mcp.

To fetch the current gold price, LLM will search if it has any tool for this. If an appropriate tool is found, LLM will call the particular tool will turn call an api and a json response will be returned. LLM will use this json and provide the current gold price to the user.

This methdology is called ReAct.It reasons the user query and based on that it will perform some action to give the desired output.

Can't the LLM itself make actions on its own ? Why we need to link tools? LLM at its core is a just mathematical equation. It is not a software. To provide more enhanced capabilities to the LLM, tools are linked to it. Consider this analogy: LLM is like a phonebook. Phonebook by itself can't make calls. We need mobile to make calls. Tools are like mobiles. Tools can be anything ranging from function, modules, api, etc...

3.Self Consistency
For the given prompt, it will generate n results. It then chooses and returns the result which appear consistently across n results.

4.Prompt chaining
Output of one prompt is fed as input to the next prompt and its output as input to next prompt and so on.

5.Tree of thoughts
For the given prompt, instead of thinking in one direction, we will make LLM to think in multiple different paths(ways) and it will compare the results of each and returns the best one. Best one is chosen based on the context provided.

6.CRISP Framework
C - Context [Background information]
R - Role [Persona to take]
I - Instruction [Task]
S - Style [output format]
P - Purpose [Final goal]

If our prompt contains all the above elements, it will yield a good result.

7.RICE Framework
R - Role
I - Input
C - Constraints
E - Expectations

Which prompting style to use?
Now that we have seen several prompting styles, a question comes naturally to our mind. Which style to use ? There is no one size fits all approach. It varies based on our usecase. In a trial and error way, we need to find the style that suits our usecase.

For normal queries, we can prefer one of basic prompting styles like few shot prompting, system prompting etc. For program related queries we can prefer advanced prompting styles like chain of thoughts, tree of thoughts etc. There are also few suggestions in this:

We can choose chain of thoughts when we need to think step by step.
We can choose ReAct framework when we need to think + act repeatedly(tool calling)
We can choose tree of thoughts when we need to explore several possibilities
We can choose CRISP framework when we need to get structured response. Ex: Content creation, Document generation, Code generation and such.
We can choose RICE framework to control output. Ex: Planning, requirement gathering and such.

It is not necessary to use a single prompting style, we can combine several styles together as well.

Major applications of prompting styles

Quering with LLM interface like chatgpt,claude etc
In RAG pipeline.

Prompting styles - Basic

Indumathi R — Tue, 02 Jun 2026 02:22:19 +0000

Query which we ask the LLM is referred to as prompt. The way in which we provide prompt to LLM makes a difference and there are different ways to to provide a prompt. This is referred to as prompting styles or prompt engineering. Now lets see some of the commonly used styles :

1. Zero shot prompting
This is a no brainer, We will simply just give a query to llm. i.e A task will alone be provided to llm. However this style is not that good.

2. Few shot prompting
Along with the basic prompt, Few examples will be provided. i.e inputs and its respective output of how it should be are included. LLM will generate output to our query based on the provided examples.

If one example is provided, then it is referred to as one shot prompting. If two or more examples is provided then it is referred to as few shot prompting.

3. System prompting
We will provide some governing laws(instructions). i.e setting some constraints, boundaries etc for the given prompt to the LLM.

4. Role based prompting
We will make the LLM to adopt a specific persona.

5. Contextual prompting

If someone asks me a question like "Have you ate ?" and if i am replying it as sun rises in the east. provided answer is not a lie but it is not relevant to the question asked. provided answer is out of context.
Context means Background information or extra information. Feeding more and more context to prompt will yield better result.

Day 9 - Sparse embedding continued - RAG

Indumathi R — Thu, 28 May 2026 02:52:45 +0000

In the previous post, we saw some basic methodologies under sparse embeddings. In that, term frequency(TF) had a fallback when same words are repeated too often. To overcome the shortcomings of TF, next method was introduced. We shall see them in detail:

Inverse document frequency(IDF)
It determines how less frequent a word occurs in the input documents. It calculates how the rare the word is. Rare word is of high priority. i.e If the word occurs less frequent, then the value will be high and if if the word occurs more frequently, then the value will be low.

If i ask query about frequently occurring words (for which IDF score is low), results will not be that good. On the other hand, if i ask query about rarest word(IDF score is high), results will be comparatively good.

Drawbacks of IDF
If i ask query about kubernetes and if the word is occurring only in one document , that particular document will be returned. There will be chances where the doc will have mention of kubernetes once but does not describe in detail about it. In such cases, returned doc is not that useful

TF-IDF
This combines both TF and IDF. i.e For a word its TF score will be multiplied with IDF.

BM-25(Best match-25)
Next improved version of TF-IDF is BM-25 algorithm. 25 refers to top 25 matching words. This may yield better result when compared to TF-IDF.

As sparse embedding(s.e) does keyword search, We cannot use s.e alone in a RAG pipeline. To make the best of both worlds, we need to combine dense embeddings (semantic similarity) and sparse embedding(keyword search). This is called hybrid search For dense embedding we can use sentence transformer and for sparse embedding we can use BM-25 algorithm.

Day 8 - Sparse embedding - RAG

Indumathi R — Tue, 26 May 2026 18:02:15 +0000

What is a sparse embedding ?
The word sparse means thinly scattered or occurs in a small amount over a large area. Sparse embedding(shortly as S.E) will have a vocabulary(dictionary of words). Words will be stored in a ordered list format.

Basic S.E methdology
Lets assume that vocabulary has 10,000 words. For the given chunks, it will first start to tokenize each of the chunk.

Ex: Chunk 1 -> Redis is a inmemory database
Tokenization of chunk1 -> ["Redis", "is", "a", "inmemory", "database"].

It will take the first token i.e Redis, if this word is found in its vocabulary, on the index where the redis occurs, 1 will be marked. rest of them will be zero. [0,0,0,1,...] i.e vocabulary list will either be 1 or 0.
1 means token is found in vocabulary and 0 means not found. As we are using vocabulary list, each token embedding will be list of 10k words. Embedding will match with the vocabulary size

Where S.E can be used?
S.E will be used in places where we need to do a exact word match. To give some context behind the S.E, consider the below:
We are having the male and female words and we are trying to build a ML model. How does underlying system know whether the word is male or female ? It does not know about strings. We can use binary classification. i.e we can give 0 for male and 1 for female, viceversa.

There is a small problem with this approach, indirectly we are forming a bias that female is higher than male and viceversa(Since in value wise 1 > 0). To remove this, we can use 2 column feature:

This is the basic concept of S.E. Unlike dense embeddings, S.E won't have continuous values. It is based on occurence and frequency of words.

Shortcomings with this basic approach
It does not consider the frequency of words in a chunk. It will yield the same vector even for words that are repeated.

Term frequency
Next variation of S.E is term frequency. Chunks will be converted to tokens. For each token respective frequency will be calculated. Frequency of the token will then be divided with total numbers of tokens in the chunk. This value will be considered as term frequency of token. This process will be repeated for each token in the chunk.

Shortcomings with this approach
If a word is spammed or occurs too many times its respective chunk will be prioritized over other, Even if the user query is unrelated to it.

Introduction to Generative AI

Indumathi R — Sun, 24 May 2026 03:11:43 +0000

What is Generative AI ?
For the given user input(user query), output like text,image, video etc will be generated. This is called generative AI.
How it generates content?
A model will be used to generate output. i.e model will receive input and based on that, it will generate a output.
What is a model?
At its core, model is nothing but a mathematical equation.It will be multidimensional. vast amount of multimodal data (text,audio,video, image etc) would be subjected to training to get the required mathematical equation. To get the desired state, backpropagation will be carried out.

120b model means, equation has, 120 billion parameters. One of the commonly used model type in generative ai is LLM.

What is LLM ?
LLM stand for large language model. LLM basically predict the next word. If i provide the input query as hello to gpt model, based on the data it was trained, it will predict and returns the next word. In my case i got Hello,How can I help you today?

Response will not be generated and sent all at once. It will be generated one by one and sent in a streamed manner(by means of SSE event).

How it predicts the next word?
In the above example, when i gave hello as input, why "Hi, how can i help you today was returned" ? not hi or world etc .
For the given input, model provides some of possibility words like
hi, world, howdy, how may i help you etc. For each possible word, it gives a score(most occuring probability). Word which is having highest score will be returned as output. If the scores are hi (0.2), world(0.4), howdy(0.1), how may i help you(0.7), highest score is 0.7, so "how may i help you is returned".

Can we tweak the model to control how output should be?

This can be achieved by tweaking the following parameters
1. Temperature
2. Top- K
3. Top - P

Temperature
Temperature controls whether the output generated be factual or imaginative. Temperature value lies between 0 - 1. If it is closer to zero, then it more of a factual and the value is closer to 1, then it is more of a imaginative.

Example prompt for low temperature

Example prompt for high temperature

2.Top -K
K denotes the number of tokens to be returned. For the prompt, The cat sat on the ---- following words are predicted for the varying values of k.

3.Top - P
Threshold percentage will be set. From the set of predicted words, those words will be taken whose cumulative probability score approximates to threshold percentage.
For the prompt, The cat and top_p = 0.7

Day 7 - Dense Embedding - RAG

Indumathi R — Thu, 21 May 2026 03:52:11 +0000

Dense embedding have continuous numeric values. i.e after decimal point values will be present. Chunk will be converted to embeddings, each embedding point will have number like [0.3455566 ,0.6777779, ...]. Generated vectors will be plotted in a space called latent space. Discrete values like 0 won't be present.

Sparse embedding will mostly have discrete values like 0,1 etc. Rather than semantic meaning, it considers frequency or importance of words in a text.
Ex: one hot encoding

Models for Dense embedding
1. LLM

Embed only LLMs are also available. Sole purpose of these LLMs is to generate embedding. Ex: Nomic embed, BGE.
We can also give a prompt to general purpose LLM to generate embedding. But this is costly operation.

2. Transformers (encoder)
Ex: Minilm, nomic transformers

These models are available in hugging face, ollama.It also hosts other models as well.

How can we evaluate the performance of RAG system ?
For a given user query, RAG system will return some set of matching documents. If the returned documents matches with our expectations, we can say it is yielding good results. Say, if our expectation from RAG is to return a, b, c, d, e documents for a user query and in reality it returns a, b, d docs alone. Out of 5, 3 is returned. It is meeting expectation to half right ? Like how we write unit test cases for a software code, we need to write test cases for user query in evaluating the RAG systems.

Should the same embedding model be used throughout the RAG pipeline?
Yes. If we use nomic embed text for document vectorisation then the same model should be used for query vectorisation as well. Suppose if we use different models(one for document vectorisation and other for query vectorisation), then there is a chance that the documents vectors will be plotted in one cluster space and query vector will be plotted in another cluster space. To avoid this, we need to use the same embedding throughout the pipeline.

Day 6 - Embedding - RAG

Indumathi R — Tue, 19 May 2026 17:43:37 +0000

In the previous post, we saw what chunking is and the various methdologies of chunking. In this post, we are going to see the next stage of the RAG pipeline - Embedding.

What is Embedding ?
For each chunk, a vector will be generated. Vector is nothing but a list of numbers. Vector denotes a point in three dimensional space. This process is called embedding.

Why we need to generate a list of numbers in the first place ?
The whole idea of RAG is to enable semantic search.
Lets consider the following word pairs
1.Feline & cat
2.King & Queen
Although words in each pair are different, meaning wise, words of the respective pairs are related to each other.
Now let's consider another term, similarity. It means how close two items are in nature. Combining semantic and similarity we get semantic similarity. It refers to how close two items are related to each other in terms of intent, meaning and context. So in RAG,words which are semantic in nature(meaning is similar) occurs closer in multi dimensional space as vectors.

Vectors are generated for each chunk and stored in vectorDB. User query will also be converted to vector. To return a relevant answer for the query, vector points which are of at close proximity to the query vector will be chosen. among them top n close points will be returned.By means of vectorisation, we can find and return the relevant information. This answers our earlier question, why vectors.

*How close proximity vector points are determined for the user query vector ?
* There are several metrics to determine this:
1. Cosine similarity
2. Euclidean distance
Most commonly used is cosine similarity. Now you may get another question, why cosine ? not Sin or Tan ?

We basically need to find the points that are closer to each other i.e distance between them should be less. If the angle between is small, obviously distance between them will also be less. Cosine helps to achieve identify this notion.

If the angle is almost 0 deg then the cos(0) is 1. This means vectors are nearer to each other and are highly related to each other If the angle is 90, then cos(90) is 0, vectors are not situated nearer. If the angle is 180 deg, cos(180) is -1. They are situated at opposite ends, not related to each other at all. Should not be taken into consideration.

When seeing sine, it does not provide clear distinction. For 0 degree, it returns 0 and for 90 deg also returns zero. We cannot distinguish whether the points are near or far as it returns same 0 value. Tan provides unpredictable values like infinity. Because of this, cosine is preferred.

So in essence, vector is list of numbers that denotes a point in a n- dimensional space. Dimension can be of 256,..., 3000 +. i.e single point is list of 256 values or more.

For the query vector, we can either find the distance between each vector and query - this is called KNN algorithm. Suppose if the data is really huge and if we can't afford to find the distance between each of the query, we can choose approximate number of points. This is called ANN. This is all about the need for vectorisation.

Now, lets see how we can choose a embedding model
Some common categories to choose a embedding models are:

1. By query type
a. Symmetric model
search query is identical to the provided documents.
Example: If i ask to return other news article similar to the one that i provide, then we can use this model. Return the news article similar to one where PM asks not to buy gold.
Ex: Nomic-embed-text, qwen-3

b. Asymmetric model
Shorter query for longer documents.
Ex: HR documents are stored. If we ask a query like, how many leaves are allowed ? we can go with this model type
Ex: Gemini

2. By Retrieval type
a. Dense embedding
To have more semantic understanding, we can go with this model.
Ex: cohere embed models, chatgpt oss 120b

b. Sparse embedding
Does a exact keyword search. Won't have semantic understanding at all.
Ex: BM- 25. This is based on term frequency (TF) and inverse document frequency (IDF)
Term frequency: Frequency of a word in a text. This can fail, if someone spams same word over and over.
Inverse term frequency : It considers How important a word is in the given text. It ignores the frequency of word.
Ex: is, and will be repeated but not much important.

We can also use transformers to generate embeddings.
Transformers are made up of encoders and decoders. From transformers LLMs are built.

Sometimes, if the document data is large, many vectors may be situated to the query point. Due to this, accuracy of the result generated might be reduced. Many vector points will be returned. While designing documents, need to keep track of this.

Day 5 - Chunking continued - RAG

Indumathi R — Fri, 15 May 2026 17:13:11 +0000

Sliding window chunking
To understand this method, we need to know about two parameters, window size and step size. Let's now see how with the help of these two parameters, sliding window chunking works.

Consider the following :

Sample text:
Redis is an open-source, in-memory data store that is primarily used as a cache, database, and message broker. Unlike traditional databases that store data on disk, Redis keeps data in memory (RAM), which makes data access extremely fast. It is commonly used in applications where high performance and low latency are critical, such as caching frequently accessed data, managing user sessions, real-time analytics, task queues, and messaging systems.
Window size =15
Step size =5

Window position is at the first character. It takes the first 15 characters and stores them in chunk1.
Redis is an op.
Now the window moves, how farther it is gonna move will be based on step size. Since we are considering it as 5, window moves 5 characters. from that new moved point, it takes next 15 characters and store them in chunk 2
s is an open-so

Roughly, sliding window chunking looks like this.
[Redis [is an [open-source], in-memory d[ata store] that is primarily used as a cache, database], and message broker. Unlike traditional].

Sliding window is more of a overlapping chunking. Unlike normal overlapping chunking, where we take 1/4th of previous sentence, we are doing a more extensive overlapping in this kind of sliding window chunking.

In overlapping chunking, there is a limitation, if the text contains two unrelated ideas, by means of overlapping chunking, we are bringing them close together. We are forcefully making relationship. This can provide absurd results. Sliding window also carries this limitation. Token consumption will be more. As more number of chunks will be generated, equivalent number of token should also be generated. (tokens will be produced by embedding model)

Another disadvantage with this approach is that, point(generated from query), redundant results will be returned. (as there are several repetitions among several chunks).

Where sliding window chunking can be used ?
When the data in a text are not that related to each other and we need to explicitly establish a relationship between them, sliding window chunking can be used. In essence, to link less related items together.

Token based chunking
Input text is converted to tokens
Single word or character can be considered as token Each token will be assigned a number (like oneshot encoding). These numbers will be sent to embedding model for generating vector points.

When can token based chunking be used?
When there is ratelimiting in the embedding model, we can choose this method, to give a set of tokens(say 100/200 etc). This is not much used.

TOON (Token object oriented notation)
to send json in a more compact manner to a LLM, notation was employed. But this is not much effective.

Some of the commonly used chunking methdologies are shared in this and previous post. There isn't one size fits all chunking method. It varies based our usecase and dataset.

Converting Documents to chunks
Tools for converting documents to a text format so that it can be converted into proper chunks. pdfs cannot be processed as such.

1.Pypdfloader from langchain
2.Pypdf
3.Mupdf etc...
4.Tessaract (for document containing scanned files)

Here also there is no one best tool/package for processing pdfs. It varies based on document data. For special elements like tables in a documents there are few tools, that handles them. First we detect tables(means of regular expression like space before and after. Entire table will be converted into one chunk). We can also use tools like camelot to processing tabular data. Sometimes there can be also images in a document. But in vector DB, it is quite difficult to link images and textual data together. This is all about chunking methdologies.

Day 4 - Chunking continued - RAG

Indumathi R — Tue, 12 May 2026 02:39:09 +0000

Semantic Chunking
Lets Consider two paragraphs A and B, focussing on strings in python. para A focus on typecasting and para B focus on accessing characters. These two paragraphs are not that related to each other but if i do overlapping, these two points will be closer to each other. We do not want to forcefully bring the two paragraphs together. To solve this problem, semantic chunking can be used.

It will continue to add sentence to a chunk until the relevancy is present. i.e It will take first sentence, since there is nothing to compare it will add it to a chunk. Next it will the take the second sentence and compare it with the previous sentence, if the relevancy factor is > 0.75 , second sentence will be added to chunk. Next sentence will be taken and compared with the previous sentence. If the relevancy factor is < 0.75, it won't be added to chunk otherwise it will be added. Semantic chunking can be achieved by means of nltk package.

Embedding Chunking
To find relationship between previous and current sentence, LLM will be used. i.e LLM calculates and produces a number that determines how much are the two sentences related with each other.

There is no one best method to choose the chunking methodology. It varies based upon the dataset. We can do trial and error to determine the methdology suitable for us.

Day 3 - Chunking - RAG

Indumathi R — Sun, 10 May 2026 09:24:48 +0000

What is chunking ?
It is one of the step in RAG pipeline. Dividing a large document into several small parts. Each small part is called chunk. Chunking means dividing.Let's consider this following passage:

Redis is a high-speed, in-memory data structure store that functions as a database, cache, message broker, and streaming engine. It is widely used for real-time applications because it keeps data in RAM rather than on disk, enabling sub-millisecond response times. Unlike traditional databases (like MySQL or PostgreSQL) that read from a hard drive, Redis operates in the computer's main memory, which is significantly faster.

We are going to give the whole passage to the embedding model. It will generate a point (let's consider it as P1)and it is stored in vector DB. There is a small problem with this approach. If i ask a query like , "How redis functions ? " intended answer for this question will be "database, cache, message broker, and streaming engine". However, since the entire passage is stored as single point, it wont retrieve the specific part, it will return the entire passage. To get only the specific part and leave out irrelevant parts as an answer to the query, chunking is very important.

Chunking can be performed in two ways:

Discrete chunking
Semantic chunking

How Small a chunk should be or what should be the size of a chunk ?
If i ask a question "How are you ? " to LLM, if it answers as "sun rises in the east", it is irrelevant but the stmt provided is not wrong. It is just irrelevant to the question provided. LLM wont just say, i dont know, it tries to make up some answer. By means of chunking, we are going to tweak the way in which LLM provides answer.

Discrete chunking
Fixed logic to generate chunk; Let's see some types in discrete chunking :

Fixed Chunking
If i say size as 25 characters, each chunk will contain only 25 characters. In a paragraph, first 25 characters will be in chunk1 , next 25 characters will be in chunk2 etc... In the redis passage, if i start to split into 25 characters, first chunk would be Redis is a high-speed i second chunk would be n memory data structure etc. When we see these chunks, we can see that, meaning of the words is lost due to splitting. What can we infer from this chunk Redis is a high-speed i meaning is lost right ?

How can we better do chunking in this ?

Besides taking 25 characters, we can take till sentence get completed i.e 25 characters and till fullstop. In this case, chunk 1 would be Redis is a high-speed in-memory data structure store that functions as a database, cache, message broker, and streaming engine

Overlapping chunks
Taking from the heading, words between the chunks would be overlapped. i.e Consider the first sentence as Redis is a high-speed in-memory data structure store that functions as a database, cache, message broker, and streaming engine and second sentence as It is widely used for real-time applications because it keeps data in RAM rather than on disk, enabling sub-millisecond response times. If overlapping chunking is applied,few words from the last sentence would be added to starting of next sentence. i.e
Chunk 1 would be Redis is a high-speed in-memory data structure store that functions as a database, cache, message broker, and streaming engine and Chunk 2 would be database, cache, message broker, and streaming engine. It is widely used for real-time applications because it keeps data in RAM rather than on disk, enabling sub-millisecond response times .

Sometimes there are chances for the points to be plotted farther from each other although the texts are closely related to each other. overlapping chunking will reduce this event to some extent.

Day 2 - RAG - What is Vector DB ?

Indumathi R — Fri, 08 May 2026 02:13:27 +0000

To recall, Integrating our private documents with LLM is called RAG.

Lets assume that, we have some pdfs containing our data. That data in the pdf will be broken down into chunks based on some criteria. That chunk will be fed as input to the model. More specifically embedding model. This model will generate a point. How the point is generated ?

Lets take a simple example:

Today is Wednesday
Tomorrow is Thursday
I am travelling today
Wednesday is a nice series

Lets construct a sentence now containing only unique words from the above set of sentences:
Today, is, Wednesday, Tomorrow, Thursday, I, am, travelling, a, nice, series

We are now going to construct each of the 4 sentences into a number format. We will compare unique constructed sentence with each of the input sentence. If the input sentence contains a word from unique construct sentence, number 1 will be assigned to uniquely constructed sentence otherwise 0.

1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0
1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0
0,1, 1, 0, 0, 0, 0, 0, 1, 1, 1

This method of conversion is called a one shot encoding

Now coming to RAG, based on the context of the model, it will generate a point. Generated point will be multidimensional (x,y,z,a ...). Generated points will enable semantic search. What is semantic search ? It will help us to know, how two points are closely related to each other. Meaning based search is called semantic search. For each chunk, a point will be generated. Then model based on its context plots it. Related points appear together.

Vector DB provides a place to store related points together and when quering on the data, it provides the related data.

*How do we say that two points are closer to each other ?
*
When distance is less we say that the two points are closer to each other. Just because there are two points, we can't always say that they are nearer to each other. We need to bring in another point.(for comparison). To find distance between points, there are several algorithms: Euclidean, Cosine Similarity, Manhattan distance.

Lets take Cosine similarity and see how it works:
There are three points(p1,p2,p3) plotted in a graph. From origin, a straight line will be drawn to each of the points. The lines forming an angle with point3 will be considered and its angle will be noted. Cosine of the angle will be taken. smallest cosine angle will be the shortest point.

There are 100 points. if i want to find the nearest points for a point named x, i need to calculate distance between x to all other remaining points. Then only i can arrive the nearest points. But this approach is time consuming.

So a pipeline for RAG is, data will be given to a embedding model(nomic-embeed text), it will a generate a point (mathematical representation of the data). This point will be stored in a vector DB. Some examples of vector DB are chromaDB(general purpose), pinecone, FAISS(high similarity), Quadrant(images) etc.

If i ask any query, it will be sent to embedding model and generate a point and store it in the vector DB and returns the points(say like 5) that are nearer to the query point. This is all about Vector DB

Day 1 - RAG

Indumathi R — Mon, 04 May 2026 04:11:00 +0000

RAG stands for Retrieval Augmented Generation. Why do we even need RAG?? To answer this lets take a look at What LLMs and SLMs are.

LLM(Large Language Model). Data on several categories(generalized) will be given as input. From that, a model would be created. What is a model ? To understand this, lets take mathematical equation of a straight line

y = mx +c

Lets take x values to be 1, 2, 3, ... and y values to be 2, 4, 6, 8, 10. We can use whatever values for m and c to get our desired y value(like 2, 4 etc). Instead of a simple linear equation, we can also consider double, cubic or equations(order of the variables like x^2, x^3 etc...). When we say a model is os of 4b parametrs, 120b parameters and all , it refers to a big equation. Using the input data, a mathematical equation is being created. Larger the equation, more better the result will be. i.e if model is exposed and trained on several amount of data, results generated will also be more relevant and good.

LLMs predict the next word. If we give hello, it may give hello world. We can control how the output should be generated by LLM. like factual or imaginative type. This is determined by a factor used in LLM called Temperature. Higher the temperature, more factual it will be. Lower the temperature, output will be more imaginative.

Temperature is meant for a single query

SLM(Small Language model)
Instead of training the data on vast amount of data across all categories, training a model on the data of specific domain to solve a set of tasks from that domain (like speech to text generation) is referred to as small language model.

Think of it like this, LLMs are generic and SLMs are specific

If we ask a question to LLM based on the data it was trained, we will be getting a good result. But, if we ask a question which is out of the scope of trained data, it will try to answer it i.e makes up answer on its own. This is called hallucination. (wont say like i dont know it, unless we explicitly prompt it).

Analogy: Lets take GPT-OSS model (released at around 2025). If we ask the model now about the Iran-Isreal war, it wont know about it. As the war did not happen at 2025.

In the sameway think about this, In our company, we have some set of data stored in doc, wikis etc. Models out there (gemini, claude) wont know about it. Somehow, if were able to link the LLMs with our private data, we can use that LLM for our internal usage in our company/personal use. This is called RAG. i.e Linking LLM with our data and asking LLM some questions about our data is what RAG is.

One of the approach to achieve LLM to answer our queries on private data is to train the LLMs with the private data. This is one way but not the only way.

Another way is, uploading documents into a vector DB. Before getting into deep in this. Lets first What is vector ? one that has direction and magnitude. For our case, we wont be dealing with direction only dealing with magnitude.

We will be breaking the document into several chunks and convert it into points and plot it in a graph. Lets just plot apple, orange, pear, doctor as points in a graph. Which two are points are releveant here? apple and doctor(apple a day keeps a doctor away), how more relevant ? How to find this. Two points are said to be closer, if the distance between the two points are less. (This is with respect to 2d). It can go upto 700D.

Why did we put doctor closer to apple ? Normally a sentence will be broken into chunks. These chunks wil be given to LLM and it gives points. Based on the context it was trained, it generates points. The closer points will be related to each other.

In essence, our private document will be broken down into several chunk. For each chunk, a point will be generated and plotted in vector DB.

Analogy: ANN(Approximate nearest neighbour) is one of the algorithms used in spotify like platform to find relevancy between items and suggest relevant items