Day 6 - Embedding - RAG

#ai #nlp #rag #tutorial

In the previous post, we saw what chunking is and the various methdologies of chunking. In this post, we are going to see the next stage of the RAG pipeline - Embedding.

What is Embedding ?
For each chunk, a vector will be generated. Vector is nothing but a list of numbers. Vector denotes a point in three dimensional space. This process is called embedding.

Why we need to generate a list of numbers in the first place ?
The whole idea of RAG is to enable semantic search.
Lets consider the following word pairs
1.Feline & cat
2.King & Queen
Although words in each pair are different, meaning wise, words of the respective pairs are related to each other.
Now let's consider another term, similarity. It means how close two items are in nature. Combining semantic and similarity we get semantic similarity. It refers to how close two items are related to each other in terms of intent, meaning and context. So in RAG,words which are semantic in nature(meaning is similar) occurs closer in multi dimensional space as vectors.

Vectors are generated for each chunk and stored in vectorDB. User query will also be converted to vector. To return a relevant answer for the query, vector points which are of at close proximity to the query vector will be chosen. among them top n close points will be returned.By means of vectorisation, we can find and return the relevant information. This answers our earlier question, why vectors.

*How close proximity vector points are determined for the user query vector ?
* There are several metrics to determine this:
1. Cosine similarity
2. Euclidean distance
Most commonly used is cosine similarity. Now you may get another question, why cosine ? not Sin or Tan ?

We basically need to find the points that are closer to each other i.e distance between them should be less. If the angle between is small, obviously distance between them will also be less. Cosine helps to achieve identify this notion.

If the angle is almost 0 deg then the cos(0) is 1. This means vectors are nearer to each other and are highly related to each other If the angle is 90, then cos(90) is 0, vectors are not situated nearer. If the angle is 180 deg, cos(180) is -1. They are situated at opposite ends, not related to each other at all. Should not be taken into consideration.

When seeing sine, it does not provide clear distinction. For 0 degree, it returns 0 and for 90 deg also returns zero. We cannot distinguish whether the points are near or far as it returns same 0 value. Tan provides unpredictable values like infinity. Because of this, cosine is preferred.

So in essence, vector is list of numbers that denotes a point in a n- dimensional space. Dimension can be of 256,..., 3000 +. i.e single point is list of 256 values or more.

For the query vector, we can either find the distance between each vector and query - this is called KNN algorithm. Suppose if the data is really huge and if we can't afford to find the distance between each of the query, we can choose approximate number of points. This is called ANN. This is all about the need for vectorisation.

Now, lets see how we can choose a embedding model
Some common categories to choose a embedding models are:

1. By query type
a. Symmetric model
search query is identical to the provided documents.
Example: If i ask to return other news article similar to the one that i provide, then we can use this model. Return the news article similar to one where PM asks not to buy gold.
Ex: Nomic-embed-text, qwen-3

b. Asymmetric model
Shorter query for longer documents.
Ex: HR documents are stored. If we ask a query like, how many leaves are allowed ? we can go with this model type
Ex: Gemini

2. By Retrieval type
a. Dense embedding
To have more semantic understanding, we can go with this model.
Ex: cohere embed models, chatgpt oss 120b

b. Sparse embedding
Does a exact keyword search. Won't have semantic understanding at all.
Ex: BM- 25. This is based on term frequency (TF) and inverse document frequency (IDF)
Term frequency: Frequency of a word in a text. This can fail, if someone spams same word over and over.
Inverse term frequency : It considers How important a word is in the given text. It ignores the frequency of word.
Ex: is, and will be repeated but not much important.

We can also use transformers to generate embeddings.
Transformers are made up of encoders and decoders. From transformers LLMs are built.

Sometimes, if the document data is large, many vectors may be situated to the query point. Due to this, accuracy of the result generated might be reduced. Many vector points will be returned. While designing documents, need to keep track of this.

DEV Community

Day 6 - Embedding - RAG

Top comments (0)