To recall, Integrating our private documents with LLM is called RAG.
Lets assume that, we have some pdfs containing our data. That data in the pdf will be broken down into chunks based on some criteria. That chunk will be fed as input to the model. More specifically embedding model. This model will generate a point. How the point is generated ?
Lets take a simple example:
- Today is Wednesday
- Tomorrow is Thursday
- I am travelling today
- Wednesday is a nice series
Lets construct a sentence now containing only unique words from the above set of sentences:
Today, is, Wednesday, Tomorrow, Thursday, I, am, travelling, a, nice, series
We are now going to construct each of the 4 sentences into a number format. We will compare unique constructed sentence with each of the input sentence. If the input sentence contains a word from unique construct sentence, number 1 will be assigned to uniquely constructed sentence otherwise 0.
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0
0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0
1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0
0,1, 1, 0, 0, 0, 0, 0, 1, 1, 1
This method of conversion is called a one shot encoding
Now coming to RAG, based on the context of the model, it will generate a point. Generated point will be multidimensional (x,y,z,a ...). Generated points will enable semantic search. What is semantic search ? It will help us to know, how two points are closely related to each other. Meaning based search is called semantic search. For each chunk, a point will be generated. Then model based on its context plots it. Related points appear together.
Vector DB provides a place to store related points together and when quering on the data, it provides the related data.
*How do we say that two points are closer to each other ?
*
When distance is less we say that the two points are closer to each other. Just because there are two points, we can't always say that they are nearer to each other. We need to bring in another point.(for comparison). To find distance between points, there are several algorithms: Euclidean, Cosine Similarity, Manhattan distance.
Lets take Cosine similarity and see how it works:
There are three points(p1,p2,p3) plotted in a graph. From origin, a straight line will be drawn to each of the points. The lines forming an angle with point3 will be considered and its angle will be noted. Cosine of the angle will be taken. smallest cosine angle will be the shortest point.
There are 100 points. if i want to find the nearest points for a point named x, i need to calculate distance between x to all other remaining points. Then only i can arrive the nearest points. But this approach is time consuming.
So a pipeline for RAG is, data will be given to a embedding model(nomic-embeed text), it will a generate a point (mathematical representation of the data). This point will be stored in a vector DB. Some examples of vector DB are chromaDB(general purpose), pinecone, FAISS(high similarity), Quadrant(images) etc.
If i ask any query, it will be sent to embedding model and generate a point and store it in the vector DB and returns the points(say like 5) that are nearer to the query point. This is all about Vector DB
Top comments (0)