Dev J. Shah 🥑

Posted on Jul 29 • Edited on Aug 4

Embeddings & Cosine Similarity Explained Simply

#ai #vectordatabase #rag

Introduction

This blog will discuss two main components of Retrieval Augmented Generation: the ingestion of data into a vector database and the retrieval of a relevant chunk of data using cosine similarity.

Brief About RAG

Before going further, as a prerequisite, here is a brief explanation of Retrieval Augmented Generation for those who are not familiar with this concept. Please feel free to skip to the next sections if you already know RAG.

This technique was designed to provide context to the LLM when it comes to generating responses to domain-specific questions. LLMs were trained on a vast amount of general data available on the internet. Hence, when a user asks a domain-specific question, for example, related to medical, legal, etc., a usual tendency of LLMs is to hallucinate. To resolve these issues, RAG was introduced.

The way the RAG technique works is that, first, the domain-specific data is split into various chunks. For example, if the data is in the form of multiple paragraphs, that data is split into each paragraph.

Note: There are various methods of data splitting, such as by the number of characters, paragraphs, etc.

Once the data is split into chunks, each chunk is not converted into an array of numbers. The reason for converting text into numbers is that computers can only understand numbers and not words. Further, this array of numbers, along with their associated data chunk, is stored in a vector database.

Now, when a user asks a domain-specific question to LLM, the text of this question is also converted into an array of numbers using the same method that was used to convert the data. Further, this array of numbers (of the question’s text) is passed to the vector database. The vector database now uses something known as cosine similarity to search for the most relevant chunk/s of data that can help answer the user’s question and returns these chunk/s. At this point in time, we have the user’s question and the most relevant chunk/s of data (that can answer the user’s question) from the vector database. Subsequently, we can pass the user’s question to the LLM along with the chunk/s of data, as context, which ensures that the LLM’s response is grounded in the actual stored data, making the answer more accurate, up-to-date, and trustworthy. If you want to know more about RAG, please check out RAG Explained series.

As you would have noticed, RAG has two core functionalities: Converting text into an array of numbers and retrieving the relevant chunk/s of data using cosine similarity. In the next sections, we are going to learn how these are done and have a good overall understanding.

Text into Numbers

First, we will discuss how the text is converted into an array of numbers. But instead of directly discussing the actual technique used, let me also discuss the alternatives. This will help you give more perspective on why we use what we use.

When someone mentions converting text into numbers, the first method that comes to mind is to create a big vocabulary table that stores all the words along with an index. Now, we can refer to this table and easily convert sentences/paragraphs into an array of numbers.

Consider the example of the following vocabulary table, alphabetically ordered.

Index	Word
1	best
2	client
3	for
4	high-quality
5	i
6	is
7	know
8	provide
9	service
10	software
11	the
12	think
13	tribalscale
…	…
100,000	zod

Using the above table to convert the following sentence: “I think TribalScale provides the best services.”. The array of numbers will become [5, 12, 13, 8, 11, 1, 9].

Taking one more example of the following sentence: “TribalScale is known for providing high-quality service.”. The array of numbers for this sentence becomes [13, 6, 7, 3, 8, 4, 9].

Great. This sounds like a straightforward approach. However, considering our use case, there are a couple of issues with this method. To start with, unless the vocabulary table is there, these numbers are random. Moreover, we required something that can help us find the relation between two chunks, to be able to search for the most relevant chunk of data. Nonetheless, when we keep the array of numbers for both the above sentences together without the text,

Sentence 1: [5, 12, 13, 8, 11, 1, 9]
Sentence 2: [13, 6, 7, 3, 8, 4, 9]

It's hard to tell if both of these sentences are related to each other. Plus, with the number of words increasing in the vocabulary table, we might have indices in millions. Hence, this would not be an ideal approach, looking at the compute required and the cost associated with that compute.

Therefore, this calls for another approach. Another approach should help identify if two sentences are related to each other. This is when we should talk about ‘Word Embeddings’.

“Word Embeddings” is a technique that allows converting text into an array of numbers while also capturing the relationship between all chunks. In this method, instead of using a vocabulary table, we have a list of parameters or, in simple words, flags. For example, the following list.

Parameters/Flags
Is Company?
Is Opinion?
Is Positive?
Is a service?
Is location?
…

Now, based on the sentence, we will answer each flag, anywhere between 0 and 1, including decimals. If the answer is yes, we use 1; if the answer is no, we use 0, and if the answer is anywhere in between, we use a decimal value in the range. Considering both the above sentences, along with a new sentence.

Sentence 1: “I think TribalScale provides the best services.”
Sentence 2: “TribalScale is known for providing high-quality service.”
Sentence 3: “I live in Toronto”

Parameters/Flags	Sentence 1	Sentence 2	Sentence 3
Is Company?	1	1	0.1
Is Opinion?	1	0.8	0
Is Positive?	0.9	0.7	0.1
Is a service?	0.9	0.9	0.2
Is location?	0.1	0	1
Is city?	0	0	1
…	…	…	…

Note: The values are approx and used for explanation.

This array of numbers is called embedding. Thus, the embeddings for all three sentences become,

Sentence 1: [ 1, 1, 0.9, 0.9, 0.1, 0, …]
Sentence 2: [ 1, 0.8, 0.7, 0.9, 0, 0, …]
Sentence 3: [ 0.1, 0, 0.1, 0.2, 1, 1, …]

This time, when you see the embeddings for all the sentences together, it is evident that for each index, the values of the index of sentence 1 and sentence 2 are very similar or near to each other; nonetheless, the same for sentence 3 is far. This indicates that sentences 1 and 2 are more similar to each other than sentence 3.

Note:

In real life, there are hundreds and thousands of such parameters/flags, which increase the accuracy of embeddings.

The model comes up with these parameters by itself when it is being trained on a huge amount of text.

For simplicity, we took the range of 0 to 1 to answer the parameters/flags, but it can vary based on the model used for generating the embeddings. It can be from -1 to 1, -3 to 3, but the fundamental purpose remains the same.

To revise what we saw so far. We first had data (paragraphs), which we split into chunks (each paragraph), then generated embeddings for each chunk, and lastly stored these embeddings along with the text into a vector database.

Cosine Similarity to search for relevant chunk/s

Moving towards the next section, since we now have all the data stored in the vector database, it is time to see how does vector database searches for the more relevant data based on the user's question. To understand this, we need to plot some graphs.

Getting back the embeddings of all the sentences,

Sentence 1: [ 1, 1, 0.9, 0.9, 0.1, 0, …]
Sentence 2: [ 1, 0.8, 0.7, 0.9, 0, 0, …]
Sentence 3: [ 0.1, 0, 0.1, 0.2, 1, 1, …]

Each index in these embeddings represents a coordinate in the graph. So {1, 1, 0.1} (first index value of three sentences) becomes the x-axis, {1, 0.8, 0} becomes the y-axis, and so on. These current embeddings have 6 coordinates. Since it is easy for us to see and observe a 3D graph, for explanation purposes, we will only consider the first three indices of each sentence. Hence, the following.

Sentence 1: [ 1, 1, 0.9]
Sentence 2: [ 1, 0.8, 0.7]
Sentence 3: [ 0.1, 0, 0.1]

Upon plotting these values on a graph, we get,

The graph clearly states that sentences 1 and 2 are going in the same direction, while sentence 3 is going in another direction.

Now, let's take a sample user’s question, “Which company provides the best services?”. To find the relevant chunk of data to answer this question, we need to use the same strategy to convert this text into embeddings.

Parameters/Flags	User’s Question
Is Company?	1
Is Opinion?	0.8
Is Positive?	0.8
Is a service?	1
Is location?	0.1
Is city?	0.1
…	…

The embeddings for the user’s question become [1, 0.8, 0.8, 1, 0.1, 0.1, …], and to plot it on a graph, consider the first three embeddings [1, 0.8, 0.8]. Upon plotting this user’s question on the same graph as above, we get,

Here, you can see that the line for the user’s prompt and that of sentences 1 and 2 are going in the same direction, whereas that of sentence 3 is in a different direction. Visually, it's evident that since the line of the user’s prompt is going in the same direction as sentences 1 and 2; hence, they are the most relevant chunk/s of data to answer the user’s question. But we need numbers to prove this. This is where cosine similarity comes in.

Before going into calculations, let me first explain the reason for using cosine similarity. In simple words, cosine similarity means to find the value of cos θ, where θ is the angle between two lines. The value of cos θ can be anywhere between -1 to 1. If the value of cos θ is

1; it indicates that both lines are perfectly aligned with each other; hence, the angle between them is 0.
0; it indicates that both lines are perpendicular to each other.
-1; it indicates that both lines are in different directions, having a 180-degree angle with each other.

With this analogy, if we find the angle between the line of the user’s question and all three sentences and calculate the value cos θ of those angles, we can get the number of how aligned the lines of each sentence are with the line of the user’s question. This is the reason for using cosine similarity.

The following are the calculated angles between the line of the user’s question and the line of each sentence.

If we calculate the cos θ of all these three angles, we get,

Angle between the user’s prompt and sentence 1: 5.38°
cos 5.38°: 0.9955

Angle between the user’s prompt and sentence 2: 3.33°
cos 3.33°: 0.9983

Angle between the user’s prompt and sentence 3: 32.55°
cos 32.55°: 0.8429

Since the values of cos θ for sentences 1 and 2 are very close to 1, it indicates that they are more relevant to the user’s question.

Note: The value of cos θ between sentence 3 and the user’s question is also close to 1 because, for calculations, we only used the first 3 parameters/flags to plot the graph. But in real life, all the hundreds and thousands of parameters/flags are considered, which increases the accuracy of these calculations.

Alright, before going further, let me do a quick recap of what we learnt so far.

First, we converted the text into numbers, then we plotted a line on the graph using those numbers.
Second, we converted the text of the user’s question into numbers and plotted those numbers on a graph.
Lastly, we calculated the angle between the data’s line and the user’s question’s line and used that angle to calculate the value of cos θ, which specifies how aligned each data chunk is to the user’s question.

Perfect. Now, the issue is, for all the questions that the user asks, we cannot go ahead and plot all these lines on a graph, calculate the angle, and cos θ to find which is the most relevant chunk of data based on the user’s question. Therefore, we need a formula that can help us calculate the value of cos θ directly using the embeddings of the sentences. For this purpose, we use the formula of the dot product.

\vec{A}\space.\space\vec{B}\space= \Vert A \Vert\space\times\space\Vert B \Vert\space\times\space cos\space\theta \newline cos\space\theta = \frac{\vec{A}\space.\space\vec{B}}{\Vert A \Vert\space\times\space\Vert B \Vert} \newline cos\space\theta = \frac{\sum\limits_{i=1}^{n}A_{i}\times B_{i}}{\sqrt{\sum\limits_{i=1}^{n}A_{i}^{2}}\times\sqrt{\sum\limits_{i=1}^{n}B_{i}^{2}}}

Applying this formula for sentence 2 and the user’s question,

Sentence 2: [ 1, 0.8, 0.7]
User’s Question: [ 1, 0.8, 0.8]

cos\space\theta = \frac{(1\times1) + (0.8\times0.8)+(0.7\times0.8)}{\sqrt{1^2 + 0.8^2 + 0.7^2}\times\sqrt{1^2 + 0.8^2 + 0.8^2}} \newline cos\space\theta = \frac{2.2}{\sqrt{2.13}\times\sqrt{2.28}} \newline cos\space\theta = \frac{2.2}{1.459\times1.509} \newline cos\space\theta = 0.9995

The value of cos θ for sentence 2 and the user’s question’s embeddings, which was obtained from calculations, is very close to what we found via plotting lines from the graph. Using this formula, we can get the value of cos θ directly from this formula by directly using the embeddings.

Conclusion

Ah, alright, guys. The main purpose of this blog was to understand how these two core functionalities work in Vector Database. The good news is that we don't need to do all these calculations to build a RAG-powered application. There are existing frameworks, such as LangChain, that can do all these things, and all you need to do is call the appropriate function as required.

In case you want to try a RAG-powered application, here is the documentation.