Gokul Kannan

Posted on May 3

Vector Databases in RAG - Day 2

#ai #rag #beginners #vectordatabase

In Day-1, we understood about the overview of a RAG system and what are its components and how it helps the LLM to generate more accurate and contextual responses. Now, lets see about the storage of the data using Vector Databases.

Vector Database

Lets assume that we have a PDF with us and this would be considered as our private data. Now I want my LLM to have the context about this PDF, So that I could ask any query related to that PDF and get the response.

Now, we need to store this PDF data in a format with which the LLM could fetch the data and give us a relevant responses.
Here in this case, Vector Database helps us to store the PDF data in a Numerical format which can be used by the LLM to fetch the relevant data.

A vector database stores data in the form of vectors (arrays of numbers).

A vector database is a specialized database designed to store and search vector embeddings (numerical representations of data). Unlike traditional RDBMS systems that use exact matching (like SQL queries), vector databases are optimized for similarity search. Examples include ChromaDB, Pinecone, FAISS, and Qdrant.

Example:

Text => converted into numbers using embeddings
Image => converted into numbers
Audio => converted into numbers

These numbers capture meaning, not just raw data.

What does mean by Numerical Format?

It means the any kind of Data (Text, Image or Audio) is converted into a numerical format using any kind of encoding algorithms and gets saved into DB.

Here, first we would break down the PDF data as chunks of data where each chunk would have n number of characters. Then each chunks would be converted into a vector points with n-number of dimensions. To have clear understanding, Lets see the below example :

Today is Wednesday
Tomorrow is Thursday
I am travelling today
Wednesday is a nice series

Now I need this data to be converted into some sort of numerical format - as vectors. Here lets consider a simple One Hot Encoding.

First lets just find all the unique words and list it down as an array.

[Today, is, Wednesday, Tomorrow, Thursday, I, am, Travelling, a, nice, series]

Now lets assign the values 1 and 0 for each words in the same array format. We would give 1 if it occurs in the sentence, if not 0.

1 1 1 0 0 0 0 0 0 0 0
0 1 0 1 1 0 0 0 0 0 0
1 0 0 0 0 1 1 1 0 0 0
0 1 1 0 0 0 0 0 1 1 1

As you can see, now we have converted each sentence into a numerical format. This is a very basic encoding algorithm which can be used for a meaningful conversion.

How do we do this conversion?

Now we understood that why we need to convert the data in a numerical format and we would use different kinds of encoding algorithms to do that conversion.

So, how do we convert our data into a format which can be stored in a vector database? The answer is Embedding Models.

What is this Embedding Model?

Embedding Model helps us to convert our data into vectors which can then be stored into a vector DB.

There are different sets of embedding models available. One such model is nomic-embed which has a 768 Dimensions, it means each chunk of data is represented as a vector of 768 Dimensions.

Data
↓
[nomic-embed -Embedding Model]
↓
768D[] vectors
↓
VectorDB

Why do we need to save as Numerical Format?

We may have a question that why can't we just store the same text data into the DB and do a normal text search. In this case, what happens is, we would be able to save those as isolated words and we can't really extract the context or semantic meaning out of this.

Vector DB helps us to find a similar meaning data by doing a Similarity or Semantic Search.

Now lets understand the whole flow where this Vector DB gets used:

Your data (PDF, docs, DB) → converted into embeddings
Stored in a vector DB
When user asks a question → it is also converted into a vector
Vector DB finds similar content
That content is sent to the LLM for answer generation

Lets understand with an example.

Imagine a company has:

1000 PDFs (policies, FAQs, manuals)
They want a chatbot to answer questions based on these documents

Step 1: Convert documents into vectors

Each paragraph is converted into numbers (embeddings) using an Embedding Model

Step 2: Store in Vector Database

Step 3: User asks a question

Step 4: Convert question into vector

Step 5: Similarity Search

Vector DB compares:

Question vector
Stored document vectors

It finds the closest match (similar meaning)

How does a vector DB finds the similar meaning?

A Vector Database is a type of database designed to store data as numerical vectors (embeddings) and efficiently retrieve similar data by performing similarity searches using metrics like cosine similarity.

Let’s imagine we reduce everything to 2D (real systems use 100s–1000s of dimensions).

We take 5 words:

Cat
Dog
Tiger
Car
Bus

Now imagine they are plotted like this:

    ↑ Y-axis
    |
    |        Tiger .(0.8, 0.9)
    |
    |   Cat .(0.6, 0.7)
    |   Dog .(0.65, 0.6)
    |
    |
    |
    |                    Car .(0.1, 0.2)
    |                    Bus .(0.15, 0.25)
    |
    +--------------------------------→ X-axis

Cat, Dog, Tiger are close → similar meaning ()
Car, Bus are close → similar meaning
Animals are far from vehicles → very different

Step 1: User query

Let’s say user searches: "Lion"
We convert "Lion" into a vector: Lion → (0.75, 0.85)

Step 2: Compare with existing points

Now the Vector DB calculates similarity using something like:
- Cosine similarity - This measures the angle between two vectors, not just distance
- OR Euclidean distance

Step 3: Find nearest neighbors

Finally,

Vector DB returns: Tiger, Cat (top matches)

In reality:

The Embedding models would not have 2D instead it would have 768D, 1536D, etc.

Uses optimized algorithms like:

ANN (Approximate Nearest Neighbor)
KNN (K Nearest Neighbor)

Is Vector DB mandatory for RAG?

RAG (Retrieval-Augmented Generation) is an approach/architecture. In one of the approach we use the Vector DB to retrieve the relevant Data.

Here Vector DB used in the retriever layer to perform semantic search.

Instead of Vector DB, RAG can also use:

Keyword search (like SQL LIKE)
APIs or databases

So to have clear understanding,

RAG is not just a LLM + Vector DB
Instead,
RAG is LLM + Retrieval (Vector DB is one way to do retrieval)

So, RAG is an approach where an LLM retrieves relevant external data (often using a vector database) and uses it to generate more accurate, context-aware responses.

Summary

A Vector Database performs similarity search by representing data (such as text, images, or audio as chunks) as high dimensional vectors. If we consider that as a multi dimensional space, each item is stored as a point in this space.
When a query is given, it is also converted into a vector, and the database uses similarity metrics such as cosine similarity to measure how close the query vector is to other vectors.
Based on this, it retrieves the most relevant results by selecting the vectors that are closest in terms of semantic meaning.