Ramya Perumal

Posted on May 17

RAG- Understanding of Embedding

#rag #ai #beginners #python

What is Embedding?

After text is split into chunks, the next process is called embedding. In this step, each chunk is converted into vectors (points in vector space). In vector-based RAG systems, chunks are converted into vectors so that semantic search can be performed efficiently.

Why Do We Need to Convert Chunks into Vectors?

The main goal of a RAG application is to achieve semantic search.

Semantic
Example
The word feline is related to the cat family, even though the words are different. Understanding that “feline” and “cat” are related is called semantic understanding.

Similarity
When a user asks a query, semantically related chunks are returned even though the exact words in the chunks may be different.

Semantic Similarity
Semantic similarity combines:

Intent
Context
Meaning

The purpose is to establish relationships between the user query and the documents stored in the RAG system. This allows the system to retrieve relevant information from the database and provide it to the LLM for further processing.

Words that are semantically related are usually stored closer together in multi-dimensional vector space.

Cosine Similarity
To determine how close vectors are to each other, cosine similarity is commonly used.

When a user query arrives:

The query is converted into a vector
Cosine similarity is calculated between the query vector and stored vectors
The closest vectors are retrieved

Retrieval Methodologies

Two major retrieval methodologies are used:

1. KNN (K-Nearest Neighbors)
KNN compares the query vector with all stored vectors one by one to find the nearest neighbors.

Advantage
More accurate retrieval

Disadvantage
Slow for very large datasets

2. ANN (Approximate Nearest Neighbors)
ANN approximately finds the nearest vectors instead of comparing every single point.

This method is mainly used when:

The document volume is huge
Faster retrieval is required
Time constraints exist

ANN improves retrieval speed while sacrificing a small amount of accuracy.

Why Cosine Similarity Instead of Sine or Tangent?

Cosine similarity works effectively because:
If two vectors are very close and highly related, the cosine similarity value approaches 1. If the angle between vectors increases, the cosine similarity value decreases, meaning the vectors are less related

Why Not Sine or Tangent?
For small angles:

Sine values remain close to 0
Tangent values can fluctuate significantly

These measurements are not stable for semantic comparison. Cosine similarity provides a more reliable way to measure semantic closeness between vectors.

Embedding Dimensions
Embedding models can generate vectors with dimensions ranging from 256 to 3000 or more.

The dimension size depends on the embedding model and the amount of contextual information it captures.

Generally:

Higher dimensions capture richer semantic information
Lower dimensions are faster and cheaper but may lose context

Types of Embedding Models
Choosing an embedding model completely depends on the application scenario.

1. Based on Query Type

Symmetric Models
Symmetric embedding models are used when the query and the documents are similar in structure and length.

Examples
nomic-embed-text
Qwen embeddings

These are commonly used in semantic search systems.

Asymmetric Models
Asymmetric embedding models are used when:

Queries are short
Documents are long

Example
Google Gemini embedding models

These models are optimized for retrieving long documents from short queries.

2. Based on Retrieval Type

Dense Embeddings
Dense embeddings mainly focus on semantic meaning.

These embeddings generate dense vectors where most values contain meaningful information.

Examples
Cohere embedding models
ChatGPT OSS 120B embeddings

Advantage
Better semantic understanding

Sparse Embeddings
Sparse embeddings mainly focus on exact keyword matching.

They commonly use the BM25 (Best Match 25) algorithm, which is based on:

TF (Term Frequency)
IDF (Inverse Document Frequency)

TF-IDF Concepts

TF (Term Frequency)
Measures how many times a word appears in a document.

IDF (Inverse Document Frequency)
Measures how important a word is across the entire document collection.Words that appear too frequently across all documents are considered less important.

Transformer Architecture

The transformer architecture was a major breakthrough for LLMs.
Transformers mainly contain:

Encoder
Decoder

Encoder
The encoder converts text into embeddings (vectors).

Decoder
The decoder converts embeddings back into human-readable text after processing.

This architecture enables modern LLMs to understand and generate natural language effectively.

Choosing a Vector Database

Chroma
Open source
Easy to set up
Suitable for basic and small-scale applications

FAISS
Better for large document collections
Optimized for high-performance semantic search
Commonly used in production-scale retrieval systems

DEV Community