Ramya Perumal

Posted on May 27

RAG - Sparse Embedding

#ai #beginners #rag

Sparse means thinly spread, scattered, or not dense.

In sparse embeddings, chunks are converted into tokens, and each token is represented based on whether it exists in the vocabulary dictionary.

If a token is present in the vocabulary, it is assigned 1; otherwise, it is assigned 0.

Example

[0,0,0,1,0,0,1,0,...]

If the vocabulary dictionary contains 10,000 words, the vector representation will also contain 10,000 dimensions.

For a particular chunk:

Only a few positions may contain values like 1
Most other positions will contain 0

Unlike dense embeddings, sparse embeddings do not contain continuous values. They mainly depend on token occurrence and frequency.

Why Do We Use Sparse Embeddings?

Sparse embeddings are mainly used for direct text matching and keyword-based retrieval.

They are useful when:

Exact keyword matching is important
Semantic understanding is not the primary requirement
Traditional search behavior is needed

Basic Sparse Representation

In the basic sparse approach:

Word tokens are compared with the vocabulary dictionary
If the token exists, the value becomes 1
Otherwise, the value becomes 0

This is similar to one-hot encoding.

Drawback of Basic Sparse Representation

The main drawback is that it does not consider how many times a word appears in the document.

For example:

If the word “database” appears 20 times and another word appears only once, both may still receive the same representation.

To solve this problem, the concept of token weighting was introduced.

Term Frequency (TF)

TF stands for Term Frequency.

It measures how frequently a term appears in a document.

The formula is:

TF gives higher importance to terms that appear more frequently in a document.

Issue with TF

The problem with TF is that commonly occurring words may receive very high importance even if they are not meaningful.

For example:

“the”
“is”
“and”

These words appear frequently in most documents but do not provide strong contextual meaning.

To solve this issue, IDF was introduced.

Inverse Document Frequency (IDF)

IDF stands for Inverse Document Frequency.

It measures how rare or important a word is across the entire document collection.

Common words receive lower scores
Rare and meaningful words receive higher scores

The formula is:

Issue with IDF

IDF alone does not determine how relevant a document is to the user query.

It only measures the rarity of terms across documents.

To improve retrieval quality, TF and IDF are combined together.

TF-IDF

TF-IDF combines:

Term Frequency (TF)
Inverse Document Frequency (IDF)

The formula is:

TF-IDF works well for many traditional search systems because it balances:

Word frequency within the document
Word importance across all documents

However, TF-IDF still does not fully capture semantic meaning.

BM25 (Best Match 25)

BM25 is an advanced ranking algorithm used in sparse retrieval systems.

It improves upon TF-IDF by considering:

Term frequency
Document length
Query relevance

BM25 is one of the most commonly used algorithms in traditional search engines and sparse retrieval systems.

Limitation of Sparse Embeddings

Sparse embeddings alone are usually not enough to retrieve highly relevant documents in modern RAG systems because they mainly focus on exact keyword matching rather than semantic meaning.

For example:

“car” and “automobile” may not match
“feline” and “cat” may not match

Even though the meanings are similar.

Hybrid Search

To improve retrieval quality, modern systems combine:

Dense embeddings
Sparse embeddings

This approach is called hybrid search.

Typical Combination

Dense retrieval → Sentence transformers or embedding models
Sparse retrieval → BM25

Dense embeddings help with semantic understanding, while sparse embeddings help with exact keyword matching.

Together, they provide better retrieval performance in RAG applications.

DEV Community

RAG - Sparse Embedding

Why Do We Use Sparse Embeddings?

Basic Sparse Representation

Drawback of Basic Sparse Representation

Term Frequency (TF)

Issue with TF

Inverse Document Frequency (IDF)

Issue with IDF

TF-IDF

BM25 (Best Match 25)

Limitation of Sparse Embeddings

Hybrid Search

Typical Combination

Top comments (0)