Sparse means thinly spread, scattered, or not dense.
In sparse embeddings, chunks are converted into tokens, and each token is represented based on whether it exists in the vocabulary dictionary.
If a token is present in the vocabulary, it is assigned 1; otherwise, it is assigned 0.
Example
[0,0,0,1,0,0,1,0,...]
If the vocabulary dictionary contains 10,000 words, the vector representation will also contain 10,000 dimensions.
For a particular chunk:
Only a few positions may contain values like 1
Most other positions will contain 0
Unlike dense embeddings, sparse embeddings do not contain continuous values. They mainly depend on token occurrence and frequency.
Why Do We Use Sparse Embeddings?
Sparse embeddings are mainly used for direct text matching and keyword-based retrieval.
They are useful when:
- Exact keyword matching is important
- Semantic understanding is not the primary requirement
- Traditional search behavior is needed
Basic Sparse Representation
In the basic sparse approach:
- Word tokens are compared with the vocabulary dictionary
- If the token exists, the value becomes 1
- Otherwise, the value becomes 0
This is similar to one-hot encoding.
Drawback of Basic Sparse Representation
The main drawback is that it does not consider how many times a word appears in the document.
For example:
If the word “database” appears 20 times and another word appears only once, both may still receive the same representation.
To solve this problem, the concept of token weighting was introduced.
Term Frequency (TF)
TF stands for Term Frequency.
It measures how frequently a term appears in a document.
The formula is:
TF gives higher importance to terms that appear more frequently in a document.
Issue with TF
The problem with TF is that commonly occurring words may receive very high importance even if they are not meaningful.
For example:
- “the”
- “is”
- “and”
These words appear frequently in most documents but do not provide strong contextual meaning.
To solve this issue, IDF was introduced.
Inverse Document Frequency (IDF)
IDF stands for Inverse Document Frequency.
It measures how rare or important a word is across the entire document collection.
- Common words receive lower scores
- Rare and meaningful words receive higher scores
The formula is:
Issue with IDF
IDF alone does not determine how relevant a document is to the user query.
It only measures the rarity of terms across documents.
To improve retrieval quality, TF and IDF are combined together.
TF-IDF
TF-IDF combines:
Term Frequency (TF)
Inverse Document Frequency (IDF)
The formula is:
TF-IDF works well for many traditional search systems because it balances:
- Word frequency within the document
- Word importance across all documents
However, TF-IDF still does not fully capture semantic meaning.
BM25 (Best Match 25)
BM25 is an advanced ranking algorithm used in sparse retrieval systems.
It improves upon TF-IDF by considering:
- Term frequency
- Document length
- Query relevance
BM25 is one of the most commonly used algorithms in traditional search engines and sparse retrieval systems.
Limitation of Sparse Embeddings
Sparse embeddings alone are usually not enough to retrieve highly relevant documents in modern RAG systems because they mainly focus on exact keyword matching rather than semantic meaning.
For example:
- “car” and “automobile” may not match
- “feline” and “cat” may not match
Even though the meanings are similar.
Hybrid Search
To improve retrieval quality, modern systems combine:
- Dense embeddings
- Sparse embeddings
This approach is called hybrid search.
Typical Combination
- Dense retrieval → Sentence transformers or embedding models
- Sparse retrieval → BM25
Dense embeddings help with semantic understanding, while sparse embeddings help with exact keyword matching.
Together, they provide better retrieval performance in RAG applications.



Top comments (0)