Day 8 - Sparse embedding - RAG

#ai #nlp #rag #tutorial

What is a sparse embedding ?
The word sparse means thinly scattered or occurs in a small amount over a large area. Sparse embedding(shortly as S.E) will have a vocabulary(dictionary of words). Words will be stored in a ordered list format.

Basic S.E methdology
Lets assume that vocabulary has 10,000 words. For the given chunks, it will first start to tokenize each of the chunk.

Ex: Chunk 1 -> Redis is a inmemory database
Tokenization of chunk1 -> ["Redis", "is", "a", "inmemory", "database"].

It will take the first token i.e Redis, if this word is found in its vocabulary, on the index where the redis occurs, 1 will be marked. rest of them will be zero. [0,0,0,1,...] i.e vocabulary list will either be 1 or 0.
1 means token is found in vocabulary and 0 means not found. As we are using vocabulary list, each token embedding will be list of 10k words. Embedding will match with the vocabulary size

Where S.E can be used?
S.E will be used in places where we need to do a exact word match. To give some context behind the S.E, consider the below:
We are having the male and female words and we are trying to build a ML model. How does underlying system know whether the word is male or female ? It does not know about strings. We can use binary classification. i.e we can give 0 for male and 1 for female, viceversa.

There is a small problem with this approach, indirectly we are forming a bias that female is higher than male and viceversa(Since in value wise 1 > 0). To remove this, we can use 2 column feature:

This is the basic concept of S.E. Unlike dense embeddings, S.E won't have continuous values. It is based on occurence and frequency of words.

Shortcomings with this basic approach
It does not consider the frequency of words in a chunk. It will yield the same vector even for words that are repeated.

Term frequency
Next variation of S.E is term frequency. Chunks will be converted to tokens. For each token respective frequency will be calculated. Frequency of the token will then be divided with total numbers of tokens in the chunk. This value will be considered as term frequency of token. This process will be repeated for each token in the chunk.

Shortcomings with this approach
If a word is spammed or occurs too many times its respective chunk will be prioritized over other, Even if the user query is unrelated to it.

DEV Community

Day 8 - Sparse embedding - RAG

Top comments (0)