Akash

Posted on Mar 2

Matryoshka Embeddings: The new kind of efficient embeddings

#machinelearning #embeddings #openai #llm

In this article, we will be diving into Matryoshka Embeddings and understanding how they prove to be useful as compared to a regular embedding model [this will be explained here as well] and how these embeddings can be trained with Sentence Transformers. This model and the embeddings produced from it have been significantly gaining popularity recently from OpenAI releasing a new set of embeddings models using it to cut down on the number of embeddings generated, reducing space taken while maintaining the meaning of the word mapped to the embeddings.

Note: This article will be assuming at least a basic level understanding of neural networks and NLP in general.

What are Embeddings?

Before we dive deep and understand everything about the Matryoshka Embeddings model, we will first need to understand what an Embedding exactly is. An embedding is a way through which we map words to 1D Vectors / Arrays which when fed into a Neural Network helps it understand what exactly is the input we are referring to.

Why do we need this conversion? In general, ML models work with numbers in a much better way as compared to words and we need to figure out a way to tell our neural network what exactly words and even images mean and this is where we use Embeddings.

Why do we need the 1D Array Representation though? This is because initially if we assign only one number to a word or an image, there could potentially be a misunderstanding by the neural network. For example, let's consider the word "great". When every single word in general is mapped to a number, the model could work with them however if you use the word in a different context for example let's say the word is used to depict a negative situation like "great, the weather sucks today", the model would not be able to understand this negative/sarcastic connotation and wouldn't be able to find the meaning of it.

This is the reason why we consider 1D vectors, it is a way to group words used together in similar contexts to first allow the model to train much faster without a lot more numbers and to allow it to train simultaneously on a multiple number of words that carry similar meanings. You might however be worried about the maths that goes behind to generate these 1D Vectors and how they're generated given a sentence. We do this with the help of a very simple neural network.

Conversion of Words to 1D / 2D Embeddings

There are multiple approaches for the conversion of words into embeddings however the best way to do this is with a neural network. A neural network can help capture the "semantic" meaning among the common words between different sentences and can help find relationships between them. The core idea of this concept is that words with similar meanings can be grouped to have similar vector representations.

This process involves training a neural network model which can then learn the associations between different kinds of words [GloVe is an example] which can then represent these words in the form of dense vectors. An embedding layer in the neural networks maps discrete categories of the words to vectors arranging them in a format to reflect relationships between the categories. The arrangement allows similar categories to be closer together and dissimilar ones to be further apart, enabling the network to leverage the geometric properties of the embedding space for accurate predictions.

By performing mathematical operations on word vectors, relationships between words can be leveraged effectively, such as the classic example of "king - man + woman = queen." This transformation of categorical data into numerical representations through word embeddings has revolutionized how neural networks handle textual data, enhancing accuracy in discrimination and generation tasks like language models

Matryoshka Embeddings

Now, we will be moving into how exactly embeddings are calculated in a much faster way through the process of shrinking the long range of vectors to improve efficiency.

Matryoshka embeddings are a type of embedding model that is designed to efficiently compress high-dimensional embeddings into smaller, fixed-size embeddings while preserving a significant amount of the original embedding's information. This is particularly useful for applications that require fast retrieval and efficient storage, such as recommendation engines, search engines, and similarity searches.

Why do we need this? The vectors generated by the traditional ways of embedding calculations in general are very long and have some cost inefficiencies in their generation process while they are fast enough when implemented with popular libraries, this way of calculating embeddings can make the embeddings much more smaller while not affecting performance. These Matryoshka embedding models are trained such that these small truncated embeddings would still be useful. In short, Matryoshka embedding models can produce useful embeddings of various dimensions.

What does the Cryptic Word Matryoshka Mean? The name "Matryoshka references Russian nesting dolls where smaller dolls fit neatly inside larger ones. Based on this concept, matryoshka embeddings cleverly store information within a large embedding vector, allowing you to "truncate" it while keeping essential data. These smaller truncated embeddings during computationally intensive tasks like performing text search over a huge document give us a performance boost while reducing memory costs.

These embeddings are trained using various loss functions and can be loaded and run inference using SentenceTransformers allowing for the computation of similarity between different text inputs or the truncation of previously generated embeddings to their desired sizes for specific tasks.

Matryoshka Representation Learning

This is a novel approach that allows for the encoding of information at various levels of granularity. This method is designed to offer a flexible representation that can adapt to multiple downstream tasks with varying computational resources using a single embedding. It minimizes modifications to existing representation pipelines and imposes no additional computational cost during inference and deployment.

Matryoshka Representation Learning (MRL) is a concept in machine learning that makes it easier to understand and work with data by encoding it in different levels of detail. Think of it like having a set of Russian nesting dolls, where each doll can be opened to reveal a smaller doll inside, and so on. In this model, each "doll" is a simplified version of the data, and by opening the dolls in the correct order, you can get a full picture of the data at any level of detail you need.

Imagine you're looking for a specific type of picture in a huge library. Without this model, you might have to search through every single book one by one. But with MRL, you can start by looking at a simplified, general version of all the pictures (the outermost doll), and then as you narrow down your search, you can open the dolls to see more detailed pictures (the inner dolls). This way, you can quickly find what you're looking for without having to look at every single picture.

MRL is used in machine learning models to make them more efficient and flexible. For example, when a model is trying to understand pictures or text, MRL helps it encode this information in different levels of detail. This means that the model can quickly and effectively understand complex data without needing a lot of computational power.

MRL has been shown to work well in various tasks, like classifying images or understanding text. It's been used with popular models like ResNet and BERT, which are designed to understand images and text, respectively. By using MRL, these models can work faster and more accurately on a wide range of tasks, making them more useful for real-world applications.

In summary, MRL is a clever way to encode data at different levels of detail, making it easier for machine learning models to understand and work with complex data. It's like having a set of Russian nesting dolls that help you quickly find what you're looking for without having to sift through everything.

Process of the Generation Matryoshka embeddings

The process of the generation of these embeddings requires several steps:

Training the Model: Matryoshka embeddings are trained using specific loss functions that are designed to preserve the semantic meaning of the embeddings. For example, the MultipleNegativesRankingLoss combined with MatryoshkaLoss [both of these are loss functions] is used for training models on Natural Language Inference data. This loss function encourages the model to generate embeddings that are semantically similar for positive pairs (e.g., sentences with the same meaning) and semantically dissimilar for negative pairs (e.g., sentences with different meanings).
Inference: After training, the model can be used to generate embeddings for new input texts using the SentenceTransformer.encode method, and the embeddings generated with this model are high-dimensional and these embeddings, in general, should be truncated to a smaller size which can be done using the MRL technique producing Matryoshka embeddings. This truncation process proves to be crucial for efficiency and also gives us storage benefits and these truncated embeddings are finally "normalized" to ensure they have a consistent scale.
Use Cases: Matryoshka embeddings can be used in a variety of applications, such as recommendation systems, search engines, and similarity searches. One interesting use case is to first process input data with smaller vectors (pre-processing) and then process the remaining vectors as full size (shortlisting and reranking). This two-step process allows for efficient scaling of embedding solutions according to desired storage cost, processing speed, and performance requirements.
Results and Benefits: Despite the reduction in dimensionality, Matryoshka embeddings have been shown to preserve a significant amount of the original embedding's information. For example, even at 8.3% of the embedding size, Matryoshka models can preserve 98.37% of the performance. This indicates that Matryoshka embeddings can significantly speed up downstream tasks and save on storage space without a notable hit in performance.

In summary, Matryoshka embeddings are generated by training a model with a specific loss function that encourages semantic similarity and dissimilarity and then using this model to generate high-dimensional embeddings for new input texts. These embeddings are then truncated and normalized to create the final Matryoshka embeddings, which are smaller, fixed-size representations of the input texts that preserve a significant amount of their semantic information.

Loss Functions and Inference in this model

Loss Functions

Now, in case these topics weren't clear, let's take a step back and analyze what they mean.

Loss functions in machine learning are used for the training, especially in neural networks and they also form a part of the MRL model's training process. These functions define an objective of the training process by quantifying or breaking down into numbers how well the model's predictions match the expected outcomes. In the context of Matryoshka embeddings, loss functions are used to guide the training of the model to produce embeddings that are not only semantically meaningful but also efficient in terms of storage and retrieval.

Matryoshka embeddings use a specific approach called Matryoshka Representation Learning (MRL) to train the model. MRL involves applying a loss function not only to the full-size embeddings but also to truncated versions of the embeddings at various dimensions. For instance, if the original embedding dimension is 768, MRL can train on embeddings of dimensions 768, 512, 256, 128, and 64. Each of these losses is added together, and optionally, weights can be assigned to each loss to balance their contributions to the overall loss 12.

This approach is beneficial for several reasons:

Preservation of Important Information: By applying the loss function to embeddings of various dimensions, the model is incentivized to retain the most important information at the start of the embedding. This ensures that even when the embedding is truncated, the most crucial information is still preserved.
Efficiency in Training: Training with MatryoshkaLoss does not significantly increase the training time, making it practical for large-scale applications.
Performance and Storage Efficiency: The model can preserve a significant amount of the original embedding's information even when the embedding size is reduced. For example, at 8.3% of the original embedding size, Matryoshka models can retain 98.37% of the performance. This indicates that the model is not only efficient in terms of storage but also maintains high performance for downstream tasks 12.

Examples of loss functions used in Matryoshka embeddings include the CoSENTLoss and MultipleNegativesRankingLoss.

In summary, loss functions in Matryoshka embeddings are crucial for ensuring that the model learns to produce embeddings that are both semantically meaningful and efficient. By applying loss functions to embeddings of various dimensions, the model is guided to retain the most important information, leading to efficient storage and retrieval while maintaining high performance.

Inference

Now, let's talk about the word "Inference". Inference in machine learning refers to the process of using a trained model to make predictions or decisions on new unseen data. This process is crucial for evaluating the performance of the model and also using it for real-world tasks and integrating it into larger systems and applications.

In the context of the MRL Algorithm, the inference process is particularly interesting due to its multi-scale representation learning approach. This algorithm encodes information at different levels or granularities allowing a single embedding to adapt to the computational limits of various tasks without adding an overhead cost during the process of inference and mainly deployment.

The inference process in MRL involves using the trained model to generate embeddings for new input data. These embeddings are then truncated to various dimensions, depending on the computational resources available or the specific requirements of the search space to be searched on. For example, if a search is required to have high accuracy, the model might use full-dimensional embeddings. Conversely, for requirements with limited computational resources, the model might truncate the embeddings to a lower dimension. This approach allows MRL to adapt to different computational constraints while maintaining the efficiency and effectiveness of the embeddings.

In summary, the inference process in MRL involves generating embeddings for new input data using a trained model and then adapting these embeddings to various dimensions based on the computational resources available or the specific requirements of the search or problem to be solved. This multi-scale representation learning approach allows MRL to efficiently encode information at different granularities, making it adaptable to various computational constraints and suitable for a wide range of applications.

Why use the MRL Model?

This form of creating embeddings with varying sizes [dimensions] can be quite useful for a lot of use cases specific to embeddings. Note: These are mainly used in the context of the RAG process.

Shortlisting: Rather than performing the search on full embeddings, the embeddings can be compressed to a smaller size also known as "shortlisting" your embeddings to reduce computational costs. This process involves the breaking down of a large list of items into a smaller subset that is more manageable for further processing. In the context of this algorithm, you can use the shorter embeddings to quickly identify the most relevant items. This proves to be much faster than the traditional long embeddings, especially for large datasets.

2.Reranking: After shortlisting, the remaining items can be re-ranked using their full-dimensional embeddings. This step ensures that the final results are more refined and accurate and in general, is performed on a compressed set of embeddings making it more efficient than re-ranking the long lists of embeddings that were produced before the MRL process.

Benefits

Efficiency: By using Matryoshka embeddings for shortlisting, you can significantly speed up the retrieval process. The truncated embeddings allow for quick initial filtering, which reduces the computational load and speeds up the overall process.
Accuracy: Despite the reduction in dimensionality, Matryoshka embeddings retain a high level of performance. Studies have shown that even at significantly reduced sizes, these embeddings can preserve a substantial amount of their original performance. This means that after shortlisting, the remaining items can be reranked with minimal loss in accuracy.
Flexibility: The ability to adapt the size of the embeddings allows for flexibility in handling different computational constraints and storage capacities. This is particularly useful in applications where resources are limited but high accuracy is still required.

What can MRL Do?

Now, let's talk about how this is being used exactly. You may be wondering what we can do with these embeddings. They're obviously used to train ML Models and perform semantic searches but there are a few more use cases.

Image Classification: MRL can significantly reduce the embedding size for ImageNet-1K classification while maintaining the same level of accuracy. This makes it more efficient for image classification tasks, where understanding and categorizing images is crucial.
Large-Scale Retrieval: MRL offers real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K. This is particularly useful in applications where you need to search through a large database of images or information quickly and accurately.

2.Few-Shot Classification: MRL can improve accuracy for long-tail few-shot classification. Few-shot learning is a challenging area in machine learning where models are trained to recognize new classes with very few examples. MRL's ability to adapt to various computational constraints makes it a powerful tool for this task.

Web-Scale Datasets: MRL extends seamlessly to web-scale datasets like ImageNet and JFT across various modalities, including vision (using models like ViT and ResNet), vision + language (using ALIGN), and language (using BERT). This flexibility allows MRL to be applied to a wide range of data types and tasks, from understanding images and text to combining both in a single model.
Robustness: Despite its flexibility and efficiency, MRL maintains the robustness of the original representations. This means that models trained with MRL can still perform well on a variety of tasks without losing their accuracy or reliability

In summary, the applications of MRL are vast, ranging from image classification and large-scale retrieval to few-shot learning and web-scale dataset analysis. Its ability to adapt to different computational constraints and maintain high accuracy makes it a versatile tool for many machine-learning tasks.

Example request for Matryoshka Embeddings Generation with OpenAI

OpenAI introduces their text-embedding-3-small and text-embedding-3-large embedding models that prove to be outperforming even an unshortened version of their previously popular text-embedding-ada-002 embeddings model thus retaining only the required details. Below is a sample request of theirs for the purpose of generating embeddings with the text-embedding-3-small model.

curl https://api.openai.com/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "input": "Your text string goes here",
    "model": "text-embedding-3-small"
  }'

Conclusion

To conclude, Matryoshka embeddings represent a significant advancement in the field of machine learning, particularly in the realm of data representation and retrieval. Their hierarchical structure, inspired by the Russian Matryoshka dolls, allows for efficient storage and retrieval of information at various levels of granularity. This innovative approach not only enhances the performance of downstream tasks by preserving a high percentage of performance even at significantly reduced embedding sizes but also offers a scalable solution for practitioners to balance storage cost, processing speed, and performance needs.

The empirical evidence, as demonstrated by the experiment comparing Matryoshka models to regular embedding models, shows that Matryoshka embeddings can maintain up to 98.37% of performance even when truncated to 8.3% of the original embedding size. This remarkable capability to retain information while reducing the size of embeddings is a testament to the effectiveness of the Matryoshka Representation Learning (MRL) technique.

Looking ahead, the potential applications of Matryoshka embeddings are vast, from improving search functionality in digital platforms to enhancing the efficiency of machine learning models across various domains. The ease of training and application of Matryoshka embeddings using frameworks like Sentence Transformers further underscores their practicality and versatility.

As the field continues to evolve, we can expect to see more research and development efforts focused on optimizing and expanding the capabilities of Matryoshka embeddings. This will likely lead to new insights and innovative applications that further leverage the power of hierarchical data representation, paving the way for more efficient and effective machine-learning systems in the future.

DEV Community

Matryoshka Embeddings: The new kind of efficient embeddings

What are Embeddings?

Conversion of Words to 1D / 2D Embeddings

Matryoshka Embeddings

Matryoshka Representation Learning

Process of the Generation Matryoshka embeddings

Loss Functions and Inference in this model

Why use the MRL Model?

Benefits

What can MRL Do?

Example request for Matryoshka Embeddings Generation with OpenAI

Conclusion

Top comments (0)

Read next

Claude MCP

How to build a RAG model from scratch?

Run Ollama on Intel Arc GPU (IPEX)

AI Agents: Transforming Ideas into Action, Collaboratively