DEV Community

Thomas van Dongen
Thomas van Dongen

Posted on

Model2Vec: Making Sentence Transformers 500x faster on CPU, and 15x smaller

Hey Everyone! I wanted to share a project we (Stephan and Thomas) have been working on for the past couple of months. It seems that the solution for many researchers and companies is to start with (very) large embedding models, but in our day to day job as ML engineers we still see plenty of value being provided with more traditional methods like static (word2vec) embeddings that are much more eco-friendly.

We've been experimenting with ways to use those same large embedding models to create much smaller models and found something quite interesting. By simply taking the output embeddings of a Sentence Transformer, reducing the dimensionality using PCA, and weighting the embeddings using zipf weighting, we created a very small static embedding model (30mb on disk) that outperforms other static embedding models on all tasks in MTEB. There's also no training data needed since all you need is a vocabulary, and "distilling" a model can be done in a minute on a regular CPU because you only have to forward pass the tokens from the vocabulary. The final model is (much) faster and smaller than (for example) GloVe while being more performant. This makes it great for usecases such as text classification, similarity search, clustering, or RAG.

The following plot shows the relationship between the models we implemented (Model2Vec base output and Model2Vec glove vocab) compared to various popular embedding models.
Image description

Want to try it yourself? We've implemented a basic interface for using our models, for example, you can use the following to create embeddings (after installing the package with pip install model2vec):

from model2vec import StaticModel

# Load a model from the HuggingFace hub (in this case the M2V_base_output model)
model_name = "minishlab/M2V_base_output"
model = StaticModel.from_pretrained(model_name)

# Make embeddings
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
Enter fullscreen mode Exit fullscreen mode

I'm curious to hear your thoughts on this, we've created a simple package called Model2Vec where we documented all our experiments and results: https://github.com/MinishLab/model2vec.

API Trace View

Struggling with slow API calls?

Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Rather than just generating snippets, our agents understand your entire project context, can make decisions, use tools, and carry out tasks autonomously.

Read full post

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay