DEV Community

Cover image for Did you know you can combine Text and Image/Video embeddings in the same latent space?
Asen Mitrev
Asen Mitrev

Posted on

Did you know you can combine Text and Image/Video embeddings in the same latent space?

Multimodal embeddings are great, until you realise that they are only as good as the model's training data. Have some videos of yourself that you want to find by your name? Either manually add metadata or suffer.

Unless you're a famous person, the model will not associate your face with your name.

There is an easy fix, though. Multimodal embedding models usually offer the ability to embed text and/or images. The embeddings share the same latent space, meaning they can not only be used to do RAG (e.g. text-to-image retrieval) but can also can be combined via weighted sum.

What does this mean? You can add vectors and get enriched meaning. A weighted sum of the vector embedding of your name and a vector for the photo, say 10% your name and 90% a photo of you eating a banana, will now be searchable by your name.

Meaning will not be lost, it will be combined, at the weights specified.

Top comments (0)