The latest Sentence Transformers release quietly changes something fundamental. Version 5.4 adds native multimodal support—same API, same patterns, but now you can encode and compare text, images, audio, and video in a shared embedding space.
This isn't a wrapper. It's a direct extension of the embedding workflow that most RAG pipelines already use.
The Shift
Traditional embedding models convert text into fixed-size vectors. You encode a query, encode your documents, compute cosine similarity. Works great until someone wants to search for "that screenshot with the error message" or "the slide deck about Q3 projections."
Multimodal embedding models solve this by mapping inputs from different modalities into the same embedding space. A text query and an image document now share a coordinate system. Same similarity functions. Same retrieval logic. Different modalities.
What's Actually New
Sentence Transformers 5.4 adds:
-
Multimodal encoding:
model.encode()now handles images, audio, and video alongside text - Cross-modal reranking: Score relevance between mixed-modality pairs
- Unified API: No new abstractions—load a model, encode inputs, compute similarity
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")
# Encode different modalities
text_emb = model.encode("quarterly revenue report")
img_emb = model.encode("path/to/screenshot.png")
# Same similarity function you already use
similarity = model.similarity(text_emb, img_emb)
The reranker models extend similarly—you can score pairs where one or both elements are images.
Why This Matters for RAG
Most production RAG systems are text-only. When users ask about visual content, you either:
- Run OCR on everything (slow, lossy)
- Ignore it entirely (incomplete)
- Build a parallel image search system (complex, disconnected)
Multimodal embeddings collapse these into one pipeline. Your retrieval step can surface relevant images alongside text chunks without OCR preprocessing or separate indices.
The reranking layer matters here too. Cross-encoder rerankers have been essential for text RAG because they score query-document pairs more accurately than embedding similarity alone. Multimodal rerankers extend that to visual documents.
The Hardware Reality
There's a catch. VLM-based models like Qwen3-VL-2B need ~8GB VRAM. The 8B variants need ~20GB. CPU inference is "extremely slow" per the docs—CLIP and text-only models are better suited there.
For production systems with GPU infrastructure, this is manageable. For edge deployments, you'll want smaller models or cloud inference.
The Practical Impact
This changes what you can retrieve:
- Visual document RAG: Search PDFs with embedded charts, screenshots, and diagrams
- Cross-modal search: Find video clips from text descriptions
- Multimodal deduplication: Identify near-duplicates across modalities
The API stays familiar. The infrastructure requirements shift. The use cases expand.
What We're Still Missing
The release handles encoding and reranking well, but production multimodal RAG needs more:
- Index efficiency: FAISS and similar indices weren't designed for mixed-modality queries
- Chunking strategies: How do you chunk a video? What about image grids?
- Evaluation frameworks: BEIR and MTEB are text-only; multimodal benchmarks are sparse
These will get solved. The embedding layer is now in place.
The gap between text RAG and multimodal RAG just got smaller. The question is whether your retrieval pipeline can handle what's now possible.
Top comments (0)