Embeddings Just Went Multimodal: What Sentence Transformers 5.4 Means for RAG

#rag #llm #embeddings #ai

The latest Sentence Transformers release quietly changes something fundamental. Version 5.4 adds native multimodal support—same API, same patterns, but now you can encode and compare text, images, audio, and video in a shared embedding space.

This isn't a wrapper. It's a direct extension of the embedding workflow that most RAG pipelines already use.

The Shift

Traditional embedding models convert text into fixed-size vectors. You encode a query, encode your documents, compute cosine similarity. Works great until someone wants to search for "that screenshot with the error message" or "the slide deck about Q3 projections."

Multimodal embedding models solve this by mapping inputs from different modalities into the same embedding space. A text query and an image document now share a coordinate system. Same similarity functions. Same retrieval logic. Different modalities.

What's Actually New

Sentence Transformers 5.4 adds:

Multimodal encoding: model.encode() now handles images, audio, and video alongside text
Cross-modal reranking: Score relevance between mixed-modality pairs
Unified API: No new abstractions—load a model, encode inputs, compute similarity

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")

# Encode different modalities
text_emb = model.encode("quarterly revenue report")
img_emb = model.encode("path/to/screenshot.png")

# Same similarity function you already use
similarity = model.similarity(text_emb, img_emb)

The reranker models extend similarly—you can score pairs where one or both elements are images.

Why This Matters for RAG

Most production RAG systems are text-only. When users ask about visual content, you either:

Run OCR on everything (slow, lossy)
Ignore it entirely (incomplete)
Build a parallel image search system (complex, disconnected)

Multimodal embeddings collapse these into one pipeline. Your retrieval step can surface relevant images alongside text chunks without OCR preprocessing or separate indices.

The reranking layer matters here too. Cross-encoder rerankers have been essential for text RAG because they score query-document pairs more accurately than embedding similarity alone. Multimodal rerankers extend that to visual documents.

The Hardware Reality

There's a catch. VLM-based models like Qwen3-VL-2B need ~8GB VRAM. The 8B variants need ~20GB. CPU inference is "extremely slow" per the docs—CLIP and text-only models are better suited there.

For production systems with GPU infrastructure, this is manageable. For edge deployments, you'll want smaller models or cloud inference.

The Practical Impact

This changes what you can retrieve:

Visual document RAG: Search PDFs with embedded charts, screenshots, and diagrams
Cross-modal search: Find video clips from text descriptions
Multimodal deduplication: Identify near-duplicates across modalities

The API stays familiar. The infrastructure requirements shift. The use cases expand.

What We're Still Missing

The release handles encoding and reranking well, but production multimodal RAG needs more:

Index efficiency: FAISS and similar indices weren't designed for mixed-modality queries
Chunking strategies: How do you chunk a video? What about image grids?
Evaluation frameworks: BEIR and MTEB are text-only; multimodal benchmarks are sparse

These will get solved. The embedding layer is now in place.

The gap between text RAG and multimodal RAG just got smaller. The question is whether your retrieval pipeline can handle what's now possible.