Wanda

Posted on Mar 11 • Originally published at apidog.com

What is Gemini Embedding 2?

Google’s Gemini Embedding 2 is a powerful tool for developers working with text, images, video, audio, and documents. It unifies all these content types in a single embedding space, streamlining the process of building multimodal AI applications. Released in March 2026, Gemini Embedding 2 is Google’s first model to natively process multiple media types without separate pipelines.

Try Apidog today

If you're implementing semantic search, RAG systems, or testing APIs that handle different types of media, this model can simplify your stack and boost both coverage and accuracy.

What Makes Gemini Embedding 2 Different?

Traditional embedding models are siloed—text embeddings for text, image embeddings for images, etc. Gemini Embedding 2 breaks this pattern by mapping these formats into one embedding space.

Supported input types per request:

Text: Up to 8,192 tokens
Images: Up to 6 images
Video: Up to 128 seconds
Audio: Up to 80 seconds
PDF Documents: Up to 6 pages

This means you can search across all these formats with a single query—ask a question in text and retrieve the most relevant videos, images, or documents.

Key Features You Need to Know

1. Interleaved Multimodal Input

Mix content types in a single request. For example, send an image and text together, or combine video and audio. The model understands how these elements relate, so if your data is inherently multimodal (like product listings with images, descriptions, and demos), you get a unified embedding that captures all relationships.

2. Matryoshka Representation Learning (MRL)

Gemini Embedding 2 outputs 3,072-dimensional vectors by default, but you can truncate down to as low as 768 dimensions with minimal accuracy loss. This is efficient for storage and retrieval:

Full (3,072): Maximum quality
Medium (1,536): Balance
Compact (768): ~75% less storage, near-peak quality

Use higher dimensions during development or for critical tasks, then drop to 768 for production to optimize storage and costs.

3. Custom Task Instructions

Specify your task with parameters:

RETRIEVAL_QUERY – for search queries
RETRIEVAL_DOCUMENT – for indexing documents
SEMANTIC_SIMILARITY – compare content
CLASSIFICATION – for categorization

The model adjusts embeddings based on your use case, improving results without retraining.

4. Native Audio Processing

Gemini Embedding 2 processes audio directly, capturing tone and context that transcription-based models miss.

Technical Specifications

Text

8,192 tokens per request
100+ languages
Handles code and long documents

Images

Up to 6 per request
PNG, JPEG formats

Video

Up to 128 seconds
MP4, MOV (H264, H265, AV1, VP9)

Audio

Up to 80 seconds
MP3, WAV
No transcription required

PDF Documents

Up to 6 pages per request
Handles both text and visuals
Built-in OCR

Real-World Use Cases

Semantic Search Across Media Types

Build search engines that return relevant content in any format. Example: A query for “how to fix a leaky faucet” returns:

Tutorial videos
Step-by-step text articles
Diagram images
Audio instructions

All ranked for relevance in a single query.

RAG Systems with Multimodal Context

Augment your LLM with context from diverse sources:

Product descriptions (text)
User manual pages (PDF)
Demo videos
Customer review audio

Embeddings enable retrieval of the most relevant context, regardless of format.

API Testing with Semantic Similarity

With Apidog, use Gemini embeddings to semantically test API responses. Instead of string matching, compare embeddings for meaning—catching cases where the response wording changes but the intent is preserved. Useful for LLM-powered or natural language APIs.

You can also enhance API documentation search—let developers find endpoints by describing what they want, not by memorizing parameter names.

Content Clustering and Organization

Automatically group related content across formats. Product photos, descriptions, and videos cluster by category.

Sentiment Analysis Across Channels

Aggregate feedback from:

Text reviews
Video testimonials
Audio support calls
Social media images

Get unified sentiment insights across all formats.

Performance and Benchmarks

Gemini Embedding 2 outperforms leading models in text, image, and video benchmarks, with strong speech capabilities and advanced multimodal relationship handling. It sets a new standard for depth and flexibility in embedding use cases.

Pricing

Text embeddings: $0.20 per million tokens (50% off with batch API)
Image, audio, video: Standard Gemini API media token rates

For most RAG or search systems, embedding thousands of documents costs just a few dollars.

Gemini Embedding 2 vs. Competitors

Feature	Gemini Embedding 2	OpenAI text-embedding-3	Cohere Embed v3
Modalities	Text, image, video, audio, PDF	Text only	Text only
Max Input	8,192 tokens (text)	8,191 tokens	512 tokens
Dimensions	128-3,072 (flexible)	256-3,072	1,024
Languages	100+	100+	100+
Task Instructions	Yes	No	Yes
Pricing	$0.20/M tokens	$0.13/M tokens	$0.10/M tokens
Best For	Multimodal apps	Text-only apps	Text classification

The main differentiator is multimodal support. If you need embeddings for more than text, Gemini is the only unified solution.

Integration and Availability

Gemini Embedding 2 (gemini-embedding-2-preview) is available via:

Gemini API
Vertex AI
LangChain
LlamaIndex
Haystack
Weaviate
QDrant
ChromaDB
Vector Search

Most vector DBs and AI frameworks already support it. Note: The API is in public preview—expect possible changes before general release.

Important Migration Note

The embedding spaces of gemini-embedding-001 and Gemini Embedding 2 are incompatible. Mixing old and new embeddings in the same database won’t work. If you migrate, re-embed your entire dataset.

Output Dimensions: What to Choose

3,072: Highest quality, largest storage
1,536: Good balance
768: Production sweet spot (near-peak quality, 75% smaller)

Most apps should use 768 dimensions to balance quality and storage.

When to Use Gemini Embedding 2

Choose Gemini Embedding 2 if:

You have multimodal data (text, images, video, audio)
You need semantic search across formats
Building RAG with diverse sources
Clustering or classifying mixed-media content
You want embeddings that capture relationships between modalities

Stick with text-only models if:

Your data is only text
You need the absolute best text-only performance
You can’t re-generate existing embeddings

What This Means for Developers

Gemini Embedding 2 makes multimodal AI simpler:

One model for all content types
One embedding space, one vector DB
Streamlined search and retrieval logic

Matryoshka means you can tune embedding size to your needs. Task instructions let you adapt embeddings without custom training.

Getting Started

Get a Gemini API key from Google AI Studio.
Install the Google Generative AI SDK.
Call the embedding endpoint with your content.
Store embeddings in your vector database.
Use for search, RAG, or classification.

Example (Python):

from google.generativeai import GeminiEmbeddingClient

client = GeminiEmbeddingClient(api_key="YOUR_API_KEY")

embedding = client.embed(
    content=["Here is my text.", "Here is an image URL or binary."],
    task_type="RETRIEVAL_QUERY",
    output_dim=768
)

Adjust task_type and output_dim as needed.

The Bottom Line

Gemini Embedding 2 unifies multimodal AI development—covering text, images, video, audio, and documents in one space. Matryoshka dimensions give you flexibility; task instructions increase task-specific accuracy. Native audio support preserves context that other models lose.

If you're building apps that span multiple content types, this model is worth testing. Public preview is live via Gemini API and Vertex AI.

For semantic search, RAG, or content understanding, Gemini Embedding 2 reduces code complexity and increases coverage. And if you're testing APIs with Apidog, use these embeddings for validating semantic similarity—especially for LLM-powered endpoints.

DEV Community