Multimodal RAG with the Gemini API File Search Tool: A Developer Guide

#api #gemini #rag #tutorial

The File Search tool in the Gemini API now supports multimodal retrieval by adding support for Gemini Embedding 2. This update allows images, such as charts, product photos, and diagrams, to be natively indexed and searched in the same store as your text-based documents.

This post covers how to use the File Search tool end-to-end: creating a store, uploading documents and images, querying with grounded generation, and retrieving image citations.

What is File Search?

Here's an example app you can try in AI Studio that lets you chat with your documents and image library

File Search is the Gemini API's built-in RAG tool. When you upload your documents, the API takes care of the heavy lifting: chunking, embedding, indexing, and retrieval. At query time, pass a file_search tool alongside your prompt, and the model automatically retrieves relevant chunks from your data to generate a grounded response.

Compared to rolling your own RAG pipeline, File Search offers:

Fully managed: No vector databases to provision or embedding pipeline to maintain.
Cost-effective: Storage and query-time embeddings are free. You only pay for the initial indexing embeddings and the standard Gemini input/output tokens.
Built-in citations: Every response includes grounding metadata that links the answer to specific documents and pages. For multimodal stores, citations also include downloadable image references.
Native image search: With the gemini-embedding-2 model, images are embedded directly rather than relying on OCR, enabling true visual retrieval.

Try It in AI Studio

Want to see multimodal File Search in action before writing any code? We built an example app in AI Studio that lets you chat with your documents and image library. Upload PDFs and images, then ask questions. The app retrieves relevant text and visuals in real time, complete with citations and page numbers so you can trace every answer back to its source.

Getting Started

Step 1: Create a File Search Store

A File Search Store is a persistent container for your document embeddings. Think of it as a managed vector database scoped to a project.

To enable multimodal search over images, specify gemini-embedding-2 as the embedding model. This parameter is optional; if omitted, the store defaults to gemini-embedding-001, which is cost-optimized for text-only workloads, and cannot be changed later.

To use the new features, make sure to install the latest Python SDK: pip install -U google-genai.

from google import genai
from google.genai import types

client = genai.Client()

# Create a multimodal store with gemini-embedding-2
# Omit embedding_model to use the default text-only model (gemini-embedding-001)
file_search_store = client.file_search_stores.create(
    config={
        "display_name": "product-catalog",
        "embedding_model": "models/gemini-embedding-2"
    }
)
print(f"Created store: {file_search_store.name}")

Embedding Model	Best For
`gemini-embedding-001` (default)	Text-heavy workloads, cost-optimized
`gemini-embedding-2`	Multimodal retrieval (documents and images)

Step 2: Upload Documents and Images

The simplest path is the upload_to_file_search_store method, which uploads and indexes a file in one step. With gemini-embedding-2, this works for both documents and images:

Note: Audio and video formats are currently not supported.

import time

# Upload a PDF document
operation = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=file_search_store.name,
    file="product_catalog.pdf",
    config={"display_name": "Product Catalog"}
)

# Wait for ingestion to complete
while not operation.done:
    time.sleep(5)
    operation = client.operations.get(operation)

# Upload product images directly
for image_file in ["sneaker_red.png", "sneaker_blue.jpeg", "sneaker_white.png"]:
    op = client.file_search_stores.upload_to_file_search_store(
        file_search_store_name=file_search_store.name,
        file=image_file,
        config={"display_name": image_file}
    )
    while not op.done:
        time.sleep(5)
        op = client.operations.get(op)

print("All files indexed!")

Behind the scenes, the API chunks documents, generates embeddings, and indexes the content. When using gemini-embedding-2, images within PDFs are also natively embedded alongside the text.

You can also import existing files from the Files API into a store.

Step 3: Query with File Search

Query your data by passing the file_search tool to generate_content:

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Which sneakers come in red?",
    config={
        "tools": [{
            "file_search": {
                "file_search_store_names": [file_search_store.name]
            }
        }]
    }
)

print(response.text)

The system performs a file search to find the most similar and relevant chunks from the File Search store , and uses them to generate a grounded response.

Step 4: Inspect Citations and Retrieve Images

Every File Search response includes grounding metadata — essentially, a bibliography for the model's answer. It captures page numbers for the indexed information, allowing applications to point users directly to the right spot in a document. This is especially useful for rigorous fact-checking over large PDFs.

With multimodal stores, citations can include a media_id for referenced images, which can be downloaded directly:

grounding = response.candidates[0].grounding_metadata

for chunk in grounding.grounding_chunks:
    ctx = chunk.retrieved_context
    if ctx.media_id:
        # This is an image citation — download it
        print(f"Cited image: {ctx.title}")
        print(f"   Media ID: {ctx.media_id}")

        blob = client.file_search_stores.download_media(
            media_id=ctx.media_id
        )
        with open(f"cited_{ctx.title}.png", "wb") as f:
            f.write(blob)
    else:
        # Text citation with exact page number
        print(f"Cited text: {ctx.title}")
        if ctx.page_number:
            print(f"   Page: {ctx.page_number}")
        print(f"   {ctx.text[:200]}...")

# See which parts of the response are grounded in which sources
for support in grounding.grounding_supports:
    print(f"Claim: '{support.segment.text}'")
    print(f"  Grounded in chunks: {support.grounding_chunk_indices}")

This is powerful for building user-facing applications. It's now possible to show users the actual images the model used in its reasoning, not just a text description.

Managing Stores

Here's a quick reference for managing stores and documents:

# List all stores
for store in client.file_search_stores.list():
    print(f"{store.name} — {store.display_name}")

# List documents in a store
for doc in client.file_search_stores.documents.list(parent=file_search_store.name):
    print(f"  {doc.name}")

# Delete a specific document
client.file_search_stores.documents.delete(
    name="fileSearchStores/my-store/documents/old_doc"
)

# Delete an entire store (force=True also deletes all contained documents)
client.file_search_stores.delete(
    name=file_search_store.name,
    config={"force": True}
)

Power Features

Custom Metadata and Filtering

You can attach metadata to documents at upload time and use it to filter at query time. This is essential when a store contains diverse documents and searches need to be scoped:

# Upload with metadata
op = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=file_search_store.name,
    file="shoes_collection.pdf",
    config={
        "display_name": "Spring 2026 Shoes",
        "custom_metadata": [
            {"key": "category", "string_value": "footwear"},
            {"key": "season", "string_value": "spring-2026"},
            {"key": "price_tier", "numeric_value": 2}
        ]
    }
)

# Query with a metadata filter
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Do you have blue spring shoes?",
    config={
        "tools": [{
            "file_search": {
                "file_search_store_names": [file_search_store.name],
                "metadata_filter": 'category="footwear" AND season="spring-2026"',
            }
        }]
    }
)

Structured Output

Starting with Gemini 3 models, File Search can be combined with structured output. This is perfect for extracting structured data from grounded responses:

from pydantic import BaseModel, Field

class ProductMatch(BaseModel):
    name: str = Field(description="Product name")
    description: str = Field(description="Brief product description")
    confidence: str = Field(description="How confident the match is")

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Find products similar to a red running shoe",
    config={
        "tools": [{
            "file_search": {
                "file_search_store_names": [file_search_store.name]
            }
        }],
        "response_mime_type": "application/json",
        "response_schema": ProductMatch.model_json_schema()
    }
)

Chunking Configuration

For more control over how documents are split, the chunking strategy can be configured:

operation = client.file_search_stores.upload_to_file_search_store(
    file_search_store_name=file_search_store.name,
    file="long_document.pdf",
    config={
        "display_name": "Technical Manual",
        "chunking_config": {
            "white_space_config": {
                "max_tokens_per_chunk": 200,
                "max_overlap_tokens": 20
            }
        }
    }
)

Use Cases

With multimodal retrieval, File Search opens up scenarios that text-only RAG can't handle:

Visual product search: Index catalogs with images and spec sheets, then search by visual similarity or natural language descriptions.
Research and technical documentation: Retrieve specific charts, architecture diagrams, or data visualizations from papers and reports.
Insurance and claims processing: Combine structured forms with damage photos for unified document and visual assessment.
Design systems: Make component libraries searchable by visual appearance, not just naming conventions.
Real estate and property listings: Match properties based on floor plans, interior photos, and visual preferences.