Gao Dalie (Ilyass)

Posted on May 17

I Built a 10x Cheaper Enterprise Gemini Multimodal RAG File Search

#machinelearning #ai #programming #datascience

This week in the AI industry is one you can’t afford to miss. In just seven days, four players — Anthropic, OpenAI, Google, and the US government — made major moves at the same time.

Google has announced a powerful update that further expands the functionality of the Gemini API. The highlights of this update are enhanced support for “File Search” and “Multimodal RAG”.

The Gemini API’s File Search has finally evolved into a RAG that can search for images as well.

The point isn’t that Gemini can now understand images; it could do that before. What’s interesting now is that the managed File Search store for RAG now includes image embedding, searching, metadata filtering, and citations.

“Retrieval-Augmented Generation,” a system where AI generates answers by referencing internal company documents, manuals, and databases. Previously, it only supported “text,” but with this update, it can now search documents containing “image information,” such as PDF diagrams, UI screenshots, product images, and graphs, and understand their meaning.

Actual business documents are not “just text.” Reports with charts, manuals with screenshots, and design documents with diagrams — these have finally become “true searchable items” for AI.

So, let me give you a quick demo of the live chatbot to show you how everything works.

Link to Demo: https://www.youtube.com/watch?v=a2qjEB14Rc8

I searched Google for a financial PDF that contains a complex layout with tables, charts, and graphs.

I ran Python code and uploaded the PDF. The app supports uploading multiple documents. After uploading, I clicked “Upload & index.”

When this happens, the agent first checks whether the filename already exists in the store to avoid duplicates. If it is new, the file is sent to a thread pool inside the agent environment.

The agent upload (which exists only in memory) is written to a temporary file on disk because the Gemini SDK requires a real file path. The agent then detects the file’s Multipurpose Internet Mail Extension type from its extension and attempts to upload it using upload_to_file_search_store().

If the direct upload fails (for example, if the API does not support that file type in that path), the agent switches to a fallback process.

In both cases, the result is an Operation object which represents an asynchronous backend process. The agent then enters a polling loop: it checksoperation.done, starting with a short delay (2 seconds) and gradually increasing the wait time to avoid excessive API calls.

Once the process finishes, the agent checks for operation.error. If something goes wrong, it displays the error code and message to the user. If successful, it fetches the indexed document and verifies that its state is ACTIVE, ensuring it is fully searchable.

When the user asks a question, the agent wraps it in a type.Content(role="user") object and calls generate_content() with the File Search tool enabled and connected to the store.

Then, the agent retrieves the most relevant chunks from the indexed documents, injects them into the prompt context, and generates a grounded response and checks for safety issues or empty outputs.

If citation_metadata is included, the agent loops through the citations and displays them as clickable references back to the original source documents.

What is File Search?

For those of you wondering what File Search is, it’s a RAG tool built into the Gemini API. fileSearchStoresIt creates a repository where you can put your documents, and generateContent when you call it, it automatically " picks up only the relevant parts and passes them to the model ." Its biggest selling point is that you don't need to manage your own vector database.

Three key features of the new function

According to the official blog, the core of this upgrade consists of three things:

True Native Multimodal File Search In the past, File Search was a pure text search; images could only be entered into the store by being converted into text using OCR.

“File Search now processes images and text together. Powered by the Gemini Embedding 2 model, the tool understands native image data.”

Now you can directly upload images to the File Search Store and have them indexed along with text.

The system runs on Gemini Embedding 2, where text, images, videos, audio, and files all share the same vector space.

This means you can search across different types of content without manually connecting them.

For example, you can find text using an image, find an image using text, or even find similar images using another image.

For us product developers, this means:

Text-and-image mixed search is no longer a research topic but an API call.

There is no need to maintain two stores (one for text chunks and one for CLIP-style image embeddings).
Scientific charts, UI screenshots, reports, photo albums… things that previously lost most of their semantic meaning after OCR can now be retrieved while retaining their original visual information.

Custom Metadata and Server-side Filtering

Every file you put into the store can now be tagged with a key-value pair:

{"key": "user_id", "string_value": "U1234abcd..."}
{"key": "department", "string_value": "Legal"}
{"key": "status", "string_value": "Final"}

When querying, use the google.aip.dev/160 filter syntax (the same format as most GCP list APIs):

metadata_filter='department="Legal" AND status="Final"'
The filtering was done on Google’s end first, instead of scooping up a bunch of data and then discarding it. With less noise, both speed and accuracy will increase, which is a lifesaver for multi-tenant SaaS — a single store with a metadata filter can switch tenants without having to isolate multiple stores.

My WhatsApp bot uses this method to isolate per-user data: each file is uploaded with WhatsApp information user_id A filter is applied during queries, so user A will never see user B's information in the Q&A section.

Page-level citations

Each quoted excerpt in the response will now be page-numbered.

“captures the page number for every piece of indexed information.”

This is crucial for enterprise clients. “AI tells me Y is mentioned on page X of the contract” vs. “AI tells me Y is mentioned in the contract”

The first answer is much easier for legal and auditing teams to trust because they can quickly verify the source.

The second still needs people to manually search through documents, which takes time and effort.

Page numbers unlock the final mile of the “LLM answer cannot be traced back to its source.”

What’s so great about it?

This is the most important part. I’ll summarise what will change in a table.

What’s particularly amazing is that you can now embed images directly. Until now, it was common to build your own preprocessing pipeline like “PDF charts → OCR → text embedding,” but that’s no longer necessary. Information that’s difficult to convert to text, such as slides, diagrams, photos, and UI captures, can now be included in search results.

Furthermore, the billing model is also user-friendly; storage and embedding during queries are free, and you are only charged for “ embedding during indexing “ and “ context tokens consumed by retrieved documents .”

This means you can operate it with only the cost of loading. The biggest expense in RAG operations is usually the cost of embedding, so the design that confines that to a “one-time initial charge” is quite considerate.

It supports a wide range of models, including Gemini 3.1 Pro Preview, Gemini 3.1 Flash-Lite, Gemini 3 Flash Preview, and Gemini 2.5 Pro/Flash-Lite. The ability to share the same storage between the Pro and Flash-Lite versions is very convenient.

Continue Reading the Full Article: Link