<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Apoorva Joshi</title>
    <description>The latest articles on DEV Community by Apoorva Joshi (@joshaaayyyy).</description>
    <link>https://dev.to/joshaaayyyy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2902043%2F69d25f8d-a584-4c7a-bfc7-d387d53e811b.jpg</url>
      <title>DEV Community: Apoorva Joshi</title>
      <link>https://dev.to/joshaaayyyy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joshaaayyyy"/>
    <language>en</language>
    <item>
      <title>Building Multimodal AI Applications With MongoDB, Voyage AI, and Gemini</title>
      <dc:creator>Apoorva Joshi</dc:creator>
      <pubDate>Tue, 29 Apr 2025 14:00:00 +0000</pubDate>
      <link>https://dev.to/mongodb/building-multimodal-ai-applications-with-mongodb-voyage-ai-and-gemini-49g3</link>
      <guid>https://dev.to/mongodb/building-multimodal-ai-applications-with-mongodb-voyage-ai-and-gemini-49g3</guid>
      <description>&lt;p&gt;The internet has content spanning multiple mediums, formats, and modalities. However, most LLM-based AI applications in the past few years have primarily dealt with text data. This is because embedding models and LLMs, until recently, were focused mainly on text. But this is changing— LLMs like Gemini and GPT-4o can now process and generate images, audio, and video. Vendors like Voyage AI and Cohere have multimodal embedding models that can embed both images and text. This means we are no longer limited to using just text in our LLM applications.&lt;/p&gt;

&lt;p&gt;In this tutorial, you will learn how to build a RAG application using documents consisting of interleaved text, images, and tables as their source of knowledge. Specifically, we will cover the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is multimodality?&lt;/li&gt;
&lt;li&gt;Processing multimodal data for retrieval&lt;/li&gt;
&lt;li&gt;Evaluating the performance of different multimodal embedding models&lt;/li&gt;
&lt;li&gt;Building a RAG application using multimodal embedding models and LLMs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What is multimodality?
&lt;/h2&gt;

&lt;p&gt;Multimodality in the context of AI is the &lt;strong&gt;ability of machine learning models to process, understand, and sometimes generate different types of data, including text, images, audio, video, etc&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this tutorial, you will come across multimodal embedding models as well as LLMs. Multimodal embedding models can take different types of data or modalities as input and map them to the same high-dimensional vector space. Multimodal LLMs can not only accept multiple data formats as input but also generate different modalities as output.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When choosing multimodal embedding models and LLMs for your application, keep in mind that not all models support all modalities. Always check the model specifications to determine which data types are supported as inputs and outputs for any given model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The challenge of multimodality
&lt;/h2&gt;

&lt;p&gt;A typical RAG pipeline for text data looks as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi1abcdd6drww431rcxrc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi1abcdd6drww431rcxrc.png" alt="RAG pipeline for text data" width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the above setup, large documents are first broken down into smaller segments or chunks. The text in each chunk is then embedded using a text embedding model, and the raw text of the chunk, along with its embedding, is stored in a vector database. Given a user query, it is embedded using the same embedding model, and relevant chunks are retrieved using vector search or other retrieval techniques. The retrieved documents, along with the user query and any additional prompts, are sent to the LLM to generate an answer.&lt;/p&gt;

&lt;p&gt;This setup works well for documents that contain mainly text data. However, it doesn’t quite work for documents with mixed modalities since traditional text-based chunking can't and shouldn’t be applied directly to images, tables, and other non-text elements. Additionally, effectively retrieving these heterogeneous data types requires specialized embedding models to maintain the contextual relationships between different modalities.&lt;/p&gt;

&lt;h2&gt;
  
  
  CLIP vs. VLM-based multimodal embedding models
&lt;/h2&gt;

&lt;p&gt;Until recently, the architecture of several multimodal embedding models in the market was based on OpenAI’s Contrastive Language-Image Pre-Training (CLIP) embedding model. In this architecture, text and images are processed via independent networks to generate embeddings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qe36bfgp07j7rxx26u6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1qe36bfgp07j7rxx26u6.png" alt="CLIP architecture" width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CLIP's separate treatment of images and text necessitates extracting these elements from documents, thus creating complex document parsing pipelines. Additionally, this architecture results in a phenomenon called “&lt;a href="https://arxiv.org/abs/2203.02053" rel="noopener noreferrer"&gt;modality gap&lt;/a&gt;,” wherein unrelated concepts within the same modality often appear closer in vector space than related concepts across different modalities. This gap occurs because the vision and text encoders have fundamentally different architectures, learning representations through different transformations despite mapping to the same vector space.&lt;/p&gt;

&lt;p&gt;This is where a newer embedding model architecture based on Vision Language Models (VLMs) can help.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fivamsx42wpzh9uw08o2o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fivamsx42wpzh9uw08o2o.png" alt="VLM-based embedding model architecture" width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These models use a single transformer encoder for both images and text, thus creating unified representations that preserve contextual relationships between modalities. This helps minimize the modality gap and eliminates the need for complex document parsing, allowing direct processing of interleaved texts and images, document screenshots, PDFs with complex layouts, annotated images, etc.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a multimodal RAG application
&lt;/h2&gt;

&lt;p&gt;In this tutorial, we will build a multimodal RAG application using MongoDB as the vector store, &lt;code&gt;voyage-multimodal-3&lt;/code&gt;, a multimodal embedding model for images and text from Voyage AI, and &lt;code&gt;Gemini 2.0 Flash&lt;/code&gt;, a multimodal LLM from Google that can interpret and generate text, images, voice, and video. We will also evaluate the performance of &lt;code&gt;voyage-multimodal-3&lt;/code&gt; against OpenAI’s &lt;code&gt;CLIP&lt;/code&gt; on a large, multi-page document consisting of interleaved text and images.&lt;/p&gt;

&lt;p&gt;Here’s an architecture diagram of the system we are building:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvlhqpqyeb5fmcgymrrh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvlhqpqyeb5fmcgymrrh.png" alt=" Multimodal RAG architecture" width="800" height="487"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Given a multi-page PDF document, we will screenshot the PDF pages and write the raw images to blob storage. We will also embed the screenshots using a multimodal embedding model and store the embeddings, along with some metadata, including the references to the raw image files into MongoDB. Given a user query, we first embed it using the same embedding model we used to embed the screenshots and perform a vector search to identify the PDF pages that are most relevant to the query based on their embedded screenshots. The screenshots of these pages are then retrieved from blob storage and passed to a multimodal LLM, along with the user query and other prompts to generate an answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where’s the code?
&lt;/h2&gt;

&lt;p&gt;The Jupyter Notebook for this tutorial can be found on GitHub &lt;a href="https://github.com/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/multimodal_rag_mongodb_voyage_ai.ipynb" rel="noopener noreferrer"&gt;in our Gen AI Showcase repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Install required libraries
&lt;/h2&gt;

&lt;p&gt;We will require the following libraries for this tutorial:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;pymongo&lt;/strong&gt;: Python driver for MongoDB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;voyageai&lt;/strong&gt;: Python client for Voyage AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;google-genai&lt;/strong&gt;: Python library to access Google's embedding models and LLMs via Google AI Studio&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;google-cloud-storage&lt;/strong&gt;: Python client for Google Cloud Storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sentence-transformers&lt;/strong&gt;: Python library to use open-source ML models from Hugging Face&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PyMuPDF&lt;/strong&gt;: Python library for analyzing and manipulating PDFs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pillow&lt;/strong&gt;: A Python imaging library&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tqdm&lt;/strong&gt;: Show progress bars for loops in Python&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tenacity&lt;/strong&gt;: Python library for adding retries to functions
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;qU&lt;/span&gt; &lt;span class="n"&gt;pymongo&lt;/span&gt; &lt;span class="n"&gt;voyageai&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;cloud&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;storage&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="n"&gt;PyMuPDF&lt;/span&gt; &lt;span class="n"&gt;Pillow&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt; &lt;span class="n"&gt;tenacity&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Set up prerequisites
&lt;/h2&gt;

&lt;p&gt;We will use MongoDB as the vector database for RAG. But first, you will need a MongoDB Atlas account with a database cluster. Once you do that, you will need to get the connection string to connect to your cluster. Follow these steps to get set up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Register for a &lt;a href="https://www.mongodb.com/cloud/atlas/register?utm_campaign=devrel&amp;amp;utm_source=third-party-content&amp;amp;utm_medium=cta&amp;amp;utm_content=multimodal_rag_tutorial&amp;amp;utm_term=apoorva.joshi" rel="noopener noreferrer"&gt;free MongoDB Atlas account&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.mongodb.com/docs/guides/atlas/cluster/?utm_campaign=devrel&amp;amp;utm_source=third-party-content&amp;amp;utm_medium=cta&amp;amp;utm_content=multimodal_rag_tutorial&amp;amp;utm_term=apoorva.joshi" rel="noopener noreferrer"&gt;Create a new database cluster&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.mongodb.com/docs/guides/atlas/connection-string/?utm_campaign=devrel&amp;amp;utm_source=third-party-content&amp;amp;utm_medium=cta&amp;amp;utm_content=multimodal_rag_tutorial&amp;amp;utm_term=apoorva.joshi" rel="noopener noreferrer"&gt;Obtain the connection string&lt;/a&gt; for your database cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you have the connection string, set it in your code as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;getpass&lt;/span&gt;

&lt;span class="n"&gt;MONGODB_URI&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getpass&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getpass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter your MongoDB connection string: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To use Voyage AI’s embedding models, you will need to &lt;a href="https://docs.voyageai.com/docs/api-key-and-installation#authentication-with-api-keys" rel="noopener noreferrer"&gt;obtain a Voyage AI API key&lt;/a&gt; and set it as an environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VOYAGE_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getpass&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getpass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter your Voyage AI API key: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will also need to &lt;a href="https://aistudio.google.com/app/apikey" rel="noopener noreferrer"&gt;obtain a Gemini API key&lt;/a&gt; to use Gemini models via Google AI Studio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;GEMINI_API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getpass&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getpass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enter your Gemini API key: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, you will need to set up Application Default Credentials (ADC) to communicate with Google Cloud Storage using the Python client. &lt;a href="https://cloud.google.com/docs/authentication/set-up-adc-local-dev-environment#google-idp" rel="noopener noreferrer"&gt;Follow the instructions&lt;/a&gt; to provide credentials to ADC.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We are configuring ADC with a Google account for this tutorial, but there are &lt;a href="https://cloud.google.com/docs/authentication/provide-credentials-adc" rel="noopener noreferrer"&gt;other options&lt;/a&gt; that might be better suited for the environment where you are running your RAG application.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 3: Read a PDF from a URL
&lt;/h2&gt;

&lt;p&gt;In this tutorial, we will use the &lt;a href="https://arxiv.org/pdf/2501.12948" rel="noopener noreferrer"&gt;Deepseek-R1 paper&lt;/a&gt; as the source of knowledge for our RAG application. The paper is substantially long and consists of text interleaved with figures and tables. This is a good representation of datasets we see in the real world for RAG, such as financial reports, technical manuals, business proposals, product catalogs, etc.&lt;/p&gt;

&lt;p&gt;Let’s download the Deepseek paper as a PDF document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BytesIO&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pymupdf&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="c1"&gt;# Download the DeepSeek paper
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://arxiv.org/pdf/2501.12948&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to download PDF. Status code: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Load the response as an in-memory file-like object
&lt;/span&gt;&lt;span class="n"&gt;pdf_stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Open the object as a PDF document
&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pymupdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pdf_stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filetype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code makes an HTTP GET request to get the content of the Deepseek paper from Arxiv, loads the content as an in-memory file-like object using &lt;code&gt;BytesIO&lt;/code&gt;, and opens the file-like object as a PDF using PyMuPDF.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Store PDF images in GCS and extract metadata for MongoDB
&lt;/h2&gt;

&lt;p&gt;Typically, for RAG applications that work with text-only data, you store the raw chunk text alongside their embeddings in the vector store. However, images, audio, and videos can get large, so we suggest against storing them directly in a database (MongoDB or otherwise) to avoid performance bottlenecks. Instead, store the media files in blob storage like Amazon S3, GCS, etc., and store references to the stored objects in MongoDB documents since these solutions are specifically optimized for storing and retrieving large binary objects at scale at significantly lower storage costs per gigabyte.&lt;/p&gt;

&lt;p&gt;In this tutorial, we want to test the ability of multimodal embedding models and LLMs to process content-rich images consisting of text interspersed with figures and tables. So we will first convert the PDF we downloaded in Step 3 to a set of screenshots. We will store the raw screenshots in GCS and store references to the GCS objects in MongoDB documents.&lt;/p&gt;

&lt;p&gt;Let’s first define a helper function to upload objects to GCS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt;

&lt;span class="c1"&gt;# Set GCS project and bucket
&lt;/span&gt;&lt;span class="n"&gt;GCS_PROJECT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;GCS_BUCKET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tutorials&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiate the GCS client and bucket
&lt;/span&gt;&lt;span class="n"&gt;gcs_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GCS_PROJECT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;gcs_bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gcs_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;upload_image_to_gcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Upload image to GCS.

    Args:
        key (str): Unique identifier for the image in the bucket.
        data (bytes): Image bytes to upload.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gcs_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upload_from_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image/png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instantiates a GCS client that connects to the &lt;code&gt;genai&lt;/code&gt; project and a bucket called &lt;code&gt;tutorials&lt;/code&gt; within that project.&lt;/li&gt;
&lt;li&gt;Creates a function called &lt;code&gt;upload_image_to_gcs&lt;/code&gt; that:

&lt;ul&gt;
&lt;li&gt;Takes an object name (&lt;code&gt;key&lt;/code&gt;) and object content (&lt;code&gt;data&lt;/code&gt;) as input.&lt;/li&gt;
&lt;li&gt;Uses the &lt;code&gt;key&lt;/code&gt; to create a GCS object, a.k.a. &lt;code&gt;blob&lt;/code&gt;, in the &lt;code&gt;GCS_BUCKET&lt;/code&gt; bucket, and uploads the object content as bytes to the blob. The content type is set to &lt;code&gt;image/png&lt;/code&gt; to allow browsers to render the image content correctly.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Now, let’s iterate through the pages of the PDF we downloaded in Step 3, convert each page into an image, write the raw images to GCS, and extract the metadata to write to MongoDB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="n"&gt;zoom&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;
&lt;span class="n"&gt;mat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pymupdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;zoom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zoom&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Iterate through the pages of the PDF
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;tqdm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_count&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
    &lt;span class="n"&gt;temp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="c1"&gt;# Render the PDf as an image
&lt;/span&gt;    &lt;span class="n"&gt;pix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get_pixmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Convert the image to in-memory bytes in PNG format
&lt;/span&gt;    &lt;span class="n"&gt;img_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pix&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tobytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;gcs_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# Upload the image bytes to GCS
&lt;/span&gt;    &lt;span class="nf"&gt;upload_image_to_gcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gcs_key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;img_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Extract some image metadata
&lt;/span&gt;    &lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;img_bytes&lt;/span&gt;
    &lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gcs_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gcs_key&lt;/span&gt;
    &lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;width&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pix&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;
    &lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;height&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pix&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt;
    &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code performs the following actions on each page of the PDF downloaded in Step 3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Renders it as an image using the &lt;code&gt;get_pixmap&lt;/code&gt; method in the PyMuPDF library.&lt;/li&gt;
&lt;li&gt;Converts the image to an in-memory PNG-formatted bytes object using the &lt;code&gt;tobytes&lt;/code&gt; method.&lt;/li&gt;
&lt;li&gt;Generates a unique identifier (&lt;code&gt;gcs_key&lt;/code&gt;) for the image object to be stored in GCS. This follows the format &lt;code&gt;multimodal-rag/(n+1).png&lt;/code&gt; where n is the page number.&lt;/li&gt;
&lt;li&gt;Uploads the image bytes to a GCS using the &lt;code&gt;upload_image_to_gcs&lt;/code&gt; helper function.&lt;/li&gt;
&lt;li&gt;Creates the document to be written to MongoDB, consisting of the GCS key (&lt;code&gt;gcs_key&lt;/code&gt;), image width (&lt;code&gt;width&lt;/code&gt;), height (&lt;code&gt;height&lt;/code&gt;), and temporarily, the image bytes (&lt;code&gt;image&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Appends the document to a list (&lt;code&gt;docs&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 5: Add embeddings to the MongoDB documents
&lt;/h2&gt;

&lt;p&gt;To enable vector search on the PDF screenshots, we need to embed them and add the embeddings to the MongoDB documents. Since we are evaluating two multimodal embedding models in this tutorial, we will add embeddings from both models to the documents.&lt;/p&gt;

&lt;p&gt;To do this, let’s create helper functions to generate embeddings using Voyage AI’s &lt;code&gt;voyage-multimodal-3&lt;/code&gt; and OpenAI’s CLIP models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;voyageai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Client&lt;/span&gt;

&lt;span class="n"&gt;voyageai_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_voyage_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Union&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;input_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Get Voyage AI embeddings for images and text.

    Args:
        data (Union[Image, str]): An image or text to embed.
        input_type (str): Input type, either &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; or &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.

    Returns:
        List: Embeddings as a list.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;voyageai_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;multimodal_embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voyage-multimodal-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;input_type&lt;/span&gt;
    &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates a Voyage AI API client. This uses the  &lt;code&gt;VOYAGE_API_KEY&lt;/code&gt; environment variable you set in Step 1 to authenticate requests to the Voyage AI API.&lt;/li&gt;
&lt;li&gt;Uses the &lt;code&gt;multimodal_embed&lt;/code&gt; method from the Voyage AI API with the following arguments to generate embeddings:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;inputs&lt;/code&gt;: List of text strings and/or PIL image objects to embed.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;model&lt;/code&gt;: The model to use for embedding. We use the &lt;code&gt;voyage-multimodal-3&lt;/code&gt;, which is Voyage AI’s latest multimodal embedding model.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;input_type&lt;/code&gt;: Should be set to either &lt;code&gt;query&lt;/code&gt; or &lt;code&gt;document.&lt;/code&gt; Set it to &lt;code&gt;query&lt;/code&gt; when embedding the user query and &lt;code&gt;document&lt;/code&gt; when embedding the dataset that you will use as the knowledge base for your RAG application.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sentence_transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SentenceTransformer&lt;/span&gt;

&lt;span class="n"&gt;clip_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SentenceTransformer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip-ViT-B-32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_clip_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Union&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Get CLIP embeddings for images and text.

    Args:
        data (Union[Image, str]): An image or text to embed.

    Returns:
        List: Embeddings as a list.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clip_model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Downloads the latest version of OpenAI’s CLIP model, &lt;code&gt;clip-ViT-B-32&lt;/code&gt;, from Hugging Face using the Sentence Transformer library.&lt;/li&gt;
&lt;li&gt;Uses the &lt;code&gt;encode&lt;/code&gt; method to generate embeddings for an image or a piece of text and converts the resulting embedding Tensor to a list using &lt;code&gt;tolist()&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Next, let’s iterate over the MongoDB documents created in Step 4 and add embeddings from the Voyage AI and CLIP models to them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;embedded_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;tqdm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Open the image from the in-memory bytes
&lt;/span&gt;    &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="c1"&gt;# Add the Voyage AI and CLIP embeddings to the document
&lt;/span&gt;    &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voyage_embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_voyage_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip_embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_clip_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Remove the image bytes from the document
&lt;/span&gt;    &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;embedded_docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code performs the following actions on each document contained in &lt;code&gt;docs&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Opens the PDF screenshot from binary image data stored in the &lt;code&gt;image&lt;/code&gt; field of the document.&lt;/li&gt;
&lt;li&gt;Adds a field called &lt;code&gt;voyage_embedding&lt;/code&gt; consisting of the embedding of the screenshot generated by the Voyage AI model. Notice that the &lt;code&gt;input_type&lt;/code&gt; argument here is set to &lt;code&gt;document.&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Adds a field called &lt;code&gt;clip_embedding&lt;/code&gt; consisting of the embedding generated by the CLIP model. &lt;/li&gt;
&lt;li&gt;Removes the &lt;code&gt;image&lt;/code&gt; field from the document since we don’t want to store the raw image in MongoDB. &lt;/li&gt;
&lt;li&gt;Appends the modified document with embeddings to a list called &lt;code&gt;embedded_docs.&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;You can roll up the embedding generation into Step 4 to avoid iterating over the PDF pages twice. We have kept it separate in the tutorial for clarity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 6: Ingest documents into MongoDB
&lt;/h2&gt;

&lt;p&gt;The next step is to ingest the documents with embeddings into MongoDB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pymongo&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MongoClient&lt;/span&gt;

&lt;span class="c1"&gt;# Create a MongoDB client
&lt;/span&gt;&lt;span class="n"&gt;mongodb_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MongoClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;MONGODB_URI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;appname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;devrel.showcase.multimodal_rag_mongodb_voyage_ai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Check connection to the cluster
&lt;/span&gt;&lt;span class="n"&gt;mongodb_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;admin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ping&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Names of the MongoDB database, collection and vector search index
&lt;/span&gt;&lt;span class="n"&gt;DB_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multimodal_rag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Connect to the MongoDB collection
&lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mongodb_client&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DB_NAME&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;COLLECTION_NAME&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Delete existing documents from the collection
&lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_many&lt;/span&gt;&lt;span class="p"&gt;({})&lt;/span&gt;

&lt;span class="c1"&gt;# Insert the embedded documents into the collection
&lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert_many&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedded_docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates a MongoDB client to connect to your database cluster.&lt;/li&gt;
&lt;li&gt;Connects to a collection named &lt;code&gt;multimodal_rag&lt;/code&gt; in a database named &lt;code&gt;mongodb&lt;/code&gt; in your cluster. &lt;/li&gt;
&lt;li&gt;Deletes any existing documents from the collection.&lt;/li&gt;
&lt;li&gt;Bulk ingests the documents in the &lt;code&gt;embedded_docs&lt;/code&gt; list into the collection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A sample document inserted into MongoDB looks as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"67eeb8f06ae434784f5d0440"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gcs_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"multimodal-rag/1.png"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"width"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1786&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"height"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2526&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"voyage_embedding"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="mf"&gt;0.0059509277343751&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="mf"&gt;0.0187988281252&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="mf"&gt;0.040283203125&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"clip_embedding"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="mf"&gt;-0.2702203392982483&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="mf"&gt;0.36099976301193243&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="mf"&gt;-0.022063829004764557&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 7: Create a vector search index
&lt;/h2&gt;

&lt;p&gt;Another requirement to enable efficient vector search on the data is to create a vector search index. The index specifies what fields to index and make searchable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;VS_INDEX_NAME&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;VS_INDEX_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vectorSearch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;definition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fields&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voyage_embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numDimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip_embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numDimensions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cosine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_search_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates a vector search index definition to index the Voyage AI and CLIP embeddings since we want to evaluate both the models. The index definition contains the following configurations:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;path&lt;/code&gt;: Represents the path to the embedding fields in the documents.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;numDimensions&lt;/code&gt;: Number of dimensions in the embedding vector. Notice that the &lt;code&gt;voyage-multimodal-3&lt;/code&gt; model produces embeddings containing &lt;strong&gt;1024&lt;/strong&gt; dimensions, and the CLIP model produces embeddings with &lt;strong&gt;512&lt;/strong&gt; dimensions.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;similarity&lt;/code&gt;: Similarity metric to measure distance in vector space. &lt;code&gt;cosine&lt;/code&gt; works well for most embedding models, unless specified otherwise in the model spec.
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Uses the index definition to create a vector search index against the &lt;code&gt;multimodal_rag&lt;/code&gt; collection.&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 8: Retrieve documents using vector search
&lt;/h2&gt;

&lt;p&gt;With embeddings added to the documents and a vector search index created, we are ready to perform vector search on the data. Retrieving the screenshots is a two-step process.  Given a user query, it is embedded using the same multimodal embedding model that was used to embed the PDF screenshots. Using vector search, we identify the screenshots that are closest to the user query in vector space. However, the vector search, in this case, only returns references to the screenshots. We then use these references to retrieve the screenshots from GCS.&lt;/p&gt;

&lt;p&gt;So, let’s define two functions, one to perform vector search and another to download the image objects from GCS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_image_from_gcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Get image bytes from GCS.

    Args:
        key (str): Identifier for the image in the bucket.

    Returns:
        bytes: Image bytes.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;blob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gcs_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;image_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download_as_bytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;image_bytes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code defines a helper function that accepts the name of an object in GCS (&lt;code&gt;key&lt;/code&gt;) as input and returns the object content as bytes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_images&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Perform vector search and display images, and return the GCS keys.

    Args:
        user_query (str): User query string.
        model (str): Embedding model to use, either &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voyage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; or &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.
        display_images (bool, optional): Whether or not to display images. Defaults to True.

    Returns:
        List[str]: List of GCS keys.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Get query embedding based on the model specified
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voyage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_voyage_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query_embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_clip_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Define the VS aggregation pipeline
&lt;/span&gt;    &lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$vectorSearch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;VS_INDEX_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queryVector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query_embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;_embedding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numCandidates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$project&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gcs_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;width&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;height&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;$meta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vectorSearchScore&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Execute the aggregation pipeline
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;aggregate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;gcs_keys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="c1"&gt;# For each VS result
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;gcs_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gcs_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="c1"&gt;# If display_images is True, display the image
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;display_images&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;get_image_from_gcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gcs_key&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Append the GCS key of the image to a list
&lt;/span&gt;        &lt;span class="n"&gt;gcs_keys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gcs_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;gcs_keys&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Given a user query (&lt;code&gt;query&lt;/code&gt;), generates an embedding for it using the specified embedding model (&lt;code&gt;model&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Defines a MongoDB aggregation pipeline that has two stages:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;$vectorSearch&lt;/code&gt;: Performs vector search. The &lt;code&gt;queryVector&lt;/code&gt; field in this stage contains the embedded user query, the &lt;code&gt;path&lt;/code&gt; refers to the path of the embedding field in the MongoDB documents depending on the &lt;code&gt;model&lt;/code&gt; specified, &lt;code&gt;numCandidates&lt;/code&gt; denotes the number of nearest neighbors to consider in the vector space when performing vector search, and finally &lt;code&gt;limit&lt;/code&gt; indicates the number of documents that will be returned from the vector search.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;$project&lt;/code&gt;: Includes only certain fields (set to the value &lt;strong&gt;1&lt;/strong&gt;) in the final results returned by the aggregation pipeline. In this case, only the GCS key, the image width and height, and the vector search score are included in the results.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Runs the aggregation pipeline against the &lt;code&gt;multimodal_rag&lt;/code&gt; collection.&lt;/li&gt;

&lt;li&gt;Gathers and returns the GCS keys from the vector search results.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;You can run the vector search using the Voyage AI and CLIP embedding models and notice the difference in the results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the Pass@1 accuracy of Deepseek R1 against other models.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voyage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;display_images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the test query seen above, the vector search using &lt;code&gt;voyage-multimodal-3&lt;/code&gt; returns the following GCS keys and scores:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/1.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.759&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/13.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.748&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/14.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.740&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/7.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.711&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/4.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.70&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the same query, &lt;code&gt;clip-ViT-B-32&lt;/code&gt; returns the following GCS keys and scores:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/1.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.634&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/7.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.632&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/14.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.631&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/8.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.627&lt;/span&gt;
&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;multimodal-rag/5.png&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.626&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that the models result in different PDF pages being retrieved. Also, note the broader distribution of similarity scores from the Voyage AI model compared to CLIP. This potentially indicates that the VLM-based model is more effective at capturing distinguishing features of each page and mapping text-based queries to mixed-modality content.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 9: Create a multimodal RAG app
&lt;/h2&gt;

&lt;p&gt;Finally, let’s pass the retrieved screenshots as context to &lt;code&gt;Gemini 2.0 Flash&lt;/code&gt; to generate answers to user queries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;gemini_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;LLM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.0-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Generate answer to the user question using a Gemini multimodal LLM.

    Args:
        user_query (str): User query string.
        model (str): Embedding model to use, either &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;voyage&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; or &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.

    Returns:
        str: LLM generated answer.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Get the GCS keys of the images returned by the vector search
&lt;/span&gt;    &lt;span class="n"&gt;gcs_keys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;vector_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;display_images&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Get the images from GCS and open them
&lt;/span&gt;    &lt;span class="n"&gt;images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;get_image_from_gcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;gcs_keys&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer the question based only on the provided context. If the context is empty, say I DON&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;T KNOW&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Question:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Context:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# Prompt to the LLM consisting of the system prompt and the images
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;images&lt;/span&gt;
    &lt;span class="c1"&gt;# Get a response from the LLM
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gemini_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;LLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Instantiates a Google Gen AI SDK client to use Gemini models via the Gemini API.&lt;/li&gt;
&lt;li&gt;Specifies the LLM to use. We are using &lt;code&gt;gemini-2.0-flash&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Defines a &lt;code&gt;generate_answer&lt;/code&gt; function that:

&lt;ul&gt;
&lt;li&gt;Uses the &lt;code&gt;vector_search&lt;/code&gt; function defined in Step 8 to get the GCS keys of the PDF pages that are most relevant to the user query (&lt;code&gt;user_query&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Uses the &lt;code&gt;get_image_from_gcs&lt;/code&gt; function defined in Step 8 to retrieve the image bytes from GCS and then converts them into PIL image objects (&lt;code&gt;images&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Creates the inputs for the LLM, consisting of a system prompt (&lt;code&gt;prompt&lt;/code&gt;) and the PDF screenshots (&lt;code&gt;images&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Uses the &lt;code&gt;generate_content&lt;/code&gt; method of the Gemini API to generate an answer using the &lt;code&gt;gemini-2.0-flash&lt;/code&gt; model. The method takes the model name and inputs as arguments, and any additional configuration options for the output generation, such as &lt;code&gt;temperature&lt;/code&gt; via the &lt;code&gt;config&lt;/code&gt; parameter. We set the &lt;code&gt;temperature&lt;/code&gt; to &lt;strong&gt;0.0&lt;/strong&gt; to maximize the probability of deterministic outputs.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 10: Evaluating retrieval and generation
&lt;/h2&gt;

&lt;p&gt;To wrap up, let's evaluate how using the VLM-based &lt;code&gt;voyage-multimodal-3&lt;/code&gt; versus &lt;code&gt;CLIP&lt;/code&gt; embeddings affects both retrieval and generation quality in our RAG application. For this, we create a small evaluation dataset consisting of a set of questions derived from the Deepseek paper, their ground truth answers, and PDF page numbers (in decreasing order of relevance) that contain the context to answer the questions. A sample entry from the evaluation dataset looks as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"question"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"How does DeepSeek-R1-Zero's accuracy on the AIME 2024 benchmark improve during RL training?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The accuracy measured by the average pass@1 score increases from 15.6% to 71.0% pass@1 as a result of RL training."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="nl"&gt;"page_numbers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use the following metrics to evaluate retrieval quality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mean Reciprocal Rank (MRR)&lt;/strong&gt;: Average of the reciprocal ranks (1 / position of the first relevant result) across all queries in the evaluation dataset. Measures how well the system ranks the first relevant item in the list of retrieved items. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mean recall at k&lt;/strong&gt;: Percentage of all relevant items that were retrieved in the top k results, averaged across all queries in the evaluation dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;For our evaluation dataset, we have identified the five most relevant pages for each question, but there might be more that are loosely relevant to the question. So, we are measuring partial recall at k on “known” relevant items.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Additionally, we measure generation quality using a method called “LLM-as-a-judge,” where you use strong LLMs for evaluation by providing them with evaluation rubrics and guidelines via prompts. Here, we use &lt;code&gt;Gemini 2.0 Flash&lt;/code&gt; itself to evaluate how aligned its answer is with the human-generated ground truth answer on a scale of 1 to 5, 1 being completely unaligned and 5 being fully aligned.&lt;/p&gt;

&lt;p&gt;Refer to the steps under the title &lt;strong&gt;Step 10: Evaluating retrieval and generation&lt;/strong&gt; in the &lt;a href="https://github.com/mongodb-developer/GenAI-Showcase/blob/main/notebooks/rag/multimodal_rag_mongodb_voyage_ai.ipynb" rel="noopener noreferrer"&gt;Jupyter Notebook&lt;/a&gt; accompanying this tutorial to see the code that defines the metrics above and does the evaluation, but let’s discuss the results here.&lt;/p&gt;

&lt;p&gt;In our evaluation dataset, we have separate question sets that can be answered from figures, including tables and text. This allows for a more granular evaluation to understand how well VLM-based and CLIP-based models handle different modalities. The performance of the &lt;code&gt;voyage-multimodal-3&lt;/code&gt; and &lt;code&gt;clip-ViT-B-32&lt;/code&gt; models is summarized in the table below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd3fvs6culbiv8rnwppbq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd3fvs6culbiv8rnwppbq.png" alt="Performance of VLM-based and CLIP-based models for multimodal data retrieval and generation" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our evaluation shows that the VLM-based &lt;code&gt;voyage-multimodal-3&lt;/code&gt; outperforms &lt;code&gt;CLIP&lt;/code&gt; on both retrieval and generation metrics for our dataset. The model demonstrates superior text comprehension within mixed-modal contexts while maintaining strong visual interpretation capabilities, as is evident by its performance on text and figure-based questions, respectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Multimodality in the context of AI is the ability of machine learning models to process, understand, and sometimes generate different types of data, including text, images, audio, video, etc. Multimodal RAG applications require different data processing strategies compared to text-only data and also specialized embedding models and LLMs that can handle different modalities. &lt;/p&gt;

&lt;p&gt;In this tutorial, we saw how to build a multimodal RAG application using documents consisting of interleaved text, images, and tables as their source of knowledge, using MongoDB Atlas Vector Search and Voyage AI. We also discussed two different architectures, CLIP and VLMs, for multimodal embedding models, and evaluated how they impact the retrieval and generation quality in our RAG application.&lt;/p&gt;

&lt;p&gt;If you enjoyed reading this tutorial, you can explore more such content on our &lt;a href="https://www.mongodb.com/resources/use-cases/artificial-intelligence/?utm_campaign=devrel&amp;amp;utm_source=third-party-content&amp;amp;utm_medium=cta&amp;amp;utm_content=multimodal_rag_tutorial&amp;amp;utm_term=apoorva.joshi" rel="noopener noreferrer"&gt;AI Learning Hub&lt;/a&gt;. If you want to go straight to code, we have several more examples of how to build RAG, agentic applications, evals, etc., in our &lt;a href="https://github.com/mongodb-developer/GenAI-Showcase" rel="noopener noreferrer"&gt;Gen AI Showcase GitHub repository&lt;/a&gt;. As always, if you have further questions as you build your AI applications, please reach out to us in our &lt;a href="https://www.mongodb.com/community/forums/c/generative-ai/162/?utm_campaign=devrel&amp;amp;utm_source=third-party-content&amp;amp;utm_medium=cta&amp;amp;utm_content=multimodal_rag_tutorial&amp;amp;utm_term=apoorva.joshi" rel="noopener noreferrer"&gt;Generative AI community forums&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>mongodb</category>
      <category>webdev</category>
      <category>javascript</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
