Connie Leung

Posted on Jan 17 • Originally published at blueskyconnie.com

Simplifying Multimodal Inputs: Using Public URLs with the Gemini API

#python #gemini #tutorial #genai

In this blog post, I will explore the enhanced inputs in the Gemini API and how it simplifies my Colab demo. I will outline the recent updates to the Gemini API, walk through the legacy implementation, and finally demonstrate how to replace a complex utility function with just two lines of code.

Introduction

As of January 2026, the Gemini API has significantly increased the limit of inline data and made data ingestion more flexible. The maximum size of the inline data has increased from 20 MB to 100 MB per file. While the increase in inline data size is significant, my demo is about adopting external public URLs instead of local files.

Moreover, the Gemini API supports external public HTTPS URLs and signed URLs. Examples of signed URLs are S3 Presigned URLs or Azure Blob Storage shared access signatures (SAS).

If the files are ready in Google Cloud Storage (GCS), their URIs can be registered and used across multiple requests without re-uploading.

The primary advantage of public URLs is eliminating the client-side upload step. The model fetches data directly from the source, removing the need to manage local byte streams. While re-uploading works fine for prototyping, it becomes a bottleneck in production where data persistence is key.

In summary, these are the changes:

Limit / Capability

Inline data limit increased to 100 MB
Support external URLs (Public HTTPS or Signed)
Support for registered Google Cloud Storage URIs and use them in the requests

Type	Supported File Types
Text file types	text/html, text/css, text/plain, text/xml, text/csv, text/rtf, text/javascript
Application file types	application/json, application/pdf
Image file types	image/bmp, image/jpeg, image/png, image/webp

Prerequisites

To run the provided Colab notebook, you will need:

Vertex AI in Express Mode: Due to regional availability (I am based in Hong Kong), I utilize Gemini via Vertex AI. However, these features apply to both. Therefore, all my demos use Gemini in Vertex AI. If you want to use Gemini in Vertex AI for free, you can sign up for Vertex AI in express mode using your Gmail account.
Google Cloud API Key: Properly configured within your environment.
Google Colab VS Code Extension - Visit Visual Studio Marketplace to install the extension to run the demo in the VS Code environment.
Python: Have Python 3.12+ and the google-genai SDK installed

Demo Overview

The demo performs image understanding by comparing pictures of two people to identify their facial similarities and differences. The result is a structured output containing a similarity score, a list of similarities, and a list of differences.

Initially, the images were local to my demo, and I wrote a utility function to open an image file, read the raw bytes, and wrap them in a Gemini Part object. Then, the text prompt, and two inline data parts are passed to the Gemini API to generate the structured output.

After the announcement of the enhancements, I moved the images to a public GitHub Repo, and refactored the code to accept public image URLs instead of local files. This mimics a production environment where assets are hosted remotely (e.g., S3, GCS, remote server) rather than bundled locally.

Note: This demo is for educational purposes to demonstrate multimodal capabilities. Always review the AI provider's Acceptable Use Policy regarding biometric categorization and facial recognition before deploying such features in production.

Architecture

The Gemini API ingests two public image URLs, and passes them along with the text prompt to the Gemini 3 Flash Preview model. The model compares the images to generate a similarity score, produce a list of similarities, and a list of differences.

Environment Variable

Copy env.example to .env and replace the placeholder of GOOGLE_CLOUD_API_KEY with the API key.

GOOGLE_CLOUD_API_KEY=<GOOGLE CLOUD API KEY>

The create_vertexai_client function constructs and returns a Gemini client. I use the client to call Gemini 3 Flash Preview model to analyze images and generate a structured output.

import os
from google import genai
from matplotlib import pyplot as plt
from dotenv import load_dotenv
from google.genai import types
from pydantic import BaseModel, Field

def create_vertexai_client():
    cloud_api_key = os.getenv("GOOGLE_CLOUD_API_KEY")
    if not cloud_api_key:
        raise ValueError("GOOGLE_CLOUD_API_KEY not found in .env file")

    # Configure the client with your API key
    client = genai.Client(
        vertexai=True, 
        api_key=cloud_api_key, 
    )

    return client

Installation

The demo uses the newer Google Gen AI SDK, so I install the google-genai library.

%pip install google-genai
%pip install matplotlib
%pip install os
%pip install dotenv
%pip install pydantic

The Legacy Approach: Local File Uploads

1. Image Understanding with Thinking Level and Media Resolution (Facial Analysis)

This utility function opens an image file, reads the image bytes, and returns an inline data part.

def get_inline_data_part(image_path: str):
    import mimetypes

    try:
        mime_type, _ = mimetypes.guess_type(image_path)
        if mime_type is None:
            mime_type = 'application/octet-stream'
            print(f"Warning: Could not determine MIME type for {image_path}. Defaulting to {mime_type}.")

        file_bytes: bytes | None = None
        with open(image_path, "rb") as image_file:
            file_bytes = image_file.read()

        if file_bytes is None:
            raise Exception(f"Unable to read the bytes from {image_path}")

        return types.Part(
            inline_data=types.Blob(
                mime_type=mime_type,
                data=file_bytes
            ),
            media_resolution={"level": "media_resolution_high"}
        ) 
    except FileNotFoundError:
        print (f"Error: The file was not found at {image_path}")
    except Exception as e:
        print (f"An error occurred: {e}")

The structured output of the prompt adheres to the schema of the FacialAnalysis model.

class FacialAnalysis(BaseModel):
    percentage: int = Field(description="Integer similarity score.")
    similarities: list[str] = Field(description="Key similarities.")
    differences: list[str] = Field(description="Key differences.")

With response_mime_type='application/json', the model is very good at returning JSON object, and it can be found in response.parse. If response.parse fails, the fallback is to parse reponse.text to an instance of FacialAnalysis model.

The Gemini model often wraps the response.text in Markdown code blocks, so the clean_json_string function removes the code blocks to ensure clean parsing.

def clean_json_string(raw_string):
    # Remove the markdown code blocks
    clean_str = raw_string.strip()
    if clean_str.startswith("<triple backticks>json"):
        clean_str = clean_str[7:]
    if clean_str.endswith("<triple backticks>"):
        clean_str = clean_str[:-3]
    return clean_str.strip()

Here, I use the Gemini 3 Flash Preview model to compare two images, applying a high thinking level and high media resolution to ensure a detailed analytical response.

config=types.GenerateContentConfig(
    response_mime_type='application/json',
    response_json_schema=FacialAnalysis.model_json_schema(),
    media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH,
    thinking_config=types.ThinkingConfig(
        include_thoughts=True,
        thinking_level=types.ThinkingLevel.HIGH
    )
)

response = client.models.generate_content(
    model='gemini-3-flash-preview',
    contents=[
        types.Content(
            role="user",
            parts=[
                types.Part(text=prompt),
                get_inline_data_part(person_a_image),
                get_inline_data_part(person_b_image),
            ],
        ),
    ],
    config=config
)

The complete implementation of analyze_facial_images is below:

def analyze_facial_images(person_a_image: str, person_b_image: str):
    prompt = """
    Role: You are an Expert Biometric Analyst and Facial Recognition Specialist. Your expertise lies in anthropometric comparison, analyzing facial landmarks, bone structure, and morphological traits.

    Task: Analyze the attached image(s). Compare the physical appearance of the two individuals shown. If two separate images are provided, treat the first as Person A and the second as Person B. If one image containing two people is provided, treat the person on the left as Person A and the person on the right as Person B.

    Analysis Criteria:
    Focus strictly on biological and physical traits (facial geometry, feature shapes, and proportions). Ignore clothing, lighting, camera angles, or accessories (glasses, hats) unless they obscure features.

    The Rule of Similarity (Visual Scoring Rubric):
    You must assign a final similarity percentage based strictly on this scale:

    0 - 10 (Dissimilar): Polar opposites. No shared facial geometry or features; different phenotypes.
    11 - 20 (Faint Connection): Very weak link; perhaps a shared broad head shape, but all specific features differ.
    21 - 30 (Vague Resemblance): Superficial similarities only (e.g., similar eye color or hair texture only), but the faces look unrelated.
    31 - 40 (Partial Overlap): Noticeable but minor similarities. Maybe the nose or mouth is similar, but the overall bone structure is different.
    41 - 50 (Moderate Association): Reminiscent. One major anatomical zone aligns (e.g., eyes/brows), but the rest of the face is distinct.
    51 - 60% (Balanced Similarity): Comparable. Distinct individuals, but there is enough overlap in bone structure to suggest a relation.
    61 - 70 (Strong Resemblance): "Cousin" status. Strong resemblance in the "map" of the face; differences are only in the details/nuance.
    71 - 80 (Kindred Spirits): High resemblance. They share facial ratios that suggest a blood relation (e.g., parent/child or first cousins), but are clearly distinguishable as different people.
    81 - 90 (Biological Sibling / Doppelgänger): Extremely high correlation. This tier represents biological sisters/brothers who look very similar, or unrelated high-grade lookalikes. They share almost all facial features, with only minor structural variances.
    91 - 100 (Identical Match): The same person or identical twins. Facial geometry and bone structure are exact matches. Any visible differences are strictly superficial (e.g., age, styling, or weight) and not structural.

    Output Format:
    Please provide the result strictly in the following format:

    Similarity Score: [Insert Integer]%

    Key Similarities:
    - [Detail 1]
    - [Detail 2]
    - [Detail 3]
    - [Detail 4]
    - [Detail 5]

    Key Differences:
    - [Detail 1]
    - [Detail 2]
    - [Detail 3]
    - [Detail 4]
    - [Detail 5]
    """

    response = client.models.generate_content(
        model="gemini-3-flash-preview",
        contents=[
            types.Content(
                role="user",
                parts=[
                    types.Part(text=prompt),
                    get_inline_data_part(person_a_image),
                    get_inline_data_part(person_b_image),
                ]
            )
        ],
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_json_schema=FacialAnalysis.model_json_schema(),
            media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH,
            thinking_config=types.ThinkingConfig(
                thinking_level=types.ThinkingLevel.HIGH
            )
        )
    )

    if response.parsed:
        result = FacialAnalysis.model_validate(response.parsed)
    else:
        result = FacialAnalysis.model_validate_json(
            clean_json_string(response.text)
        )
    return result

2. Prompt Gemini 3 and Print results

The print_result function invokes analyze_facial_images to generate the response. It displays the similarity score in percentage. Then, it iterates the similarities list to display all similarities. Finally, it iterates the differences list to display all differences.

def print_result(person_a_image: str, person_b_image: str):
    result = analyze_facial_images(person_a_image=person_a_image, person_b_image=person_b_image)

    print("Percentage: ", result.percentage)
    print ("Similarities:")
    for s in result.similarities:
        print("- ", s)

    print ("Differences:")
    for d in result.differences:
        print("-",  d)
    print("========================================================================")

def print_test_cases(heading: str, cases: list[list[str]]):
    print(heading)
    for case in cases:
        print_result(person_a_image=case[0], person_b_image=case[1])

3. Prepare a Test Case

I ask Gemini 3 to find the similarity score between the Olsen twins (Mary Kate and Ashley).

identical_cases = [
    ['./images/Ashley_Olsen.jpg', './images/Mary_Kate_Olsen.jpg'],
]

print_test_cases(heading="Identical twin cases", cases=identical_cases)

The output is as follows:

Percentage:  88
Similarities:
-  Identical large, almond-shaped green eye morphology and orbital positioning.
-  Consistent thick, arched eyebrow structure and density.
-  Nearly identical nasal bridge width and rounded tip shape.
-  Similar lip proportions, specifically the defined cupid's bow and lower lip fullness.
-  Matching heart-shaped facial geometry and prominent cheekbone structure.
Differences:
- Slightly different buccal fat distribution, with Person A showing more defined facial hollowing.
- Minor variance in chin projection and tapering at the mandibular symphysis.
- Subtle differences in the vertical height of the forehead and hairline positioning.
- Person A displays slightly more pronounced nasolabial folds in the mid-face region.
- Slight variations in the shape and protrusion of the ears relative to the jawline.
========================================================================

This is the code walkthrough of the old code.

The Latest Approach: Direct URL Ingestion

1. Move the images to a public GitHub Repository

I created a new repository, added an images folder, and upload the images to it.

2. Replace inline data with public URLs

The analyze_facial_images function uses types.Part.from_uri to create a part from a public image URL. This is crucial for types.Part.from_uri to work with raw GitHub URLs. MIME type is optional in types.Part.from_uri. The API infers it from the URL when it is not provided.

However, if you are using signed URLs that do not end in a file extension, you should explicitly pass the mime_type parameter to from_uri to ensure the model processes it correctly.

The function expects URLs and the parameters are named from person_a_image and person_b_image to person_a_image_url and person_b_image_url. In this version, the function uses the same configuration to generate content.

client, types and FacialAnalysis model are imported from the previous steps.

def analyze_facial_images(person_a_image_url: str, person_b_image_url: str):
    prompt = """
    <original prompt>
    """

    response = client.models.generate_content(
        model="gemini-3-flash-preview",
        contents=[
            types.Content(
                role="user",
                parts=[
                    types.Part(text=prompt),
                    types.Part.from_uri(file_uri=person_a_image_url),
                    types.Part.from_uri(file_uri=person_b_image_url),
                ]
            )
        ],
        config=types.GenerateContentConfig(
            response_mime_type="application/json",
            response_json_schema=FacialAnalysis.model_json_schema(),
            media_resolution=types.MediaResolution.MEDIA_RESOLUTION_HIGH,
            thinking_config=types.ThinkingConfig(
                thinking_level=types.ThinkingLevel.HIGH
            )
        )
    )

    if response.parsed:
        result = FacialAnalysis.model_validate(response.parsed)
    else:
        result = FacialAnalysis.model_validate_json(
            clean_json_string(response.text)
        )

    return result

3. URLs construction in `print_result`

The print_result constructs the full image URLs from a base URL, and invokes analyze_facial_images.
In the demo, the base URL is https://raw.githubusercontent.com/railsstudent/colab_images/refs/heads/main/images. GitHub.com URLs return an HTML page whereas raw.githubusercontent.com returns the actual image bytes, which is what the model needs.

No change to the print logic in print_result and the print_test_cases function stays unchanged.

def print_result(person_a_file_name: str, person_b_file_name: str):
    base_url = "https://raw.githubusercontent.com/railsstudent/colab_images/refs/heads/main/images"
    person_a_image_url = f"{base_url}/{person_a_file_name}"
    person_b_image_url = f"{base_url}/{person_b_file_name}"

    result = analyze_facial_images(
        person_a_image_url=person_a_image_url, person_b_image_url=person_b_image_url
    )

    ... print the score, a list of similarities, and a list of differences ...

4. Delete the unused function

Delete get_inline_data_part because it is redundant now.

5. Fix the test case

Modify the test cases because the images are not in the local images folder.

identical_cases = [
    ["Ashley_Olsen.jpg", f"Mary_Kate_Olsen.jpg"],
]

print_test_cases(heading="Identical twin cases", cases=identical_cases)

Rerun the code and the same result should be achieved.

The solution works because the GitHub repo is public, which makes the images in the images folder publicly accessible.

Conclusion

Gemini File API streamlines data ingestion by increasing the inline data size to 100 MB, allowing public HTTPS URLs, signed URLs, and registered files in Google Cloud Storage. Developers can fetch data from the URLs or GCS URIs, and pass them to a Gemini 3 model to generate text output or structured output. This update deprecates the need for boilerplate utility functions that convert images to inline data parts before passing them to Gemini API for image analysis and generating structured output.

I hope you enjoy the content and appreciate the ease of use of Gemini API.

Thank you.

Resources

GitHub Example: Look alike Colab.
Use VertexAI for free: Vertex AI in express mode.
Run the demo in VS Code: Google Colab VS Code Extension

DEV Community

Simplifying Multimodal Inputs: Using Public URLs with the Gemini API

Introduction

Limit / Capability

Prerequisites

Demo Overview

Architecture

Environment Variable

Installation

The Legacy Approach: Local File Uploads

1. Image Understanding with Thinking Level and Media Resolution (Facial Analysis)

2. Prompt Gemini 3 and Print results

3. Prepare a Test Case

The Latest Approach: Direct URL Ingestion

1. Move the images to a public GitHub Repository

2. Replace inline data with public URLs

3. URLs construction in `print_result`

4. Delete the unused function

5. Fix the test case

Conclusion

Resources

Top comments (0)

Introduction

Limit / Capability

Prerequisites

Demo Overview

Architecture

Environment Variable

Installation

The Legacy Approach: Local File Uploads

1. Image Understanding with Thinking Level and Media Resolution (Facial Analysis)

2. Prompt Gemini 3 and Print results

3. Prepare a Test Case

The Latest Approach: Direct URL Ingestion

1. Move the images to a public GitHub Repository

2. Replace inline data with public URLs

3. URLs construction in print_result

4. Delete the unused function

5. Fix the test case

Conclusion

Resources

3. URLs construction in `print_result`