Lucas Ribeiro

Posted on Dec 8, 2025

Engineering Manual for Fine-Tuning Gemini 2.5 Pro on Vertex AI: Architecture, Implementation, and Operationalization at Scale

#gemini #llm #architecture #machinelearning

1. Introduction: The New Era of Multimodal Generative Model Specialization

Generative artificial intelligence crossed a critical threshold with the introduction of the Gemini 2.5 model family by Google. This iteration represents not just an incremental increase in parameter count or pre-training data diversity, but a fundamental shift in the cognitive architecture of Large Language Models (LLMs). Gemini 2.5 Pro, positioned as the "workhorse" model for complex enterprise applications, introduces native capabilities for adaptive thinking and multimodal reasoning that redefine the state of the art.1

However, for solution architects and machine learning engineers operating in mission-critical environments, the base model—however sophisticated—is rarely the final product. The need for strict adherence to formats, specific domain terminology, regulatory compliance, and complex agent behaviors necessitates a refinement process known as Supervised Fine-Tuning (SFT).4

This technical report constitutes an exhaustive analysis and a step-by-step methodology for performing fine-tuning on the Gemini 2.5 Pro model using the Google Cloud Vertex AI platform. Unlike superficial documentation, this document delves into architectural nuances, necessary data engineering, production-grade code implementation, and the MLOps (Machine Learning Operations) strategies required to host and consume these models at a global scale.

The complexity of fine-tuning Gemini 2.5 Pro is exacerbated by its nature as a "thinking model." Technical documentation and release notes suggest a subtle interaction: during SFT, the model learns to mimic the desired output, which often allows dispensing with the extensive thinking process that consumes tokens and latency. This creates a scenario where supervised training effectively "short-circuits" explicit reasoning in favor of standardized efficiency.5 Understanding this dynamic is vital for optimizing the cost-benefit ratio and latency in production.

2. Theoretical and Architectural Foundation

Before manipulating code, it is imperative to understand the theoretical substrate upon which Gemini 2.5 fine-tuning operates. Vertex AI abstracts the physical infrastructure, but engineering decisions depend on understanding what happens behind the scenes.

2.1. The Gemini 2.5 Pro Model: Specifications and Capabilities

Gemini 2.5 Pro was released as a stable version in June 2025.7 It stands out for significant improvements in coding, mathematical reasoning, and image understanding, along with a massive context window.

| Specification | Technical Detail | Implication for Fine-Tuning |

| :---- | :---- | :---- |

| Context Window | ~1M tokens (input) | While it supports ~1M in inference, fine-tuning on Vertex AI currently limits training examples to 131,072 tokens.5 Larger examples are truncated. |

| Knowledge Cutoff | January 2025 4 | The model is unaware of events post-Jan/2025. SFT is not the ideal method for inserting new factual knowledge (use RAG for this); SFT should focus on style, format, and behavior. |

| Thinking Mode | Dynamic/Adaptive 2 | The model decides when to "think." In SFT, it is recommended to disable or minimize this budget to avoid conflict between latent reasoning and adjusted weights.5 |

| Modalities | Text, Image, Audio, Video | Current SFT supports multimodal inputs, but this report focuses on textual and logical tuning, the basis of most enterprise applications.5 |

2.2. The Mechanics of PEFT and LoRA on Vertex AI

The "fine-tuning" process available on Vertex AI is not a traditional Full Fine-Tuning where all billions of model weights are updated. Instead, it utilizes Parameter-Efficient Fine-Tuning (PEFT), specifically the Low-Rank Adaptation (LoRA) technique.4

In LoRA, the original pre-trained model weights ($W_0$) are frozen. Training injects pairs of low-rank decomposition matrices ($A$ and $B$) into the transformer layers. Weight updates are represented as $\Delta W = B \times A$. During inference, the result is $W_{new} = W_0 + \Delta W$.

Why does this matter for the engineer?

Storage Efficiency: We do not save an entire copy of Gemini 2.5 Pro. We save only the "adapters" (a few megabytes or gigabytes).
Multitenancy: A single base model can serve multiple dynamically swapped adapters per request, reducing infrastructure costs.
Hyperparameter Adapter Size: This parameter, configurable in Vertex AI (values 1, 2, 4, 8 for Pro), defines the rank ($r$) of the matrices. A larger $r$ allows learning more complex patterns but increases the risk of overfitting on small datasets.5

2.3. Vertex AI Platform vs. Google AI Studio

It is crucial to distinguish between Google AI Studio and Vertex AI for fine-tuning purposes. Historically, AI Studio offered a simplified interface. However, Google has deprecated fine-tuning support for newer models (like Gemini 1.5 Flash/Pro and 2.5 series) directly via the Gemini API in AI Studio, migrating it exclusively to Vertex AI.8

Vertex AI offers a managed infrastructure that provides granular control over:

Data Sovereignty: ensuring training data and the adapted model remain in specific geographic regions (e.g., us-central1, europe-west4).6
MLOps Pipeline: Integration with Vertex AI Experiments for metric tracking and model versioning.

3. Environment Preparation and Google Cloud Infrastructure

Success in a fine-tuning job depends on a solid infrastructure foundation. Permission errors or quota misconfigurations are the most common causes of failure before training even begins.

3.1. Project and API Configuration

It is recommended to isolate the fine-tuning environment in a dedicated GCP project to facilitate cost control and access auditing.

Step 1: Activate APIs

The following APIs are mandatory:

aiplatform.googleapis.com (Vertex AI API): The core of the operation.
storage.googleapis.com (Google Cloud Storage): For storing datasets and artifacts.
iam.googleapis.com: For identity management.

Step 2: Region Configuration

Region choice is non-trivial. Gemini 2.5 Pro and the accelerators required for its tuning are not available in all Google Cloud regions. Supported regions for tuning typically include us-central1 and europe-west4.6 Attempting to start a job in an unsupported region will result in a resource unavailability error.

3.2. Identity and Access Management (IAM)

The Service Account (SA) executing the training pipeline needs specific permissions.10

| IAM Role | Technical Justification |

| :---- | :---- |

| roles/aiplatform.user | Allows creating training jobs, models, and endpoints in Vertex AI. |

| roles/storage.objectAdmin | Allows reading the JSONL dataset and writing logs/artifacts to the staging bucket. |

| roles/serviceusage.serviceUsageConsumer | Allows the account to consume project API quota. |

3.3. Quota Verification

Fine-tuning consumes highly contested accelerator resources. Even though the service is managed, there is a project-level quota called Global concurrent tuning jobs.

Verification: Access "IAM & Admin" -> "Quotas" and filter by "Vertex AI" and "Tuning".
Default: New projects often have this quota set to 0 or 1 concurrent job.
Action: Request a quota increase in advance if planning multiple parallel experiments.4

3.4. Python SDK Installation

The environment must have the latest version of the SDK to support Gemini 2.5 classes.




\# Critical update for Gemini 2.5 support and SFT features  

pip install \--upgrade google-cloud-aiplatform google-auth google-cloud-storage

Python Environment Initialization:

Python

import sys  

import vertexai  

from google.cloud import aiplatform



# Project Constants  

PROJECT\_ID \= "your-gcp-project-id"  

REGION \= "us-central1" # Mandatory region for Gemini 2.5 tuning availability \[6\]  

STAGING\_BUCKET \= "gs://your-staging-bucket-logs"



# SDK Initialization  

vertexai.init(  

    project=PROJECT\_ID,  

    location=REGION,  

    staging\_bucket=STAGING\_BUCKET  

)



print(f"Vertex AI SDK version {aiplatform.\_\_version\_\_} initialized.")

4. Data Engineering: The Heart of Fine-Tuning

Data quality, consistency, and formatting are the single most important determinants of fine-tuning success. A noisy dataset will result in a model that hallucinates, regardless of the training epochs.

4.1. JSONL Format and Message Structure

Vertex AI strictly requires the dataset to be provided in JSON Lines (.jsonl) format. Each line is a valid, independent JSON object representing a full training session, following the chat "messages" pattern.5

Required Canonical Structure:

{  

  "messages":  

}

Common Formatting Errors:

Inconsistent System Prompt: If you use a system prompt in training ("You are a finance expert..."), you must use exactly the same system prompt during inference.
Multi-turn vs. Single-turn: Gemini supports multi-turn chat. If training a chatbot that maintains context, your JSONL examples should contain the conversation history (User -> Model -> User -> Model).

4.2. Data Quality and Volume Strategy

Vertex AI documentation and market practice suggest clear guidelines for data volume:

| Dataset Size | Expectation |

| :---- | :---- |

| 1 - 50 examples | Insufficient for SFT. Better to use Few-Shot Prompting. SFT here risks rapid overfitting. |

| 100 - 500 examples | The "Sweet Spot" for most style and format adaptation tasks.5 The model generalizes the pattern without memorizing content. |

| > 1,000 examples | Necessary for teaching new languages (e.g., DSLs), complex reasoning tasks, or very specific knowledge domains. |

4.3. Data Validation Script

Before uploading to Cloud Storage, it is vital to validate the dataset locally.

Python

import json  

import logging



logging.basicConfig(level=logging.INFO)



def validate\_jsonl(file\_path):  

    errors \=  

    valid\_count \= 0  

    with open(file\_path, 'r', encoding='utf-8') as f:  

        for line\_num, line in enumerate(f, 1):  

            try:  

                data \= json.loads(line)  

                # Check 1: 'messages' key  

                if 'messages' not in data:  

                    errors.append(f"Line {line\_num}: Missing 'messages' key.")  

                    continue  

                messages \= data\['messages'\]  

                # Check 2: Roles  

                roles \= \[m.get('role') for m in messages\]  

                if 'user' not in roles or 'model' not in roles:  

                    errors.append(f"Line {line\_num}: Must contain at least one 'user' and one 'model' message.")  

                    continue  

                # Check 3: Non-empty content  

                if any(not m.get('content') for m in messages):  

                    errors.append(f"Line {line\_num}: Empty content detected.")  

                    continue  

                valid\_count \+= 1  

            except json.JSONDecodeError:  

                errors.append(f"Line {line\_num}: Invalid JSON.")  

    if errors:  

        logging.error(f"Found {len(errors)} errors in dataset:")  

        for err in errors\[:10\]:  

            logging.error(err)  

        return False  

    logging.info(f"Validation successful. {valid\_count} valid examples.")  

    return True



# Usage  

# validate\_jsonl("my\_train\_dataset.jsonl")

5. Executing Fine-Tuning: Code and Hyperparameters

We utilize the vertexai.tuning.sft module, which is the standard programmatic interface for this task.6

5.1. Defining the Base Model

Use the correct version tag.

Target Model: gemini-2.5-pro-001 (or the latest versioned tag).
Note: Avoid generic aliases if strict reproducibility is required.

5.2. Training Code (SFT Pipeline)

Python

import time  

from vertexai.tuning import sft



# Job Configuration  

BASE\_MODEL \= "gemini-2.5-pro-001"  

TRAIN\_DATASET\_URI \= "gs://your-bucket-ml/gemini-tuning/v1/train.jsonl"  

VALIDATION\_DATASET\_URI \= "gs://your-bucket-ml/gemini-tuning/v1/validation.jsonl"  

TUNED\_MODEL\_DISPLAY\_NAME \= "gemini-2.5-pro-finance-v1"



# Hyperparameter Configuration  

EPOCHS \= 4  

ADAPTER\_SIZE \= 4  # Supported values for Pro: 1, 2, 4, 8  

LEARNING\_RATE\_MULTIPLIER \= 1.0



def run\_fine\_tuning\_job():  

    print(f"Starting SFT job for model {BASE\_MODEL}...")  

    # Create and submit the Job  

    # sft.train initiates the managed pipeline on Vertex AI  

    sft\_tuning\_job \= sft.train(  

        source\_model=BASE\_MODEL,  

        train\_dataset=TRAIN\_DATASET\_URI,  

        validation\_dataset=VALIDATION\_DATASET\_URI,  

        epochs=EPOCHS,  

        adapter\_size=ADAPTER\_SIZE,  

        learning\_rate\_multiplier=LEARNING\_RATE\_MULTIPLIER,  

        tuned\_model\_display\_name=TUNED\_MODEL\_DISPLAY\_NAME,  

        # Region is inferred from vertexai.init  

    )  

    return sft\_tuning\_job



# Execute  

# tuning\_job \= run\_fine\_tuning\_job()

5.3. Deep Dive into Hyperparameters

| Hyperparameter | Technical Impact and Recommendations |

| :---- | :---- |

| Epochs | Defines how many times the model sees the dataset. • Few (<3): Underfitting. • Many (>10): Overfitting. • Recommendation: Start with 3-5. |

| Adapter Size (LoRA Rank) | Defines the dimensionality of trainable matrices. • Size 1 or 4: Ideal for simple tasks (formatting, tone). • Size 8: Necessary for complex tasks requiring reasoning. • Note: Pro supports 1, 2, 4, 8.5 |

| Learning Rate Multiplier | Scales the default optimizer rate. • 1.0: Safe default. • <1.0: Use if the base model is already performing well and only needs slight adjustment. |

5.4. Monitoring and Polling

The script should monitor the state to ensure the process completes successfully.11

Python

def monitor\_tuning\_job(job):  

    while not job.has\_ended:  

        time.sleep(60)  

        job.refresh()  

        print(f"Status: {job.state.name}")  

    if job.state.name \== "SUCCEEDED":  

        print("Training completed successfully\!")  

        print(f"Model Resource Name: {job.tuned\_model\_name}")  

        print(f"Endpoint (Auto-Deploy): {job.tuned\_model\_endpoint\_name}")  

        return job.tuned\_model\_endpoint\_name  

    else:  

        print(f"Job FAILED. Error: {job.error}")  

        return None

6. Hosting, Deployment, and Inference Optimization

Where is the model after the job SUCCEEDED? How is it served?

6.1. The Vertex AI Endpoint Concept

In Vertex AI, you do not "download" the tuned Gemini 2.5 Pro model. The base model is proprietary and massive. Instead, your LoRA adapters are saved in the Model Registry.

When you deploy (which the SFT job often does automatically), Vertex AI provisions an Endpoint. An Endpoint is a managed service URL pointing to compute infrastructure that loads Gemini 2.5 Pro + Your Adapters.

6.2. Consuming the Model via Python SDK

To consume the model, instantiate the GenerativeModel class pointing to the Endpoint Resource Name.6

Endpoint Resource Name Format:

projects/{PROJECT_NUMBER}/locations/{REGION}/endpoints/{ENDPOINT_ID}

Python

from vertexai.generative\_models import GenerativeModel, GenerationConfig



# Replace with the value returned by monitor\_tuning\_job or from Console  

TUNED\_MODEL\_ENDPOINT\_RESOURCE \= "projects/123456789012/locations/us-central1/endpoints/11223344556677"



def predict\_with\_tuned\_model(prompt\_text):  

    print(f"Sending prompt to: {TUNED\_MODEL\_ENDPOINT\_RESOURCE}")  

    # Instantiate model pointing to the tuned endpoint  

    # The SDK routes this to your adapter  

    model \= GenerativeModel(TUNED\_MODEL\_ENDPOINT\_RESOURCE)  

    # Generation Config: The Thinking Budget Paradox  

    # For SFT models, documentation  recommends disabling thinking  

    # or setting it to minimum, as SFT teaches the direct answer.  

    generation\_config \= GenerationConfig(  

        temperature=0.2,  

        max\_output\_tokens=1024,  

        # If supported by the specific SDK version for the model:  

        # thinking\_config={"include\_thoughts": False}  

    )  

    try:  

        response \= model.generate\_content(  

            prompt\_text,  

            generation\_config=generation\_config  

        )  

        return response.text  

    except Exception as e:  

        print(f"Inference Error: {e}")  

        return None



# Real Test  

prompt \= "Summarize the following financial report focusing on EBITDA:"  

result \= predict\_with\_tuned\_model(prompt)  

print("---------------- RESPONSE \----------------")  

print(result)

6.3. The "Thinking Budget" Paradox in SFT Models

A critical finding for this report is the behavior of Gemini 2.5 Pro regarding its "thinking budget" when subjected to supervised fine-tuning.

Gemini 2.5 Pro is a "thinking" model. However, SFT trains the model to map directly Input -> Desired Output. If you keep "thinking mode" enabled with a high token budget, the model tries to "reason" its way to a response it has already memorized via training. This can cause:

Increased Latency and Cost: Paying for useless thinking tokens.
Quality Degradation: The model may "overthink" and diverge from the strict format you taught it.

Therefore, best engineering practice is to zero out or minimize the thinking budget for SFT endpoints.5

7. Evaluation and Quality Assurance (QA)

7.1. Manual AB Testing (Qualitative)

Create a "Side-by-Side" evaluation script sending the same prompt to both the base model and the tuned model.

| :---- | :---- | :---- | :---- |

| "Analyze contract X." | Generic response, academic tone. | Technical response, cites specific local laws, senior legal tone. | Success: Adoption of persona and domain knowledge. |

7.2. Automatic Evaluation with Gen AI Evaluation Service

Vertex AI offers the Gen AI Evaluation service. You can use an LLM as a "Judge" to evaluate your tuned model's responses.6

Metrics:

Coherence: Does the answer make logical sense?
Instruction Following: Did it follow format constraints (JSON, XML)?
Safety: Did it generate toxic content?

8. MLOps and Production Considerations

8.1. Troubleshooting Common Errors

ResourceExhausted Error: You hit the concurrent job quota. Cancel old jobs or request a quota increase.4
InvalidArgument in Dataset: Usually means an example exceeds the 131k token limit or the JSONL is malformed.5
Safety Filters: Fine-tuning does not remove native safety filters. If your domain is sensitive (medical/legal), you may need to adjust harm_category settings in GenerationConfig.

8.2. Conclusion

Fine-tuning Gemini 2.5 Pro on Vertex AI is a powerful tool for transforming a generalist model into a domain specialist. The secret lies not in the Python code—which is relatively simple thanks to the SDK—but in rigorous Data-Centric AI engineering and the correct management of hyperparameters and inference budgets. By following this guide, engineers can deploy generative AI solutions that are not only impressive but robust, auditable, and ready for the enterprise environment.

References

Gemini 2.5 Pro – Vertex AI - Google Cloud Console, acessado em dezembro 8, 2025, https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-2.5-pro
Gemini thinking | Gemini API - Google AI for Developers, acessado em dezembro 8, 2025, https://ai.google.dev/gemini-api/docs/thinking
Gemini 2.5 on Vertex AI: Pro, Flash & Model Optimizer Live | Google Cloud Blog, acessado em dezembro 8, 2025, https://cloud.google.com/blog/products/ai-machine-learning/gemini-2-5-pro-flash-on-vertex-ai
Gemini 2.5 Pro | Generative AI on Vertex AI - Google Cloud Documentation, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro
About supervised fine-tuning for Gemini models | Generative AI on Vertex AI | Google Cloud Documentation, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning
Tune Gemini models by using supervised fine-tuning | Generative AI on Vertex AI, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning
Release notes | Gemini API - Google AI for Developers, acessado em dezembro 8, 2025, https://ai.google.dev/gemini-api/docs/changelog
Fine-tuning with the Gemini API - Google AI for Developers, acessado em dezembro 8, 2025, https://ai.google.dev/gemini-api/docs/model-tuning
Tuning API | Generative AI on Vertex AI - Google Cloud Documentation, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/tuning
googleapis/python-aiplatform: A Python SDK for Vertex AI, a fully managed, end-to-end platform for data science and machine learning. - GitHub, acessado em dezembro 8, 2025, https://github.com/googleapis/python-aiplatform
Fine-tune Generative AI models with Vertex AI Supervised Fine-tuning, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/samples/generativeaionvertexai-tuning-basic
How to use Google Vertex AI fine tuned model via Node.js - Stack Overflow, acessado em dezembro 8, 2025, https://stackoverflow.com/questions/78738829/how-to-use-google-vertex-ai-fine-tuned-model-via-node-js

DEV Community

Engineering Manual for Fine-Tuning Gemini 2.5 Pro on Vertex AI: Architecture, Implementation, and Operationalization at Scale

1. Introduction: The New Era of Multimodal Generative Model Specialization

2.1. The Gemini 2.5 Pro Model: Specifications and Capabilities

2.2. The Mechanics of PEFT and LoRA on Vertex AI

2.3. Vertex AI Platform vs. Google AI Studio

3.1. Project and API Configuration

3.2. Identity and Access Management (IAM)

3.3. Quota Verification

3.4. Python SDK Installation

4.1. JSONL Format and Message Structure

4.2. Data Quality and Volume Strategy

4.3. Data Validation Script

5.1. Defining the Base Model

5.2. Training Code (SFT Pipeline)

5.3. Deep Dive into Hyperparameters

5.4. Monitoring and Polling

6. Hosting, Deployment, and Inference Optimization

6.1. The Vertex AI Endpoint Concept

6.2. Consuming the Model via Python SDK

6.3. The "Thinking Budget" Paradox in SFT Models

7. Evaluation and Quality Assurance (QA)

7.1. Manual AB Testing (Qualitative)

7.2. Automatic Evaluation with Gen AI Evaluation Service

Safety: Did it generate toxic content?

8. MLOps and Production Considerations

8.1. Troubleshooting Common Errors

8.2. Conclusion

References

Top comments (0)