1. Introduction: The New Era of Multimodal Generative Model Specialization
Generative artificial intelligence crossed a critical threshold with the introduction of the Gemini 2.5 model family by Google. This iteration represents not just an incremental increase in parameter count or pre-training data diversity, but a fundamental shift in the cognitive architecture of Large Language Models (LLMs). Gemini 2.5 Pro, positioned as the "workhorse" model for complex enterprise applications, introduces native capabilities for adaptive thinking and multimodal reasoning that redefine the state of the art.1
However, for solution architects and machine learning engineers operating in mission-critical environments, the base model—however sophisticated—is rarely the final product. The need for strict adherence to formats, specific domain terminology, regulatory compliance, and complex agent behaviors necessitates a refinement process known as Supervised Fine-Tuning (SFT).4
This technical report constitutes an exhaustive analysis and a step-by-step methodology for performing fine-tuning on the Gemini 2.5 Pro model using the Google Cloud Vertex AI platform. Unlike superficial documentation, this document delves into architectural nuances, necessary data engineering, production-grade code implementation, and the MLOps (Machine Learning Operations) strategies required to host and consume these models at a global scale.
The complexity of fine-tuning Gemini 2.5 Pro is exacerbated by its nature as a "thinking model." Technical documentation and release notes suggest a subtle interaction: during SFT, the model learns to mimic the desired output, which often allows dispensing with the extensive thinking process that consumes tokens and latency. This creates a scenario where supervised training effectively "short-circuits" explicit reasoning in favor of standardized efficiency.5 Understanding this dynamic is vital for optimizing the cost-benefit ratio and latency in production.
2. Theoretical and Architectural Foundation
Before manipulating code, it is imperative to understand the theoretical substrate upon which Gemini 2.5 fine-tuning operates. Vertex AI abstracts the physical infrastructure, but engineering decisions depend on understanding what happens behind the scenes.
2.1. The Gemini 2.5 Pro Model: Specifications and Capabilities
Gemini 2.5 Pro was released as a stable version in June 2025.7 It stands out for significant improvements in coding, mathematical reasoning, and image understanding, along with a massive context window.
| Specification | Technical Detail | Implication for Fine-Tuning |
| :---- | :---- | :---- |
| Context Window | ~1M tokens (input) | While it supports ~1M in inference, fine-tuning on Vertex AI currently limits training examples to 131,072 tokens.5 Larger examples are truncated. |
| Knowledge Cutoff | January 2025 4 | The model is unaware of events post-Jan/2025. SFT is not the ideal method for inserting new factual knowledge (use RAG for this); SFT should focus on style, format, and behavior. |
| Thinking Mode | Dynamic/Adaptive 2 | The model decides when to "think." In SFT, it is recommended to disable or minimize this budget to avoid conflict between latent reasoning and adjusted weights.5 |
| Modalities | Text, Image, Audio, Video | Current SFT supports multimodal inputs, but this report focuses on textual and logical tuning, the basis of most enterprise applications.5 |
2.2. The Mechanics of PEFT and LoRA on Vertex AI
The "fine-tuning" process available on Vertex AI is not a traditional Full Fine-Tuning where all billions of model weights are updated. Instead, it utilizes Parameter-Efficient Fine-Tuning (PEFT), specifically the Low-Rank Adaptation (LoRA) technique.4
In LoRA, the original pre-trained model weights ($W_0$) are frozen. Training injects pairs of low-rank decomposition matrices ($A$ and $B$) into the transformer layers. Weight updates are represented as $\Delta W = B \times A$. During inference, the result is $W_{new} = W_0 + \Delta W$.
Why does this matter for the engineer?
Storage Efficiency: We do not save an entire copy of Gemini 2.5 Pro. We save only the "adapters" (a few megabytes or gigabytes).
Multitenancy: A single base model can serve multiple dynamically swapped adapters per request, reducing infrastructure costs.
Hyperparameter Adapter Size: This parameter, configurable in Vertex AI (values 1, 2, 4, 8 for Pro), defines the rank ($r$) of the matrices. A larger $r$ allows learning more complex patterns but increases the risk of overfitting on small datasets.5
2.3. Vertex AI Platform vs. Google AI Studio
It is crucial to distinguish between Google AI Studio and Vertex AI for fine-tuning purposes. Historically, AI Studio offered a simplified interface. However, Google has deprecated fine-tuning support for newer models (like Gemini 1.5 Flash/Pro and 2.5 series) directly via the Gemini API in AI Studio, migrating it exclusively to Vertex AI.8
Vertex AI offers a managed infrastructure that provides granular control over:
Data Sovereignty: ensuring training data and the adapted model remain in specific geographic regions (e.g., us-central1, europe-west4).6
MLOps Pipeline: Integration with Vertex AI Experiments for metric tracking and model versioning.
3. Environment Preparation and Google Cloud Infrastructure
Success in a fine-tuning job depends on a solid infrastructure foundation. Permission errors or quota misconfigurations are the most common causes of failure before training even begins.
3.1. Project and API Configuration
It is recommended to isolate the fine-tuning environment in a dedicated GCP project to facilitate cost control and access auditing.
Step 1: Activate APIs
The following APIs are mandatory:
aiplatform.googleapis.com (Vertex AI API): The core of the operation.
storage.googleapis.com (Google Cloud Storage): For storing datasets and artifacts.
iam.googleapis.com: For identity management.
Step 2: Region Configuration
Region choice is non-trivial. Gemini 2.5 Pro and the accelerators required for its tuning are not available in all Google Cloud regions. Supported regions for tuning typically include us-central1 and europe-west4.6 Attempting to start a job in an unsupported region will result in a resource unavailability error.
3.2. Identity and Access Management (IAM)
The Service Account (SA) executing the training pipeline needs specific permissions.10
| IAM Role | Technical Justification |
| :---- | :---- |
| roles/aiplatform.user | Allows creating training jobs, models, and endpoints in Vertex AI. |
| roles/storage.objectAdmin | Allows reading the JSONL dataset and writing logs/artifacts to the staging bucket. |
| roles/serviceusage.serviceUsageConsumer | Allows the account to consume project API quota. |
3.3. Quota Verification
Fine-tuning consumes highly contested accelerator resources. Even though the service is managed, there is a project-level quota called Global concurrent tuning jobs.
Verification: Access "IAM & Admin" -> "Quotas" and filter by "Vertex AI" and "Tuning".
Default: New projects often have this quota set to 0 or 1 concurrent job.
Action: Request a quota increase in advance if planning multiple parallel experiments.4
3.4. Python SDK Installation
The environment must have the latest version of the SDK to support Gemini 2.5 classes.
\# Critical update for Gemini 2.5 support and SFT features
pip install \--upgrade google-cloud-aiplatform google-auth google-cloud-storage
Python Environment Initialization:
Python
import sys
import vertexai
from google.cloud import aiplatform
# Project Constants
PROJECT\_ID \= "your-gcp-project-id"
REGION \= "us-central1" # Mandatory region for Gemini 2.5 tuning availability \[6\]
STAGING\_BUCKET \= "gs://your-staging-bucket-logs"
# SDK Initialization
vertexai.init(
project=PROJECT\_ID,
location=REGION,
staging\_bucket=STAGING\_BUCKET
)
print(f"Vertex AI SDK version {aiplatform.\_\_version\_\_} initialized.")
4. Data Engineering: The Heart of Fine-Tuning
Data quality, consistency, and formatting are the single most important determinants of fine-tuning success. A noisy dataset will result in a model that hallucinates, regardless of the training epochs.
4.1. JSONL Format and Message Structure
Vertex AI strictly requires the dataset to be provided in JSON Lines (.jsonl) format. Each line is a valid, independent JSON object representing a full training session, following the chat "messages" pattern.5
Required Canonical Structure:
{
"messages":
}
Common Formatting Errors:
Inconsistent System Prompt: If you use a system prompt in training ("You are a finance expert..."), you must use exactly the same system prompt during inference.
Multi-turn vs. Single-turn: Gemini supports multi-turn chat. If training a chatbot that maintains context, your JSONL examples should contain the conversation history (User -> Model -> User -> Model).
4.2. Data Quality and Volume Strategy
Vertex AI documentation and market practice suggest clear guidelines for data volume:
| Dataset Size | Expectation |
| :---- | :---- |
| 1 - 50 examples | Insufficient for SFT. Better to use Few-Shot Prompting. SFT here risks rapid overfitting. |
| 100 - 500 examples | The "Sweet Spot" for most style and format adaptation tasks.5 The model generalizes the pattern without memorizing content. |
| > 1,000 examples | Necessary for teaching new languages (e.g., DSLs), complex reasoning tasks, or very specific knowledge domains. |
4.3. Data Validation Script
Before uploading to Cloud Storage, it is vital to validate the dataset locally.
Python
import json
import logging
logging.basicConfig(level=logging.INFO)
def validate\_jsonl(file\_path):
errors \=
valid\_count \= 0
with open(file\_path, 'r', encoding='utf-8') as f:
for line\_num, line in enumerate(f, 1):
try:
data \= json.loads(line)
# Check 1: 'messages' key
if 'messages' not in data:
errors.append(f"Line {line\_num}: Missing 'messages' key.")
continue
messages \= data\['messages'\]
# Check 2: Roles
roles \= \[m.get('role') for m in messages\]
if 'user' not in roles or 'model' not in roles:
errors.append(f"Line {line\_num}: Must contain at least one 'user' and one 'model' message.")
continue
# Check 3: Non-empty content
if any(not m.get('content') for m in messages):
errors.append(f"Line {line\_num}: Empty content detected.")
continue
valid\_count \+= 1
except json.JSONDecodeError:
errors.append(f"Line {line\_num}: Invalid JSON.")
if errors:
logging.error(f"Found {len(errors)} errors in dataset:")
for err in errors\[:10\]:
logging.error(err)
return False
logging.info(f"Validation successful. {valid\_count} valid examples.")
return True
# Usage
# validate\_jsonl("my\_train\_dataset.jsonl")
5. Executing Fine-Tuning: Code and Hyperparameters
We utilize the vertexai.tuning.sft module, which is the standard programmatic interface for this task.6
5.1. Defining the Base Model
Use the correct version tag.
Target Model: gemini-2.5-pro-001 (or the latest versioned tag).
Note: Avoid generic aliases if strict reproducibility is required.
5.2. Training Code (SFT Pipeline)
Python
import time
from vertexai.tuning import sft
# Job Configuration
BASE\_MODEL \= "gemini-2.5-pro-001"
TRAIN\_DATASET\_URI \= "gs://your-bucket-ml/gemini-tuning/v1/train.jsonl"
VALIDATION\_DATASET\_URI \= "gs://your-bucket-ml/gemini-tuning/v1/validation.jsonl"
TUNED\_MODEL\_DISPLAY\_NAME \= "gemini-2.5-pro-finance-v1"
# Hyperparameter Configuration
EPOCHS \= 4
ADAPTER\_SIZE \= 4 # Supported values for Pro: 1, 2, 4, 8
LEARNING\_RATE\_MULTIPLIER \= 1.0
def run\_fine\_tuning\_job():
print(f"Starting SFT job for model {BASE\_MODEL}...")
# Create and submit the Job
# sft.train initiates the managed pipeline on Vertex AI
sft\_tuning\_job \= sft.train(
source\_model=BASE\_MODEL,
train\_dataset=TRAIN\_DATASET\_URI,
validation\_dataset=VALIDATION\_DATASET\_URI,
epochs=EPOCHS,
adapter\_size=ADAPTER\_SIZE,
learning\_rate\_multiplier=LEARNING\_RATE\_MULTIPLIER,
tuned\_model\_display\_name=TUNED\_MODEL\_DISPLAY\_NAME,
# Region is inferred from vertexai.init
)
return sft\_tuning\_job
# Execute
# tuning\_job \= run\_fine\_tuning\_job()
5.3. Deep Dive into Hyperparameters
| Hyperparameter | Technical Impact and Recommendations |
| :---- | :---- |
| Epochs | Defines how many times the model sees the dataset. • Few (<3): Underfitting. • Many (>10): Overfitting. • Recommendation: Start with 3-5. |
| Adapter Size (LoRA Rank) | Defines the dimensionality of trainable matrices. • Size 1 or 4: Ideal for simple tasks (formatting, tone). • Size 8: Necessary for complex tasks requiring reasoning. • Note: Pro supports 1, 2, 4, 8.5 |
| Learning Rate Multiplier | Scales the default optimizer rate. • 1.0: Safe default. • <1.0: Use if the base model is already performing well and only needs slight adjustment. |
5.4. Monitoring and Polling
The script should monitor the state to ensure the process completes successfully.11
Python
def monitor\_tuning\_job(job):
while not job.has\_ended:
time.sleep(60)
job.refresh()
print(f"Status: {job.state.name}")
if job.state.name \== "SUCCEEDED":
print("Training completed successfully\!")
print(f"Model Resource Name: {job.tuned\_model\_name}")
print(f"Endpoint (Auto-Deploy): {job.tuned\_model\_endpoint\_name}")
return job.tuned\_model\_endpoint\_name
else:
print(f"Job FAILED. Error: {job.error}")
return None
6. Hosting, Deployment, and Inference Optimization
Where is the model after the job SUCCEEDED? How is it served?
6.1. The Vertex AI Endpoint Concept
In Vertex AI, you do not "download" the tuned Gemini 2.5 Pro model. The base model is proprietary and massive. Instead, your LoRA adapters are saved in the Model Registry.
When you deploy (which the SFT job often does automatically), Vertex AI provisions an Endpoint. An Endpoint is a managed service URL pointing to compute infrastructure that loads Gemini 2.5 Pro + Your Adapters.
6.2. Consuming the Model via Python SDK
To consume the model, instantiate the GenerativeModel class pointing to the Endpoint Resource Name.6
Endpoint Resource Name Format:
projects/{PROJECT_NUMBER}/locations/{REGION}/endpoints/{ENDPOINT_ID}
Python
from vertexai.generative\_models import GenerativeModel, GenerationConfig
# Replace with the value returned by monitor\_tuning\_job or from Console
TUNED\_MODEL\_ENDPOINT\_RESOURCE \= "projects/123456789012/locations/us-central1/endpoints/11223344556677"
def predict\_with\_tuned\_model(prompt\_text):
print(f"Sending prompt to: {TUNED\_MODEL\_ENDPOINT\_RESOURCE}")
# Instantiate model pointing to the tuned endpoint
# The SDK routes this to your adapter
model \= GenerativeModel(TUNED\_MODEL\_ENDPOINT\_RESOURCE)
# Generation Config: The Thinking Budget Paradox
# For SFT models, documentation recommends disabling thinking
# or setting it to minimum, as SFT teaches the direct answer.
generation\_config \= GenerationConfig(
temperature=0.2,
max\_output\_tokens=1024,
# If supported by the specific SDK version for the model:
# thinking\_config={"include\_thoughts": False}
)
try:
response \= model.generate\_content(
prompt\_text,
generation\_config=generation\_config
)
return response.text
except Exception as e:
print(f"Inference Error: {e}")
return None
# Real Test
prompt \= "Summarize the following financial report focusing on EBITDA:"
result \= predict\_with\_tuned\_model(prompt)
print("---------------- RESPONSE \----------------")
print(result)
6.3. The "Thinking Budget" Paradox in SFT Models
A critical finding for this report is the behavior of Gemini 2.5 Pro regarding its "thinking budget" when subjected to supervised fine-tuning.
Gemini 2.5 Pro is a "thinking" model. However, SFT trains the model to map directly Input -> Desired Output. If you keep "thinking mode" enabled with a high token budget, the model tries to "reason" its way to a response it has already memorized via training. This can cause:
Increased Latency and Cost: Paying for useless thinking tokens.
Quality Degradation: The model may "overthink" and diverge from the strict format you taught it.
Therefore, best engineering practice is to zero out or minimize the thinking budget for SFT endpoints.5
7. Evaluation and Quality Assurance (QA)
7.1. Manual AB Testing (Qualitative)
Create a "Side-by-Side" evaluation script sending the same prompt to both the base model and the tuned model.
| Test Prompt | Base Model Response (Gemini 2.5 Pro) | Tuned Model Response | Engineer Analysis |
| :---- | :---- | :---- | :---- |
| "Analyze contract X." | Generic response, academic tone. | Technical response, cites specific local laws, senior legal tone. | Success: Adoption of persona and domain knowledge. |
7.2. Automatic Evaluation with Gen AI Evaluation Service
Vertex AI offers the Gen AI Evaluation service. You can use an LLM as a "Judge" to evaluate your tuned model's responses.6
Metrics:
Coherence: Does the answer make logical sense?
Instruction Following: Did it follow format constraints (JSON, XML)?
Safety: Did it generate toxic content?
8. MLOps and Production Considerations
8.1. Troubleshooting Common Errors
ResourceExhausted Error: You hit the concurrent job quota. Cancel old jobs or request a quota increase.4
InvalidArgument in Dataset: Usually means an example exceeds the 131k token limit or the JSONL is malformed.5
Safety Filters: Fine-tuning does not remove native safety filters. If your domain is sensitive (medical/legal), you may need to adjust harm_category settings in GenerationConfig.
8.2. Conclusion
Fine-tuning Gemini 2.5 Pro on Vertex AI is a powerful tool for transforming a generalist model into a domain specialist. The secret lies not in the Python code—which is relatively simple thanks to the SDK—but in rigorous Data-Centric AI engineering and the correct management of hyperparameters and inference budgets. By following this guide, engineers can deploy generative AI solutions that are not only impressive but robust, auditable, and ready for the enterprise environment.
References
Gemini 2.5 Pro – Vertex AI - Google Cloud Console, acessado em dezembro 8, 2025, https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-2.5-pro
Gemini thinking | Gemini API - Google AI for Developers, acessado em dezembro 8, 2025, https://ai.google.dev/gemini-api/docs/thinking
Gemini 2.5 on Vertex AI: Pro, Flash & Model Optimizer Live | Google Cloud Blog, acessado em dezembro 8, 2025, https://cloud.google.com/blog/products/ai-machine-learning/gemini-2-5-pro-flash-on-vertex-ai
Gemini 2.5 Pro | Generative AI on Vertex AI - Google Cloud Documentation, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro
About supervised fine-tuning for Gemini models | Generative AI on Vertex AI | Google Cloud Documentation, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-supervised-tuning
Tune Gemini models by using supervised fine-tuning | Generative AI on Vertex AI, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning
Release notes | Gemini API - Google AI for Developers, acessado em dezembro 8, 2025, https://ai.google.dev/gemini-api/docs/changelog
Fine-tuning with the Gemini API - Google AI for Developers, acessado em dezembro 8, 2025, https://ai.google.dev/gemini-api/docs/model-tuning
Tuning API | Generative AI on Vertex AI - Google Cloud Documentation, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/model-reference/tuning
googleapis/python-aiplatform: A Python SDK for Vertex AI, a fully managed, end-to-end platform for data science and machine learning. - GitHub, acessado em dezembro 8, 2025, https://github.com/googleapis/python-aiplatform
Fine-tune Generative AI models with Vertex AI Supervised Fine-tuning, acessado em dezembro 8, 2025, https://docs.cloud.google.com/vertex-ai/generative-ai/docs/samples/generativeaionvertexai-tuning-basic
How to use Google Vertex AI fine tuned model via Node.js - Stack Overflow, acessado em dezembro 8, 2025, https://stackoverflow.com/questions/78738829/how-to-use-google-vertex-ai-fine-tuned-model-via-node-js
Top comments (0)