Deploy Vertex AI with Terraform: Upgrade to the Next Gemini Model with One Variable Change 😊
Vertex AI gives you Gemini 3, 2.5, Llama, and Claude on GCP. Here's how to deploy it with Terraform so swapping to the next model is a one-variable change — not a rewrite.
Google ships new Gemini models fast. Gemini 1.5 → 2.0 Flash → 2.5 Pro → 3 Flash — each one faster, smarter, cheaper.
Here's the problem: most teams hardcode gemini-2.0-flash throughout their infrastructure and application code. When Gemini 3 Flash lands, it's a multi-file search-and-replace across Terraform configs, Cloud Functions, and env vars.
What if upgrading to the next Gemini model was a one-variable change?
That's what we're building — a Vertex AI setup with Terraform where models are variables, not hardcoded strings. When the next model drops, you change one .tfvars file and run terraform apply. Done. 🎯
🤔 Vertex AI vs Bedrock vs Azure AI Foundry
| Vertex AI (GCP) | Bedrock (AWS) | Azure AI Foundry | |
|---|---|---|---|
| What it is | Full AI platform (inference + training + MLOps) | API access to pre-trained models | Managed OpenAI + multi-model platform |
| Models | Gemini 3, 2.5, Llama, Claude, Mistral | Claude, Llama, Titan | GPT-4.1, o4-mini, GPT-5 series |
| Unique strength | Native BigQuery + GCS + Model Optimizer | Serverless Knowledge Bases & Agents | OpenAI API compatibility + Azure AD |
| Key advantage | Most complete ML platform (inference → training → MLOps) | Broadest model selection | Drop-in OpenAI replacement |
Vertex AI's edge: It's the most complete — combining what AWS splits between Bedrock and SageMaker into one platform. Plus, Gemini is deeply integrated with BigQuery, Cloud Storage, and the whole GCP ecosystem. 🎯
💰 Vertex AI Model Landscape (As of Early 2026)
| Model | Best For | Input $/1M tokens | Output $/1M tokens | Context Window | Speed |
|---|---|---|---|---|---|
| Gemini 3 Pro | Frontier reasoning, agentic | $2.00 | $12.00 | 1M | Moderate |
| Gemini 3 Flash | Pro-grade at Flash speed | $0.50 | $3.00 | 1M | Fast |
| Gemini 2.5 Pro | Complex reasoning, coding | $1.25 | $10.00 | 1M | Moderate |
| Gemini 2.5 Flash | Balanced price/performance | $0.15 | $0.60 | 1M | Fast |
| Gemini 2.0 Flash | Ultra-cheap, high volume | $0.10 | $0.40 | 1M | Fastest |
| Gemini 2.0 Flash-Lite | Cheapest (deprecated Mar 2026) | $0.025 | $0.10 | 1M | Fastest |
| text-embedding-005 | Embeddings for RAG | $0.00001/char | — | 3K | Fastest |
Key insight: Gemini 2.0 Flash ($0.10/1M) is 20x cheaper than Gemini 3 Pro ($2.00/1M). Use 2.0 Flash for dev, 2.5 Flash or 3 Flash for prod. When the next gen drops, just update the variable. 💸
🏗️ Step 1: Model Configuration as Variables
This is the core pattern — every model detail is a variable:
# vertex-ai/variables.tf
variable "project_id" {
type = string
description = "GCP project ID"
}
variable "region" {
type = string
default = "us-central1"
}
variable "environment" {
type = string
default = "dev"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Must be: dev, staging, or prod."
}
}
# ─── MODEL CONFIGURATION (Change ONLY these to upgrade) ───
variable "primary_model" {
description = "Primary model for chat/completion. Change this when a new Gemini releases."
type = object({
id = string # Vertex AI model ID (e.g. "gemini-2.5-flash")
display = string # Human-readable name for tags/logs
})
}
variable "economy_model" {
description = "Cheap model for dev/testing/simple tasks."
type = object({
id = string
display = string
})
}
variable "embedding_model" {
description = "Embedding model for RAG/vector search."
type = object({
id = string
display = string
})
}
Now create per-environment .tfvars files:
# environments/dev.tfvars
# ─── Dev: Cheapest models ───
environment = "dev"
primary_model = {
id = "gemini-2.0-flash" # Cheapest current model
display = "Gemini 2.0 Flash"
}
economy_model = {
id = "gemini-2.0-flash" # Same as primary in dev
display = "Gemini 2.0 Flash"
}
embedding_model = {
id = "text-embedding-005"
display = "Text Embedding 005"
}
# environments/prod.tfvars
# ─── Prod: Latest flagship models ───
environment = "prod"
primary_model = {
id = "gemini-2.5-flash" # Latest stable flagship
display = "Gemini 2.5 Flash"
}
economy_model = {
id = "gemini-2.0-flash" # Cheap fallback for simple tasks
display = "Gemini 2.0 Flash"
}
embedding_model = {
id = "text-embedding-005"
display = "Text Embedding 005"
}
🚀 When Gemini 3 Flash goes GA: Update
prod.tfvarswithid = "gemini-3-flash". Runterraform apply. That's it. No code changes. No Docker rebuilds. One PR, one deploy.
🏗️ Step 2: Enable APIs & Create Base Infrastructure
# vertex-ai/main.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
google = {
source = "hashicorp/google"
version = ">= 5.40.0"
}
}
}
provider "google" {
project = var.project_id
region = var.region
}
# Enable all required APIs
resource "google_project_service" "apis" {
for_each = toset([
"aiplatform.googleapis.com",
"cloudfunctions.googleapis.com",
"cloudbuild.googleapis.com",
"run.googleapis.com",
"iam.googleapis.com",
"logging.googleapis.com",
"monitoring.googleapis.com",
])
project = var.project_id
service = each.value
disable_on_destroy = false
}
⚠️ GCP gotcha: Unlike AWS where services are available by default, GCP requires explicit API enablement. Forgetting
aiplatform.googleapis.comgives you cryptic 403 errors. Terraform handles this cleanly.
🔐 Step 3: Service Account with Least-Privilege IAM
# vertex-ai/iam.tf
resource "google_service_account" "vertex_invoker" {
account_id = "${var.environment}-vertex-invoker"
display_name = "Vertex AI Invoker (${var.environment})"
description = "Service account for invoking Vertex AI models"
project = var.project_id
}
# Custom role — only allow model invocation, nothing else
resource "google_project_iam_custom_role" "vertex_invoke_only" {
role_id = "${replace(var.environment, "-", "_")}_vertex_invoke"
title = "Vertex AI Invoke Only (${var.environment})"
description = "Can only invoke Vertex AI models, no admin"
project = var.project_id
permissions = [
"aiplatform.endpoints.predict",
"aiplatform.endpoints.generateContent",
"aiplatform.endpoints.streamGenerateContent",
"aiplatform.endpoints.list",
"aiplatform.endpoints.get",
"aiplatform.models.list",
"aiplatform.models.get",
]
}
resource "google_project_iam_member" "vertex_invoke" {
project = var.project_id
role = google_project_iam_custom_role.vertex_invoke_only.id
member = "serviceAccount:${google_service_account.vertex_invoker.email}"
}
Why custom roles? The built-in roles/aiplatform.user lets you create endpoints, deploy models, and more. A custom role with just predict and generateContent is how enterprises lock it down. 🔒
⚡ Step 4: Model-Agnostic Cloud Function
The function reads the model ID from env vars — zero hardcoded model names:
# vertex-ai/function.tf
resource "google_storage_bucket" "function_source" {
name = "${var.project_id}-${var.environment}-ai-functions"
location = var.region
uniform_bucket_level_access = true
force_destroy = true
}
data "archive_file" "ai_function" {
type = "zip"
output_path = "${path.module}/ai_function.zip"
source {
content = <<-PYTHON
import functions_framework
import vertexai
from vertexai.generative_models import GenerativeModel
import json
import os
vertexai.init(
project=os.environ.get('GCP_PROJECT'),
location=os.environ.get('GCP_REGION', 'us-central1')
)
@functions_framework.http
def handler(request):
try:
request_json = request.get_json(silent=True) or {}
prompt = request_json.get('prompt', 'Say hello!')
max_tokens = request_json.get('max_tokens', 500)
temperature = request_json.get('temperature', 0.7)
# Model comes from env var — NEVER hardcoded
model_id = os.environ.get('PRIMARY_MODEL_ID')
model = GenerativeModel(model_id)
response = model.generate_content(
prompt,
generation_config={
"max_output_tokens": max_tokens,
"temperature": temperature,
}
)
return json.dumps({
"response": response.text,
"model": model_id,
"usage": {
"prompt_tokens": response.usage_metadata.prompt_token_count,
"response_tokens": response.usage_metadata.candidates_token_count,
"total_tokens": response.usage_metadata.total_token_count
}
}), 200, {'Content-Type': 'application/json'}
except Exception as e:
return json.dumps({"error": str(e)}), 500, {'Content-Type': 'application/json'}
PYTHON
filename = "main.py"
}
source {
content = <<-REQUIREMENTS
functions-framework==3.*
google-cloud-aiplatform>=1.60.0
vertexai>=1.60.0
REQUIREMENTS
filename = "requirements.txt"
}
}
resource "google_storage_bucket_object" "function_source" {
name = "ai-function-${data.archive_file.ai_function.output_md5}.zip"
bucket = google_storage_bucket.function_source.name
source = data.archive_file.ai_function.output_path
}
resource "google_cloudfunctions2_function" "ai_endpoint" {
name = "${var.environment}-vertex-ai-endpoint"
location = var.region
build_config {
runtime = "python312"
entry_point = "handler"
source {
storage_source {
bucket = google_storage_bucket.function_source.name
object = google_storage_bucket_object.function_source.name
}
}
}
service_config {
max_instance_count = 10
min_instance_count = 0
available_memory = "512Mi"
timeout_seconds = 60
service_account_email = google_service_account.vertex_invoker.email
environment_variables = {
GCP_PROJECT = var.project_id
GCP_REGION = var.region
PRIMARY_MODEL_ID = var.primary_model.id # ← From .tfvars
}
}
depends_on = [
google_project_service.apis["cloudfunctions.googleapis.com"],
google_project_service.apis["aiplatform.googleapis.com"],
google_project_service.apis["run.googleapis.com"],
google_project_service.apis["cloudbuild.googleapis.com"],
]
labels = {
environment = var.environment
purpose = "vertex-ai"
managed-by = "terraform"
}
}
# Allow unauthenticated access for dev (remove for prod!)
resource "google_cloud_run_v2_service_iam_member" "allow_unauthenticated" {
count = var.environment == "dev" ? 1 : 0
project = var.project_id
location = var.region
name = google_cloudfunctions2_function.ai_endpoint.service_config[0].service
role = "roles/run.invoker"
member = "allUsers"
}
output "ai_endpoint_url" {
value = google_cloudfunctions2_function.ai_endpoint.url
description = "URL to invoke your AI endpoint"
}
output "active_model" {
value = var.primary_model.id
description = "Currently deployed model"
}
Notice: PRIMARY_MODEL_ID = var.primary_model.id — the model flows from .tfvars → Terraform → env var → Python. The function code never changes. ✅
🧪 Step 5: Deploy & Test
# Deploy dev (uses gemini-2.0-flash — cheapest)
terraform apply -var-file=environments/dev.tfvars
# Test it
curl -X POST $(terraform output -raw ai_endpoint_url) \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain Kubernetes in 2 sentences.", "max_tokens": 200}'
# Deploy prod (uses gemini-2.5-flash — latest stable)
terraform apply -var-file=environments/prod.tfvars
Response:
{
"response": "Kubernetes is an open-source container orchestration platform...",
"model": "gemini-2.5-flash",
"usage": {
"prompt_tokens": 12,
"response_tokens": 48,
"total_tokens": 60
}
}
🔄 The Upgrade Workflow
When Gemini 3 Flash goes GA:
# environments/prod.tfvars
primary_model = {
- id = "gemini-2.5-flash"
- display = "Gemini 2.5 Flash"
+ id = "gemini-3-flash"
+ display = "Gemini 3 Flash"
}
terraform plan -var-file=environments/prod.tfvars # Review
terraform apply -var-file=environments/prod.tfvars # Deploy
No application code changes. No Docker rebuilds. No sprint tickets. One .tfvars diff → PR → review → merge → done. 🎯
And cascade the old prod model down to staging:
# environments/staging.tfvars
primary_model = {
- id = "gemini-2.0-flash"
- display = "Gemini 2.0 Flash"
+ id = "gemini-2.5-flash"
+ display = "Gemini 2.5 Flash"
}
Model cascade pattern: Latest → Prod, previous flagship → Staging, cheapest → Dev. Each upgrade just shifts models down. ♻️
🎯 What You Just Built
┌──────────────────────────────────────────────────┐
│ .tfvars files │
│ dev: gemini-2.0-flash prod: gemini-2.5-flash │
└──────────────┬───────────────────────────────────┘
│ terraform apply -var-file=...
▼
┌──────────────────────────┐
│ Cloud Function (v2) │
│ env: PRIMARY_MODEL_ID │
│ (model-agnostic code) │
│ SA: vertex-invoker │
└──────────────┬───────────┘
│ Service Account auth
▼
┌──────────────────────────┐
│ Vertex AI │
│ Gemini (whichever │
│ version .tfvars says) │
└──────────────────────────┘
Config flows one direction: .tfvars → infrastructure → application. The app never knows or cares which model version it's calling. 🚀
⏭️ What's Next
This is Post 1 of the AI Infra on GCP with Terraform series. Coming up:
- Post 2: Vertex AI Safety Filters — Content moderation with Terraform
- Post 3: Audit Logging — Track every AI call with Cloud Logging
- Post 4: RAG with Vertex AI Search — Connect your docs to Gemini
Google ships new Gemini models quarterly. Build your infra so upgrading is a config change, not a project. 🧠
Found this helpful? Follow for the full AI Infra on GCP with Terraform series! 💬
Top comments (0)