DEV Community

Cover image for Your GCP Account is AI-Ready: Deploy your first AI endpoint with Terraform in 10 minutes⚡
Suhas Mallesh
Suhas Mallesh

Posted on

Your GCP Account is AI-Ready: Deploy your first AI endpoint with Terraform in 10 minutes⚡

Deploy Vertex AI with Terraform: Upgrade to the Next Gemini Model with One Variable Change 😊

Vertex AI gives you Gemini 3, 2.5, Llama, and Claude on GCP. Here's how to deploy it with Terraform so swapping to the next model is a one-variable change — not a rewrite.

Google ships new Gemini models fast. Gemini 1.5 → 2.0 Flash → 2.5 Pro → 3 Flash — each one faster, smarter, cheaper.

Here's the problem: most teams hardcode gemini-2.0-flash throughout their infrastructure and application code. When Gemini 3 Flash lands, it's a multi-file search-and-replace across Terraform configs, Cloud Functions, and env vars.

What if upgrading to the next Gemini model was a one-variable change?

That's what we're building — a Vertex AI setup with Terraform where models are variables, not hardcoded strings. When the next model drops, you change one .tfvars file and run terraform apply. Done. 🎯

🤔 Vertex AI vs Bedrock vs Azure AI Foundry

Vertex AI (GCP) Bedrock (AWS) Azure AI Foundry
What it is Full AI platform (inference + training + MLOps) API access to pre-trained models Managed OpenAI + multi-model platform
Models Gemini 3, 2.5, Llama, Claude, Mistral Claude, Llama, Titan GPT-4.1, o4-mini, GPT-5 series
Unique strength Native BigQuery + GCS + Model Optimizer Serverless Knowledge Bases & Agents OpenAI API compatibility + Azure AD
Key advantage Most complete ML platform (inference → training → MLOps) Broadest model selection Drop-in OpenAI replacement

Vertex AI's edge: It's the most complete — combining what AWS splits between Bedrock and SageMaker into one platform. Plus, Gemini is deeply integrated with BigQuery, Cloud Storage, and the whole GCP ecosystem. 🎯

💰 Vertex AI Model Landscape (As of Early 2026)

Model Best For Input $/1M tokens Output $/1M tokens Context Window Speed
Gemini 3 Pro Frontier reasoning, agentic $2.00 $12.00 1M Moderate
Gemini 3 Flash Pro-grade at Flash speed $0.50 $3.00 1M Fast
Gemini 2.5 Pro Complex reasoning, coding $1.25 $10.00 1M Moderate
Gemini 2.5 Flash Balanced price/performance $0.15 $0.60 1M Fast
Gemini 2.0 Flash Ultra-cheap, high volume $0.10 $0.40 1M Fastest
Gemini 2.0 Flash-Lite Cheapest (deprecated Mar 2026) $0.025 $0.10 1M Fastest
text-embedding-005 Embeddings for RAG $0.00001/char 3K Fastest

Key insight: Gemini 2.0 Flash ($0.10/1M) is 20x cheaper than Gemini 3 Pro ($2.00/1M). Use 2.0 Flash for dev, 2.5 Flash or 3 Flash for prod. When the next gen drops, just update the variable. 💸

🏗️ Step 1: Model Configuration as Variables

This is the core pattern — every model detail is a variable:

# vertex-ai/variables.tf

variable "project_id" {
  type        = string
  description = "GCP project ID"
}

variable "region" {
  type    = string
  default = "us-central1"
}

variable "environment" {
  type    = string
  default = "dev"

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Must be: dev, staging, or prod."
  }
}

# ─── MODEL CONFIGURATION (Change ONLY these to upgrade) ───

variable "primary_model" {
  description = "Primary model for chat/completion. Change this when a new Gemini releases."
  type = object({
    id      = string  # Vertex AI model ID (e.g. "gemini-2.5-flash")
    display = string  # Human-readable name for tags/logs
  })
}

variable "economy_model" {
  description = "Cheap model for dev/testing/simple tasks."
  type = object({
    id      = string
    display = string
  })
}

variable "embedding_model" {
  description = "Embedding model for RAG/vector search."
  type = object({
    id      = string
    display = string
  })
}
Enter fullscreen mode Exit fullscreen mode

Now create per-environment .tfvars files:

# environments/dev.tfvars
# ─── Dev: Cheapest models ───

environment = "dev"

primary_model = {
  id      = "gemini-2.0-flash"   # Cheapest current model
  display = "Gemini 2.0 Flash"
}

economy_model = {
  id      = "gemini-2.0-flash"   # Same as primary in dev
  display = "Gemini 2.0 Flash"
}

embedding_model = {
  id      = "text-embedding-005"
  display = "Text Embedding 005"
}
Enter fullscreen mode Exit fullscreen mode
# environments/prod.tfvars
# ─── Prod: Latest flagship models ───

environment = "prod"

primary_model = {
  id      = "gemini-2.5-flash"   # Latest stable flagship
  display = "Gemini 2.5 Flash"
}

economy_model = {
  id      = "gemini-2.0-flash"   # Cheap fallback for simple tasks
  display = "Gemini 2.0 Flash"
}

embedding_model = {
  id      = "text-embedding-005"
  display = "Text Embedding 005"
}
Enter fullscreen mode Exit fullscreen mode

🚀 When Gemini 3 Flash goes GA: Update prod.tfvars with id = "gemini-3-flash". Run terraform apply. That's it. No code changes. No Docker rebuilds. One PR, one deploy.

🏗️ Step 2: Enable APIs & Create Base Infrastructure

# vertex-ai/main.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = ">= 5.40.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

# Enable all required APIs
resource "google_project_service" "apis" {
  for_each = toset([
    "aiplatform.googleapis.com",
    "cloudfunctions.googleapis.com",
    "cloudbuild.googleapis.com",
    "run.googleapis.com",
    "iam.googleapis.com",
    "logging.googleapis.com",
    "monitoring.googleapis.com",
  ])

  project = var.project_id
  service = each.value

  disable_on_destroy = false
}
Enter fullscreen mode Exit fullscreen mode

⚠️ GCP gotcha: Unlike AWS where services are available by default, GCP requires explicit API enablement. Forgetting aiplatform.googleapis.com gives you cryptic 403 errors. Terraform handles this cleanly.

🔐 Step 3: Service Account with Least-Privilege IAM

# vertex-ai/iam.tf

resource "google_service_account" "vertex_invoker" {
  account_id   = "${var.environment}-vertex-invoker"
  display_name = "Vertex AI Invoker (${var.environment})"
  description  = "Service account for invoking Vertex AI models"
  project      = var.project_id
}

# Custom role — only allow model invocation, nothing else
resource "google_project_iam_custom_role" "vertex_invoke_only" {
  role_id     = "${replace(var.environment, "-", "_")}_vertex_invoke"
  title       = "Vertex AI Invoke Only (${var.environment})"
  description = "Can only invoke Vertex AI models, no admin"
  project     = var.project_id

  permissions = [
    "aiplatform.endpoints.predict",
    "aiplatform.endpoints.generateContent",
    "aiplatform.endpoints.streamGenerateContent",
    "aiplatform.endpoints.list",
    "aiplatform.endpoints.get",
    "aiplatform.models.list",
    "aiplatform.models.get",
  ]
}

resource "google_project_iam_member" "vertex_invoke" {
  project = var.project_id
  role    = google_project_iam_custom_role.vertex_invoke_only.id
  member  = "serviceAccount:${google_service_account.vertex_invoker.email}"
}
Enter fullscreen mode Exit fullscreen mode

Why custom roles? The built-in roles/aiplatform.user lets you create endpoints, deploy models, and more. A custom role with just predict and generateContent is how enterprises lock it down. 🔒

⚡ Step 4: Model-Agnostic Cloud Function

The function reads the model ID from env vars — zero hardcoded model names:

# vertex-ai/function.tf

resource "google_storage_bucket" "function_source" {
  name                        = "${var.project_id}-${var.environment}-ai-functions"
  location                    = var.region
  uniform_bucket_level_access = true
  force_destroy               = true
}

data "archive_file" "ai_function" {
  type        = "zip"
  output_path = "${path.module}/ai_function.zip"

  source {
    content  = <<-PYTHON
import functions_framework
import vertexai
from vertexai.generative_models import GenerativeModel
import json
import os

vertexai.init(
    project=os.environ.get('GCP_PROJECT'),
    location=os.environ.get('GCP_REGION', 'us-central1')
)

@functions_framework.http
def handler(request):
    try:
        request_json = request.get_json(silent=True) or {}
        prompt = request_json.get('prompt', 'Say hello!')
        max_tokens = request_json.get('max_tokens', 500)
        temperature = request_json.get('temperature', 0.7)

        # Model comes from env var — NEVER hardcoded
        model_id = os.environ.get('PRIMARY_MODEL_ID')
        model = GenerativeModel(model_id)

        response = model.generate_content(
            prompt,
            generation_config={
                "max_output_tokens": max_tokens,
                "temperature": temperature,
            }
        )

        return json.dumps({
            "response": response.text,
            "model": model_id,
            "usage": {
                "prompt_tokens": response.usage_metadata.prompt_token_count,
                "response_tokens": response.usage_metadata.candidates_token_count,
                "total_tokens": response.usage_metadata.total_token_count
            }
        }), 200, {'Content-Type': 'application/json'}

    except Exception as e:
        return json.dumps({"error": str(e)}), 500, {'Content-Type': 'application/json'}
    PYTHON
    filename = "main.py"
  }

  source {
    content  = <<-REQUIREMENTS
functions-framework==3.*
google-cloud-aiplatform>=1.60.0
vertexai>=1.60.0
    REQUIREMENTS
    filename = "requirements.txt"
  }
}

resource "google_storage_bucket_object" "function_source" {
  name   = "ai-function-${data.archive_file.ai_function.output_md5}.zip"
  bucket = google_storage_bucket.function_source.name
  source = data.archive_file.ai_function.output_path
}

resource "google_cloudfunctions2_function" "ai_endpoint" {
  name     = "${var.environment}-vertex-ai-endpoint"
  location = var.region

  build_config {
    runtime     = "python312"
    entry_point = "handler"

    source {
      storage_source {
        bucket = google_storage_bucket.function_source.name
        object = google_storage_bucket_object.function_source.name
      }
    }
  }

  service_config {
    max_instance_count    = 10
    min_instance_count    = 0
    available_memory      = "512Mi"
    timeout_seconds       = 60
    service_account_email = google_service_account.vertex_invoker.email

    environment_variables = {
      GCP_PROJECT      = var.project_id
      GCP_REGION       = var.region
      PRIMARY_MODEL_ID = var.primary_model.id  # ← From .tfvars
    }
  }

  depends_on = [
    google_project_service.apis["cloudfunctions.googleapis.com"],
    google_project_service.apis["aiplatform.googleapis.com"],
    google_project_service.apis["run.googleapis.com"],
    google_project_service.apis["cloudbuild.googleapis.com"],
  ]

  labels = {
    environment = var.environment
    purpose     = "vertex-ai"
    managed-by  = "terraform"
  }
}

# Allow unauthenticated access for dev (remove for prod!)
resource "google_cloud_run_v2_service_iam_member" "allow_unauthenticated" {
  count = var.environment == "dev" ? 1 : 0

  project  = var.project_id
  location = var.region
  name     = google_cloudfunctions2_function.ai_endpoint.service_config[0].service
  role     = "roles/run.invoker"
  member   = "allUsers"
}

output "ai_endpoint_url" {
  value       = google_cloudfunctions2_function.ai_endpoint.url
  description = "URL to invoke your AI endpoint"
}

output "active_model" {
  value       = var.primary_model.id
  description = "Currently deployed model"
}
Enter fullscreen mode Exit fullscreen mode

Notice: PRIMARY_MODEL_ID = var.primary_model.id — the model flows from .tfvars → Terraform → env var → Python. The function code never changes. ✅

🧪 Step 5: Deploy & Test

# Deploy dev (uses gemini-2.0-flash — cheapest)
terraform apply -var-file=environments/dev.tfvars

# Test it
curl -X POST $(terraform output -raw ai_endpoint_url) \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain Kubernetes in 2 sentences.", "max_tokens": 200}'

# Deploy prod (uses gemini-2.5-flash — latest stable)
terraform apply -var-file=environments/prod.tfvars
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "response": "Kubernetes is an open-source container orchestration platform...",
  "model": "gemini-2.5-flash",
  "usage": {
    "prompt_tokens": 12,
    "response_tokens": 48,
    "total_tokens": 60
  }
}
Enter fullscreen mode Exit fullscreen mode

🔄 The Upgrade Workflow

When Gemini 3 Flash goes GA:

# environments/prod.tfvars

 primary_model = {
-  id      = "gemini-2.5-flash"
-  display = "Gemini 2.5 Flash"
+  id      = "gemini-3-flash"
+  display = "Gemini 3 Flash"
   }
Enter fullscreen mode Exit fullscreen mode
terraform plan -var-file=environments/prod.tfvars   # Review
terraform apply -var-file=environments/prod.tfvars   # Deploy
Enter fullscreen mode Exit fullscreen mode

No application code changes. No Docker rebuilds. No sprint tickets. One .tfvars diff → PR → review → merge → done. 🎯

And cascade the old prod model down to staging:

# environments/staging.tfvars

 primary_model = {
-  id      = "gemini-2.0-flash"
-  display = "Gemini 2.0 Flash"
+  id      = "gemini-2.5-flash"
+  display = "Gemini 2.5 Flash"
 }
Enter fullscreen mode Exit fullscreen mode

Model cascade pattern: Latest → Prod, previous flagship → Staging, cheapest → Dev. Each upgrade just shifts models down. ♻️

🎯 What You Just Built

┌──────────────────────────────────────────────────┐
│                  .tfvars files                   │
│  dev: gemini-2.0-flash  prod: gemini-2.5-flash   │
└──────────────┬───────────────────────────────────┘
               │ terraform apply -var-file=...
               ▼
┌──────────────────────────┐
│  Cloud Function (v2)     │
│  env: PRIMARY_MODEL_ID   │
│  (model-agnostic code)   │
│  SA: vertex-invoker      │
└──────────────┬───────────┘
               │ Service Account auth
               ▼
┌──────────────────────────┐
│  Vertex AI               │
│  Gemini (whichever       │
│  version .tfvars says)   │
└──────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Config flows one direction: .tfvars → infrastructure → application. The app never knows or cares which model version it's calling. 🚀

⏭️ What's Next

This is Post 1 of the AI Infra on GCP with Terraform series. Coming up:

  • Post 2: Vertex AI Safety Filters — Content moderation with Terraform
  • Post 3: Audit Logging — Track every AI call with Cloud Logging
  • Post 4: RAG with Vertex AI Search — Connect your docs to Gemini

Google ships new Gemini models quarterly. Build your infra so upgrading is a config change, not a project. 🧠

Found this helpful? Follow for the full AI Infra on GCP with Terraform series! 💬

Top comments (0)