Suhas Mallesh

Posted on Apr 12

Vertex AI Endpoints: Deploy Your Model to Production with Terraform 🚀

#googlecloud #ai #devops #machinelearning

Your model is trained. Now deploy it to a scalable endpoint with autoscaling, traffic splitting for canary rollouts, and request-response logging. Here's how to deploy Vertex AI endpoints with Terraform.

In the previous post, we set up the Vertex AI Workbench - the workspace where your team trains models. Now comes the production side: taking a trained model and deploying it to an endpoint that your applications can call for real-time predictions.

Vertex AI uses a three-step deployment: upload a Model to the Model Registry, create an Endpoint (the HTTPS URL), then deploy the model to the endpoint with compute resources. Terraform provisions the endpoint and configures autoscaling, traffic splitting, and logging. The SDK handles model upload and deployment. 🎯

🏗️ The Deployment Architecture

Model Registry (uploaded model + container)
    ↓
Endpoint (HTTPS URL, traffic routing)
    ↓
Deployed Model (machine type, replicas, autoscaling)

Component	What It Defines
Model	Container image + model artifacts in GCS
Endpoint	Stable HTTPS prediction URL
Deployed Model	Machine type, replica count, autoscaling
Traffic Split	Percentage routing between model versions

The endpoint URL stays stable across model versions. You update the deployed model behind it without changing your application code.

🔧 Terraform: Create the Endpoint

Endpoint Resource

# endpoint/main.tf

resource "google_vertex_ai_endpoint" "this" {
  name         = "${var.environment}-${var.model_name}-endpoint"
  display_name = "${var.environment}-${var.model_name}"
  description  = "Production endpoint for ${var.model_name}"
  location     = var.region
  project      = var.project_id

  labels = {
    environment = var.environment
    model       = var.model_name
    managed_by  = "terraform"
  }
}

Endpoint with Private Network (Production)

For production, deploy with a private endpoint inside your VPC:

resource "google_vertex_ai_endpoint" "private" {
  name         = "${var.environment}-${var.model_name}-endpoint"
  display_name = "${var.environment}-${var.model_name}"
  location     = var.region
  project      = var.project_id
  network      = "projects/${data.google_project.this.number}/global/networks/${var.vpc_network_name}"

  depends_on = [google_service_networking_connection.vertex_vpc]
}

resource "google_service_networking_connection" "vertex_vpc" {
  network                 = var.vpc_network_id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.vertex_range.name]
}

resource "google_compute_global_address" "vertex_range" {
  name          = "${var.environment}-vertex-range"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = var.vpc_network_id
  project       = var.project_id
}

Prediction Logging to BigQuery

resource "google_bigquery_dataset" "predictions" {
  dataset_id = "${var.environment}_prediction_logs"
  location   = var.region
  project    = var.project_id
}

resource "google_vertex_ai_endpoint" "logged" {
  name         = "${var.environment}-${var.model_name}-endpoint"
  display_name = "${var.environment}-${var.model_name}"
  location     = var.region
  project      = var.project_id

  predict_request_response_logging_config {
    enabled       = var.enable_prediction_logging
    sampling_rate = var.logging_sample_rate

    bigquery_destination {
      output_uri = "bq://${var.project_id}.${google_bigquery_dataset.predictions.dataset_id}.request_response"
    }
  }
}

Log a sample of prediction requests and responses to BigQuery for model monitoring, drift detection, and debugging.

🐍 Model Upload and Deployment (SDK)

Terraform creates the endpoint. The Vertex AI SDK uploads the model and deploys it with compute resources:

# deploy.py

from google.cloud import aiplatform
import json

with open("deploy_config.json") as f:
    config = json.load(f)

aiplatform.init(
    project=config["project_id"],
    location=config["region"],
)

# Upload model to Model Registry
model = aiplatform.Model.upload(
    display_name=config["model_name"],
    artifact_uri=config["model_artifact_uri"],  # gs://bucket/model/
    serving_container_image_uri=config["serving_image"],
    serving_container_predict_route="/predict",
    serving_container_health_route="/health",
)

# Get the Terraform-created endpoint
endpoint = aiplatform.Endpoint(config["endpoint_resource_name"])

# Deploy model to endpoint
model.deploy(
    endpoint=endpoint,
    machine_type=config["machine_type"],
    min_replica_count=config["min_replicas"],
    max_replica_count=config["max_replicas"],
    traffic_percentage=100,
    deploy_request_timeout=1800,
)

print(f"Model deployed to: {endpoint.resource_name}")

Terraform Config Output for SDK

# endpoint/config.tf

resource "local_file" "deploy_config" {
  filename = "${path.module}/deploy_config.json"
  content = jsonencode({
    project_id             = var.project_id
    region                 = var.region
    model_name             = "${var.environment}-${var.model_name}"
    model_artifact_uri     = var.model_artifact_uri
    serving_image          = var.serving_container_image
    endpoint_resource_name = google_vertex_ai_endpoint.this.name
    machine_type           = var.machine_type
    min_replicas           = var.min_replicas
    max_replicas           = var.max_replicas
  })
}

📐 Traffic Splitting for Canary Deployments

Deploy a new model version alongside the existing one and gradually shift traffic:

# Canary: deploy new version with 10% traffic
new_model = aiplatform.Model.upload(
    display_name="fraud-detector-v3",
    artifact_uri="gs://ml-models/fraud-detector/v3/",
    serving_container_image_uri=config["serving_image"],
)

new_model.deploy(
    endpoint=endpoint,
    machine_type=config["machine_type"],
    min_replica_count=1,
    max_replica_count=4,
    traffic_percentage=10,  # 10% to new model
)

# After validation, shift 100% to new model
endpoint.update(traffic_split={
    new_model.id: 100,
})

The endpoint URL doesn't change. Your application sends predictions to the same endpoint while you shift traffic from the old model to the new one.

📐 Model Garden Deployment (Terraform-Only)

For open models from Model Garden (Gemma, Llama, PaLiGemma), use the newer Terraform resource that handles everything in one step:

resource "google_vertex_ai_endpoint_with_model_garden_deployment" "gemma" {
  publisher_model_name = "publishers/google/models/gemma3@gemma-3-1b-it"
  location             = var.region

  model_config {
    accept_eula = true
  }

  deploy_config {
    dedicated_resources {
      machine_spec {
        machine_type      = "g2-standard-12"
        accelerator_type  = "NVIDIA_L4"
        accelerator_count = 1
      }
      min_replica_count = 1
    }
  }
}

One resource creates the endpoint, uploads the model, and deploys it. This works for Model Garden models only - custom models still need the endpoint + SDK pattern.

📐 Environment Configuration

# environments/dev.tfvars
model_name              = "fraud-detector"
machine_type            = "n1-standard-4"
min_replicas            = 1
max_replicas            = 2
enable_prediction_logging = false
logging_sample_rate     = 0.0

# environments/prod.tfvars
model_name              = "fraud-detector"
machine_type            = "n1-standard-8"
min_replicas            = 2
max_replicas            = 10
enable_prediction_logging = true
logging_sample_rate     = 0.1   # Log 10% of requests

🧪 Invoke the Endpoint

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

endpoint = aiplatform.Endpoint("ENDPOINT_ID")

prediction = endpoint.predict(
    instances=[{"features": [0.5, 1.2, 3.4, 0.8]}]
)

print(prediction.predictions)

⚠️ Gotchas and Tips

Endpoint creation is fast, deployment is slow. Creating an endpoint takes seconds. Deploying a model (provisioning VMs, loading containers, health checks) takes 10-20 minutes. Plan for this in your CI/CD pipeline.

Autoscaling uses replica count. Set min_replica_count to your baseline and max_replica_count to your peak. Vertex AI scales based on CPU utilization and request queue depth automatically.

GPU quota must be requested. NVIDIA L4, T4, A100 accelerators require quota increases. Request early in your project setup.

Model Registry keeps versions. Every Model.upload creates a new version in the registry. Old versions remain available for rollback. Use the traffic_split to redirect traffic back to a previous version if needed.

Prediction logging costs. BigQuery logging at 10% sampling rate is manageable. At 100%, costs scale with request volume. Use sampling for high-traffic endpoints.

⏭️ What's Next

This is Post 2 of the GCP ML Pipelines & MLOps with Terraform series.

Post 1: Vertex AI Workbench 🔬
Post 2: Vertex AI Endpoints - Deploy to Prod (you are here) 🚀
Post 3: Vertex AI Feature Store
Post 4: Vertex AI Pipelines + Cloud Build

Your model is in production. Stable endpoint URL, autoscaling replicas, traffic splitting for canary rollouts, and prediction logging to BigQuery. From training to production, all in Terraform and Python. 🚀

Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! 💬

DEV Community