Your model is trained. Now deploy it to a scalable endpoint with autoscaling, traffic splitting for canary rollouts, and request-response logging. Here's how to deploy Vertex AI endpoints with Terraform.
In the previous post, we set up the Vertex AI Workbench - the workspace where your team trains models. Now comes the production side: taking a trained model and deploying it to an endpoint that your applications can call for real-time predictions.
Vertex AI uses a three-step deployment: upload a Model to the Model Registry, create an Endpoint (the HTTPS URL), then deploy the model to the endpoint with compute resources. Terraform provisions the endpoint and configures autoscaling, traffic splitting, and logging. The SDK handles model upload and deployment. π―
ποΈ The Deployment Architecture
Model Registry (uploaded model + container)
β
Endpoint (HTTPS URL, traffic routing)
β
Deployed Model (machine type, replicas, autoscaling)
| Component | What It Defines |
|---|---|
| Model | Container image + model artifacts in GCS |
| Endpoint | Stable HTTPS prediction URL |
| Deployed Model | Machine type, replica count, autoscaling |
| Traffic Split | Percentage routing between model versions |
The endpoint URL stays stable across model versions. You update the deployed model behind it without changing your application code.
π§ Terraform: Create the Endpoint
Endpoint Resource
# endpoint/main.tf
resource "google_vertex_ai_endpoint" "this" {
name = "${var.environment}-${var.model_name}-endpoint"
display_name = "${var.environment}-${var.model_name}"
description = "Production endpoint for ${var.model_name}"
location = var.region
project = var.project_id
labels = {
environment = var.environment
model = var.model_name
managed_by = "terraform"
}
}
Endpoint with Private Network (Production)
For production, deploy with a private endpoint inside your VPC:
resource "google_vertex_ai_endpoint" "private" {
name = "${var.environment}-${var.model_name}-endpoint"
display_name = "${var.environment}-${var.model_name}"
location = var.region
project = var.project_id
network = "projects/${data.google_project.this.number}/global/networks/${var.vpc_network_name}"
depends_on = [google_service_networking_connection.vertex_vpc]
}
resource "google_service_networking_connection" "vertex_vpc" {
network = var.vpc_network_id
service = "servicenetworking.googleapis.com"
reserved_peering_ranges = [google_compute_global_address.vertex_range.name]
}
resource "google_compute_global_address" "vertex_range" {
name = "${var.environment}-vertex-range"
purpose = "VPC_PEERING"
address_type = "INTERNAL"
prefix_length = 16
network = var.vpc_network_id
project = var.project_id
}
Prediction Logging to BigQuery
resource "google_bigquery_dataset" "predictions" {
dataset_id = "${var.environment}_prediction_logs"
location = var.region
project = var.project_id
}
resource "google_vertex_ai_endpoint" "logged" {
name = "${var.environment}-${var.model_name}-endpoint"
display_name = "${var.environment}-${var.model_name}"
location = var.region
project = var.project_id
predict_request_response_logging_config {
enabled = var.enable_prediction_logging
sampling_rate = var.logging_sample_rate
bigquery_destination {
output_uri = "bq://${var.project_id}.${google_bigquery_dataset.predictions.dataset_id}.request_response"
}
}
}
Log a sample of prediction requests and responses to BigQuery for model monitoring, drift detection, and debugging.
π Model Upload and Deployment (SDK)
Terraform creates the endpoint. The Vertex AI SDK uploads the model and deploys it with compute resources:
# deploy.py
from google.cloud import aiplatform
import json
with open("deploy_config.json") as f:
config = json.load(f)
aiplatform.init(
project=config["project_id"],
location=config["region"],
)
# Upload model to Model Registry
model = aiplatform.Model.upload(
display_name=config["model_name"],
artifact_uri=config["model_artifact_uri"], # gs://bucket/model/
serving_container_image_uri=config["serving_image"],
serving_container_predict_route="/predict",
serving_container_health_route="/health",
)
# Get the Terraform-created endpoint
endpoint = aiplatform.Endpoint(config["endpoint_resource_name"])
# Deploy model to endpoint
model.deploy(
endpoint=endpoint,
machine_type=config["machine_type"],
min_replica_count=config["min_replicas"],
max_replica_count=config["max_replicas"],
traffic_percentage=100,
deploy_request_timeout=1800,
)
print(f"Model deployed to: {endpoint.resource_name}")
Terraform Config Output for SDK
# endpoint/config.tf
resource "local_file" "deploy_config" {
filename = "${path.module}/deploy_config.json"
content = jsonencode({
project_id = var.project_id
region = var.region
model_name = "${var.environment}-${var.model_name}"
model_artifact_uri = var.model_artifact_uri
serving_image = var.serving_container_image
endpoint_resource_name = google_vertex_ai_endpoint.this.name
machine_type = var.machine_type
min_replicas = var.min_replicas
max_replicas = var.max_replicas
})
}
π Traffic Splitting for Canary Deployments
Deploy a new model version alongside the existing one and gradually shift traffic:
# Canary: deploy new version with 10% traffic
new_model = aiplatform.Model.upload(
display_name="fraud-detector-v3",
artifact_uri="gs://ml-models/fraud-detector/v3/",
serving_container_image_uri=config["serving_image"],
)
new_model.deploy(
endpoint=endpoint,
machine_type=config["machine_type"],
min_replica_count=1,
max_replica_count=4,
traffic_percentage=10, # 10% to new model
)
# After validation, shift 100% to new model
endpoint.update(traffic_split={
new_model.id: 100,
})
The endpoint URL doesn't change. Your application sends predictions to the same endpoint while you shift traffic from the old model to the new one.
π Model Garden Deployment (Terraform-Only)
For open models from Model Garden (Gemma, Llama, PaLiGemma), use the newer Terraform resource that handles everything in one step:
resource "google_vertex_ai_endpoint_with_model_garden_deployment" "gemma" {
publisher_model_name = "publishers/google/models/gemma3@gemma-3-1b-it"
location = var.region
model_config {
accept_eula = true
}
deploy_config {
dedicated_resources {
machine_spec {
machine_type = "g2-standard-12"
accelerator_type = "NVIDIA_L4"
accelerator_count = 1
}
min_replica_count = 1
}
}
}
One resource creates the endpoint, uploads the model, and deploys it. This works for Model Garden models only - custom models still need the endpoint + SDK pattern.
π Environment Configuration
# environments/dev.tfvars
model_name = "fraud-detector"
machine_type = "n1-standard-4"
min_replicas = 1
max_replicas = 2
enable_prediction_logging = false
logging_sample_rate = 0.0
# environments/prod.tfvars
model_name = "fraud-detector"
machine_type = "n1-standard-8"
min_replicas = 2
max_replicas = 10
enable_prediction_logging = true
logging_sample_rate = 0.1 # Log 10% of requests
π§ͺ Invoke the Endpoint
from google.cloud import aiplatform
aiplatform.init(project="my-project", location="us-central1")
endpoint = aiplatform.Endpoint("ENDPOINT_ID")
prediction = endpoint.predict(
instances=[{"features": [0.5, 1.2, 3.4, 0.8]}]
)
print(prediction.predictions)
β οΈ Gotchas and Tips
Endpoint creation is fast, deployment is slow. Creating an endpoint takes seconds. Deploying a model (provisioning VMs, loading containers, health checks) takes 10-20 minutes. Plan for this in your CI/CD pipeline.
Autoscaling uses replica count. Set min_replica_count to your baseline and max_replica_count to your peak. Vertex AI scales based on CPU utilization and request queue depth automatically.
GPU quota must be requested. NVIDIA L4, T4, A100 accelerators require quota increases. Request early in your project setup.
Model Registry keeps versions. Every Model.upload creates a new version in the registry. Old versions remain available for rollback. Use the traffic_split to redirect traffic back to a previous version if needed.
Prediction logging costs. BigQuery logging at 10% sampling rate is manageable. At 100%, costs scale with request volume. Use sampling for high-traffic endpoints.
βοΈ What's Next
This is Post 2 of the GCP ML Pipelines & MLOps with Terraform series.
- Post 1: Vertex AI Workbench π¬
- Post 2: Vertex AI Endpoints - Deploy to Prod (you are here) π
- Post 3: Vertex AI Feature Store
- Post 4: Vertex AI Pipelines + Cloud Build
Your model is in production. Stable endpoint URL, autoscaling replicas, traffic splitting for canary rollouts, and prediction logging to BigQuery. From training to production, all in Terraform and Python. π
Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! π¬
Top comments (0)