Vertex AI Workbench is the JupyterLab IDE for ML on GCP - pre-installed ML frameworks, Vertex AI integration, and GPU support out of the box. Here's how to provision instances with Terraform including networking, IAM, auto-shutdown, and custom containers.
In Series 1-3, we worked with managed AI services - Vertex AI for models, RAG Engine for retrieval, ADK for agents. Series 5 shifts to custom ML - training your own models, deploying endpoints, managing features, and building ML pipelines.
It starts with a development environment. Vertex AI Workbench provides JupyterLab instances backed by Compute Engine VMs, pre-loaded with ML frameworks (TensorFlow, PyTorch, JAX, scikit-learn), the Vertex AI SDK, and direct integration with GCS, BigQuery, and Vertex AI services. Each instance is a full ML workspace. Terraform provisions them consistently across your team. π―
ποΈ Workbench Architecture
| Component | What It Does |
|---|---|
| Workbench Instance | JupyterLab on a Compute Engine VM |
| VM Image | Pre-built with ML frameworks and Vertex SDK |
| Custom Container | Your own Docker image with specific dependencies |
| Data Disk | Persistent storage for notebooks and datasets |
| Service Account | Controls access to GCS, BigQuery, Vertex AI |
| Network | VPC, subnet, firewall rules for isolation |
Unlike SageMaker's domain-based approach, Workbench instances are standalone VMs. Each data scientist gets their own instance with full control over machine type, GPU, and installed packages. Terraform standardizes the configuration across the team.
π§ Terraform: The Full Workbench Setup
APIs and Networking
# workbench/apis.tf
resource "google_project_service" "required" {
for_each = toset([
"notebooks.googleapis.com",
"aiplatform.googleapis.com",
"compute.googleapis.com",
"storage.googleapis.com",
"bigquery.googleapis.com",
])
project = var.project_id
service = each.value
}
# VPC for Workbench instances
resource "google_compute_network" "ml" {
name = "${var.environment}-ml-network"
auto_create_subnetworks = false
project = var.project_id
}
resource "google_compute_subnetwork" "ml" {
name = "${var.environment}-ml-subnet"
ip_cidr_range = var.subnet_cidr
region = var.region
network = google_compute_network.ml.id
private_ip_google_access = true # Access Google APIs without public IP
project = var.project_id
}
# Allow internal traffic for JupyterLab proxy
resource "google_compute_firewall" "internal" {
name = "${var.environment}-ml-internal"
network = google_compute_network.ml.id
project = var.project_id
allow {
protocol = "tcp"
ports = ["8080", "443"]
}
source_ranges = [var.subnet_cidr]
}
private_ip_google_access = true is essential. It lets Workbench instances access Google APIs (GCS, BigQuery, Vertex AI) without a public IP address, keeping your ML environment network-isolated.
Service Account
# workbench/iam.tf
resource "google_service_account" "workbench" {
account_id = "${var.environment}-workbench-sa"
display_name = "Workbench ML Service Account"
project = var.project_id
}
# Access to training data in GCS
resource "google_project_iam_member" "storage_user" {
project = var.project_id
role = "roles/storage.objectUser"
member = "serviceAccount:${google_service_account.workbench.email}"
}
# Access to Vertex AI for training and prediction
resource "google_project_iam_member" "vertex_ai_user" {
project = var.project_id
role = "roles/aiplatform.user"
member = "serviceAccount:${google_service_account.workbench.email}"
}
# Access to BigQuery for data exploration
resource "google_project_iam_member" "bigquery_user" {
project = var.project_id
role = "roles/bigquery.user"
member = "serviceAccount:${google_service_account.workbench.email}"
}
# Read Artifact Registry for custom containers
resource "google_project_iam_member" "artifact_reader" {
project = var.project_id
role = "roles/artifactregistry.reader"
member = "serviceAccount:${google_service_account.workbench.email}"
}
Workbench Instance
# workbench/instance.tf
resource "google_workbench_instance" "this" {
for_each = var.workbench_instances
name = "${var.environment}-${each.key}"
location = var.zone
project = var.project_id
gce_setup {
machine_type = each.value.machine_type
vm_image {
project = "cloud-notebooks-managed"
family = "workbench-instances"
}
# Optional GPU
dynamic "accelerator_configs" {
for_each = each.value.gpu_type != null ? [1] : []
content {
type = each.value.gpu_type
core_count = each.value.gpu_count
}
}
boot_disk {
disk_type = "PD_SSD"
disk_size_gb = each.value.boot_disk_gb
}
data_disks {
disk_type = "PD_SSD"
disk_size_gb = each.value.data_disk_gb
}
network_interfaces {
network = google_compute_network.ml.id
subnet = google_compute_subnetwork.ml.id
nic_type = "GVNIC"
}
service_accounts {
email = google_service_account.workbench.email
}
metadata = {
idle-timeout-seconds = each.value.idle_timeout
install-nvidia-driver = each.value.gpu_type != null ? "true" : "false"
post-startup-script = each.value.post_startup_script
}
enable_ip_forwarding = false
}
disable_proxy_access = false # Enable JupyterLab proxy access
labels = {
environment = var.environment
team = each.value.team
managed_by = "terraform"
}
}
Key configuration points:
-
vm_image.family = "workbench-instances"uses Google's pre-built image with ML frameworks -
data_disksprovides persistent storage separate from the boot disk -
idle-timeout-secondsin metadata auto-stops the instance after inactivity -
install-nvidia-driverauto-installs GPU drivers when GPU is attached -
disable_proxy_access = falseenables browser access to JupyterLab via IAP
Custom Container Alternative
For teams that need specific package versions or private packages, use a custom container:
resource "google_workbench_instance" "custom" {
name = "${var.environment}-custom-ml"
location = var.zone
project = var.project_id
gce_setup {
machine_type = "n1-standard-8"
container_image {
repository = "${var.region}-docker.pkg.dev/${var.project_id}/ml-images/custom-workbench"
tag = "latest"
}
boot_disk {
disk_type = "PD_SSD"
disk_size_gb = 150
}
network_interfaces {
network = google_compute_network.ml.id
subnet = google_compute_subnetwork.ml.id
}
service_accounts {
email = google_service_account.workbench.email
}
}
}
Build your container from Google's base image, add your packages, push to Artifact Registry, and reference it in Terraform. Updates are a new tag and terraform apply.
π Environment Configuration
# environments/dev.tfvars
environment = "dev"
zone = "us-central1-a"
workbench_instances = {
"data-scientist-1" = {
machine_type = "n1-standard-4"
gpu_type = null # No GPU for dev
gpu_count = 0
boot_disk_gb = 100
data_disk_gb = 200
idle_timeout = "3600" # 1 hour auto-stop
team = "ml-team"
post_startup_script = null
}
}
# environments/prod.tfvars
environment = "prod"
zone = "us-central1-a"
workbench_instances = {
"ml-lead" = {
machine_type = "n1-standard-8"
gpu_type = "NVIDIA_TESLA_T4"
gpu_count = 1
boot_disk_gb = 150
data_disk_gb = 500
idle_timeout = "7200" # 2 hour auto-stop
team = "ml-team"
post_startup_script = "gs://my-bucket/scripts/setup.sh"
}
"ml-engineer-1" = {
machine_type = "n1-standard-8"
gpu_type = "NVIDIA_TESLA_T4"
gpu_count = 1
boot_disk_gb = 150
data_disk_gb = 500
idle_timeout = "7200"
team = "ml-team"
post_startup_script = "gs://my-bucket/scripts/setup.sh"
}
}
Cost control through idle_timeout. GPU instances cost $1-3/hour. A T4 instance left running overnight costs $24-72 in wasted spend. The idle-timeout-seconds metadata auto-stops the VM when JupyterLab kernels are idle.
π§ Post-Startup Script for Team Standardization
Standardize packages and configs across all instances:
#!/bin/bash
# scripts/setup.sh - uploaded to GCS
# Install team-standard packages
pip install --upgrade \
google-cloud-aiplatform \
google-cloud-bigquery \
pandas \
scikit-learn \
xgboost
# Clone team shared repos
git clone https://github.com/your-org/ml-utils.git /home/jupyter/ml-utils
# Set environment variables
echo 'export PROJECT_ID=your-project' >> /home/jupyter/.bashrc
echo 'export REGION=us-central1' >> /home/jupyter/.bashrc
Reference it in the instance metadata via post_startup_script. Runs once when the instance starts for the first time.
β οΈ Gotchas and Tips
User-managed notebooks are deprecated. As of January 2025, only Workbench Instances (google_workbench_instance) are supported. Don't use google_notebooks_instance - it's legacy.
GPU quota must be requested. T4, V100, A100 GPUs require quota increases in your GCP project. Request early - approval can take 1-2 business days.
Instance stops don't lose data. Stopping a Workbench instance preserves the boot and data disks. You only pay for disk storage when stopped. Starting it again restores your full environment.
Use data disks for large datasets. Boot disks are limited and get recreated on image updates. Store notebooks and datasets on the data disk (/home/jupyter) which persists independently.
Private Google Access vs public IP. With private_ip_google_access = true on the subnet, instances can reach GCS, BigQuery, and Vertex AI without a public IP. Combine with IAP for secure browser access to JupyterLab.
βοΈ What's Next
This is Post 1 of the GCP ML Pipelines & MLOps with Terraform series.
- Post 1: Vertex AI Workbench (you are here) π¬
- Post 2: Vertex AI Endpoints - Deploy to Prod
- Post 3: Vertex AI Feature Store
- Post 4: Vertex AI Pipelines + Cloud Build
Your ML workspace is provisioned. Network-isolated, GPU-ready, auto-shutdown enabled, with standardized packages across your team. The foundation for model training, deployment, and production ML pipelines. π¬
Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! π¬
Top comments (0)