Suhas Mallesh

Posted on Apr 2

Vertex AI Workbench with Terraform: Your ML Workspace on GCP 🔬

#googlecloud #ai #devops #machinelearning

Vertex AI Workbench is the JupyterLab IDE for ML on GCP - pre-installed ML frameworks, Vertex AI integration, and GPU support out of the box. Here's how to provision instances with Terraform including networking, IAM, auto-shutdown, and custom containers.

In Series 1-3, we worked with managed AI services - Vertex AI for models, RAG Engine for retrieval, ADK for agents. Series 5 shifts to custom ML - training your own models, deploying endpoints, managing features, and building ML pipelines.

It starts with a development environment. Vertex AI Workbench provides JupyterLab instances backed by Compute Engine VMs, pre-loaded with ML frameworks (TensorFlow, PyTorch, JAX, scikit-learn), the Vertex AI SDK, and direct integration with GCS, BigQuery, and Vertex AI services. Each instance is a full ML workspace. Terraform provisions them consistently across your team. 🎯

🏗️ Workbench Architecture

Component	What It Does
Workbench Instance	JupyterLab on a Compute Engine VM
VM Image	Pre-built with ML frameworks and Vertex SDK
Custom Container	Your own Docker image with specific dependencies
Data Disk	Persistent storage for notebooks and datasets
Service Account	Controls access to GCS, BigQuery, Vertex AI
Network	VPC, subnet, firewall rules for isolation

Unlike SageMaker's domain-based approach, Workbench instances are standalone VMs. Each data scientist gets their own instance with full control over machine type, GPU, and installed packages. Terraform standardizes the configuration across the team.

🔧 Terraform: The Full Workbench Setup

APIs and Networking

# workbench/apis.tf

resource "google_project_service" "required" {
  for_each = toset([
    "notebooks.googleapis.com",
    "aiplatform.googleapis.com",
    "compute.googleapis.com",
    "storage.googleapis.com",
    "bigquery.googleapis.com",
  ])
  project = var.project_id
  service = each.value
}

# VPC for Workbench instances
resource "google_compute_network" "ml" {
  name                    = "${var.environment}-ml-network"
  auto_create_subnetworks = false
  project                 = var.project_id
}

resource "google_compute_subnetwork" "ml" {
  name                     = "${var.environment}-ml-subnet"
  ip_cidr_range            = var.subnet_cidr
  region                   = var.region
  network                  = google_compute_network.ml.id
  private_ip_google_access = true  # Access Google APIs without public IP
  project                  = var.project_id
}

# Allow internal traffic for JupyterLab proxy
resource "google_compute_firewall" "internal" {
  name    = "${var.environment}-ml-internal"
  network = google_compute_network.ml.id
  project = var.project_id

  allow {
    protocol = "tcp"
    ports    = ["8080", "443"]
  }

  source_ranges = [var.subnet_cidr]
}

private_ip_google_access = true is essential. It lets Workbench instances access Google APIs (GCS, BigQuery, Vertex AI) without a public IP address, keeping your ML environment network-isolated.

Service Account

# workbench/iam.tf

resource "google_service_account" "workbench" {
  account_id   = "${var.environment}-workbench-sa"
  display_name = "Workbench ML Service Account"
  project      = var.project_id
}

# Access to training data in GCS
resource "google_project_iam_member" "storage_user" {
  project = var.project_id
  role    = "roles/storage.objectUser"
  member  = "serviceAccount:${google_service_account.workbench.email}"
}

# Access to Vertex AI for training and prediction
resource "google_project_iam_member" "vertex_ai_user" {
  project = var.project_id
  role    = "roles/aiplatform.user"
  member  = "serviceAccount:${google_service_account.workbench.email}"
}

# Access to BigQuery for data exploration
resource "google_project_iam_member" "bigquery_user" {
  project = var.project_id
  role    = "roles/bigquery.user"
  member  = "serviceAccount:${google_service_account.workbench.email}"
}

# Read Artifact Registry for custom containers
resource "google_project_iam_member" "artifact_reader" {
  project = var.project_id
  role    = "roles/artifactregistry.reader"
  member  = "serviceAccount:${google_service_account.workbench.email}"
}

Workbench Instance

# workbench/instance.tf

resource "google_workbench_instance" "this" {
  for_each = var.workbench_instances

  name     = "${var.environment}-${each.key}"
  location = var.zone
  project  = var.project_id

  gce_setup {
    machine_type = each.value.machine_type

    vm_image {
      project = "cloud-notebooks-managed"
      family  = "workbench-instances"
    }

    # Optional GPU
    dynamic "accelerator_configs" {
      for_each = each.value.gpu_type != null ? [1] : []
      content {
        type       = each.value.gpu_type
        core_count = each.value.gpu_count
      }
    }

    boot_disk {
      disk_type    = "PD_SSD"
      disk_size_gb = each.value.boot_disk_gb
    }

    data_disks {
      disk_type    = "PD_SSD"
      disk_size_gb = each.value.data_disk_gb
    }

    network_interfaces {
      network  = google_compute_network.ml.id
      subnet   = google_compute_subnetwork.ml.id
      nic_type = "GVNIC"
    }

    service_accounts {
      email = google_service_account.workbench.email
    }

    metadata = {
      idle-timeout-seconds   = each.value.idle_timeout
      install-nvidia-driver  = each.value.gpu_type != null ? "true" : "false"
      post-startup-script    = each.value.post_startup_script
    }

    enable_ip_forwarding = false
  }

  disable_proxy_access = false  # Enable JupyterLab proxy access

  labels = {
    environment = var.environment
    team        = each.value.team
    managed_by  = "terraform"
  }
}

Key configuration points:

vm_image.family = "workbench-instances" uses Google's pre-built image with ML frameworks
data_disks provides persistent storage separate from the boot disk
idle-timeout-seconds in metadata auto-stops the instance after inactivity
install-nvidia-driver auto-installs GPU drivers when GPU is attached
disable_proxy_access = false enables browser access to JupyterLab via IAP

Custom Container Alternative

For teams that need specific package versions or private packages, use a custom container:

resource "google_workbench_instance" "custom" {
  name     = "${var.environment}-custom-ml"
  location = var.zone
  project  = var.project_id

  gce_setup {
    machine_type = "n1-standard-8"

    container_image {
      repository = "${var.region}-docker.pkg.dev/${var.project_id}/ml-images/custom-workbench"
      tag        = "latest"
    }

    boot_disk {
      disk_type    = "PD_SSD"
      disk_size_gb = 150
    }

    network_interfaces {
      network  = google_compute_network.ml.id
      subnet   = google_compute_subnetwork.ml.id
    }

    service_accounts {
      email = google_service_account.workbench.email
    }
  }
}

Build your container from Google's base image, add your packages, push to Artifact Registry, and reference it in Terraform. Updates are a new tag and terraform apply.

📐 Environment Configuration

# environments/dev.tfvars
environment = "dev"
zone        = "us-central1-a"

workbench_instances = {
  "data-scientist-1" = {
    machine_type       = "n1-standard-4"
    gpu_type           = null          # No GPU for dev
    gpu_count          = 0
    boot_disk_gb       = 100
    data_disk_gb       = 200
    idle_timeout        = "3600"       # 1 hour auto-stop
    team               = "ml-team"
    post_startup_script = null
  }
}

# environments/prod.tfvars
environment = "prod"
zone        = "us-central1-a"

workbench_instances = {
  "ml-lead" = {
    machine_type       = "n1-standard-8"
    gpu_type           = "NVIDIA_TESLA_T4"
    gpu_count          = 1
    boot_disk_gb       = 150
    data_disk_gb       = 500
    idle_timeout        = "7200"       # 2 hour auto-stop
    team               = "ml-team"
    post_startup_script = "gs://my-bucket/scripts/setup.sh"
  }
  "ml-engineer-1" = {
    machine_type       = "n1-standard-8"
    gpu_type           = "NVIDIA_TESLA_T4"
    gpu_count          = 1
    boot_disk_gb       = 150
    data_disk_gb       = 500
    idle_timeout        = "7200"
    team               = "ml-team"
    post_startup_script = "gs://my-bucket/scripts/setup.sh"
  }
}

Cost control through idle_timeout. GPU instances cost $1-3/hour. A T4 instance left running overnight costs $24-72 in wasted spend. The idle-timeout-seconds metadata auto-stops the VM when JupyterLab kernels are idle.

🔧 Post-Startup Script for Team Standardization

Standardize packages and configs across all instances:

#!/bin/bash
# scripts/setup.sh - uploaded to GCS

# Install team-standard packages
pip install --upgrade \
  google-cloud-aiplatform \
  google-cloud-bigquery \
  pandas \
  scikit-learn \
  xgboost

# Clone team shared repos
git clone https://github.com/your-org/ml-utils.git /home/jupyter/ml-utils

# Set environment variables
echo 'export PROJECT_ID=your-project' >> /home/jupyter/.bashrc
echo 'export REGION=us-central1' >> /home/jupyter/.bashrc

Reference it in the instance metadata via post_startup_script. Runs once when the instance starts for the first time.

⚠️ Gotchas and Tips

User-managed notebooks are deprecated. As of January 2025, only Workbench Instances (google_workbench_instance) are supported. Don't use google_notebooks_instance - it's legacy.

GPU quota must be requested. T4, V100, A100 GPUs require quota increases in your GCP project. Request early - approval can take 1-2 business days.

Instance stops don't lose data. Stopping a Workbench instance preserves the boot and data disks. You only pay for disk storage when stopped. Starting it again restores your full environment.

Use data disks for large datasets. Boot disks are limited and get recreated on image updates. Store notebooks and datasets on the data disk (/home/jupyter) which persists independently.

Private Google Access vs public IP. With private_ip_google_access = true on the subnet, instances can reach GCS, BigQuery, and Vertex AI without a public IP. Combine with IAP for secure browser access to JupyterLab.

⏭️ What's Next

This is Post 1 of the GCP ML Pipelines & MLOps with Terraform series.

Post 1: Vertex AI Workbench (you are here) 🔬
Post 2: Vertex AI Endpoints - Deploy to Prod
Post 3: Vertex AI Feature Store
Post 4: Vertex AI Pipelines + Cloud Build

Your ML workspace is provisioned. Network-isolated, GPU-ready, auto-shutdown enabled, with standardized packages across your team. The foundation for model training, deployment, and production ML pipelines. 🔬

Found this helpful? Follow for the full ML Pipelines & MLOps with Terraform series! 💬

DEV Community