Suhas Mallesh

Posted on Apr 24

Vertex AI Pipelines + Cloud Build: CI/CD for ML on GCP with Terraform 🔁

#googlecloud #ai #devops #machinelearning

Manual ML retraining doesn't scale. Vertex AI Pipelines orchestrates your ML DAG while Cloud Build automates testing, compiling, and deploying updated pipelines on every code push. Here's how to wire it all together with Terraform.

Through Series 5, we've built the Workbench, deployed endpoints, and set up the Feature Store. The final piece is automation. Right now, retraining means a data scientist manually runs a notebook, checks metrics, and updates the endpoint. That's a bottleneck and a reliability risk.

GCP's ML CI/CD stack uses two services together: Vertex AI Pipelines orchestrates the ML workflow (preprocessing, training, evaluation, registration) as a managed DAG. Cloud Build provides the CI/CD layer that tests your pipeline code, compiles it, uploads it to GCS, and runs it on a schedule or trigger. Terraform provisions the infrastructure for both. 🎯

🏗️ The CI/CD Architecture

Code push to GitHub/Cloud Source Repos
    ↓
Cloud Build trigger fires
    ↓
Cloud Build: run tests → compile KFP pipeline → upload to GCS
    ↓
Cloud Scheduler: daily trigger → run Vertex AI Pipeline
    ↓
Pipeline DAG: preprocess → train → evaluate → condition
    ↓
Pass: register model → approve → deploy to endpoint
Fail: pipeline exits with error

Component	Role
Vertex AI Pipelines	Managed KFP pipeline execution (the ML DAG)
Cloud Build	CI/CD: test, compile, upload pipeline on code push
Cloud Scheduler	Trigger pipeline on a cron schedule
GCS	Store compiled pipeline specs (`.json`)
Vertex AI Model Registry	Version and approve trained models

🔧 Terraform: CI/CD Infrastructure

APIs and Service Account

# pipelines/apis.tf

resource "google_project_service" "required" {
  for_each = toset([
    "aiplatform.googleapis.com",
    "cloudbuild.googleapis.com",
    "cloudscheduler.googleapis.com",
    "storage.googleapis.com",
    "artifactregistry.googleapis.com",
  ])
  project = var.project_id
  service = each.value
}

resource "google_service_account" "pipeline_runner" {
  account_id   = "${var.environment}-pipeline-runner"
  display_name = "Vertex AI Pipeline Runner"
  project      = var.project_id
}

resource "google_project_iam_member" "pipeline_roles" {
  for_each = toset([
    "roles/aiplatform.user",
    "roles/storage.objectAdmin",
    "roles/bigquery.dataEditor",
    "roles/bigquery.jobUser",
  ])
  project = var.project_id
  role    = each.value
  member  = "serviceAccount:${google_service_account.pipeline_runner.email}"
}

GCS Bucket for Pipeline Artifacts

# pipelines/storage.tf

resource "google_storage_bucket" "pipeline_root" {
  name          = "${var.project_id}-${var.environment}-pipeline-root"
  location      = var.region
  force_destroy = var.environment != "prod"

  versioning {
    enabled = true
  }

  labels = {
    environment = var.environment
    managed_by  = "terraform"
  }
}

resource "google_storage_bucket" "pipeline_specs" {
  name          = "${var.project_id}-${var.environment}-pipeline-specs"
  location      = var.region
  force_destroy = var.environment != "prod"
}

Cloud Build Trigger

# pipelines/cloudbuild.tf

resource "google_cloudbuild_trigger" "pipeline_deploy" {
  name     = "${var.environment}-ml-pipeline-deploy"
  project  = var.project_id
  location = var.region

  github {
    owner = var.github_owner
    name  = var.github_repo
    push {
      branch = var.deploy_branch  # e.g. "main" for prod, "develop" for dev
    }
  }

  filename = "cloudbuild/pipeline-deploy.yaml"

  substitutions = {
    _ENVIRONMENT        = var.environment
    _PIPELINE_ROOT      = "gs://${google_storage_bucket.pipeline_root.name}"
    _PIPELINE_SPECS_GCS = "gs://${google_storage_bucket.pipeline_specs.name}/specs/"
    _REGION             = var.region
    _PROJECT_ID         = var.project_id
    _SA_EMAIL           = google_service_account.pipeline_runner.email
  }

  service_account = google_service_account.cloudbuild_sa.id
}

Cloud Scheduler: Run on Schedule

# pipelines/scheduler.tf

resource "google_cloud_scheduler_job" "pipeline_schedule" {
  name     = "${var.environment}-training-pipeline"
  region   = var.region
  project  = var.project_id
  schedule = var.pipeline_schedule   # e.g. "0 2 * * *"
  time_zone = "UTC"

  http_target {
    uri         = "https://${var.region}-aiplatform.googleapis.com/v1/projects/${var.project_id}/locations/${var.region}/pipelineJobs"
    http_method = "POST"

    body = base64encode(jsonencode({
      displayName = "${var.environment}-training-run"
      pipelineSpec = {}
      templateUri  = "gs://${google_storage_bucket.pipeline_specs.name}/specs/training-pipeline.json"
      runtimeConfig = {
        gcsOutputDirectory = "gs://${google_storage_bucket.pipeline_root.name}/runs/"
        parameterValues = {
          project_id   = var.project_id
          region       = var.region
          data_gcs_uri = var.training_data_uri
          model_name   = var.model_name
        }
      }
      serviceAccount = google_service_account.pipeline_runner.email
    }))

    oauth_token {
      service_account_email = google_service_account.pipeline_runner.email
    }
  }
}

🔧 Cloud Build Config (cloudbuild/pipeline-deploy.yaml)

This file lives in your repo and runs on every push to the deploy branch:

# cloudbuild/pipeline-deploy.yaml

steps:
  # Step 1: Install dependencies
  - name: "python:3.11"
    entrypoint: pip
    args: ["install", "-r", "requirements.txt", "--user"]

  # Step 2: Run unit tests on pipeline components
  - name: "python:3.11"
    entrypoint: python
    args: ["-m", "pytest", "pipelines/tests/", "-v"]

  # Step 3: Compile the Vertex AI Pipeline
  - name: "python:3.11"
    entrypoint: python
    args: ["pipelines/compile.py", "--output", "/workspace/training-pipeline.json"]
    env:
      - "PROJECT_ID=$PROJECT_ID"
      - "REGION=$_REGION"

  # Step 4: Upload compiled pipeline spec to GCS
  - name: "gcr.io/cloud-builders/gsutil"
    args: ["cp", "/workspace/training-pipeline.json", "${_PIPELINE_SPECS_GCS}training-pipeline.json"]

  # Step 5: (Optional) Run a quick end-to-end test on dev
  - name: "python:3.11"
    entrypoint: python
    args: ["pipelines/run.py", "--pipeline-spec", "${_PIPELINE_SPECS_GCS}training-pipeline.json"]
    env:
      - "ENVIRONMENT=$_ENVIRONMENT"
    id: "e2e-test"

substitutions:
  _ENVIRONMENT: dev
  _PIPELINE_SPECS_GCS: gs://my-bucket/specs/
  _REGION: us-central1

🐍 KFP Pipeline Definition

# pipelines/compile.py

from kfp import dsl, compiler
from kfp.dsl import component
from google.cloud import aiplatform

@component(base_image="python:3.11", packages_to_install=["scikit-learn", "pandas"])
def preprocess(data_uri: str, output_uri: str) -> None:
    import pandas as pd
    df = pd.read_parquet(data_uri)
    # ... preprocessing logic ...
    df.to_parquet(output_uri)

@component(base_image="python:3.11", packages_to_install=["scikit-learn"])
def train(data_uri: str, model_uri: str) -> float:
    # ... training logic ...
    # Returns accuracy
    return accuracy

@component(base_image="python:3.11")
def register_model(model_uri: str, accuracy: float, project: str, region: str, model_name: str) -> None:
    aiplatform.init(project=project, location=region)
    model = aiplatform.Model.upload(
        display_name=model_name,
        artifact_uri=model_uri,
        serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest",
    )

@dsl.pipeline(name="training-pipeline")
def training_pipeline(
    project_id: str,
    region: str,
    data_gcs_uri: str,
    model_name: str,
    accuracy_threshold: float = 0.85,
):
    preprocess_task = preprocess(data_uri=data_gcs_uri, output_uri=f"gs://pipeline-root/processed/")
    train_task = train(data_uri=preprocess_task.output, model_uri=f"gs://pipeline-root/model/")

    with dsl.If(train_task.output >= accuracy_threshold, name="AccuracyGate"):
        register_model(
            model_uri=f"gs://pipeline-root/model/",
            accuracy=train_task.output,
            project=project_id,
            region=region,
            model_name=model_name,
        )

if __name__ == "__main__":
    compiler.Compiler().compile(training_pipeline, "training-pipeline.json")

📐 Environment Configuration

# environments/dev.tfvars
environment        = "dev"
deploy_branch      = "develop"
pipeline_schedule  = "0 6 * * *"    # Daily at 6am UTC
model_name         = "my-model-dev"

# environments/prod.tfvars
environment        = "prod"
deploy_branch      = "main"
pipeline_schedule  = "0 2 * * *"    # Daily at 2am UTC
model_name         = "my-model"

⚠️ Gotchas and Tips

Two separate pipelines. Cloud Build is the CI/CD pipeline for your code. Vertex AI Pipelines is the ML orchestration DAG. They serve different purposes and run independently.

Compile on every push. The Cloud Build step compiles the KFP pipeline from Python code to JSON on every merge. This catches pipeline definition errors early and ensures GCS always has the latest spec.

Pipeline spec versioning. Upload compiled specs with a version suffix (commit hash or timestamp) alongside latest. This enables rollback to any previous pipeline version: training-pipeline-abc123.json.

Cloud Scheduler vs Eventarc. Cloud Scheduler runs pipelines on a fixed cron. For event-driven triggers (new data in GCS), use Eventarc to trigger a Cloud Function that submits the pipeline job.

Service account for Cloud Build. Give the Cloud Build trigger a dedicated service account with roles/aiplatform.user and roles/storage.objectAdmin. Avoid using the default Cloud Build SA which has overly broad permissions.

⏭️ Series 5 Complete!

This is Post 4 of the GCP ML Pipelines & MLOps with Terraform series.

Post 1: Vertex AI Workbench 🔬
Post 2: Vertex AI Endpoints 🚀
Post 3: Vertex AI Feature Store 🗃️
Post 4: Vertex AI Pipelines + Cloud Build (you are here) 🔁

Your ML workflow is automated. Cloud Build tests and compiles your pipeline on every code push. Cloud Scheduler runs it on a cron. Vertex AI Pipelines executes the DAG. Models that pass evaluation register automatically. All provisioned with Terraform. 🔁

Found this helpful? Follow for the next series! 💬

DEV Community