Manual ML retraining doesn't scale. Vertex AI Pipelines orchestrates your ML DAG while Cloud Build automates testing, compiling, and deploying updated pipelines on every code push. Here's how to wire it all together with Terraform.
Through Series 5, we've built the Workbench, deployed endpoints, and set up the Feature Store. The final piece is automation. Right now, retraining means a data scientist manually runs a notebook, checks metrics, and updates the endpoint. That's a bottleneck and a reliability risk.
GCP's ML CI/CD stack uses two services together: Vertex AI Pipelines orchestrates the ML workflow (preprocessing, training, evaluation, registration) as a managed DAG. Cloud Build provides the CI/CD layer that tests your pipeline code, compiles it, uploads it to GCS, and runs it on a schedule or trigger. Terraform provisions the infrastructure for both. π―
ποΈ The CI/CD Architecture
Code push to GitHub/Cloud Source Repos
β
Cloud Build trigger fires
β
Cloud Build: run tests β compile KFP pipeline β upload to GCS
β
Cloud Scheduler: daily trigger β run Vertex AI Pipeline
β
Pipeline DAG: preprocess β train β evaluate β condition
β
Pass: register model β approve β deploy to endpoint
Fail: pipeline exits with error
| Component | Role |
|---|---|
| Vertex AI Pipelines | Managed KFP pipeline execution (the ML DAG) |
| Cloud Build | CI/CD: test, compile, upload pipeline on code push |
| Cloud Scheduler | Trigger pipeline on a cron schedule |
| GCS | Store compiled pipeline specs (.json) |
| Vertex AI Model Registry | Version and approve trained models |
π§ Terraform: CI/CD Infrastructure
APIs and Service Account
# pipelines/apis.tf
resource "google_project_service" "required" {
for_each = toset([
"aiplatform.googleapis.com",
"cloudbuild.googleapis.com",
"cloudscheduler.googleapis.com",
"storage.googleapis.com",
"artifactregistry.googleapis.com",
])
project = var.project_id
service = each.value
}
resource "google_service_account" "pipeline_runner" {
account_id = "${var.environment}-pipeline-runner"
display_name = "Vertex AI Pipeline Runner"
project = var.project_id
}
resource "google_project_iam_member" "pipeline_roles" {
for_each = toset([
"roles/aiplatform.user",
"roles/storage.objectAdmin",
"roles/bigquery.dataEditor",
"roles/bigquery.jobUser",
])
project = var.project_id
role = each.value
member = "serviceAccount:${google_service_account.pipeline_runner.email}"
}
GCS Bucket for Pipeline Artifacts
# pipelines/storage.tf
resource "google_storage_bucket" "pipeline_root" {
name = "${var.project_id}-${var.environment}-pipeline-root"
location = var.region
force_destroy = var.environment != "prod"
versioning {
enabled = true
}
labels = {
environment = var.environment
managed_by = "terraform"
}
}
resource "google_storage_bucket" "pipeline_specs" {
name = "${var.project_id}-${var.environment}-pipeline-specs"
location = var.region
force_destroy = var.environment != "prod"
}
Cloud Build Trigger
# pipelines/cloudbuild.tf
resource "google_cloudbuild_trigger" "pipeline_deploy" {
name = "${var.environment}-ml-pipeline-deploy"
project = var.project_id
location = var.region
github {
owner = var.github_owner
name = var.github_repo
push {
branch = var.deploy_branch # e.g. "main" for prod, "develop" for dev
}
}
filename = "cloudbuild/pipeline-deploy.yaml"
substitutions = {
_ENVIRONMENT = var.environment
_PIPELINE_ROOT = "gs://${google_storage_bucket.pipeline_root.name}"
_PIPELINE_SPECS_GCS = "gs://${google_storage_bucket.pipeline_specs.name}/specs/"
_REGION = var.region
_PROJECT_ID = var.project_id
_SA_EMAIL = google_service_account.pipeline_runner.email
}
service_account = google_service_account.cloudbuild_sa.id
}
Cloud Scheduler: Run on Schedule
# pipelines/scheduler.tf
resource "google_cloud_scheduler_job" "pipeline_schedule" {
name = "${var.environment}-training-pipeline"
region = var.region
project = var.project_id
schedule = var.pipeline_schedule # e.g. "0 2 * * *"
time_zone = "UTC"
http_target {
uri = "https://${var.region}-aiplatform.googleapis.com/v1/projects/${var.project_id}/locations/${var.region}/pipelineJobs"
http_method = "POST"
body = base64encode(jsonencode({
displayName = "${var.environment}-training-run"
pipelineSpec = {}
templateUri = "gs://${google_storage_bucket.pipeline_specs.name}/specs/training-pipeline.json"
runtimeConfig = {
gcsOutputDirectory = "gs://${google_storage_bucket.pipeline_root.name}/runs/"
parameterValues = {
project_id = var.project_id
region = var.region
data_gcs_uri = var.training_data_uri
model_name = var.model_name
}
}
serviceAccount = google_service_account.pipeline_runner.email
}))
oauth_token {
service_account_email = google_service_account.pipeline_runner.email
}
}
}
π§ Cloud Build Config (cloudbuild/pipeline-deploy.yaml)
This file lives in your repo and runs on every push to the deploy branch:
# cloudbuild/pipeline-deploy.yaml
steps:
# Step 1: Install dependencies
- name: "python:3.11"
entrypoint: pip
args: ["install", "-r", "requirements.txt", "--user"]
# Step 2: Run unit tests on pipeline components
- name: "python:3.11"
entrypoint: python
args: ["-m", "pytest", "pipelines/tests/", "-v"]
# Step 3: Compile the Vertex AI Pipeline
- name: "python:3.11"
entrypoint: python
args: ["pipelines/compile.py", "--output", "/workspace/training-pipeline.json"]
env:
- "PROJECT_ID=$PROJECT_ID"
- "REGION=$_REGION"
# Step 4: Upload compiled pipeline spec to GCS
- name: "gcr.io/cloud-builders/gsutil"
args: ["cp", "/workspace/training-pipeline.json", "${_PIPELINE_SPECS_GCS}training-pipeline.json"]
# Step 5: (Optional) Run a quick end-to-end test on dev
- name: "python:3.11"
entrypoint: python
args: ["pipelines/run.py", "--pipeline-spec", "${_PIPELINE_SPECS_GCS}training-pipeline.json"]
env:
- "ENVIRONMENT=$_ENVIRONMENT"
id: "e2e-test"
substitutions:
_ENVIRONMENT: dev
_PIPELINE_SPECS_GCS: gs://my-bucket/specs/
_REGION: us-central1
π KFP Pipeline Definition
# pipelines/compile.py
from kfp import dsl, compiler
from kfp.dsl import component
from google.cloud import aiplatform
@component(base_image="python:3.11", packages_to_install=["scikit-learn", "pandas"])
def preprocess(data_uri: str, output_uri: str) -> None:
import pandas as pd
df = pd.read_parquet(data_uri)
# ... preprocessing logic ...
df.to_parquet(output_uri)
@component(base_image="python:3.11", packages_to_install=["scikit-learn"])
def train(data_uri: str, model_uri: str) -> float:
# ... training logic ...
# Returns accuracy
return accuracy
@component(base_image="python:3.11")
def register_model(model_uri: str, accuracy: float, project: str, region: str, model_name: str) -> None:
aiplatform.init(project=project, location=region)
model = aiplatform.Model.upload(
display_name=model_name,
artifact_uri=model_uri,
serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest",
)
@dsl.pipeline(name="training-pipeline")
def training_pipeline(
project_id: str,
region: str,
data_gcs_uri: str,
model_name: str,
accuracy_threshold: float = 0.85,
):
preprocess_task = preprocess(data_uri=data_gcs_uri, output_uri=f"gs://pipeline-root/processed/")
train_task = train(data_uri=preprocess_task.output, model_uri=f"gs://pipeline-root/model/")
with dsl.If(train_task.output >= accuracy_threshold, name="AccuracyGate"):
register_model(
model_uri=f"gs://pipeline-root/model/",
accuracy=train_task.output,
project=project_id,
region=region,
model_name=model_name,
)
if __name__ == "__main__":
compiler.Compiler().compile(training_pipeline, "training-pipeline.json")
π Environment Configuration
# environments/dev.tfvars
environment = "dev"
deploy_branch = "develop"
pipeline_schedule = "0 6 * * *" # Daily at 6am UTC
model_name = "my-model-dev"
# environments/prod.tfvars
environment = "prod"
deploy_branch = "main"
pipeline_schedule = "0 2 * * *" # Daily at 2am UTC
model_name = "my-model"
β οΈ Gotchas and Tips
Two separate pipelines. Cloud Build is the CI/CD pipeline for your code. Vertex AI Pipelines is the ML orchestration DAG. They serve different purposes and run independently.
Compile on every push. The Cloud Build step compiles the KFP pipeline from Python code to JSON on every merge. This catches pipeline definition errors early and ensures GCS always has the latest spec.
Pipeline spec versioning. Upload compiled specs with a version suffix (commit hash or timestamp) alongside latest. This enables rollback to any previous pipeline version: training-pipeline-abc123.json.
Cloud Scheduler vs Eventarc. Cloud Scheduler runs pipelines on a fixed cron. For event-driven triggers (new data in GCS), use Eventarc to trigger a Cloud Function that submits the pipeline job.
Service account for Cloud Build. Give the Cloud Build trigger a dedicated service account with roles/aiplatform.user and roles/storage.objectAdmin. Avoid using the default Cloud Build SA which has overly broad permissions.
βοΈ Series 5 Complete!
This is Post 4 of the GCP ML Pipelines & MLOps with Terraform series.
- Post 1: Vertex AI Workbench π¬
- Post 2: Vertex AI Endpoints π
- Post 3: Vertex AI Feature Store ποΈ
- Post 4: Vertex AI Pipelines + Cloud Build (you are here) π
Your ML workflow is automated. Cloud Build tests and compiles your pipeline on every code push. Cloud Scheduler runs it on a cron. Vertex AI Pipelines executes the DAG. Models that pass evaluation register automatically. All provisioned with Terraform. π
Found this helpful? Follow for the next series! π¬
Top comments (0)