Manual ML retraining is a reliability risk. Azure ML Pipelines orchestrates the ML workflow while Azure DevOps automates testing, validation, and deployment on every code push. Here's how to build the full CI/CD stack with Terraform.
Through Series 5, we've built the workspace, deployed endpoints, and set up the feature store. The final piece is automation. Right now, retraining means a data scientist manually submits a job, checks the accuracy, and updates the endpoint. That's a bottleneck.
Azure ML Pipelines (SDK v2) orchestrates the ML workflow as reusable components connected into a DAG - preprocessing, training, evaluation, conditional registration. Azure DevOps provides the CI/CD layer: unit tests, pipeline submission, and gated deployment on every code merge. Terraform provisions everything. π―
ποΈ The CI/CD Architecture
Code push to Azure Repos / GitHub
β
Azure DevOps Pipeline trigger fires
β
Stage 1 (CI): lint β unit tests β validate components
β
Stage 2 (CD): submit Azure ML Pipeline job
β
Azure ML Pipeline: preprocess β train β evaluate β condition
β
Pass: register model β manual approval gate β deploy endpoint
Fail: pipeline exits with error notification
| Component | Role |
|---|---|
| Azure ML Pipeline | Reusable component DAG (the ML workflow) |
| Azure DevOps | CI/CD: test, validate, submit pipeline on code push |
| Schedule (SDK v2) | Recurring pipeline runs for continuous retraining |
| Model Registry | Version and approve trained models |
| Approval Gate | Human review before production deployment |
π§ Terraform: CI/CD Infrastructure
Service Principal for DevOps
# devops/service_principal.tf
data "azuread_client_config" "current" {}
resource "azuread_application" "devops" {
display_name = "${var.environment}-ml-devops-sp"
}
resource "azuread_service_principal" "devops" {
client_id = azuread_application.devops.client_id
}
resource "azuread_service_principal_password" "devops" {
service_principal_id = azuread_service_principal.devops.id
}
# DevOps SP needs Contributor on the ML workspace
resource "azurerm_role_assignment" "devops_ml" {
scope = azurerm_machine_learning_workspace.this.id
role_definition_name = "Contributor"
principal_id = azuread_service_principal.devops.object_id
}
Storage for Pipeline Artifacts
# devops/pipeline_storage.tf
resource "azurerm_storage_container" "pipeline_artifacts" {
name = "pipeline-artifacts"
storage_account_id = azurerm_storage_account.ml.id
container_access_type = "private"
}
Azure DevOps Project and Service Connection (via azuredevops provider)
# devops/azuredevops.tf
terraform {
required_providers {
azuredevops = {
source = "microsoft/azuredevops"
version = "~> 1.0"
}
}
}
resource "azuredevops_project" "ml" {
name = "${var.environment}-ml-platform"
visibility = "private"
version_control = "Git"
work_item_template = "Agile"
}
resource "azuredevops_serviceendpoint_azurerm" "ml_workspace" {
project_id = azuredevops_project.ml.id
service_endpoint_name = "azure-ml-connection"
credentials {
serviceprincipalid = azuread_application.devops.client_id
serviceprincipalkey = azuread_service_principal_password.devops.value
}
environment = "AzureCloud"
resource_group = azurerm_resource_group.ml.name
subscription_id = data.azurerm_client_config.current.subscription_id
subscription_name = data.azurerm_subscription.current.display_name
}
resource "azuredevops_build_definition" "ml_pipeline" {
project_id = azuredevops_project.ml.id
name = "${var.environment}-ml-pipeline"
ci_trigger {
use_yaml = true
}
repository {
repo_type = "GitHub"
repo_id = "${var.github_owner}/${var.github_repo}"
branch_name = var.deploy_branch
yml_path = "azure-devops/ml-pipeline.yml"
}
variable {
name = "ENVIRONMENT"
value = var.environment
}
variable {
name = "WORKSPACE_NAME"
value = azurerm_machine_learning_workspace.this.name
is_secret = false
}
}
π§ Azure DevOps Pipeline YAML
This file lives in your repo and runs on every push to the deploy branch:
# azure-devops/ml-pipeline.yml
trigger:
branches:
include:
- main
variables:
SUBSCRIPTION_ID: $(subscriptionId)
RESOURCE_GROUP: $(resourceGroup)
WORKSPACE_NAME: $(workspaceName)
ENVIRONMENT: $(environment)
stages:
- stage: CI
displayName: "Test and Validate"
jobs:
- job: Test
pool:
vmImage: "ubuntu-latest"
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: "3.11"
- script: pip install -r requirements.txt
displayName: "Install dependencies"
- script: python -m pytest pipelines/tests/ -v
displayName: "Run unit tests"
- script: python pipelines/validate_components.py
displayName: "Validate component definitions"
- stage: CD
displayName: "Submit ML Pipeline"
dependsOn: CI
condition: succeeded()
jobs:
- job: SubmitPipeline
pool:
vmImage: "ubuntu-latest"
steps:
- task: AzureCLI@2
displayName: "Submit Azure ML Pipeline"
inputs:
azureSubscription: "azure-ml-connection"
scriptType: "bash"
scriptLocation: "inlineScript"
inlineScript: |
az ml job create \
--file pipelines/training-pipeline.yml \
--workspace-name $(WORKSPACE_NAME) \
--resource-group $(RESOURCE_GROUP) \
--subscription $(SUBSCRIPTION_ID) \
--stream
- stage: Approval
displayName: "Manual Approval Gate"
dependsOn: CD
condition: succeeded()
jobs:
- deployment: ApproveDeployment
environment: "$(ENVIRONMENT)-ml-approval"
strategy:
runOnce:
deploy:
steps:
- task: AzureCLI@2
displayName: "Deploy approved model to endpoint"
inputs:
azureSubscription: "azure-ml-connection"
scriptType: "bash"
scriptLocation: "inlineScript"
inlineScript: |
python scripts/deploy_approved_model.py \
--workspace $(WORKSPACE_NAME) \
--resource-group $(RESOURCE_GROUP) \
--endpoint-name $(ENVIRONMENT)-my-endpoint
π Azure ML Pipeline Definition (SDK v2)
# pipelines/training_pipeline.py
from azure.ai.ml import MLClient, Input, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import (
CommandComponent,
RecurrenceTrigger,
JobSchedule,
)
from azure.identity import DefaultAzureCredential
ml_client = MLClient(
DefaultAzureCredential(),
subscription_id="...",
resource_group_name="...",
workspace_name="...",
)
# Define components
preprocess = CommandComponent(
name="preprocess",
command="python preprocess.py --input ${{inputs.raw_data}} --output ${{outputs.processed_data}}",
environment="azureml:sklearn-env:1",
inputs={"raw_data": {"type": "uri_folder"}},
outputs={"processed_data": {"type": "uri_folder"}},
)
train = CommandComponent(
name="train",
command="python train.py --data ${{inputs.data}} --model-output ${{outputs.model}} --accuracy-output ${{outputs.accuracy}}",
environment="azureml:sklearn-env:1",
inputs={"data": {"type": "uri_folder"}},
outputs={"model": {"type": "uri_folder"}, "accuracy": {"type": "uri_file"}},
)
@pipeline(name="training-pipeline", compute="cpu-cluster")
def training_pipeline(raw_data: Input(type="uri_folder")):
preprocess_step = preprocess(raw_data=raw_data)
train_step = train(data=preprocess_step.outputs.processed_data)
return {"model": train_step.outputs.model}
# Submit pipeline
pipeline_job = training_pipeline(
raw_data=Input(path="azureml://datastores/workspaceblobstore/paths/data/")
)
ml_client.jobs.create_or_update(pipeline_job, experiment_name="training-runs")
Scheduled Recurring Training (SDK v2)
from azure.ai.ml.entities import RecurrenceTrigger, JobSchedule
schedule = JobSchedule(
name=f"{environment}-daily-training",
trigger=RecurrenceTrigger(frequency="day", interval=1, start_time="2026-01-01T02:00:00"),
create_job=pipeline_job,
)
ml_client.schedules.begin_create_or_update(schedule).result()
π Environment Configuration
# environments/dev.tfvars
environment = "dev"
deploy_branch = "develop"
# environments/prod.tfvars
environment = "prod"
deploy_branch = "main"
Approval gates in Azure DevOps are configured per environment in the UI: Pipelines β Environments β prod-ml-approval β Approvals and checks. Add team members as required approvers before any production deployment proceeds.
β οΈ Gotchas and Tips
Use SDK v2 only. SDK v1 reached end-of-support in March 2025 and will fully stop working in June 2026. All pipelines should use azure-ai-ml (SDK v2) and CLI v2.
Service principal secrets need rotation. The azuread_service_principal_password expires. Use federated identity (OIDC) in Azure DevOps for a secretless authentication alternative that doesn't require rotation.
AzureML Job Wait task for long-running jobs. Training jobs can take hours. Use the AzureML Job Wait task in Azure DevOps to hold the pipeline until the ML job completes before proceeding to the approval stage.
Component versioning. Register components in the Azure ML registry with versions. This ensures pipeline runs are reproducible - you know exactly which version of each component ran for any historical job.
Schedules live in the workspace, not Terraform. Azure ML job schedules are created via SDK v2 or CLI v2 and live in the workspace. They're not managed by Terraform directly. Include schedule creation in your DevOps pipeline's deploy stage.
βοΈ Series 5 Complete!
This is Post 4 of the Azure ML Pipelines & MLOps with Terraform series, and the final post of Series 5.
- Post 1: Azure ML Workspace π¬
- Post 2: Azure ML Online Endpoints π
- Post 3: Azure ML Feature Store ποΈ
- Post 4: Azure ML Pipelines + Azure DevOps (you are here) π
Your ML workflow is automated. Azure DevOps tests and validates on every push. Azure ML Pipelines runs the DAG. Models that pass evaluation register automatically. Manual approval gates protect production. All provisioned with Terraform. π
Thanks for following the full Series 5! Series 6 coming soon. π¬
Top comments (0)