Suhas Mallesh

Posted on Apr 25

Azure ML Pipelines + Azure DevOps: CI/CD for ML with Terraform 🔁

#azure #ai #devops #machinelearning

Manual ML retraining is a reliability risk. Azure ML Pipelines orchestrates the ML workflow while Azure DevOps automates testing, validation, and deployment on every code push. Here's how to build the full CI/CD stack with Terraform.

Through Series 5, we've built the workspace, deployed endpoints, and set up the feature store. The final piece is automation. Right now, retraining means a data scientist manually submits a job, checks the accuracy, and updates the endpoint. That's a bottleneck.

Azure ML Pipelines (SDK v2) orchestrates the ML workflow as reusable components connected into a DAG - preprocessing, training, evaluation, conditional registration. Azure DevOps provides the CI/CD layer: unit tests, pipeline submission, and gated deployment on every code merge. Terraform provisions everything. 🎯

🏗️ The CI/CD Architecture

Code push to Azure Repos / GitHub
    ↓
Azure DevOps Pipeline trigger fires
    ↓
Stage 1 (CI): lint → unit tests → validate components
    ↓
Stage 2 (CD): submit Azure ML Pipeline job
    ↓
Azure ML Pipeline: preprocess → train → evaluate → condition
    ↓
Pass: register model → manual approval gate → deploy endpoint
Fail: pipeline exits with error notification

Component	Role
Azure ML Pipeline	Reusable component DAG (the ML workflow)
Azure DevOps	CI/CD: test, validate, submit pipeline on code push
Schedule (SDK v2)	Recurring pipeline runs for continuous retraining
Model Registry	Version and approve trained models
Approval Gate	Human review before production deployment

🔧 Terraform: CI/CD Infrastructure

Service Principal for DevOps

# devops/service_principal.tf

data "azuread_client_config" "current" {}

resource "azuread_application" "devops" {
  display_name = "${var.environment}-ml-devops-sp"
}

resource "azuread_service_principal" "devops" {
  client_id = azuread_application.devops.client_id
}

resource "azuread_service_principal_password" "devops" {
  service_principal_id = azuread_service_principal.devops.id
}

# DevOps SP needs Contributor on the ML workspace
resource "azurerm_role_assignment" "devops_ml" {
  scope                = azurerm_machine_learning_workspace.this.id
  role_definition_name = "Contributor"
  principal_id         = azuread_service_principal.devops.object_id
}

Storage for Pipeline Artifacts

# devops/pipeline_storage.tf

resource "azurerm_storage_container" "pipeline_artifacts" {
  name                  = "pipeline-artifacts"
  storage_account_id    = azurerm_storage_account.ml.id
  container_access_type = "private"
}

Azure DevOps Project and Service Connection (via azuredevops provider)

# devops/azuredevops.tf

terraform {
  required_providers {
    azuredevops = {
      source  = "microsoft/azuredevops"
      version = "~> 1.0"
    }
  }
}

resource "azuredevops_project" "ml" {
  name               = "${var.environment}-ml-platform"
  visibility         = "private"
  version_control    = "Git"
  work_item_template = "Agile"
}

resource "azuredevops_serviceendpoint_azurerm" "ml_workspace" {
  project_id            = azuredevops_project.ml.id
  service_endpoint_name = "azure-ml-connection"

  credentials {
    serviceprincipalid  = azuread_application.devops.client_id
    serviceprincipalkey = azuread_service_principal_password.devops.value
  }

  environment           = "AzureCloud"
  resource_group        = azurerm_resource_group.ml.name
  subscription_id       = data.azurerm_client_config.current.subscription_id
  subscription_name     = data.azurerm_subscription.current.display_name
}

resource "azuredevops_build_definition" "ml_pipeline" {
  project_id = azuredevops_project.ml.id
  name       = "${var.environment}-ml-pipeline"

  ci_trigger {
    use_yaml = true
  }

  repository {
    repo_type   = "GitHub"
    repo_id     = "${var.github_owner}/${var.github_repo}"
    branch_name = var.deploy_branch
    yml_path    = "azure-devops/ml-pipeline.yml"
  }

  variable {
    name  = "ENVIRONMENT"
    value = var.environment
  }

  variable {
    name           = "WORKSPACE_NAME"
    value          = azurerm_machine_learning_workspace.this.name
    is_secret      = false
  }
}

🔧 Azure DevOps Pipeline YAML

This file lives in your repo and runs on every push to the deploy branch:

# azure-devops/ml-pipeline.yml

trigger:
  branches:
    include:
      - main

variables:
  SUBSCRIPTION_ID: $(subscriptionId)
  RESOURCE_GROUP: $(resourceGroup)
  WORKSPACE_NAME: $(workspaceName)
  ENVIRONMENT: $(environment)

stages:
  - stage: CI
    displayName: "Test and Validate"
    jobs:
      - job: Test
        pool:
          vmImage: "ubuntu-latest"
        steps:
          - task: UsePythonVersion@0
            inputs:
              versionSpec: "3.11"

          - script: pip install -r requirements.txt
            displayName: "Install dependencies"

          - script: python -m pytest pipelines/tests/ -v
            displayName: "Run unit tests"

          - script: python pipelines/validate_components.py
            displayName: "Validate component definitions"

  - stage: CD
    displayName: "Submit ML Pipeline"
    dependsOn: CI
    condition: succeeded()
    jobs:
      - job: SubmitPipeline
        pool:
          vmImage: "ubuntu-latest"
        steps:
          - task: AzureCLI@2
            displayName: "Submit Azure ML Pipeline"
            inputs:
              azureSubscription: "azure-ml-connection"
              scriptType: "bash"
              scriptLocation: "inlineScript"
              inlineScript: |
                az ml job create \
                  --file pipelines/training-pipeline.yml \
                  --workspace-name $(WORKSPACE_NAME) \
                  --resource-group $(RESOURCE_GROUP) \
                  --subscription $(SUBSCRIPTION_ID) \
                  --stream

  - stage: Approval
    displayName: "Manual Approval Gate"
    dependsOn: CD
    condition: succeeded()
    jobs:
      - deployment: ApproveDeployment
        environment: "$(ENVIRONMENT)-ml-approval"
        strategy:
          runOnce:
            deploy:
              steps:
                - task: AzureCLI@2
                  displayName: "Deploy approved model to endpoint"
                  inputs:
                    azureSubscription: "azure-ml-connection"
                    scriptType: "bash"
                    scriptLocation: "inlineScript"
                    inlineScript: |
                      python scripts/deploy_approved_model.py \
                        --workspace $(WORKSPACE_NAME) \
                        --resource-group $(RESOURCE_GROUP) \
                        --endpoint-name $(ENVIRONMENT)-my-endpoint

🐍 Azure ML Pipeline Definition (SDK v2)

# pipelines/training_pipeline.py

from azure.ai.ml import MLClient, Input, Output
from azure.ai.ml.dsl import pipeline
from azure.ai.ml.entities import (
    CommandComponent,
    RecurrenceTrigger,
    JobSchedule,
)
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="...",
    resource_group_name="...",
    workspace_name="...",
)

# Define components
preprocess = CommandComponent(
    name="preprocess",
    command="python preprocess.py --input ${{inputs.raw_data}} --output ${{outputs.processed_data}}",
    environment="azureml:sklearn-env:1",
    inputs={"raw_data": {"type": "uri_folder"}},
    outputs={"processed_data": {"type": "uri_folder"}},
)

train = CommandComponent(
    name="train",
    command="python train.py --data ${{inputs.data}} --model-output ${{outputs.model}} --accuracy-output ${{outputs.accuracy}}",
    environment="azureml:sklearn-env:1",
    inputs={"data": {"type": "uri_folder"}},
    outputs={"model": {"type": "uri_folder"}, "accuracy": {"type": "uri_file"}},
)

@pipeline(name="training-pipeline", compute="cpu-cluster")
def training_pipeline(raw_data: Input(type="uri_folder")):
    preprocess_step = preprocess(raw_data=raw_data)
    train_step = train(data=preprocess_step.outputs.processed_data)
    return {"model": train_step.outputs.model}

# Submit pipeline
pipeline_job = training_pipeline(
    raw_data=Input(path="azureml://datastores/workspaceblobstore/paths/data/")
)
ml_client.jobs.create_or_update(pipeline_job, experiment_name="training-runs")

Scheduled Recurring Training (SDK v2)

from azure.ai.ml.entities import RecurrenceTrigger, JobSchedule

schedule = JobSchedule(
    name=f"{environment}-daily-training",
    trigger=RecurrenceTrigger(frequency="day", interval=1, start_time="2026-01-01T02:00:00"),
    create_job=pipeline_job,
)

ml_client.schedules.begin_create_or_update(schedule).result()

📐 Environment Configuration

# environments/dev.tfvars
environment    = "dev"
deploy_branch  = "develop"

# environments/prod.tfvars
environment    = "prod"
deploy_branch  = "main"

Approval gates in Azure DevOps are configured per environment in the UI: Pipelines → Environments → prod-ml-approval → Approvals and checks. Add team members as required approvers before any production deployment proceeds.

⚠️ Gotchas and Tips

Use SDK v2 only. SDK v1 reached end-of-support in March 2025 and will fully stop working in June 2026. All pipelines should use azure-ai-ml (SDK v2) and CLI v2.

Service principal secrets need rotation. The azuread_service_principal_password expires. Use federated identity (OIDC) in Azure DevOps for a secretless authentication alternative that doesn't require rotation.

AzureML Job Wait task for long-running jobs. Training jobs can take hours. Use the AzureML Job Wait task in Azure DevOps to hold the pipeline until the ML job completes before proceeding to the approval stage.

Component versioning. Register components in the Azure ML registry with versions. This ensures pipeline runs are reproducible - you know exactly which version of each component ran for any historical job.

Schedules live in the workspace, not Terraform. Azure ML job schedules are created via SDK v2 or CLI v2 and live in the workspace. They're not managed by Terraform directly. Include schedule creation in your DevOps pipeline's deploy stage.

⏭️ Series 5 Complete!

This is Post 4 of the Azure ML Pipelines & MLOps with Terraform series, and the final post of Series 5.

Post 1: Azure ML Workspace 🔬
Post 2: Azure ML Online Endpoints 🚀
Post 3: Azure ML Feature Store 🗃️
Post 4: Azure ML Pipelines + Azure DevOps (you are here) 🔁

Your ML workflow is automated. Azure DevOps tests and validates on every push. Azure ML Pipelines runs the DAG. Models that pass evaluation register automatically. Manual approval gates protect production. All provisioned with Terraform. 🔁

Thanks for following the full Series 5! Series 6 coming soon. 💬

DEV Community