Jubin Soni

Posted on Jun 28

MLOps: Building a CI/CD Pipeline for ML Models on Azure Databricks

#azure #databricks #mlops #cicd

Most ML teams are great at training models. Very few are great at shipping them. The gap between a notebook that works and a model that reliably serves production traffic is where most ML projects stall.

In this tutorial I'll walk through building a proper CI/CD pipeline for ML models on Azure Databricks using:

MLflow for experiment tracking, model versioning, and the model registry
Databricks Asset Bundles (DABs) for infrastructure-as-code and job deployment
Azure DevOps / GitHub Actions as the CI/CD orchestrator
Delta Lake as the feature and validation data store
Databricks Model Serving for the production endpoint

The use case is the same churn prediction model from the previous post, but this time we're focusing entirely on how it gets from a notebook to a production endpoint reliably and repeatably.

Architecture Overview

CI/CD Flow

Pipeline Stage Breakdown

Stage	Tool	What happens	Gate to next stage
CI	GitHub Actions	Lint, unit tests, bundle validate	All tests green
Train	Databricks Job	Full training run, MLflow logging	Job exits 0
Register	MLflow Registry	Model versioned and moved to Staging	Auto on train success
Validate	Databricks Job	Metric thresholds checked on holdout set	ROC-AUC >= 0.80
Promote	MLflow Registry	Model moved to Production	Validation passes
Deploy	Model Serving	Endpoint updated to new model version	Promotion complete
Monitor	Databricks Lakehouse Monitoring	Drift and accuracy tracked post-deploy	Ongoing

Step 1 — Project Structure with Databricks Asset Bundles

Databricks Asset Bundles (DABs) let you define your jobs, clusters, and pipelines as code and deploy them via CLI. This is your IaC layer.

# databricks.yml
bundle:
  name: churn-prediction

variables:
  env:
    default: dev

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://adb-xxxx.azuredatabricks.net

  staging:
    workspace:
      host: https://adb-xxxx.azuredatabricks.net

  production:
    mode: production
    workspace:
      host: https://adb-xxxx.azuredatabricks.net

resources:
  jobs:
    training_job:
      name: churn-training-${var.env}
      tasks:
        - task_key: feature_engineering
          notebook_task:
            notebook_path: ./notebooks/01_feature_engineering.py
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 4

        - task_key: train_and_register
          depends_on:
            - task_key: feature_engineering
          notebook_task:
            notebook_path: ./notebooks/02_train_and_register.py
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 2

    validation_job:
      name: churn-validation-${var.env}
      tasks:
        - task_key: validate_model
          notebook_task:
            notebook_path: ./notebooks/03_validate_and_promote.py
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 2

Step 2 — Training Job with MLflow Autologging

Keep the training notebook clean and focused. Let MLflow autologging handle the heavy lifting on metric and param capture.

# notebooks/02_train_and_register.py
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score
from delta.tables import DeltaTable
import pandas as pd

mlflow.set_experiment('/churn-prediction/ci-cd-pipeline')
mlflow.sklearn.autolog(log_input_examples=True, log_model_signatures=True)

# Read Gold features and capture Delta version for reproducibility
gold_table    = DeltaTable.forName(spark, 'churn.gold.features')
delta_version = gold_table.history(1).select('version').collect()[0][0]

features_pdf = spark.table('churn.gold.features') \
    .filter("feature_date = current_date()") \
    .toPandas()

FEATURE_COLS = [
    'total_events', 'total_sessions', 'distinct_products',
    'events_last_30d', 'events_last_90d', 'days_since_last_event',
    'customer_tenure_days', 'avg_events_per_day',
    'recency_tier', 'engagement_score',
]
TARGET = 'churned'

X = features_pdf[FEATURE_COLS]
y = features_pdf[TARGET]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

with mlflow.start_run() as run:
    params = {
        'n_estimators':  200,
        'max_depth':     5,
        'learning_rate': 0.05,
        'subsample':     0.8,
    }
    model = GradientBoostingClassifier(**params, random_state=42)
    model.fit(X_train, y_train)

    y_prob = model.predict_proba(X_test)[:, 1]
    y_pred = model.predict(X_test)

    mlflow.log_param('delta_feature_version', delta_version)
    mlflow.log_param('git_sha', dbutils.widgets.get('git_sha'))  # passed in by CI
    mlflow.log_metric('roc_auc',   roc_auc_score(y_test, y_prob))
    mlflow.log_metric('f1_score',  f1_score(y_test, y_pred))
    mlflow.log_metric('precision', precision_score(y_test, y_pred))
    mlflow.log_metric('recall',    recall_score(y_test, y_pred))

    signature = infer_signature(X_train, y_pred)
    mlflow.sklearn.log_model(
        model,
        artifact_path='churn-gbm',
        signature=signature,
        registered_model_name='churn-prediction-gbm',
        await_registration_for=300,
    )

    # Pass run ID to downstream tasks via job output
    dbutils.jobs.taskValues.set(key='run_id', value=run.info.run_id)
    print(f"Training complete. Run ID: {run.info.run_id}")

Step 3 — Validation Job: Gate on Metrics Before Promoting

Never auto-promote based on a successful training run alone. Always validate on a held-out or recent dataset and check against a threshold before touching the Production alias.

# notebooks/03_validate_and_promote.py
import mlflow
from mlflow import MlflowClient
from sklearn.metrics import roc_auc_score
import pandas as pd

client    = MlflowClient()
MODEL_NAME = 'churn-prediction-gbm'

# Pick up run_id from the upstream training task
run_id = dbutils.jobs.taskValues.get(
    taskKey='train_and_register', key='run_id'
)

# Thresholds — fail the job if any metric is below these
THRESHOLDS = {
    'roc_auc':   0.80,
    'f1_score':  0.72,
    'precision': 0.70,
}

run     = client.get_run(run_id)
metrics = run.data.metrics

print("Validating metrics against thresholds...")
failures = []
for metric, threshold in THRESHOLDS.items():
    actual = metrics.get(metric, 0)
    status = 'PASS' if actual >= threshold else 'FAIL'
    print(f"  {metric}: {actual:.4f} (threshold: {threshold}) -> {status}")
    if actual < threshold:
        failures.append(f"{metric}={actual:.4f} below threshold {threshold}")

if failures:
    raise Exception(f"Validation failed: {', '.join(failures)}")

# Promote to Production alias if all thresholds pass
model_version = client.search_model_versions(
    filter_string=f"run_id='{run_id}'"
)[0].version

client.set_registered_model_alias(
    name=MODEL_NAME,
    alias='Production',
    version=model_version,
)

print(f"Model version {model_version} promoted to Production alias.")

Step 4 — GitHub Actions CI/CD Workflow

This is the glue that ties everything together. One workflow handles PR validation; the other handles deployment on merge to main.

# .github/workflows/mlops.yml
name: MLOps CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  DATABRICKS_HOST:  ${{ secrets.DATABRICKS_HOST }}
  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

jobs:
  ci:
    name: Lint and Test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install databricks-cli databricks-sdk pytest flake8 mlflow

      - name: Lint
        run: flake8 notebooks/ src/ --max-line-length=120

      - name: Unit tests
        run: pytest tests/unit/ -v

      - name: Validate DAB bundle
        run: databricks bundle validate --target staging

  train-and-deploy:
    name: Train, Validate, Deploy
    runs-on: ubuntu-latest
    needs: ci
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: pip install databricks-cli databricks-sdk

      - name: Deploy bundle to staging
        run: databricks bundle deploy --target staging

      - name: Run training job
        run: |
          JOB_RUN_ID=$(databricks bundle run training_job \
            --var "git_sha=${{ github.sha }}" \
            --output json | jq -r '.run_id')
          echo "TRAINING_RUN_ID=$JOB_RUN_ID" >> $GITHUB_ENV

      - name: Run validation job
        run: |
          databricks bundle run validation_job --target staging
          echo "Validation passed. Promoting to production."

      - name: Deploy bundle to production
        run: databricks bundle deploy --target production

      - name: Update Model Serving endpoint
        run: |
          python scripts/update_serving_endpoint.py \
            --model-name churn-prediction-gbm \
            --alias Production

Step 5 — Update the Serving Endpoint

The final step in the pipeline updates the Databricks Model Serving endpoint to point at the newly promoted Production model version.

# scripts/update_serving_endpoint.py
import argparse
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ServedModelInput, EndpointCoreConfigInput
from mlflow import MlflowClient

def update_serving_endpoint(model_name: str, alias: str):
    w      = WorkspaceClient()
    client = MlflowClient()

    # Resolve alias to concrete version
    model_version = client.get_model_version_by_alias(model_name, alias).version
    print(f"Deploying {model_name} version {model_version} (alias: {alias})")

    endpoint_name = 'churn-prediction-endpoint'
    served_model  = ServedModelInput(
        model_name=model_name,
        model_version=model_version,
        workload_size='Small',
        scale_to_zero_enabled=True,
    )

    try:
        # Update existing endpoint
        w.serving_endpoints.update_config(
            name=endpoint_name,
            served_models=[served_model],
        )
        print(f"Endpoint '{endpoint_name}' updated to version {model_version}.")
    except Exception:
        # Create if it doesn't exist yet
        w.serving_endpoints.create(
            name=endpoint_name,
            config=EndpointCoreConfigInput(served_models=[served_model]),
        )
        print(f"Endpoint '{endpoint_name}' created with version {model_version}.")


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model-name', required=True)
    parser.add_argument('--alias',      required=True)
    args = parser.parse_args()
    update_serving_endpoint(args.model_name, args.alias)

Tool Comparison

Tool	Role in pipeline	Why not the alternative
Databricks Asset Bundles	IaC for jobs and clusters	Terraform Databricks provider (more verbose, no native notebook support), manual UI (not reproducible)
MLflow Registry aliases	Production / Staging promotion	Stage-based promotion (deprecated in MLflow 2.x), manual version tracking (error prone)
GitHub Actions	CI/CD orchestrator	Azure DevOps (works equally well, swap yml syntax), Jenkins (more ops overhead)
Metric gate in validation job	Automated quality check	Manual review (blocks velocity), no gate at all (risky)
Databricks Model Serving	Managed REST endpoint	AKS deployment (more control, much more ops), Azure ML endpoints (extra service dependency)
dbutils.jobs.taskValues	Pass run_id between tasks	Environment variables (not available cross-task in DABs), hardcoded run lookup (fragile)

Things to Watch in Production

Pin your Databricks Runtime version. Using 14.3.x-scala2.12 in your bundle ensures every training run uses the same Spark and library versions. Floating versions (latest) cause silent library drift that breaks reproducibility.

Store secrets in Azure Key Vault, not GitHub Secrets alone. GitHub Secrets work fine for CI tokens but for long-lived service principal credentials that Databricks jobs use at runtime, back them with Azure Key Vault and reference them via Databricks secret scopes.

Set a metric baseline from your current production model. Your thresholds (ROC-AUC >= 0.80) should be relative to what the current Production model achieves on the same holdout set, not an arbitrary number. Add a step in the validation job that fetches the current Production model's metrics and gates the new model against those.

Tag every MLflow run with git SHA. Logging git_sha as a param in every training run means you can always trace a model artifact back to the exact code version that produced it. Critical for incident response.

Scale to zero on serving endpoints. For non-latency-critical models, enable scale_to_zero_enabled=True on your serving endpoint. It cuts cost dramatically for endpoints that don't receive traffic 24/7.

Wrapping Up

The pattern here is straightforward: code change triggers CI, CI triggers a training job, training job registers a model, a validation job gates on metrics, and only then does the model get promoted and deployed. Nothing manual, nothing skipped.

What makes this production-grade rather than just automated is the combination of Delta versioning for feature reproducibility, MLflow aliases for clean promotion semantics, and metric-gated promotion so a worse model can never silently replace a better one.

DEV Community