DEV Community

Cover image for MLOps: Building a CI/CD Pipeline for ML Models on Azure Databricks
Jubin Soni
Jubin Soni Subscriber

Posted on

MLOps: Building a CI/CD Pipeline for ML Models on Azure Databricks

Most ML teams are great at training models. Very few are great at shipping them. The gap between a notebook that works and a model that reliably serves production traffic is where most ML projects stall.

In this tutorial I'll walk through building a proper CI/CD pipeline for ML models on Azure Databricks using:

  • MLflow for experiment tracking, model versioning, and the model registry
  • Databricks Asset Bundles (DABs) for infrastructure-as-code and job deployment
  • Azure DevOps / GitHub Actions as the CI/CD orchestrator
  • Delta Lake as the feature and validation data store
  • Databricks Model Serving for the production endpoint

The use case is the same churn prediction model from the previous post, but this time we're focusing entirely on how it gets from a notebook to a production endpoint reliably and repeatably.


Architecture Overview

Architecture description


CI/CD Flow

CI/CD Flow description


Pipeline Stage Breakdown

Stage Tool What happens Gate to next stage
CI GitHub Actions Lint, unit tests, bundle validate All tests green
Train Databricks Job Full training run, MLflow logging Job exits 0
Register MLflow Registry Model versioned and moved to Staging Auto on train success
Validate Databricks Job Metric thresholds checked on holdout set ROC-AUC >= 0.80
Promote MLflow Registry Model moved to Production Validation passes
Deploy Model Serving Endpoint updated to new model version Promotion complete
Monitor Databricks Lakehouse Monitoring Drift and accuracy tracked post-deploy Ongoing

Step 1 — Project Structure with Databricks Asset Bundles

Databricks Asset Bundles (DABs) let you define your jobs, clusters, and pipelines as code and deploy them via CLI. This is your IaC layer.

# databricks.yml
bundle:
  name: churn-prediction

variables:
  env:
    default: dev

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: https://adb-xxxx.azuredatabricks.net

  staging:
    workspace:
      host: https://adb-xxxx.azuredatabricks.net

  production:
    mode: production
    workspace:
      host: https://adb-xxxx.azuredatabricks.net

resources:
  jobs:
    training_job:
      name: churn-training-${var.env}
      tasks:
        - task_key: feature_engineering
          notebook_task:
            notebook_path: ./notebooks/01_feature_engineering.py
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 4

        - task_key: train_and_register
          depends_on:
            - task_key: feature_engineering
          notebook_task:
            notebook_path: ./notebooks/02_train_and_register.py
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 2

    validation_job:
      name: churn-validation-${var.env}
      tasks:
        - task_key: validate_model
          notebook_task:
            notebook_path: ./notebooks/03_validate_and_promote.py
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS3_v2
            num_workers: 2
Enter fullscreen mode Exit fullscreen mode

Step 2 — Training Job with MLflow Autologging

Keep the training notebook clean and focused. Let MLflow autologging handle the heavy lifting on metric and param capture.

# notebooks/02_train_and_register.py
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score
from delta.tables import DeltaTable
import pandas as pd

mlflow.set_experiment('/churn-prediction/ci-cd-pipeline')
mlflow.sklearn.autolog(log_input_examples=True, log_model_signatures=True)

# Read Gold features and capture Delta version for reproducibility
gold_table    = DeltaTable.forName(spark, 'churn.gold.features')
delta_version = gold_table.history(1).select('version').collect()[0][0]

features_pdf = spark.table('churn.gold.features') \
    .filter("feature_date = current_date()") \
    .toPandas()

FEATURE_COLS = [
    'total_events', 'total_sessions', 'distinct_products',
    'events_last_30d', 'events_last_90d', 'days_since_last_event',
    'customer_tenure_days', 'avg_events_per_day',
    'recency_tier', 'engagement_score',
]
TARGET = 'churned'

X = features_pdf[FEATURE_COLS]
y = features_pdf[TARGET]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

with mlflow.start_run() as run:
    params = {
        'n_estimators':  200,
        'max_depth':     5,
        'learning_rate': 0.05,
        'subsample':     0.8,
    }
    model = GradientBoostingClassifier(**params, random_state=42)
    model.fit(X_train, y_train)

    y_prob = model.predict_proba(X_test)[:, 1]
    y_pred = model.predict(X_test)

    mlflow.log_param('delta_feature_version', delta_version)
    mlflow.log_param('git_sha', dbutils.widgets.get('git_sha'))  # passed in by CI
    mlflow.log_metric('roc_auc',   roc_auc_score(y_test, y_prob))
    mlflow.log_metric('f1_score',  f1_score(y_test, y_pred))
    mlflow.log_metric('precision', precision_score(y_test, y_pred))
    mlflow.log_metric('recall',    recall_score(y_test, y_pred))

    signature = infer_signature(X_train, y_pred)
    mlflow.sklearn.log_model(
        model,
        artifact_path='churn-gbm',
        signature=signature,
        registered_model_name='churn-prediction-gbm',
        await_registration_for=300,
    )

    # Pass run ID to downstream tasks via job output
    dbutils.jobs.taskValues.set(key='run_id', value=run.info.run_id)
    print(f"Training complete. Run ID: {run.info.run_id}")
Enter fullscreen mode Exit fullscreen mode

Step 3 — Validation Job: Gate on Metrics Before Promoting

Never auto-promote based on a successful training run alone. Always validate on a held-out or recent dataset and check against a threshold before touching the Production alias.

# notebooks/03_validate_and_promote.py
import mlflow
from mlflow import MlflowClient
from sklearn.metrics import roc_auc_score
import pandas as pd

client    = MlflowClient()
MODEL_NAME = 'churn-prediction-gbm'

# Pick up run_id from the upstream training task
run_id = dbutils.jobs.taskValues.get(
    taskKey='train_and_register', key='run_id'
)

# Thresholds — fail the job if any metric is below these
THRESHOLDS = {
    'roc_auc':   0.80,
    'f1_score':  0.72,
    'precision': 0.70,
}

run     = client.get_run(run_id)
metrics = run.data.metrics

print("Validating metrics against thresholds...")
failures = []
for metric, threshold in THRESHOLDS.items():
    actual = metrics.get(metric, 0)
    status = 'PASS' if actual >= threshold else 'FAIL'
    print(f"  {metric}: {actual:.4f} (threshold: {threshold}) -> {status}")
    if actual < threshold:
        failures.append(f"{metric}={actual:.4f} below threshold {threshold}")

if failures:
    raise Exception(f"Validation failed: {', '.join(failures)}")

# Promote to Production alias if all thresholds pass
model_version = client.search_model_versions(
    filter_string=f"run_id='{run_id}'"
)[0].version

client.set_registered_model_alias(
    name=MODEL_NAME,
    alias='Production',
    version=model_version,
)

print(f"Model version {model_version} promoted to Production alias.")
Enter fullscreen mode Exit fullscreen mode

Step 4 — GitHub Actions CI/CD Workflow

This is the glue that ties everything together. One workflow handles PR validation; the other handles deployment on merge to main.

# .github/workflows/mlops.yml
name: MLOps CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  DATABRICKS_HOST:  ${{ secrets.DATABRICKS_HOST }}
  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

jobs:
  ci:
    name: Lint and Test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install databricks-cli databricks-sdk pytest flake8 mlflow

      - name: Lint
        run: flake8 notebooks/ src/ --max-line-length=120

      - name: Unit tests
        run: pytest tests/unit/ -v

      - name: Validate DAB bundle
        run: databricks bundle validate --target staging

  train-and-deploy:
    name: Train, Validate, Deploy
    runs-on: ubuntu-latest
    needs: ci
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Install Databricks CLI
        run: pip install databricks-cli databricks-sdk

      - name: Deploy bundle to staging
        run: databricks bundle deploy --target staging

      - name: Run training job
        run: |
          JOB_RUN_ID=$(databricks bundle run training_job \
            --var "git_sha=${{ github.sha }}" \
            --output json | jq -r '.run_id')
          echo "TRAINING_RUN_ID=$JOB_RUN_ID" >> $GITHUB_ENV

      - name: Run validation job
        run: |
          databricks bundle run validation_job --target staging
          echo "Validation passed. Promoting to production."

      - name: Deploy bundle to production
        run: databricks bundle deploy --target production

      - name: Update Model Serving endpoint
        run: |
          python scripts/update_serving_endpoint.py \
            --model-name churn-prediction-gbm \
            --alias Production
Enter fullscreen mode Exit fullscreen mode

Step 5 — Update the Serving Endpoint

The final step in the pipeline updates the Databricks Model Serving endpoint to point at the newly promoted Production model version.

# scripts/update_serving_endpoint.py
import argparse
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ServedModelInput, EndpointCoreConfigInput
from mlflow import MlflowClient

def update_serving_endpoint(model_name: str, alias: str):
    w      = WorkspaceClient()
    client = MlflowClient()

    # Resolve alias to concrete version
    model_version = client.get_model_version_by_alias(model_name, alias).version
    print(f"Deploying {model_name} version {model_version} (alias: {alias})")

    endpoint_name = 'churn-prediction-endpoint'
    served_model  = ServedModelInput(
        model_name=model_name,
        model_version=model_version,
        workload_size='Small',
        scale_to_zero_enabled=True,
    )

    try:
        # Update existing endpoint
        w.serving_endpoints.update_config(
            name=endpoint_name,
            served_models=[served_model],
        )
        print(f"Endpoint '{endpoint_name}' updated to version {model_version}.")
    except Exception:
        # Create if it doesn't exist yet
        w.serving_endpoints.create(
            name=endpoint_name,
            config=EndpointCoreConfigInput(served_models=[served_model]),
        )
        print(f"Endpoint '{endpoint_name}' created with version {model_version}.")


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model-name', required=True)
    parser.add_argument('--alias',      required=True)
    args = parser.parse_args()
    update_serving_endpoint(args.model_name, args.alias)
Enter fullscreen mode Exit fullscreen mode

Tool Comparison

Tool Role in pipeline Why not the alternative
Databricks Asset Bundles IaC for jobs and clusters Terraform Databricks provider (more verbose, no native notebook support), manual UI (not reproducible)
MLflow Registry aliases Production / Staging promotion Stage-based promotion (deprecated in MLflow 2.x), manual version tracking (error prone)
GitHub Actions CI/CD orchestrator Azure DevOps (works equally well, swap yml syntax), Jenkins (more ops overhead)
Metric gate in validation job Automated quality check Manual review (blocks velocity), no gate at all (risky)
Databricks Model Serving Managed REST endpoint AKS deployment (more control, much more ops), Azure ML endpoints (extra service dependency)
dbutils.jobs.taskValues Pass run_id between tasks Environment variables (not available cross-task in DABs), hardcoded run lookup (fragile)

Things to Watch in Production

Pin your Databricks Runtime version. Using 14.3.x-scala2.12 in your bundle ensures every training run uses the same Spark and library versions. Floating versions (latest) cause silent library drift that breaks reproducibility.

Store secrets in Azure Key Vault, not GitHub Secrets alone. GitHub Secrets work fine for CI tokens but for long-lived service principal credentials that Databricks jobs use at runtime, back them with Azure Key Vault and reference them via Databricks secret scopes.

Set a metric baseline from your current production model. Your thresholds (ROC-AUC >= 0.80) should be relative to what the current Production model achieves on the same holdout set, not an arbitrary number. Add a step in the validation job that fetches the current Production model's metrics and gates the new model against those.

Tag every MLflow run with git SHA. Logging git_sha as a param in every training run means you can always trace a model artifact back to the exact code version that produced it. Critical for incident response.

Scale to zero on serving endpoints. For non-latency-critical models, enable scale_to_zero_enabled=True on your serving endpoint. It cuts cost dramatically for endpoints that don't receive traffic 24/7.


Wrapping Up

The pattern here is straightforward: code change triggers CI, CI triggers a training job, training job registers a model, a validation job gates on metrics, and only then does the model get promoted and deployed. Nothing manual, nothing skipped.

What makes this production-grade rather than just automated is the combination of Delta versioning for feature reproducibility, MLflow aliases for clean promotion semantics, and metric-gated promotion so a worse model can never silently replace a better one.


References

Top comments (0)