Most ML teams are great at training models. Very few are great at shipping them. The gap between a notebook that works and a model that reliably serves production traffic is where most ML projects stall.
In this tutorial I'll walk through building a proper CI/CD pipeline for ML models on Azure Databricks using:
- MLflow for experiment tracking, model versioning, and the model registry
- Databricks Asset Bundles (DABs) for infrastructure-as-code and job deployment
- Azure DevOps / GitHub Actions as the CI/CD orchestrator
- Delta Lake as the feature and validation data store
- Databricks Model Serving for the production endpoint
The use case is the same churn prediction model from the previous post, but this time we're focusing entirely on how it gets from a notebook to a production endpoint reliably and repeatably.
Architecture Overview
CI/CD Flow
Pipeline Stage Breakdown
| Stage | Tool | What happens | Gate to next stage |
|---|---|---|---|
| CI | GitHub Actions | Lint, unit tests, bundle validate | All tests green |
| Train | Databricks Job | Full training run, MLflow logging | Job exits 0 |
| Register | MLflow Registry | Model versioned and moved to Staging | Auto on train success |
| Validate | Databricks Job | Metric thresholds checked on holdout set | ROC-AUC >= 0.80 |
| Promote | MLflow Registry | Model moved to Production | Validation passes |
| Deploy | Model Serving | Endpoint updated to new model version | Promotion complete |
| Monitor | Databricks Lakehouse Monitoring | Drift and accuracy tracked post-deploy | Ongoing |
Step 1 — Project Structure with Databricks Asset Bundles
Databricks Asset Bundles (DABs) let you define your jobs, clusters, and pipelines as code and deploy them via CLI. This is your IaC layer.
# databricks.yml
bundle:
name: churn-prediction
variables:
env:
default: dev
targets:
dev:
mode: development
default: true
workspace:
host: https://adb-xxxx.azuredatabricks.net
staging:
workspace:
host: https://adb-xxxx.azuredatabricks.net
production:
mode: production
workspace:
host: https://adb-xxxx.azuredatabricks.net
resources:
jobs:
training_job:
name: churn-training-${var.env}
tasks:
- task_key: feature_engineering
notebook_task:
notebook_path: ./notebooks/01_feature_engineering.py
new_cluster:
spark_version: 14.3.x-scala2.12
node_type_id: Standard_DS3_v2
num_workers: 4
- task_key: train_and_register
depends_on:
- task_key: feature_engineering
notebook_task:
notebook_path: ./notebooks/02_train_and_register.py
new_cluster:
spark_version: 14.3.x-scala2.12
node_type_id: Standard_DS3_v2
num_workers: 2
validation_job:
name: churn-validation-${var.env}
tasks:
- task_key: validate_model
notebook_task:
notebook_path: ./notebooks/03_validate_and_promote.py
new_cluster:
spark_version: 14.3.x-scala2.12
node_type_id: Standard_DS3_v2
num_workers: 2
Step 2 — Training Job with MLflow Autologging
Keep the training notebook clean and focused. Let MLflow autologging handle the heavy lifting on metric and param capture.
# notebooks/02_train_and_register.py
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score
from delta.tables import DeltaTable
import pandas as pd
mlflow.set_experiment('/churn-prediction/ci-cd-pipeline')
mlflow.sklearn.autolog(log_input_examples=True, log_model_signatures=True)
# Read Gold features and capture Delta version for reproducibility
gold_table = DeltaTable.forName(spark, 'churn.gold.features')
delta_version = gold_table.history(1).select('version').collect()[0][0]
features_pdf = spark.table('churn.gold.features') \
.filter("feature_date = current_date()") \
.toPandas()
FEATURE_COLS = [
'total_events', 'total_sessions', 'distinct_products',
'events_last_30d', 'events_last_90d', 'days_since_last_event',
'customer_tenure_days', 'avg_events_per_day',
'recency_tier', 'engagement_score',
]
TARGET = 'churned'
X = features_pdf[FEATURE_COLS]
y = features_pdf[TARGET]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
with mlflow.start_run() as run:
params = {
'n_estimators': 200,
'max_depth': 5,
'learning_rate': 0.05,
'subsample': 0.8,
}
model = GradientBoostingClassifier(**params, random_state=42)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)
mlflow.log_param('delta_feature_version', delta_version)
mlflow.log_param('git_sha', dbutils.widgets.get('git_sha')) # passed in by CI
mlflow.log_metric('roc_auc', roc_auc_score(y_test, y_prob))
mlflow.log_metric('f1_score', f1_score(y_test, y_pred))
mlflow.log_metric('precision', precision_score(y_test, y_pred))
mlflow.log_metric('recall', recall_score(y_test, y_pred))
signature = infer_signature(X_train, y_pred)
mlflow.sklearn.log_model(
model,
artifact_path='churn-gbm',
signature=signature,
registered_model_name='churn-prediction-gbm',
await_registration_for=300,
)
# Pass run ID to downstream tasks via job output
dbutils.jobs.taskValues.set(key='run_id', value=run.info.run_id)
print(f"Training complete. Run ID: {run.info.run_id}")
Step 3 — Validation Job: Gate on Metrics Before Promoting
Never auto-promote based on a successful training run alone. Always validate on a held-out or recent dataset and check against a threshold before touching the Production alias.
# notebooks/03_validate_and_promote.py
import mlflow
from mlflow import MlflowClient
from sklearn.metrics import roc_auc_score
import pandas as pd
client = MlflowClient()
MODEL_NAME = 'churn-prediction-gbm'
# Pick up run_id from the upstream training task
run_id = dbutils.jobs.taskValues.get(
taskKey='train_and_register', key='run_id'
)
# Thresholds — fail the job if any metric is below these
THRESHOLDS = {
'roc_auc': 0.80,
'f1_score': 0.72,
'precision': 0.70,
}
run = client.get_run(run_id)
metrics = run.data.metrics
print("Validating metrics against thresholds...")
failures = []
for metric, threshold in THRESHOLDS.items():
actual = metrics.get(metric, 0)
status = 'PASS' if actual >= threshold else 'FAIL'
print(f" {metric}: {actual:.4f} (threshold: {threshold}) -> {status}")
if actual < threshold:
failures.append(f"{metric}={actual:.4f} below threshold {threshold}")
if failures:
raise Exception(f"Validation failed: {', '.join(failures)}")
# Promote to Production alias if all thresholds pass
model_version = client.search_model_versions(
filter_string=f"run_id='{run_id}'"
)[0].version
client.set_registered_model_alias(
name=MODEL_NAME,
alias='Production',
version=model_version,
)
print(f"Model version {model_version} promoted to Production alias.")
Step 4 — GitHub Actions CI/CD Workflow
This is the glue that ties everything together. One workflow handles PR validation; the other handles deployment on merge to main.
# .github/workflows/mlops.yml
name: MLOps CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
jobs:
ci:
name: Lint and Test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install databricks-cli databricks-sdk pytest flake8 mlflow
- name: Lint
run: flake8 notebooks/ src/ --max-line-length=120
- name: Unit tests
run: pytest tests/unit/ -v
- name: Validate DAB bundle
run: databricks bundle validate --target staging
train-and-deploy:
name: Train, Validate, Deploy
runs-on: ubuntu-latest
needs: ci
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Install Databricks CLI
run: pip install databricks-cli databricks-sdk
- name: Deploy bundle to staging
run: databricks bundle deploy --target staging
- name: Run training job
run: |
JOB_RUN_ID=$(databricks bundle run training_job \
--var "git_sha=${{ github.sha }}" \
--output json | jq -r '.run_id')
echo "TRAINING_RUN_ID=$JOB_RUN_ID" >> $GITHUB_ENV
- name: Run validation job
run: |
databricks bundle run validation_job --target staging
echo "Validation passed. Promoting to production."
- name: Deploy bundle to production
run: databricks bundle deploy --target production
- name: Update Model Serving endpoint
run: |
python scripts/update_serving_endpoint.py \
--model-name churn-prediction-gbm \
--alias Production
Step 5 — Update the Serving Endpoint
The final step in the pipeline updates the Databricks Model Serving endpoint to point at the newly promoted Production model version.
# scripts/update_serving_endpoint.py
import argparse
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ServedModelInput, EndpointCoreConfigInput
from mlflow import MlflowClient
def update_serving_endpoint(model_name: str, alias: str):
w = WorkspaceClient()
client = MlflowClient()
# Resolve alias to concrete version
model_version = client.get_model_version_by_alias(model_name, alias).version
print(f"Deploying {model_name} version {model_version} (alias: {alias})")
endpoint_name = 'churn-prediction-endpoint'
served_model = ServedModelInput(
model_name=model_name,
model_version=model_version,
workload_size='Small',
scale_to_zero_enabled=True,
)
try:
# Update existing endpoint
w.serving_endpoints.update_config(
name=endpoint_name,
served_models=[served_model],
)
print(f"Endpoint '{endpoint_name}' updated to version {model_version}.")
except Exception:
# Create if it doesn't exist yet
w.serving_endpoints.create(
name=endpoint_name,
config=EndpointCoreConfigInput(served_models=[served_model]),
)
print(f"Endpoint '{endpoint_name}' created with version {model_version}.")
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model-name', required=True)
parser.add_argument('--alias', required=True)
args = parser.parse_args()
update_serving_endpoint(args.model_name, args.alias)
Tool Comparison
| Tool | Role in pipeline | Why not the alternative |
|---|---|---|
| Databricks Asset Bundles | IaC for jobs and clusters | Terraform Databricks provider (more verbose, no native notebook support), manual UI (not reproducible) |
| MLflow Registry aliases | Production / Staging promotion | Stage-based promotion (deprecated in MLflow 2.x), manual version tracking (error prone) |
| GitHub Actions | CI/CD orchestrator | Azure DevOps (works equally well, swap yml syntax), Jenkins (more ops overhead) |
| Metric gate in validation job | Automated quality check | Manual review (blocks velocity), no gate at all (risky) |
| Databricks Model Serving | Managed REST endpoint | AKS deployment (more control, much more ops), Azure ML endpoints (extra service dependency) |
| dbutils.jobs.taskValues | Pass run_id between tasks | Environment variables (not available cross-task in DABs), hardcoded run lookup (fragile) |
Things to Watch in Production
Pin your Databricks Runtime version. Using 14.3.x-scala2.12 in your bundle ensures every training run uses the same Spark and library versions. Floating versions (latest) cause silent library drift that breaks reproducibility.
Store secrets in Azure Key Vault, not GitHub Secrets alone. GitHub Secrets work fine for CI tokens but for long-lived service principal credentials that Databricks jobs use at runtime, back them with Azure Key Vault and reference them via Databricks secret scopes.
Set a metric baseline from your current production model. Your thresholds (ROC-AUC >= 0.80) should be relative to what the current Production model achieves on the same holdout set, not an arbitrary number. Add a step in the validation job that fetches the current Production model's metrics and gates the new model against those.
Tag every MLflow run with git SHA. Logging git_sha as a param in every training run means you can always trace a model artifact back to the exact code version that produced it. Critical for incident response.
Scale to zero on serving endpoints. For non-latency-critical models, enable scale_to_zero_enabled=True on your serving endpoint. It cuts cost dramatically for endpoints that don't receive traffic 24/7.
Wrapping Up
The pattern here is straightforward: code change triggers CI, CI triggers a training job, training job registers a model, a validation job gates on metrics, and only then does the model get promoted and deployed. Nothing manual, nothing skipped.
What makes this production-grade rather than just automated is the combination of Delta versioning for feature reproducibility, MLflow aliases for clean promotion semantics, and metric-gated promotion so a worse model can never silently replace a better one.


Top comments (0)