ADF-to-Databricks Integration Patterns
Datanest Digital — datanest.dev
Overview
This guide covers production patterns for orchestrating Databricks notebooks from Azure Data Factory. It addresses cluster strategy, parameter passing, error handling, medallion architecture orchestration, and operational considerations.
1. Linked Service Authentication Patterns
Managed Identity (Recommended)
Managed identity is the most secure approach — no tokens or credentials to manage.
{
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-<id>.azuredatabricks.net",
"authentication": "MSI",
"workspaceResourceId": "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Databricks/workspaces/<ws>"
}
}
Requirements:
- ADF managed identity must have Contributor role on the Databricks workspace
- The workspace must have Unity Catalog or legacy ACLs configured for the service principal
Access Token via Key Vault
For environments where MSI is not supported (e.g., cross-tenant):
{
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-<id>.azuredatabricks.net",
"accessToken": {
"type": "AzureKeyVaultSecret",
"store": { "referenceName": "LS_AzureKeyVault" },
"secretName": "databricks-access-token"
}
}
}
2. Cluster Strategy
Existing Interactive Cluster
Best for development and small-to-medium workloads with predictable scheduling.
{
"existingClusterId": "@{pipeline().parameters.p_cluster_id}"
}
Pros: Fast startup, cost-effective for frequent short runs
Cons: Shared resource contention, single point of failure
New Job Cluster (Recommended for Production)
Spins up a dedicated cluster per pipeline run — full isolation.
{
"newClusterVersion": "14.3.x-scala2.12",
"newClusterNumOfWorker": "2",
"newClusterSparkConf": {
"spark.databricks.delta.preview.enabled": "true",
"spark.sql.adaptive.enabled": "true"
},
"newClusterNodeType": "Standard_DS3_v2",
"newClusterInitScripts": [],
"newClusterDriverNodeType": "Standard_DS3_v2"
}
Sizing guidance:
| Workload Type | Worker Count | Node Type | Rationale |
|---|---|---|---|
| Light ETL (< 10 GB) | 1-2 | Standard_DS3_v2 | Cost-efficient, sufficient memory |
| Medium ETL (10-100 GB) | 2-4 | Standard_DS4_v2 | Balanced compute and memory |
| Heavy ETL (100+ GB) | 4-8 | Standard_E8s_v3 | Memory-optimized for large joins |
| ML Training | 2-8 | Standard_NC6s_v3 | GPU-enabled for model training |
Instance Pools
Pre-provision VMs to reduce cluster startup time from ~5 minutes to ~30 seconds.
Configure instance pools in Databricks, then reference via:
{
"newClusterVersion": "14.3.x-scala2.12",
"instancePoolId": "<pool-id>",
"newClusterNumOfWorker": "2"
}
3. Parameter Passing Patterns
Direct Parameters
Pass scalar values directly from ADF to notebooks:
{
"type": "DatabricksNotebook",
"typeProperties": {
"notebookPath": "/Shared/pipelines/bronze/ingest",
"baseParameters": {
"source_path": "@pipeline().parameters.p_source_path",
"load_date": "@formatDateTime(utcNow(), 'yyyy-MM-dd')",
"pipeline_run_id": "@pipeline().RunId"
}
}
}
In the Databricks notebook:
source_path = dbutils.widgets.get("source_path")
load_date = dbutils.widgets.get("load_date")
pipeline_run_id = dbutils.widgets.get("pipeline_run_id")
Passing Complex Objects
ADF base parameters only support strings. For complex objects, serialize to JSON:
{
"config_json": "@string(json(concat('{\"tables\": [\"orders\", \"customers\"], \"mode\": \"incremental\"}')))"
}
In the notebook:
import json
config = json.loads(dbutils.widgets.get("config_json"))
tables = config["tables"]
Returning Values from Notebooks
Use dbutils.notebook.exit() to return values to ADF:
import json
result = {"status": "SUCCESS", "rows_processed": row_count, "output_path": output_path}
dbutils.notebook.exit(json.dumps(result))
Access in subsequent ADF activities:
@activity('RunBronzeNotebook').output.runOutput.status
@activity('RunBronzeNotebook').output.runOutput.rows_processed
4. Medallion Architecture Orchestration
Sequential Bronze → Silver → Gold
The included databricks_notebook_activity.json pipeline implements this pattern:
[Bronze Notebook] → [Check Output] → [Silver Notebook] → [Check Output] → [Gold Notebook]
Each stage validates the output of the previous stage before proceeding. This prevents cascading failures where corrupted bronze data propagates through silver and gold layers.
Parallel Fan-Out Pattern
For independent tables that can be processed simultaneously:
┌─ [Bronze: Orders] ─┐
[Get Table List] ─→ ├─ [Bronze: Customers] ├─→ [Silver: Join & Transform]
└─ [Bronze: Products] ─┘
Implementation using ForEach with isSequential: false:
{
"type": "ForEach",
"typeProperties": {
"items": "@pipeline().parameters.p_table_list",
"isSequential": false,
"batchCount": 5,
"activities": [
{
"type": "DatabricksNotebook",
"typeProperties": {
"notebookPath": "/Shared/pipelines/bronze/generic_ingest",
"baseParameters": {
"table_name": "@item().table_name",
"source_schema": "@item().source_schema"
}
}
}
]
}
}
Dependency-Aware DAG
For complex dependency graphs, use nested pipelines:
[Master Pipeline]
├─ Execute Pipeline: bronze_ingestion (parallel ForEach)
├─ Execute Pipeline: silver_transforms (sequential, waits for bronze)
└─ Execute Pipeline: gold_aggregations (sequential, waits for silver)
5. Error Handling
Retry Strategy
Configure retries at the activity level:
{
"policy": {
"timeout": "0.02:00:00",
"retry": 2,
"retryIntervalInSeconds": 120
}
}
Recommended settings by scenario:
| Scenario | Retry Count | Interval | Timeout |
|---|---|---|---|
| Cluster startup issues | 2 | 120s | 2h |
| Transient network errors | 3 | 60s | 1h |
| Long-running transforms | 1 | 300s | 4h |
Notebook-Level Error Handling
Structure notebooks to return structured error information:
try:
# Processing logic
result = {"status": "SUCCESS", "rows_processed": count}
except Exception as e:
result = {"status": "FAILED", "error": str(e), "error_type": type(e).__name__}
finally:
dbutils.notebook.exit(json.dumps(result))
ADF Conditional Branching on Notebook Output
{
"type": "IfCondition",
"typeProperties": {
"expression": {
"value": "@equals(activity('RunNotebook').output.runOutput.status, 'SUCCESS')",
"type": "Expression"
},
"ifTrueActivities": [...],
"ifFalseActivities": [...]
}
}
6. Cost Optimization
- Use job clusters in production — they auto-terminate after the notebook completes
- Set
newClusterSparkConf["spark.databricks.cluster.profile"]to"singleNode"for light workloads - Use spot instances for non-time-critical batch jobs (up to 80% cost reduction)
- Configure instance pools to reduce startup latency without paying for idle clusters
- Set auto-termination on interactive clusters (minimum 10 minutes)
7. Monitoring & Observability
Pipeline-Level Monitoring
ADF provides built-in monitoring for Databricks activities:
- Cluster startup duration
- Notebook execution duration
- Output size
Cross-Platform Correlation
Pass pipeline().RunId to every notebook as a parameter. Use this to correlate:
- ADF pipeline runs (Azure Monitor)
- Databricks job runs (Databricks UI / API)
- Application Insights traces
Log Analytics KQL Queries
ADFPipelineRun
| where PipelineName contains "Databricks"
| where Status == "Failed"
| project TimeGenerated, PipelineName, RunId, ErrorMessage = Parameters
| order by TimeGenerated desc
© 2026 Datanest Digital. All rights reserved.
This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Azure Data Factory Integration Templates] with all files, templates, and documentation for $49.
Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.
Top comments (0)