DatanestDigital

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Azure Data Factory Integration Templates: ADF-to-Databricks Integration Patterns

#databricks #sql #azure #dataengineering

ADF-to-Databricks Integration Patterns

Datanest Digital — datanest.dev

Overview

This guide covers production patterns for orchestrating Databricks notebooks from Azure Data Factory. It addresses cluster strategy, parameter passing, error handling, medallion architecture orchestration, and operational considerations.

1. Linked Service Authentication Patterns

Managed Identity (Recommended)

Managed identity is the most secure approach — no tokens or credentials to manage.

{
  "type": "AzureDatabricks",
  "typeProperties": {
    "domain": "https://adb-<id>.azuredatabricks.net",
    "authentication": "MSI",
    "workspaceResourceId": "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Databricks/workspaces/<ws>"
  }
}

Requirements:

ADF managed identity must have Contributor role on the Databricks workspace
The workspace must have Unity Catalog or legacy ACLs configured for the service principal

Access Token via Key Vault

For environments where MSI is not supported (e.g., cross-tenant):

{
  "type": "AzureDatabricks",
  "typeProperties": {
    "domain": "https://adb-<id>.azuredatabricks.net",
    "accessToken": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "LS_AzureKeyVault" },
      "secretName": "databricks-access-token"
    }
  }
}

2. Cluster Strategy

Existing Interactive Cluster

Best for development and small-to-medium workloads with predictable scheduling.

{
  "existingClusterId": "@{pipeline().parameters.p_cluster_id}"
}

Pros: Fast startup, cost-effective for frequent short runs
Cons: Shared resource contention, single point of failure

New Job Cluster (Recommended for Production)

Spins up a dedicated cluster per pipeline run — full isolation.

{
  "newClusterVersion": "14.3.x-scala2.12",
  "newClusterNumOfWorker": "2",
  "newClusterSparkConf": {
    "spark.databricks.delta.preview.enabled": "true",
    "spark.sql.adaptive.enabled": "true"
  },
  "newClusterNodeType": "Standard_DS3_v2",
  "newClusterInitScripts": [],
  "newClusterDriverNodeType": "Standard_DS3_v2"
}

Sizing guidance:

Workload Type	Worker Count	Node Type	Rationale
Light ETL (< 10 GB)	1-2	Standard_DS3_v2	Cost-efficient, sufficient memory
Medium ETL (10-100 GB)	2-4	Standard_DS4_v2	Balanced compute and memory
Heavy ETL (100+ GB)	4-8	Standard_E8s_v3	Memory-optimized for large joins
ML Training	2-8	Standard_NC6s_v3	GPU-enabled for model training

Instance Pools

Pre-provision VMs to reduce cluster startup time from ~5 minutes to ~30 seconds.

Configure instance pools in Databricks, then reference via:

{
  "newClusterVersion": "14.3.x-scala2.12",
  "instancePoolId": "<pool-id>",
  "newClusterNumOfWorker": "2"
}

3. Parameter Passing Patterns

Direct Parameters

Pass scalar values directly from ADF to notebooks:

{
  "type": "DatabricksNotebook",
  "typeProperties": {
    "notebookPath": "/Shared/pipelines/bronze/ingest",
    "baseParameters": {
      "source_path": "@pipeline().parameters.p_source_path",
      "load_date": "@formatDateTime(utcNow(), 'yyyy-MM-dd')",
      "pipeline_run_id": "@pipeline().RunId"
    }
  }
}

In the Databricks notebook:

source_path = dbutils.widgets.get("source_path")
load_date = dbutils.widgets.get("load_date")
pipeline_run_id = dbutils.widgets.get("pipeline_run_id")

Passing Complex Objects

ADF base parameters only support strings. For complex objects, serialize to JSON:

{
  "config_json": "@string(json(concat('{\"tables\": [\"orders\", \"customers\"], \"mode\": \"incremental\"}')))"
}

In the notebook:

import json
config = json.loads(dbutils.widgets.get("config_json"))
tables = config["tables"]

Returning Values from Notebooks

Use dbutils.notebook.exit() to return values to ADF:

import json
result = {"status": "SUCCESS", "rows_processed": row_count, "output_path": output_path}
dbutils.notebook.exit(json.dumps(result))

Access in subsequent ADF activities:

@activity('RunBronzeNotebook').output.runOutput.status
@activity('RunBronzeNotebook').output.runOutput.rows_processed

4. Medallion Architecture Orchestration

Sequential Bronze → Silver → Gold

The included databricks_notebook_activity.json pipeline implements this pattern:

[Bronze Notebook] → [Check Output] → [Silver Notebook] → [Check Output] → [Gold Notebook]

Each stage validates the output of the previous stage before proceeding. This prevents cascading failures where corrupted bronze data propagates through silver and gold layers.

Parallel Fan-Out Pattern

For independent tables that can be processed simultaneously:

                    ┌─ [Bronze: Orders]    ─┐
[Get Table List] ─→ ├─ [Bronze: Customers]  ├─→ [Silver: Join & Transform]
                    └─ [Bronze: Products]   ─┘

Implementation using ForEach with isSequential: false:

{
  "type": "ForEach",
  "typeProperties": {
    "items": "@pipeline().parameters.p_table_list",
    "isSequential": false,
    "batchCount": 5,
    "activities": [
      {
        "type": "DatabricksNotebook",
        "typeProperties": {
          "notebookPath": "/Shared/pipelines/bronze/generic_ingest",
          "baseParameters": {
            "table_name": "@item().table_name",
            "source_schema": "@item().source_schema"
          }
        }
      }
    ]
  }
}

Dependency-Aware DAG

For complex dependency graphs, use nested pipelines:

[Master Pipeline]
  ├─ Execute Pipeline: bronze_ingestion (parallel ForEach)
  ├─ Execute Pipeline: silver_transforms (sequential, waits for bronze)
  └─ Execute Pipeline: gold_aggregations (sequential, waits for silver)

5. Error Handling

Retry Strategy

Configure retries at the activity level:

{
  "policy": {
    "timeout": "0.02:00:00",
    "retry": 2,
    "retryIntervalInSeconds": 120
  }
}

Recommended settings by scenario:

Scenario	Retry Count	Interval	Timeout
Cluster startup issues	2	120s	2h
Transient network errors	3	60s	1h
Long-running transforms	1	300s	4h

Notebook-Level Error Handling

Structure notebooks to return structured error information:

try:
    # Processing logic
    result = {"status": "SUCCESS", "rows_processed": count}
except Exception as e:
    result = {"status": "FAILED", "error": str(e), "error_type": type(e).__name__}
finally:
    dbutils.notebook.exit(json.dumps(result))

ADF Conditional Branching on Notebook Output

{
  "type": "IfCondition",
  "typeProperties": {
    "expression": {
      "value": "@equals(activity('RunNotebook').output.runOutput.status, 'SUCCESS')",
      "type": "Expression"
    },
    "ifTrueActivities": [...],
    "ifFalseActivities": [...]
  }
}

6. Cost Optimization

Use job clusters in production — they auto-terminate after the notebook completes
Set newClusterSparkConf["spark.databricks.cluster.profile"] to "singleNode" for light workloads
Use spot instances for non-time-critical batch jobs (up to 80% cost reduction)
Configure instance pools to reduce startup latency without paying for idle clusters
Set auto-termination on interactive clusters (minimum 10 minutes)

7. Monitoring & Observability

Pipeline-Level Monitoring

ADF provides built-in monitoring for Databricks activities:

Cluster startup duration
Notebook execution duration
Output size

Cross-Platform Correlation

Pass pipeline().RunId to every notebook as a parameter. Use this to correlate:

ADF pipeline runs (Azure Monitor)
Databricks job runs (Databricks UI / API)
Application Insights traces

Log Analytics KQL Queries

ADFPipelineRun
| where PipelineName contains "Databricks"
| where Status == "Failed"
| project TimeGenerated, PipelineName, RunId, ErrorMessage = Parameters
| order by TimeGenerated desc

This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Azure Data Factory Integration Templates] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →

DEV Community