DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Azure Data Factory Integration Templates: ADF-to-Databricks Integration Patterns

ADF-to-Databricks Integration Patterns

Datanest Digital — datanest.dev


Overview

This guide covers production patterns for orchestrating Databricks notebooks from Azure Data Factory. It addresses cluster strategy, parameter passing, error handling, medallion architecture orchestration, and operational considerations.


1. Linked Service Authentication Patterns

Managed Identity (Recommended)

Managed identity is the most secure approach — no tokens or credentials to manage.

{
  "type": "AzureDatabricks",
  "typeProperties": {
    "domain": "https://adb-<id>.azuredatabricks.net",
    "authentication": "MSI",
    "workspaceResourceId": "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Databricks/workspaces/<ws>"
  }
}
Enter fullscreen mode Exit fullscreen mode

Requirements:

  • ADF managed identity must have Contributor role on the Databricks workspace
  • The workspace must have Unity Catalog or legacy ACLs configured for the service principal

Access Token via Key Vault

For environments where MSI is not supported (e.g., cross-tenant):

{
  "type": "AzureDatabricks",
  "typeProperties": {
    "domain": "https://adb-<id>.azuredatabricks.net",
    "accessToken": {
      "type": "AzureKeyVaultSecret",
      "store": { "referenceName": "LS_AzureKeyVault" },
      "secretName": "databricks-access-token"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

2. Cluster Strategy

Existing Interactive Cluster

Best for development and small-to-medium workloads with predictable scheduling.

{
  "existingClusterId": "@{pipeline().parameters.p_cluster_id}"
}
Enter fullscreen mode Exit fullscreen mode

Pros: Fast startup, cost-effective for frequent short runs
Cons: Shared resource contention, single point of failure

New Job Cluster (Recommended for Production)

Spins up a dedicated cluster per pipeline run — full isolation.

{
  "newClusterVersion": "14.3.x-scala2.12",
  "newClusterNumOfWorker": "2",
  "newClusterSparkConf": {
    "spark.databricks.delta.preview.enabled": "true",
    "spark.sql.adaptive.enabled": "true"
  },
  "newClusterNodeType": "Standard_DS3_v2",
  "newClusterInitScripts": [],
  "newClusterDriverNodeType": "Standard_DS3_v2"
}
Enter fullscreen mode Exit fullscreen mode

Sizing guidance:

Workload Type Worker Count Node Type Rationale
Light ETL (< 10 GB) 1-2 Standard_DS3_v2 Cost-efficient, sufficient memory
Medium ETL (10-100 GB) 2-4 Standard_DS4_v2 Balanced compute and memory
Heavy ETL (100+ GB) 4-8 Standard_E8s_v3 Memory-optimized for large joins
ML Training 2-8 Standard_NC6s_v3 GPU-enabled for model training

Instance Pools

Pre-provision VMs to reduce cluster startup time from ~5 minutes to ~30 seconds.

Configure instance pools in Databricks, then reference via:

{
  "newClusterVersion": "14.3.x-scala2.12",
  "instancePoolId": "<pool-id>",
  "newClusterNumOfWorker": "2"
}
Enter fullscreen mode Exit fullscreen mode

3. Parameter Passing Patterns

Direct Parameters

Pass scalar values directly from ADF to notebooks:

{
  "type": "DatabricksNotebook",
  "typeProperties": {
    "notebookPath": "/Shared/pipelines/bronze/ingest",
    "baseParameters": {
      "source_path": "@pipeline().parameters.p_source_path",
      "load_date": "@formatDateTime(utcNow(), 'yyyy-MM-dd')",
      "pipeline_run_id": "@pipeline().RunId"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

In the Databricks notebook:

source_path = dbutils.widgets.get("source_path")
load_date = dbutils.widgets.get("load_date")
pipeline_run_id = dbutils.widgets.get("pipeline_run_id")
Enter fullscreen mode Exit fullscreen mode

Passing Complex Objects

ADF base parameters only support strings. For complex objects, serialize to JSON:

{
  "config_json": "@string(json(concat('{\"tables\": [\"orders\", \"customers\"], \"mode\": \"incremental\"}')))"
}
Enter fullscreen mode Exit fullscreen mode

In the notebook:

import json
config = json.loads(dbutils.widgets.get("config_json"))
tables = config["tables"]
Enter fullscreen mode Exit fullscreen mode

Returning Values from Notebooks

Use dbutils.notebook.exit() to return values to ADF:

import json
result = {"status": "SUCCESS", "rows_processed": row_count, "output_path": output_path}
dbutils.notebook.exit(json.dumps(result))
Enter fullscreen mode Exit fullscreen mode

Access in subsequent ADF activities:

@activity('RunBronzeNotebook').output.runOutput.status
@activity('RunBronzeNotebook').output.runOutput.rows_processed
Enter fullscreen mode Exit fullscreen mode

4. Medallion Architecture Orchestration

Sequential Bronze → Silver → Gold

The included databricks_notebook_activity.json pipeline implements this pattern:

[Bronze Notebook] → [Check Output] → [Silver Notebook] → [Check Output] → [Gold Notebook]
Enter fullscreen mode Exit fullscreen mode

Each stage validates the output of the previous stage before proceeding. This prevents cascading failures where corrupted bronze data propagates through silver and gold layers.

Parallel Fan-Out Pattern

For independent tables that can be processed simultaneously:

                    ┌─ [Bronze: Orders]    ─┐
[Get Table List] ─→ ├─ [Bronze: Customers]  ├─→ [Silver: Join & Transform]
                    └─ [Bronze: Products]   ─┘
Enter fullscreen mode Exit fullscreen mode

Implementation using ForEach with isSequential: false:

{
  "type": "ForEach",
  "typeProperties": {
    "items": "@pipeline().parameters.p_table_list",
    "isSequential": false,
    "batchCount": 5,
    "activities": [
      {
        "type": "DatabricksNotebook",
        "typeProperties": {
          "notebookPath": "/Shared/pipelines/bronze/generic_ingest",
          "baseParameters": {
            "table_name": "@item().table_name",
            "source_schema": "@item().source_schema"
          }
        }
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Dependency-Aware DAG

For complex dependency graphs, use nested pipelines:

[Master Pipeline]
  ├─ Execute Pipeline: bronze_ingestion (parallel ForEach)
  ├─ Execute Pipeline: silver_transforms (sequential, waits for bronze)
  └─ Execute Pipeline: gold_aggregations (sequential, waits for silver)
Enter fullscreen mode Exit fullscreen mode

5. Error Handling

Retry Strategy

Configure retries at the activity level:

{
  "policy": {
    "timeout": "0.02:00:00",
    "retry": 2,
    "retryIntervalInSeconds": 120
  }
}
Enter fullscreen mode Exit fullscreen mode

Recommended settings by scenario:

Scenario Retry Count Interval Timeout
Cluster startup issues 2 120s 2h
Transient network errors 3 60s 1h
Long-running transforms 1 300s 4h

Notebook-Level Error Handling

Structure notebooks to return structured error information:

try:
    # Processing logic
    result = {"status": "SUCCESS", "rows_processed": count}
except Exception as e:
    result = {"status": "FAILED", "error": str(e), "error_type": type(e).__name__}
finally:
    dbutils.notebook.exit(json.dumps(result))
Enter fullscreen mode Exit fullscreen mode

ADF Conditional Branching on Notebook Output

{
  "type": "IfCondition",
  "typeProperties": {
    "expression": {
      "value": "@equals(activity('RunNotebook').output.runOutput.status, 'SUCCESS')",
      "type": "Expression"
    },
    "ifTrueActivities": [...],
    "ifFalseActivities": [...]
  }
}
Enter fullscreen mode Exit fullscreen mode

6. Cost Optimization

  • Use job clusters in production — they auto-terminate after the notebook completes
  • Set newClusterSparkConf["spark.databricks.cluster.profile"] to "singleNode" for light workloads
  • Use spot instances for non-time-critical batch jobs (up to 80% cost reduction)
  • Configure instance pools to reduce startup latency without paying for idle clusters
  • Set auto-termination on interactive clusters (minimum 10 minutes)

7. Monitoring & Observability

Pipeline-Level Monitoring

ADF provides built-in monitoring for Databricks activities:

  • Cluster startup duration
  • Notebook execution duration
  • Output size

Cross-Platform Correlation

Pass pipeline().RunId to every notebook as a parameter. Use this to correlate:

  • ADF pipeline runs (Azure Monitor)
  • Databricks job runs (Databricks UI / API)
  • Application Insights traces

Log Analytics KQL Queries

ADFPipelineRun
| where PipelineName contains "Databricks"
| where Status == "Failed"
| project TimeGenerated, PipelineName, RunId, ErrorMessage = Parameters
| order by TimeGenerated desc
Enter fullscreen mode Exit fullscreen mode

© 2026 Datanest Digital. All rights reserved.


This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Azure Data Factory Integration Templates] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)