DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Azure Synapse-Databricks Integration Kit: Architecture Decision Guide: Synapse vs Databricks vs Both

Architecture Decision Guide: Synapse vs Databricks vs Both

Datanest Digital | datanest.dev


Purpose

This guide provides a structured decision framework for determining which Azure analytics
service to use for each workload in your data platform. Most mature Azure data platforms
benefit from running both Synapse and Databricks. The question is not "which one" but
"which workload goes where."


Executive Summary

Capability Synapse Wins Databricks Wins
T-SQL analytics Yes No
Serverless ad-hoc queries Yes No
Large-scale Spark ETL No Yes
ML model training No Yes
Delta Lake management No Yes
Power BI DirectQuery (DWH) Yes Comparable
Real-time streaming No Yes
Dedicated SQL warehousing Yes Comparable
Cost at low scale Yes No
Cost at high scale No Yes
Governance (Unity Catalog) No Yes
T-SQL stored procedures Yes No

Decision Tree

Use this tree top-down. Start with your workload type and follow the branches.

Level 1: What is the primary workload?

START HERE
    │
    ├── T-SQL analytics / reporting ──────────> SYNAPSE (dedicated or serverless pool)
    │
    ├── Ad-hoc data exploration ──────────────> See Level 2A
    │
    ├── ETL / Data transformation ────────────> See Level 2B
    │
    ├── Machine Learning ─────────────────────> DATABRICKS
    │
    ├── Real-time streaming ──────────────────> DATABRICKS (Structured Streaming)
    │
    ├── Data warehousing (star schema) ───────> See Level 2C
    │
    └── Data sharing / governance ────────────> See Level 2D
Enter fullscreen mode Exit fullscreen mode

Level 2A: Ad-Hoc Data Exploration

Ad-Hoc Data Exploration
    │
    ├── Data is in Parquet/CSV/JSON on ADLS?
    │   └── YES ──> SYNAPSE SERVERLESS
    │       - OPENROWSET queries, no cluster spin-up
    │       - Pay per TB scanned (~$5/TB)
    │       - Ideal for infrequent, schema-discovery queries
    │
    ├── Data is in Delta Lake?
    │   ├── Need T-SQL interface?
    │   │   └── YES ──> SYNAPSE SERVERLESS (with Delta support)
    │   └── Need full Spark/Python interface?
    │       └── YES ──> DATABRICKS SQL or notebooks
    │
    └── Data volume per query > 1 TB regularly?
        └── YES ──> DATABRICKS SQL WAREHOUSE
            - More cost-effective at high scan volumes
            - Better query optimization for Delta
Enter fullscreen mode Exit fullscreen mode

Level 2B: ETL / Data Transformation

ETL / Data Transformation
    │
    ├── Transformation language?
    │   ├── T-SQL only (CTAS, views, stored procs)
    │   │   └── SYNAPSE DEDICATED POOL or SERVERLESS
    │   │
    │   ├── PySpark / Scala Spark
    │   │   ├── Workload < 2 hours/day, < 500 GB?
    │   │   │   └── EITHER (cost-compare with calculator tool)
    │   │   └── Workload > 2 hours/day or > 500 GB?
    │   │       └── DATABRICKS (Photon engine, better autoscaling)
    │   │
    │   └── Mixed T-SQL + Spark
    │       └── BOTH: Databricks for Spark, Synapse for T-SQL layers
    │
    ├── Need Delta Lake MERGE / UPSERT?
    │   └── YES ──> DATABRICKS
    │       - Native Delta engine with optimized MERGE
    │       - Synapse Spark has Delta support but less optimized
    │
    ├── Need orchestration?
    │   ├── Simple pipelines ──> Synapse Pipelines (built-in ADF)
    │   ├── Complex DAGs ──> Databricks Workflows or external orchestrator
    │   └── Enterprise orchestration ──> Azure Data Factory (works with both)
    │
    └── Need incremental processing?
        ├── Change Data Feed (Delta) ──> DATABRICKS
        └── Watermark-based ──> EITHER
Enter fullscreen mode Exit fullscreen mode

Level 2C: Data Warehousing

Data Warehousing
    │
    ├── Existing T-SQL codebase to maintain?
    │   └── YES ──> SYNAPSE DEDICATED POOL
    │       - MPP engine compatible with SQL Server patterns
    │       - Distribution, indexing, materialized views
    │
    ├── Building new from scratch?
    │   ├── Team has strong SQL Server / T-SQL skills?
    │   │   └── YES ──> SYNAPSE DEDICATED POOL
    │   ├── Team prefers Spark + SQL hybrid?
    │   │   └── YES ──> DATABRICKS SQL WAREHOUSE
    │   └── Need lakehouse architecture (no data copies)?
    │       └── YES ──> DATABRICKS SQL WAREHOUSE over Delta Lake
    │
    ├── Concurrency requirements?
    │   ├── < 30 concurrent queries ──> EITHER
    │   └── > 30 concurrent queries ──> DATABRICKS SQL WAREHOUSE
    │       - Better concurrency scaling with serverless SQL warehouses
    │
    └── Data freshness requirements?
        ├── Near real-time (< 5 min) ──> DATABRICKS (streaming into Delta)
        └── Batch (hourly+) ──> EITHER
Enter fullscreen mode Exit fullscreen mode

Level 2D: Data Sharing & Governance

Data Sharing & Governance
    │
    ├── Cross-workspace data sharing?
    │   └── DATABRICKS UNITY CATALOG
    │       - Centralized metastore across workspaces
    │       - Row/column-level security
    │       - Data lineage tracking
    │
    ├── Need to share data with T-SQL consumers?
    │   └── BOTH: Unity Catalog for governance + Synapse serverless for T-SQL access
    │
    ├── External data sharing (outside organization)?
    │   ├── Delta Sharing protocol ──> DATABRICKS
    │   └── Via SQL endpoint ──> SYNAPSE or DATABRICKS SQL
    │
    └── Audit and compliance?
        ├── Unity Catalog audit logs ──> DATABRICKS
        └── Synapse + Purview integration ──> SYNAPSE
Enter fullscreen mode Exit fullscreen mode

Combined Architecture Patterns

Pattern 1: Databricks for Processing, Synapse for Serving

Best for: Organizations with strong T-SQL reporting teams and large-scale Spark ETL.

Ingestion ──> Databricks (Bronze/Silver/Gold) ──> ADLS Gen2 (Delta) ──> Synapse Serverless ──> Power BI
Enter fullscreen mode Exit fullscreen mode

When to use:

  • ETL is primarily PySpark
  • Reporting team uses T-SQL
  • Power BI connects via Synapse serverless endpoints
  • Cost-sensitive on the serving layer (pay-per-query)

When to avoid:

  • Low data volumes (< 100 GB) -- Synapse serverless alone may suffice
  • No T-SQL requirements

Pattern 2: Synapse for DWH, Databricks for ML

Best for: Organizations with an existing Synapse dedicated pool running star-schema workloads.

Sources ──> ADF/Synapse Pipelines ──> Synapse Dedicated Pool (DWH)
                                  ──> ADLS Gen2 ──> Databricks (ML training)
Enter fullscreen mode Exit fullscreen mode

When to use:

  • Established Synapse DWH with complex T-SQL logic
  • ML team needs Spark for feature engineering and training
  • Reporting runs against the dedicated pool

When to avoid:

  • DWH workload is shrinking or migrating to lakehouse

Pattern 3: Full Lakehouse with Synapse SQL Access

Best for: New builds adopting the lakehouse architecture.

Ingestion ──> Databricks (medallion architecture) ──> Delta Lake on ADLS Gen2
                                                        ├──> Databricks SQL (primary)
                                                        └──> Synapse Serverless (T-SQL access)
Enter fullscreen mode Exit fullscreen mode

When to use:

  • Greenfield deployment
  • Want single source of truth in Delta Lake
  • Mixed consumer base (SQL + Python + BI)

When to avoid:

  • Regulatory requirement for a traditional DWH appliance

Pattern 4: Migration Transition State

Best for: Organizations migrating from Synapse Spark pools to Databricks.

Phase 1: Synapse Spark + Synapse SQL (current)
Phase 2: Databricks Spark + Synapse SQL (transition -- use migration guide)
Phase 3: Databricks Spark + Databricks SQL + Synapse Serverless (target)
Enter fullscreen mode Exit fullscreen mode

Cost Decision Matrix

Use the included tools/cost_comparison_calculator.py for detailed estimates. Here are
general guidelines:

Synapse is Cheaper When:

Scenario Reason
< 10 TB scanned/month (serverless) Pay-per-query at $5/TB is hard to beat
DWH workload < DW1000c Fixed-cost dedicated pool
Sporadic query patterns Serverless has zero idle cost
Existing EA/CSP discounts on Synapse Negotiated pricing

Databricks is Cheaper When:

Scenario Reason
> 4 hours/day Spark compute Photon engine throughput advantage
High-concurrency SQL workload Serverless SQL warehouse autoscaling
ML training workloads GPU cluster support, MLflow included
Data volumes > 10 TB per pipeline run Better Spark optimization

Both Together Optimizes Cost When:

Scenario Approach
Heavy Spark ETL + light SQL reporting Databricks for ETL, Synapse serverless for reporting
Mixed T-SQL DWH + ML Synapse dedicated for DWH, Databricks for ML
Large Delta Lake + diverse consumers Databricks manages Delta, Synapse provides T-SQL access

Workload Mapping Worksheet

Use this worksheet to map your current workloads. Fill in each row and use the
decision tree above to determine the target service.

# Workload Name Type Current Service Data Volume Frequency Target Service Rationale
1
2
3
4
5
6
7
8

Anti-Patterns to Avoid

1. Running Synapse Spark Pools for Large-Scale ETL

Synapse Spark pools lack Photon, have slower autoscaling, and limited cluster
configuration. If your Spark workloads exceed 2 hours/day, Databricks is typically
30-50% more cost-effective due to Photon and better cluster management.

2. Using Databricks for Simple T-SQL Reporting

If your reporting team writes T-SQL and data volumes are small (< 10 TB/month scanned),
Synapse serverless SQL is simpler and cheaper. There is no need to spin up a Databricks
SQL warehouse for 20 queries per day.

3. Duplicating Data Between Services

The worst outcome is copying data into both Synapse dedicated pools AND Delta Lake.
Choose ONE storage layer (Delta Lake on ADLS Gen2 is recommended) and have both services
read from it. Synapse serverless can query Delta directly.

4. Running Dedicated Pools 24/7 at High DWU

If your dedicated pool runs at DW2000c+ but utilization is low outside business hours,
either implement pause/resume automation or consider moving to Databricks SQL serverless
which scales to zero.

5. Ignoring Unity Catalog for Governance

If you run Databricks, enable Unity Catalog from day one. Retrofitting governance is
expensive. Synapse consumers can still access Unity Catalog-managed tables via Synapse
serverless views over Delta Lake.


Migration Decision

If you are currently on Synapse Spark pools and considering Databricks, see the
companion guide: guides/migration_synapse_to_databricks.md.

Key migration triggers:

  • Spark workload costs exceeding budget by > 20%
  • Need for Photon engine acceleration
  • Unity Catalog governance requirements
  • ML/MLflow workflow needs
  • Structured Streaming requirements
  • Cluster startup time complaints (Synapse Spark: ~5 min, Databricks: ~2-3 min)

Decision Checklist

Before finalizing your architecture, verify:

  • [ ] Each workload is mapped to a specific service with a documented rationale
  • [ ] ADLS Gen2 is designated as the shared storage layer
  • [ ] Cross-service authentication is planned (see guides/security_cross_service.md)
  • [ ] Cost estimates are calculated for the combined architecture
  • [ ] Power BI connectivity path is determined (Synapse endpoint vs Databricks SQL)
  • [ ] Governance model is defined (Unity Catalog, Purview, or both)
  • [ ] Disaster recovery covers both services
  • [ ] Network architecture supports private endpoints for both services
  • [ ] Team skill gaps are identified and training is planned

Further Reading

  • guides/security_cross_service.md -- Authentication patterns
  • guides/migration_synapse_to_databricks.md -- Migration playbook
  • sql/serverless_views_over_delta.sql -- Synapse-to-Delta integration SQL
  • tools/cost_comparison_calculator.py -- Automated cost comparison

Datanest Digital -- datanest.dev


This is 1 of 20 resources in the Datanest Platform Pro toolkit. Get the complete [Azure Synapse-Databricks Integration Kit] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire Datanest Platform Pro bundle (20 products) for $199 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)