DEV Community

Cover image for Azure Data Engineering: Synapse, ADF, Databricks — Full Guide
Gowtham Potureddi
Gowtham Potureddi

Posted on

Azure Data Engineering: Synapse, ADF, Databricks — Full Guide

azure data engineering is built on three flagship services every modern Azure team ships on: Azure Data Factory (orchestration and bulk movement — pipelines, activities, integration runtimes, triggers), Azure Synapse Analytics (the unified workspace that fuses Dedicated SQL Pools, Serverless SQL, and Spark Pools on one ADLS Gen2 lake), and Azure Databricks (the lakehouse compute engine — clusters, notebooks, Delta Lake, Unity Catalog). Whether you're shipping a daily batch pipeline at a financial-services team or prepping for azure data engineer interview questions at a FAANG-adjacent Azure-first shop, the same three services anchor every architecture.

This guide walks through every Azure service that matters for production DE work: azure data factory (pipelines and the three integration-runtime modes), azure synapse analytics (Dedicated vs Serverless vs Spark — when to pick each), azure databricks (clusters, Delta Lake medallion architecture, Unity Catalog governance), the ADLS Gen2 hierarchical-namespace storage layer they all share, and the broader Azure data platform stack (Event Hubs, Stream Analytics, Purview, Power BI, Microsoft Fabric) that wraps everything together. Every section ends as a teach-by-example block: a short worked example with Question → Input → Code → Step-by-step → Output, then a longer interview-style problem with the four-part Solution Tail (code → trace → output → why this works).

PipeCode blog header for an Azure data engineering full guide — bold white headline 'Azure Data Engineering' with subtitle 'Synapse · ADF · Databricks' and a stylised three-pillar Azure stack on a dark gradient with purple, blue, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps while reading, browse Python practice library →, drill ETL Python drills →, sharpen data-manipulation patterns →, rehearse streaming Python drills →, or widen coverage on the full data-analysis library →.


On this page


1. Why Azure data engineering matters in 2026

Azure is the second-largest cloud data platform — and the only one with three first-class compute engines on one lake

The one-sentence invariant: azure data engineering is the discipline of moving data through ADLS Gen2 using ADF for orchestration, then picking the right compute engine — Synapse Dedicated for warehouse workloads, Synapse Serverless for ad-hoc lake queries, or Databricks for production Spark + ML — all governed by Unity Catalog or Microsoft Purview. Once you internalize that mental model, every Azure DE question becomes "which service is the right tool for this job, and which integration runtime do I need?"

The four pillars of Azure data engineering.

  • Storage — ADLS Gen2 (hierarchical namespace over Azure Blob); the durable layer every engine reads from and writes to.
  • Orchestration + movement — Azure Data Factory (ADF); pipelines, activities, triggers, and three flavours of integration runtime.
  • Compute — Synapse Analytics (SQL warehouse + Spark + Serverless SQL) and Azure Databricks (Spark + Delta Lake); both anchor to the same ADLS Gen2 lake.
  • Governance + serve — Microsoft Purview / Unity Catalog for lineage and access control; Power BI and Microsoft Fabric for downstream consumption.

Why Azure is the second-largest cloud data platform.

  • Enterprise install base — every Microsoft 365 shop is already an Azure tenant; the path of least resistance for data workloads.
  • Synapse is the only "one workspace" experience — SQL warehouse + Spark + Serverless SQL + pipelines all in a single Studio.
  • Databricks on Azure is a first-party offering — billed through Azure, integrated with Azure AD, Azure Key Vault, and ADLS Gen2.
  • Microsoft Fabric is the new SaaS umbrella over OneLake, Synapse, Power BI, Data Activator, and Real-Time Analytics — increasingly the default for greenfield projects.
  • Hybrid is a first-class story — Self-hosted IRs in ADF reach on-prem SQL Server, Oracle, and SAP without standing up custom Kafka connectors.

What interviewers listen for in azure data engineer interview questions.

  • Do you know which integration runtime each ADF activity needs? — basic-but-tested.
  • Can you contrast Dedicated SQL Pool vs Serverless SQL Pool vs Spark Pool and pick the right one for the workload? — fluency signal.
  • Do you describe the medallion architecture (bronze → silver → gold) on ADLS Gen2 with Delta? — senior signal.
  • Do you mention Unity Catalog or Purview for governance (lineage, access control, row + column ACL)? — interview-canonical answer.

adls gen2 vs blob storage — the most-asked storage question.

  • ADLS Gen2 = Azure Blob + hierarchical namespace (HNS)/customers/year=2026/month=05/ is a real folder, not a flat key prefix.
  • Atomic directory operationsrename and delete on a directory work as one transaction; on plain blob they're per-file loops.
  • POSIX-style ACLs — fine-grained file + directory permissions for Hadoop / Spark compatibility.
  • The default for any analytics workload — Blob is fine for media and static assets; ADLS Gen2 is what every Azure DE pipeline writes to.

The Azure DE stack you'll see at most enterprises.

  • Sources — on-prem SQL Server, Salesforce, SAP, Cosmos DB, Event Hubs.
  • Ingestion + orchestration — ADF pipelines with Copy / Mapping Data Flow / Notebook activities.
  • Lake — ADLS Gen2 with a raw / curated / consumption (or bronze / silver / gold) zone layout.
  • Compute — Databricks for production Spark; Synapse Spark for embedded workloads; Synapse SQL for warehouse queries.
  • Governance — Microsoft Purview for catalog + lineage; Unity Catalog on Databricks side.
  • Serve — Synapse Dedicated, Power BI Direct Lake (Fabric), or REST APIs over Cosmos DB.

Worked example — a typical Azure data engineering pipeline at a glance

Detailed explanation. A realistic Azure DE pipeline glues all four pillars together: a source system drops files (or rows) into ADLS Gen2, ADF orchestrates the run, Databricks or Synapse transforms the data, the curated output lands back in ADLS Gen2 as Delta tables, and Power BI reads through Direct Lake or DirectQuery. Knowing the exact shape — and which service owns which step — is the senior-signal answer.

Question. A 50GB daily sales CSV lands in an on-prem SQL Server. Land it on ADLS Gen2, transform it with Spark, and surface per-region revenue to Power BI by 8 AM.

Pipeline shape (Azure data engineering).

Ingest   : Self-hosted IR + ADF Copy activity  →  ADLS Gen2 /bronze/sales/dt=2026-05-24/
Process  : ADF Notebook activity  →  Azure Databricks job
              Reads bronze Delta, cleans + joins, writes silver Delta
              Aggregates per-region, writes gold Delta /gold/revenue/dt=2026-05-24/
Govern   : Unity Catalog registers gold.revenue with row-level ACL by region
Serve    : Power BI Direct Lake reads gold.revenue from OneLake
Trigger  : ADF Schedule trigger at 06:00 UTC daily
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Self-hosted IR is installed on the on-prem network so ADF's Copy activity can reach SQL Server behind the firewall.
  2. Copy activity writes the day's slice to /bronze/sales/dt=2026-05-24/ on ADLS Gen2 as Parquet.
  3. Notebook activity invokes a Databricks job; the job reads bronze, applies cleaning + joining rules, writes /silver/sales_clean/ as a Delta table.
  4. A second notebook aggregates silver to /gold/revenue/ partitioned by region.
  5. Unity Catalog registers gold.revenue and applies a row filter so each regional analyst only sees their region.
  6. Power BI queries gold.revenue via Direct Lake — no data copy, sub-second response.
  7. ADF Schedule trigger runs the pipeline daily at 06:00 UTC; alerts fire on activity failure.

Rule of thumb: Every Azure DE pipeline maps cleanly onto ingest → store → transform → govern → serve. Know which Azure service owns each step and the architecture answer practically writes itself.

Python
Topic — etl
ETL Python drills

Practice →

Python
Topic — data-analysis
Azure lakehouse patterns

Practice →


2. Azure Data Factory — pipelines, activities, integration runtimes

Visual diagram of Azure Data Factory — a single 'Pipeline' card on top holding three activity tiles (Copy, Mapping Data Flow, Notebook); below it three 'Integration Runtime' cards in a row labelled Azure IR, Self-hosted IR and Azure-SSIS IR; arrows flowing from source connectors (SQL Server, REST, S3) on the left into the pipeline then out to sinks (ADLS Gen2, Synapse, Delta Lake) on the right, on a light PipeCode card.

azure data factory — orchestration, bulk movement, and three integration runtimes

azure data factory (ADF) is Azure's managed orchestration service — the canonical place to schedule, parameterize, and monitor data movement and transformation jobs. In one sentence: ADF organises work as Pipelines containing Activities; the actual compute that executes an Activity runs on an Integration Runtime (Azure IR for cloud-to-cloud, Self-hosted IR for hybrid, Azure-SSIS IR for legacy SSIS packages).

The four ADF primitives every interview tests.

  • Pipeline — a logical grouping of activities; the unit you schedule, run, and monitor.
  • Activity — one unit of work (Copy, Mapping Data Flow, Notebook, Stored Procedure, Web, ForEach, If Condition).
  • Linked service — a connection string + credentials to a data source or compute service.
  • Dataset — a typed reference to data inside a linked service (a table, a file path with format).

The three Integration Runtime types.

  • Azure IR — fully managed serverless compute for cloud-to-cloud activities (Copy from Blob to Synapse, Mapping Data Flow runs).
  • Self-hosted IR — software you install on a VM (cloud or on-prem) so ADF can reach behind a firewall (on-prem SQL Server, Oracle, SAP).
  • Azure-SSIS IR — lift-and-shift execution environment for legacy SQL Server Integration Services packages.

The activity types most teams use.

  • Copy activity — bulk movement; runs on Azure IR or Self-hosted IR; supports 90+ source/sink connectors.
  • Mapping Data Flow — visual ETL transform that compiles to a managed Spark cluster behind the scenes.
  • Notebook activity — invoke a Databricks or Synapse notebook with parameters.
  • Stored Procedure activity — call a procedure on SQL Server / Synapse Dedicated SQL Pool.
  • ForEach / If Condition / Until — control-flow activities for parameterized loops and branching.
  • Web activity — call a REST API; common for triggering external systems or fetching tokens.
  • Lookup activity — query a small result set into a variable for downstream conditionals.

Triggers — how pipelines start.

  • Schedule trigger — run at a wall-clock time (e.g. daily at 06:00 UTC).
  • Tumbling-window trigger — back-fillable, stateful interval (run for every hour since X with retry support).
  • Storage-event trigger — fire when a blob appears in or is deleted from an ADLS Gen2 path.
  • Custom-event trigger — fire from an Event Grid event (e.g. SAP iDoc arrived).
  • Manual trigger — kick off via the portal, REST API, or PowerShell.

adf vs databricks — the most-asked comparison.

  • ADF is orchestration + bulk movement; Databricks is compute + transformation.
  • You usually run them together — an ADF pipeline whose main activity is Databricks Notebook that runs the heavy Spark job.
  • For pure SQL-on-warehouse pipelines you might use only ADF + Stored Procedure activities on Synapse Dedicated.
  • Modern teams default to ADF for orchestration + Databricks for transform because Databricks has the richer notebook + cluster + Delta Lake experience.

Worked example — a basic ADF Copy activity pattern

Detailed explanation. The single most-used ADF pattern is a Copy activity that lands data from a source connector into ADLS Gen2 as Parquet. Combined with parameterisation, this one pattern powers thousands of "land raw data into the lake" pipelines.

Question. Land a daily snapshot of an on-prem dbo.orders SQL Server table into /bronze/orders/dt={ingest_date}/ on ADLS Gen2 as Parquet, then trigger a Databricks notebook that processes only that day's partition.

Input (pipeline parameters).

Parameter Example Used by
ingest_date 2026-05-24 Copy sink path + Notebook activity argument
source_table dbo.orders Copy source dataset
sink_root bronze/orders Copy sink path

Code (ADF pipeline JSON, abbreviated).

{
  "name": "pl_ingest_orders",
  "properties": {
    "parameters": {
      "ingest_date": { "type": "string" }
    },
    "activities": [
      {
        "name": "Copy_OrdersToBronze",
        "type": "Copy",
        "typeProperties": {
          "source": { "type": "SqlServerSource", "sqlReaderQuery":
            "SELECT * FROM dbo.orders WHERE CAST(updated_at AS DATE) = '@{pipeline().parameters.ingest_date}'" },
          "sink": { "type": "ParquetSink" }
        },
        "inputs": [{ "referenceName": "ds_sql_orders" }],
        "outputs": [{ "referenceName": "ds_adls_bronze_orders",
                     "parameters": { "ingest_date": "@pipeline().parameters.ingest_date" } }],
        "linkedServiceName": { "referenceName": "ls_self_hosted_ir" }
      },
      {
        "name": "Run_DatabricksNotebook",
        "type": "DatabricksNotebook",
        "dependsOn": [{ "activity": "Copy_OrdersToBronze", "dependencyConditions": ["Succeeded"] }],
        "typeProperties": {
          "notebookPath": "/jobs/bronze_to_silver_orders",
          "baseParameters": { "ingest_date": "@pipeline().parameters.ingest_date" }
        }
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Schedule trigger invokes the pipeline daily at 06:00 UTC with ingest_date set to today.
  2. Copy_OrdersToBronze runs on the Self-hosted IR so it can reach on-prem SQL Server through the firewall; it writes the day's slice to /bronze/orders/dt=2026-05-24/ as Parquet.
  3. Run_DatabricksNotebook depends on Succeeded of the copy; it invokes a Databricks notebook with ingest_date as a parameter.
  4. The notebook reads only dt=2026-05-24/, applies cleaning rules, writes /silver/orders_clean/.
  5. If anything fails, ADF retries (configurable per activity); on terminal failure it raises an alert through a downstream Web activity that posts to Teams / PagerDuty.

Output (run summary surfaced in ADF monitoring).

Activity Duration Rows Status
Copy_OrdersToBronze 00:04:12 1,250,438 Succeeded
Run_DatabricksNotebook 00:07:48 Succeeded

Rule of thumb: Parameterize the date and the source-table name; never hard-code paths inside the activity body. Every interviewer probes "how do you handle backfills?" — parameterisation + the tumbling-window trigger is the canonical answer.

Interview problem on Azure Data Factory — designing a parameterized, retryable ingestion pipeline

Question. Design an ADF pipeline that ingests a list of 20 on-prem SQL Server tables into ADLS Gen2 as daily Parquet snapshots. The list of tables must be configuration-driven (so adding a 21st table requires no code change), each table writes to its own bronze folder, failures retry up to 3 times, and a single failed table must not block the other 19.

Solution Using ForEach with isolated failure handling

Code.

{
  "name": "pl_ingest_all_tables",
  "properties": {
    "activities": [
      {
        "name": "Lookup_TableList",
        "type": "Lookup",
        "typeProperties": {
          "source": { "type": "JsonSource" },
          "dataset": { "referenceName": "ds_config_tables_json" },
          "firstRowOnly": false
        }
      },
      {
        "name": "ForEach_Table",
        "type": "ForEach",
        "dependsOn": [{ "activity": "Lookup_TableList",
                        "dependencyConditions": ["Succeeded"] }],
        "typeProperties": {
          "items": "@activity('Lookup_TableList').output.value",
          "isSequential": false,
          "batchCount": 5,
          "activities": [
            {
              "name": "Copy_OneTable",
              "type": "Copy",
              "policy": {
                "retry": 3,
                "retryIntervalInSeconds": 60,
                "timeout": "01:00:00"
              },
              "typeProperties": {
                "source": { "type": "SqlServerSource", "sqlReaderQuery":
                  "SELECT * FROM @{item().schema}.@{item().table}" },
                "sink": { "type": "ParquetSink" }
              },
              "inputs": [{ "referenceName": "ds_sql_dynamic",
                           "parameters": { "schema": "@item().schema",
                                          "table":  "@item().table" } }],
              "outputs": [{ "referenceName": "ds_adls_bronze_dynamic",
                            "parameters": { "table": "@item().table",
                                            "ingest_date": "@pipeline().parameters.ingest_date" } }]
            }
          ]
        }
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Input — config/tables.json on ADLS Gen2:

schema table
dbo orders
dbo customers
sales regions
... ... (20 rows total)
  1. Lookup_TableList reads the JSON config into output.value (an array of 20 {schema, table} records).
  2. ForEach_Table iterates with isSequential=false, batchCount=5 — meaning 5 tables run in parallel, then the next 5, etc.
  3. For each item, Copy_OneTable issues SELECT * FROM {schema}.{table} and writes to /bronze/{table}/dt={ingest_date}/.
  4. retry: 3, retryIntervalInSeconds: 60 — a single transient failure (network blip, momentary SQL deadlock) auto-retries up to 3 times with 60-second backoff before marking the activity Failed.
  5. Because each Copy is inside the ForEach, one table's terminal failure marks only that iteration as Failed; the outer ForEach reports per-item success/failure but continues. The pipeline finishes with mixed status — failed tables are visible in monitoring without dragging down the 19 successful ones.

Output:

Table Status Rows Retries used
dbo.orders Succeeded 1,250,438 0
dbo.customers Succeeded 412,003 1
sales.regions Failed 3
... (17 more Succeeded)

Why this works — concept by concept:

  • Lookup activity turns a JSON / SQL config into a runtime array; adding a 21st table means editing tables.json, not the pipeline — the configuration-driven requirement is satisfied.
  • ForEach with batchCount parallelises across items while capping fan-out; without batchCount, you'd hammer the source SQL Server with 20 simultaneous connections.
  • Activity-level retry policy isolates transient failures from terminal ones; the 60-second backoff handles network blips without page-flooding.
  • Per-iteration error isolation is the ForEach default — a single bad item doesn't halt the others, which matches the "must not block the other 19" requirement.
  • Self-hosted IR is the integration runtime that gives ADF reach across the on-prem firewall; without it, every Copy fails before retry can even help.
  • Cost — Azure IR is billed per data-movement DIU-hour for Copy; Self-hosted IR is free (you pay only for the host VM); ForEach control-flow itself is metered per activity run at fractions of a cent.

Python
Topic — etl
ADF / orchestration drills

Practice →

Python
Topic — data-manipulation
Bulk ingestion patterns

Practice →


3. Azure Synapse Analytics — dedicated, serverless, Spark on one workspace

Visual diagram of Azure Synapse Analytics — a 'Synapse Workspace' card at the top spanning the whole width; below it three engine cards side-by-side labelled 'Dedicated SQL Pool (MPP)', 'Serverless SQL Pool', and 'Spark Pool'; dedicated pool shows 60 distribution slots in a tiny grid; serverless shows pay-per-query meter; Spark shows worker nodes; everything anchored to an ADLS Gen2 layer at the bottom; on a light PipeCode card.

azure synapse analytics — three compute engines fused with the lake under one Studio

azure synapse analytics is Azure's unified analytics workspace. The senior answer in one sentence: Synapse is a single workspace that exposes three independent compute engines — Dedicated SQL Pool (MPP warehouse), Serverless SQL Pool (pay-per-query lake SQL), and Spark Pool (managed Spark) — all reading and writing the same ADLS Gen2 lake, with embedded ADF-style pipelines for orchestration.

The three Synapse compute engines.

  • Dedicated SQL Pool (formerly SQL DW) — Massively Parallel Processing (MPP) data warehouse; you provision DWUs and pay by the hour; 60 distribution slots; ideal for high-concurrency reporting workloads.
  • Serverless SQL Pool — fully serverless T-SQL over files in ADLS Gen2; you pay per TB scanned ($5/TB); ideal for ad-hoc lake exploration and OPENROWSET queries.
  • Spark Pool — managed Apache Spark with auto-pause / auto-scale; ideal for embedded Spark inside the Synapse experience (where Databricks would be overkill).

Dedicated SQL Pool — the MPP warehouse.

  • DWU sizing — DW100c (small) to DW30000c (huge); pause when not in use to avoid charges.
  • 60 distribution slots — every table physically lives across 60 distributions; the distribution strategy determines join + aggregate performance.
  • Distribution typesHASH (column) for large fact tables, ROUND_ROBIN for staging tables, REPLICATE for small dim tables (< 2GB).
  • PolyBase + COPY INTO — bulk load from ADLS Gen2 / Blob into Dedicated tables.
  • Result-set caching — repeat queries return from cache in milliseconds (configurable).
  • Workload management — workload groups + classifiers for multi-tenant priority + concurrency control.

Serverless SQL Pool — pay-per-query lake SQL.

  • No provisioned compute — query files directly with OPENROWSET('parquet'/'csv'/'delta').
  • External tables — define a schema once, query like a real table.
  • Per-TB-scanned billing — partition pruning + Parquet's columnar layout keep cost low.
  • Use case — ad-hoc exploration, low-volume serving, joining lake data to small reference tables.
  • Limitation — no INSERT / UPDATE / DELETE (read-only on the lake); use CETAS (CREATE EXTERNAL TABLE AS SELECT) to materialise transformations.

Spark Pool — embedded managed Spark.

  • Auto-pause — idle for N minutes? cluster shuts down automatically.
  • Auto-scale — min / max node range; cluster grows under load.
  • PySpark / Scala / .NET / SparkSQL notebooks.
  • Synapse Link — near-real-time replication from Cosmos DB / Dataverse / SQL into Synapse for analytics.
  • When to pick it over Databricks — when your team lives in Synapse Studio and the Spark workload is moderate; Databricks wins for production lakehouse + ML.

Distribution strategies on Dedicated — the canonical interview question.

Strategy When to use Notes
HASH (column) Large fact tables (> 2GB) Picks the same column as the most-common join key
ROUND_ROBIN Staging tables, no good hash key Even distribution; bad for joins
REPLICATE Small dimension tables (< 2GB) Full copy on every distribution; zero shuffle

synapse vs databricks — the senior comparison.

  • Synapse — one Microsoft-managed workspace; SQL-first audience; Dedicated SQL Pool gives true warehouse semantics.
  • Databricks — Spark-first; better notebooks; richer Delta Lake + Unity Catalog ecosystem; preferred for ML / data science.
  • Hybrid is common — Synapse Serverless for ad-hoc lake SQL, Databricks for production Spark.
  • Microsoft Fabric is positioned as Synapse's evolution — SaaS lakehouse over OneLake.

Worked example — querying ADLS Gen2 Parquet with Serverless SQL

Detailed explanation. Serverless SQL Pool's killer feature is querying lake files with zero provisioning. You point at a Parquet path, define the schema (or let Synapse infer it), and write T-SQL. The cost is "per TB scanned" so partition pruning is your friend.

Question. A daily Parquet drop lives at /curated/sales/year=2026/month=05/day=*/*.parquet on ADLS Gen2. Return the top 5 regions by revenue for the current month using Serverless SQL.

Input. Parquet files with schema (order_id INT, region VARCHAR(50), customer_id INT, amount DECIMAL(18,2), order_date DATE).

Code.

SELECT TOP 5
    region,
    SUM(amount) AS revenue
FROM
    OPENROWSET(
        BULK 'https://pclake.dfs.core.windows.net/curated/sales/year=2026/month=05/day=*/*.parquet',
        FORMAT = 'PARQUET'
    ) AS s
GROUP BY region
ORDER BY revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. OPENROWSET(BULK ..., FORMAT='PARQUET') opens the file set as a virtual table; the wildcard expands to every matching day-partition file under May 2026.
  2. Schema inference reads the Parquet column metadata so you don't have to declare it (you can override with WITH (...)).
  3. Partition pruning — Serverless reads only files under year=2026/month=05/; February partitions are never scanned, keeping the bill low.
  4. Columnar push-down — only region and amount are physically read from each Parquet file; order_id, customer_id, order_date are skipped.
  5. GROUP BY region + SUM runs distributed across Serverless workers; results stream back ordered.

Output.

region revenue
North America 18,402,113.55
EMEA 12,001,884.20
APAC 9,317,442.10
LATAM 4,128,005.77
MEA 2,001,338.42

Rule of thumb: Always include a partition predicate when querying Serverless; without one, you pay to scan the entire history. For repeated reads of the same aggregate, CETAS into a curated external table.

Interview problem on Synapse — choosing the right distribution for a star-schema warehouse

Question. You're loading a 200GB fact_orders table and four dimension tables (dim_customer 6GB, dim_product 1.2GB, dim_date 50MB, dim_region 5MB) into a Synapse Dedicated SQL Pool. The most common query joins fact_orders to all four dims by their natural key and aggregates revenue by customer_id. Pick the distribution strategy for each table and justify the choice.

Solution Using HASH on fact + REPLICATE on small dims

Code.

CREATE TABLE fact_orders (
    order_id        BIGINT       NOT NULL,
    customer_id     BIGINT       NOT NULL,
    product_id      BIGINT       NOT NULL,
    date_id         INT          NOT NULL,
    region_id       INT          NOT NULL,
    revenue         DECIMAL(18,2) NOT NULL
)
WITH (
    DISTRIBUTION = HASH (customer_id),
    CLUSTERED COLUMNSTORE INDEX,
    PARTITION (date_id RANGE RIGHT FOR VALUES (20260101, 20260201, 20260301, 20260401, 20260501))
);

CREATE TABLE dim_customer ( customer_id BIGINT, ... )
WITH ( DISTRIBUTION = HASH (customer_id), CLUSTERED COLUMNSTORE INDEX );

CREATE TABLE dim_product  ( product_id BIGINT, ... )
WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX );

CREATE TABLE dim_date     ( date_id INT, ... )
WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX );

CREATE TABLE dim_region   ( region_id INT, ... )
WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX );
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Input — the warehouse query:

SELECT c.customer_name, SUM(f.revenue) AS revenue
FROM fact_orders f
JOIN dim_customer c ON f.customer_id = c.customer_id
JOIN dim_product  p ON f.product_id  = p.product_id
JOIN dim_date     d ON f.date_id     = d.date_id
JOIN dim_region   r ON f.region_id   = r.region_id
WHERE d.year = 2026
GROUP BY c.customer_name;
Enter fullscreen mode Exit fullscreen mode
  1. fact_orders is HASH-distributed on customer_id — the 200GB table is sharded across 60 distributions by hash(customer_id) % 60. Rows for the same customer always land on the same distribution.
  2. dim_customer is also HASH-distributed on customer_id — the join fact_orders ⋈ dim_customer ON customer_id is now distribution-local; every distribution joins its own fact slice to its own dim slice with zero shuffle.
  3. dim_product, dim_date, dim_region are REPLICATEd — each is < 2GB, so a full copy lives on every distribution. Joins to them are also local; no shuffle, no broadcast.
  4. GROUP BY c.customer_name is partially aggregated locally (each distribution computes its slice's sum), then a single shuffle returns the 60 partials to one node for final aggregation.
  5. Partitioning fact_orders by date_id lets the WHERE d.year = 2026 predicate prune entire partitions on the fact table; Synapse skips reading 2024 / 2025 data altogether.

Output:

customer_name revenue
Contoso Corp 1,240,113.55
Adventure Works 884,201.10
Fabrikam 612,778.20
... ...

Why this works — concept by concept:

  • HASH-on-join-key — colocating the fact and its largest dimension on the same hash column makes the join distribution-local; no shuffle is the #1 lever for MPP warehouse performance.
  • REPLICATE for small dims — copying a 5MB dim onto every distribution costs ~300MB cluster-wide (60 × 5MB) but eliminates 100% of the shuffle on every query that joins it. The break-even is roughly 2GB.
  • Clustered Columnstore Index — Synapse's default; gives 5–10× compression and lets the engine read only the columns the query references (revenue, customer_name).
  • Partitioning the fact by date — partition elimination lets the engine skip cold partitions entirely; the 2026-only filter scans roughly 1/3 of the rows for a 3-year fact.
  • Distribution choice is the most-asked Synapse interview question — "HASH the fact on the most common join key; REPLICATE dims under 2GB; ROUND_ROBIN only for unloaded staging" is the canonical answer.
  • Cost — DWU usage stays bounded; the query plan touches only ~33% of the fact and zero shuffle for 4 of 5 joins; pause the pool overnight to drop cost to zero.

Python
Topic — etl
Synapse / warehouse drills

Practice →

Python
Topic — aggregation
MPP aggregation patterns

Practice →


4. Azure Databricks — clusters, Delta Lake, Unity Catalog

Visual diagram of Azure Databricks — a 'Workspace' card at the top with a notebook icon; below it a 'Cluster' card showing one driver node and four worker nodes; arrows from the cluster into a Delta Lake medallion stack (bronze → silver → gold tables) on ADLS Gen2; a 'Unity Catalog' governance card on the right wraps tables with access rules; on a light PipeCode card.

azure databricks — clusters run the compute, Delta Lake stores the truth, Unity Catalog rules access

azure databricks is the most-deployed lakehouse engine on Azure. The senior answer in one sentence: Databricks gives you managed Spark clusters that read and write Delta Lake tables on ADLS Gen2, with Unity Catalog providing centralised metadata + access control across workspaces — the foundation of every Azure medallion (bronze → silver → gold) lakehouse architecture.

The four Databricks primitives every interview tests.

  • Workspace — the UI + asset container (notebooks, jobs, repos, libraries, dashboards).
  • Cluster — one driver node + N worker nodes running a specific Databricks Runtime version (Spark + Photon + libraries).
  • Delta Lake — the table format (Parquet files + a transaction log) that gives you ACID transactions on the lake.
  • Unity Catalog — the cross-workspace metastore for tables, views, models, lineage, and ACLs.

Cluster types — pick the right one.

  • All-purpose cluster — interactive notebook work; multiple users; auto-terminate after N minutes idle.
  • Job cluster — spun up for one job, killed when the job ends; cheapest for scheduled work.
  • SQL warehouse (formerly SQL Endpoint) — auto-scaling cluster optimised for BI / dashboard queries.
  • Photon runtime — Databricks' vectorised C++ query engine; 2–3× faster on SQL + DataFrame workloads.

Delta Lake — the lakehouse format.

  • Parquet + a JSON transaction log in _delta_log/; every write appends a new log entry describing added / removed files.
  • ACID transactions — concurrent writers can't corrupt the table; snapshot isolation on reads.
  • Time travelSELECT * FROM t VERSION AS OF 17 or TIMESTAMP AS OF '2026-05-01'.
  • MERGE INTO — upsert / SCD2 in one statement.
  • OPTIMIZE + Z-ORDER — compact small files and physically cluster by a high-cardinality column for fast point reads.
  • Schema enforcement + evolution — new columns must be opted in via mergeSchema=true.

Medallion architecture (bronze → silver → gold).

  • Bronze — raw, append-only landing zone; one table per source; minimal transformation.
  • Silver — cleaned, conformed, deduplicated; joined to relevant reference data.
  • Gold — business-ready aggregates; the layer Power BI / serving APIs read.
  • Why three layers — each one isolates a class of concerns (ingestion, conformance, business logic); failures are localised, reprocessing is cheap.

Unity Catalog — the governance layer.

  • Three-level namespacecatalog.schema.table (replaces the two-level Hive metastore).
  • Metastore-per-region — one metastore covers many workspaces in one region.
  • Lineage — automatically captured across notebooks, jobs, and dashboards.
  • Access control — table, row, and column-level ACLs via GRANT statements; integrates with Azure AD groups.
  • Delta Sharing — open protocol for sharing tables across orgs without copying data.

Auto Loader — the streaming-ingest workhorse.

  • cloudFiles source — incremental file ingest from ADLS Gen2; uses a checkpoint to avoid reprocessing.
  • Schema inference + evolution built in.
  • Trigger.AvailableNow — micro-batch mode that processes all new files since the last checkpoint, then exits — perfect for ADF-orchestrated runs.

Worked example — incremental bronze → silver MERGE with Delta

Detailed explanation. The single most-used Databricks pattern is incremental upsert from a bronze staging table into a silver curated table. MERGE INTO makes this one statement that handles inserts, updates, and (optionally) deletes.

Question. Bronze bronze.orders_landing receives a daily CDC slice with new + changed rows. Upsert into silver.orders keyed by order_id, keeping the latest updated_at per key, and soft-deleting rows flagged is_deleted = true.

Input. Bronze sample for dt=2026-05-24:

order_id customer_id amount updated_at is_deleted
101 22 199.00 2026-05-24 09:00 false
102 22 49.00 2026-05-24 09:05 false
100 21 0 2026-05-24 09:07 true

Existing silver:

order_id customer_id amount updated_at is_deleted
100 21 99.00 2026-05-23 18:30 false
101 22 150.00 2026-05-23 18:35 false

Code.

from pyspark.sql.functions import col

bronze = spark.read.format("delta").load("/bronze/orders_landing").where("dt = '2026-05-24'")
bronze.createOrReplaceTempView("bronze_today")

spark.sql("""
MERGE INTO silver.orders AS t
USING (
    SELECT * FROM bronze_today
) AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN
    UPDATE SET *
WHEN NOT MATCHED AND s.is_deleted = false THEN
    INSERT *
""")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Read the day's bronze slice with partition predicate dt = '2026-05-24' — partition pruning means only that day's files are read.
  2. MERGE INTO silver.orders AS t USING bronze_today AS s opens a single ACID transaction against the silver Delta table.
  3. ON t.order_id = s.order_id identifies the matching key.
  4. WHEN MATCHED AND s.updated_at > t.updated_at — only newer rows overwrite; out-of-order CDC events don't downgrade a row.
  5. WHEN NOT MATCHED AND s.is_deleted = false — new live rows insert; new deletions are skipped (no zombie row gets created).
  6. Delta writes new Parquet files for changed rows + appends a transaction-log entry; readers continue to see the pre-merge snapshot until the txn commits.

Output (silver after merge).

order_id customer_id amount updated_at is_deleted
100 21 99.00 2026-05-23 18:30 false
101 22 199.00 2026-05-24 09:00 false
102 22 49.00 2026-05-24 09:05 false

Rule of thumb: Always guard WHEN MATCHED with a updated_at comparison; otherwise an out-of-order replay overwrites newer state with older.

Interview problem on Databricks — designing a streaming bronze ingestor with Auto Loader

Question. Files land in /raw/events/ on ADLS Gen2 — about 5K files per day, ~1MB each. Build a near-real-time bronze ingestor that processes each new file exactly once, handles schema drift gracefully, and writes to a Delta table that downstream silver jobs can MERGE from. The job should be re-runnable, restartable, and survive cluster restarts.

Solution Using Auto Loader + Trigger.AvailableNow + checkpoint

Code.

from pyspark.sql.functions import col, current_timestamp, input_file_name

raw_path        = "abfss://landing@pclake.dfs.core.windows.net/raw/events/"
bronze_path     = "abfss://lake@pclake.dfs.core.windows.net/bronze/events/"
checkpoint_path = "abfss://lake@pclake.dfs.core.windows.net/_chk/bronze_events/"
schema_path     = "abfss://lake@pclake.dfs.core.windows.net/_schema/bronze_events/"

bronze_stream = (
    spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaLocation", schema_path)
        .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
        .load(raw_path)
        .withColumn("ingest_ts", current_timestamp())
        .withColumn("source_file", input_file_name())
)

(
    bronze_stream.writeStream
        .format("delta")
        .option("checkpointLocation", checkpoint_path)
        .option("mergeSchema", "true")
        .trigger(availableNow=True)
        .toTable("bronze.events")
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Input — bronze input directory shape:

Day New files Avg size
2026-05-24 5,200 1.1 MB
2026-05-25 4,980 1.0 MB
2026-05-26 5,310 1.2 MB
  1. cloudFiles (Auto Loader) uses Azure Event Grid + Queue Storage (or directory listing) to detect newly arrived JSON files since the last checkpoint.
  2. schemaLocation persists the inferred schema; addNewColumns evolution mode lets a brand-new field appear in tomorrow's JSON without breaking today's read.
  3. withColumn("ingest_ts", ...) stamps each row with arrival time; input_file_name() records source provenance for debugging.
  4. writeStream.trigger(availableNow=True) — micro-batch mode that processes all unseen files in one pass, then exits. ADF can call this job in a daily Notebook activity without leaving a streaming cluster running 24/7.
  5. checkpointLocation is the source-of-truth for "which files have I already consumed?" — restarting the job re-reads from the checkpoint, guaranteeing exactly-once write semantics.
  6. mergeSchema=true lets the silver table absorb the new column on commit; the Delta log records the schema change.

Output:

Metric Value
Files consumed this run 5,200
Rows written to bronze.events 1,820,331
New schema columns detected 1 (device_os_build)
Run duration 00:11:42
Checkpoint advanced to offset 5,200 / file 2026-05-24-23-59-58.json

Why this works — concept by concept:

  • Auto Loader cloudFiles is the right primitive for "small files in a directory" — it avoids the listing storm you'd hit by spark.read.json(...) on the whole folder daily.
  • Trigger.AvailableNow is the killer combo with ADF — you get streaming-style exactly-once semantics with batch-style "run, finish, release the cluster" cost.
  • Schema location + addNewColumns turns schema drift from a job failure into a logged event; the silver layer can choose how to deal with the new column.
  • Checkpoint location owns exactly-once; without it, every restart double-processes everything and you build a duplicate-handling layer downstream.
  • Delta mergeSchema=true lets the destination table evolve in lockstep with the source; the transaction log records the schema change with a version stamp you can rewind to.
  • Cost — a job cluster billed only while availableNow is running is dramatically cheaper than a 24/7 structured-streaming cluster; for 5K small files/day you'll typically spend < $1/run.

Python
Topic — etl
Databricks / Spark drills

Practice →

Python
Topic — streaming
Auto Loader / streaming patterns

Practice →


5. The Azure data platform — storage, ingestion, compute, governance + serve

Visual layered stack of the Azure data platform — bottom 'Storage' layer with ADLS Gen2 / Azure SQL / Cosmos DB icons; next 'Ingestion + Streaming' layer with Event Hubs / IoT Hub / ADF; next 'Compute + Transform' layer with Databricks / Synapse Spark / Stream Analytics; top 'Governance + Serve' layer with Purview / Key Vault / Power BI as small pill labels — a clean stratified infographic, on a light PipeCode card.

The four-layer Azure data platform every modern team ships on

Beyond the big three (ADF, Synapse, Databricks), Azure ships dozens of data services. Senior azure data engineer interview questions test fluency with the layered architecture and which service belongs where. The mental model: storage → ingestion → compute → governance + serve — four strata, each with two-to-four services.

Layer 1 — Storage.

  • ADLS Gen2 — the default lake; hierarchical namespace, ACLs, optimised for analytics.
  • Azure SQL DB / Managed Instance — fully managed OLTP; the source of truth for transactional apps.
  • Cosmos DB — globally distributed, multi-model NoSQL (document, key-value, graph, column); 5 consistency levels; the choice for high-volume operational APIs.
  • Azure Blob Storage — object store for media, backups, static assets (ADLS Gen2 is built on it).

Layer 2 — Ingestion + Streaming.

  • Azure Data Factory — orchestration + bulk movement (covered in §2).
  • Event Hubs — high-throughput event ingest; Kafka-protocol compatible; the front door for streaming pipelines.
  • IoT Hub — Event Hubs + device-management features (bi-directional comms, device-twin state) for IoT workloads.
  • Service Bus — enterprise message bus for app-to-app integration (FIFO, dead-letter queues, transactions).

Layer 3 — Compute + Transform.

  • Azure Databricks — the lakehouse + ML default (covered in §4).
  • Synapse Spark / Synapse SQL — embedded compute in the Synapse workspace (covered in §3).
  • Azure Stream Analytics — managed SQL-on-streaming over Event Hubs / IoT Hub; great for sub-second windowed aggregates without a Spark cluster.
  • Azure Functions — serverless compute; the glue for lightweight transforms triggered by Event Grid / Service Bus / HTTP.

Layer 4 — Governance + Serve.

  • Microsoft Purview — enterprise data catalog + lineage + sensitivity classification; covers all Azure data services.
  • Azure Key Vault — secrets / certificates / keys; every linked service in ADF should pull credentials from Key Vault, never inline.
  • Power BI — the default BI tool; Direct Lake mode (in Fabric) reads OneLake parquet without a copy.
  • Microsoft Fabric — SaaS umbrella with OneLake at the centre; Synapse + Power BI + Data Activator + Real-Time Analytics in one experience.

The modern Azure lakehouse — putting it together.

  • Storage — ADLS Gen2 with bronze / silver / gold zones; Delta as the table format.
  • Ingestion — Event Hubs for streams, ADF for batch; both write to bronze.
  • Compute — Databricks for production Spark, Synapse SQL for warehouse queries, Stream Analytics for low-latency windowed aggregates.
  • Governance — Unity Catalog (Databricks side) + Purview (cross-service catalog + lineage); Key Vault for all secrets.
  • Serve — Power BI Direct Lake for dashboards, Cosmos DB or Synapse Dedicated for low-latency APIs.

Microsoft Fabric — what changed in 2024 and what it means for azure data engineering interviews.

  • OneLake is "the OneDrive for data" — one logical lake across the tenant; backed by ADLS Gen2 under the hood.
  • Lakehouse experience natively uses Delta; same medallion architecture as Databricks.
  • Direct Lake in Power BI reads OneLake parquet without copying to a tabular model.
  • Real-Time Intelligence rolls Stream Analytics + KQL together for sub-second analytics.
  • Interview signal — mentioning Fabric without ditching Databricks/Synapse is the senior answer; Fabric complements rather than replaces in most enterprises.

Reference architecture cheat sheet.

Pattern Stack
Daily batch warehouse ADF + Databricks + Delta + Synapse Dedicated + Power BI
Near-real-time analytics Event Hubs + Stream Analytics + Cosmos DB + Power BI
Lakehouse with ML ADF + Databricks + Delta + Unity Catalog + MLflow + Power BI
Fabric-native OneLake + Fabric Lakehouse + Power BI Direct Lake
Hybrid (on-prem source) Self-hosted IR + ADF + ADLS Gen2 + Databricks

Worked example — picking the right compute service for a workload

Detailed explanation. Every Azure DE interview eventually asks "which compute would you pick and why?" — the answer is a decision tree over four axes: workload type (batch vs streaming), expected SLA (sub-second vs daily), data size, and team skillset (SQL-first vs Spark-first).

Question. You have four workloads. For each, name the Azure compute service you'd reach for first.

Workload Compute service Why
Daily 200GB warehouse refresh, high BI concurrency Synapse Dedicated SQL Pool MPP + result-set cache + workload mgmt
Sub-second tumbling-window aggregate over 100K events/sec Azure Stream Analytics Managed streaming SQL; no Spark cluster mgmt
Production lakehouse ETL + ML training Azure Databricks Spark + Delta + MLflow + Unity Catalog
Ad-hoc "select count(*) from this raw parquet" exploration Synapse Serverless SQL Pool Pay-per-TB; zero provisioning

Step-by-step explanation.

  1. Synapse Dedicated wins on predictable warehouse workloads with many concurrent BI users — the result-set cache + workload groups give you the SLA and tenant isolation.
  2. Stream Analytics wins on sub-second streaming SQL because it's fully managed and you don't pay for a Spark cluster idle between bursts.
  3. Databricks wins for production lakehouse + ML because of Delta + Unity Catalog + MLflow + Photon — the deepest ecosystem.
  4. Synapse Serverless wins for ad-hoc lake exploration because there's no compute to provision; you pay $5/TB scanned.

Rule of thumb: Match workload to the service that's purpose-built for it; resist using one service for everything just because it's already paid for. The cost difference between the right and wrong tool can be 10×.

Interview problem on the Azure ecosystem — design a near-real-time + batch lakehouse

Question. Design an Azure data platform for an e-commerce company that needs (a) sub-minute clickstream dashboards in Power BI, (b) daily revenue and inventory reports for finance, (c) ML feature engineering on clickstream + transactions, and (d) governed access to PII for the data-science team. Name every Azure service, the data flow, and the governance layer.

Solution Using a layered Fabric-friendly lakehouse

Code (architecture as Bicep-style pseudo-spec).

storage:
  - adls_gen2: { name: pclake, zones: [bronze, silver, gold] }
ingestion:
  streaming:
    - event_hubs: { name: clicks, partitions: 32, retention_h: 72 }
  batch:
    - adf: { pipelines: [pl_ingest_transactions_sqlserver,
                         pl_ingest_inventory_oracle] }
compute:
  streaming:
    - stream_analytics: { input: event_hubs.clicks,
                          output: cosmosdb.live_dashboard }
  lakehouse:
    - databricks: { jobs: [auto_loader_bronze_clicks,
                           silver_clicks_dedup,
                           gold_revenue_daily,
                           feature_store_train_ready] }
  warehouse:
    - synapse_dedicated: { dwu: DW400c, tables: [fact_orders, dim_*] }
governance:
  - unity_catalog: { metastore: pclake-east-us }
  - purview: { catalog: pccatalog }
  - key_vault: { secrets: [sql_conn, oracle_conn, eventhubs_conn] }
serve:
  - power_bi: { mode: direct_lake, dataset: gold.revenue_daily }
  - cosmosdb: { db: live, container: dashboard_aggregates }
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Input — the four requirements:

# Requirement Maps to
a Sub-minute clickstream dashboards Event Hubs → Stream Analytics → Cosmos DB → Power BI
b Daily revenue + inventory reports ADF → ADLS Gen2 → Databricks → Synapse Dedicated → Power BI
c ML feature engineering Databricks + Unity Catalog + MLflow
d Governed PII access Unity Catalog row + column ACL + Purview classification
  1. Clickstream lands on Event Hubs (Kafka-protocol compatible, 32 partitions for parallelism). One branch flows to Stream Analytics which writes sub-minute windowed aggregates to Cosmos DB; Power BI hits Cosmos with DirectQuery for live dashboards.
  2. The same Event Hubs partition also feeds an Auto Loader bronze job in Databricks that lands raw clicks in /bronze/clicks/ as Delta.
  3. ADF runs nightly pipelines that ingest transactions from on-prem SQL Server (Self-hosted IR) and inventory from Oracle, dropping each to its own bronze folder.
  4. Databricks silver jobs dedupe + conform; gold jobs aggregate revenue / inventory; gold.revenue_daily is loaded into Synapse Dedicated for high-concurrency BI.
  5. Power BI Direct Lake reads gold.revenue_daily from OneLake (Fabric); finance dashboards refresh nightly with sub-second open times.
  6. MLflow in Databricks trains the recommender model on the joined gold features; the feature table is governed by Unity Catalog.
  7. Unity Catalog row-level filter masks customer PII columns for the data-science group; Purview auto-classifies new columns as PII based on patterns; Key Vault holds every credential — no inline secrets anywhere in ADF.

Output (operational outcome).

KPI Target Achieved
Clickstream dashboard latency < 60s 38s
Daily revenue report freshness < 8h 4.5h
Feature-store SLA < 24h 12h
PII data exfiltration risk Zero Zero — UC ACL blocks

Why this works — concept by concept:

  • Two-track ingestion — streaming (Event Hubs → Stream Analytics) and batch (ADF) flow into the same lake; downstream consumers see one source of truth regardless of arrival mode.
  • Medallion lakehouse — bronze/silver/gold isolates ingestion failures from business-logic failures from serving failures; reprocessing is cheap because each layer is idempotent.
  • Right tool per workload — Stream Analytics for sub-second streaming SQL, Databricks for ML + Spark, Synapse for warehouse BI; one service per job class keeps cost and complexity bounded.
  • Unity Catalog + Purview — UC owns the active row + column access decisions; Purview owns the cross-service catalog + classification; the two complement each other.
  • Key Vault everywhere — every linked service references a Key Vault secret URL; rotating a password becomes a single Key Vault update with no pipeline code change.
  • Cost — Stream Analytics is metered per Streaming Unit (cheap for steady throughput); Databricks job clusters auto-terminate; Synapse Dedicated pauses outside business hours; Power BI Direct Lake removes the cost of a materialised tabular model.

Python
Topic — etl
Azure platform drills

Practice →

Python
Topic — real-time-analytics
Streaming + lakehouse patterns

Practice →


Choosing the right Azure service (cheat sheet)

A one-screen cheat sheet for the most-asked azure data engineering decisions.

You want to … Azure service Notes
Orchestrate a daily batch Azure Data Factory Pipelines + Schedule trigger
Reach an on-prem source Self-hosted IR + ADF Self-hosted IR runs the Copy on your VM
Land raw files on the lake ADLS Gen2 + ADF Copy / Auto Loader Use HNS containers
Build a star-schema warehouse Synapse Dedicated SQL Pool HASH on join key; REPLICATE small dims
Ad-hoc lake SQL Synapse Serverless SQL Pool $5/TB scanned; partition pruning matters
Embedded Spark in Synapse Synapse Spark Pool Auto-pause; for moderate Spark workloads
Production lakehouse Spark Azure Databricks Delta + Unity Catalog + Photon
Upsert into a Delta table MERGE INTO on Databricks / Synapse Guard with updated_at for CDC
Compact Delta small files OPTIMIZE table (+ Z-ORDER) Run nightly on hot tables
Travel back in time SELECT ... VERSION AS OF n Delta time-travel
Stream high-throughput events Event Hubs (Kafka-compatible) 32+ partitions for parallelism
Sub-second streaming SQL Azure Stream Analytics Tumbling / hopping / sliding windows
Workflow / app messaging Azure Service Bus Queues, topics, dead-letter
Serverless glue Azure Functions HTTP / Event Grid / Timer triggers
Catalog + lineage across services Microsoft Purview Pulls from ADF, Synapse, SQL, Databricks
Table + row + column ACL Unity Catalog (Databricks) catalog.schema.table three-level namespace
Store secrets Azure Key Vault Linked-service URLs reference KV
BI dashboards Power BI Direct Lake (Fabric) for no-copy reads
New SaaS lakehouse (greenfield) Microsoft Fabric OneLake + Direct Lake + native Delta

Frequently asked questions

What is Azure data engineering and which services should I learn first?

azure data engineering is the practice of building data pipelines on Microsoft Azure — moving data from sources (on-prem databases, SaaS apps, event streams) into a governed lake or warehouse, transforming it, and serving it to analytics and ML consumers. The three foundational services to learn first are Azure Data Factory (ADF) for orchestration and bulk movement, Azure Synapse Analytics for the unified SQL + Spark workspace, and Azure Databricks for production lakehouse compute on Delta Lake. Beneath them sits ADLS Gen2 — the hierarchical-namespace storage layer every service reads and writes — and around them sits the governance layer of Microsoft Purview, Unity Catalog, and Azure Key Vault. If you're shipping new pipelines in 2026, learn ADF + Databricks + ADLS Gen2 first; Synapse is essential when an enterprise BI workload demands the Dedicated SQL Pool's MPP semantics, and Microsoft Fabric is increasingly the default for greenfield projects.

What's the difference between Azure Synapse Analytics and Azure Databricks?

Synapse is a unified workspace fusing three compute engines — Dedicated SQL Pool (MPP warehouse), Serverless SQL Pool (pay-per-query lake SQL), and Spark Pool (managed Spark) — under one Studio, with embedded ADF-style pipelines. It's the right pick when your team is SQL-first, you need true warehouse semantics (high concurrency, workload management, result-set cache), or you want one tool for the whole job. Databricks is the most-deployed lakehouse engine on Azure; it gives you the richest Delta Lake + Unity Catalog + MLflow + Photon experience, with notebooks and clusters tuned for production Spark + ML. The most common pattern is both — Synapse for the SQL warehouse / Serverless lake SQL, Databricks for production Spark transforms and ML. For a 2026 interview, name Databricks for production lakehouse + ML and Synapse Dedicated for high-concurrency warehouse BI; mention Microsoft Fabric as the SaaS evolution if asked about the long-term direction.

What is ADLS Gen2 and how is it different from Azure Blob Storage?

ADLS Gen2 (Azure Data Lake Storage Gen2) is Azure Blob Storage with a hierarchical namespace (HNS) enabled — meaning /customers/year=2026/month=05/ is a true directory tree, not a flat key prefix. Three differences matter for data engineering: (1) atomic directory operations — renaming or deleting a directory is one transaction, not a per-file loop; (2) POSIX-style ACLs — file + directory permissions compatible with Hadoop / Spark ecosystems; (3) performance optimised for analytics — Spark, Databricks, and Synapse all detect HNS and use it for partition pruning, fast metadata listing, and locality-aware reads. For any analytics workload — Delta Lake medallion architectures, Parquet stores, Synapse external tables — use ADLS Gen2. Plain Blob is fine for media, backups, and static assets where directory semantics don't matter.

What are the three integration runtimes in Azure Data Factory?

ADF activities run on an Integration Runtime (IR) — the actual compute substrate. There are three flavours: (1) Azure IR — fully managed serverless cloud compute; the default for any cloud-to-cloud activity (Blob → Synapse Copy, Mapping Data Flow runs); (2) Self-hosted IR — software you install on a Windows VM (cloud or on-prem) so ADF can reach behind a firewall (on-prem SQL Server, Oracle, SAP); (3) Azure-SSIS IR — a lift-and-shift execution environment for legacy SQL Server Integration Services packages. The most common interview gotcha is "how do you reach an on-prem source?" — the answer is Self-hosted IR, not Azure IR. For a Mapping Data Flow that transforms cloud data, you use the Azure IR (it spins up a managed Spark cluster behind the scenes). Knowing which IR an activity needs — and the difference between time to live (TTL) and compute cost for each — is the senior fluency signal.

What is Delta Lake and why does every Azure lakehouse use it?

Delta Lake is the open-source table format that powers every modern Azure lakehouse. It's Parquet files plus a JSON transaction log (_delta_log/) that records every commit — added files, removed files, schema changes. From the user's perspective, this gives you ACID transactions on the lake (concurrent writers can't corrupt the table), MERGE INTO for upserts and SCD2, time travel (VERSION AS OF, TIMESTAMP AS OF), schema enforcement + evolution, and OPTIMIZE / Z-ORDER for compaction and fast point reads. On Azure, Delta is the default table format in Databricks, Synapse Spark Pools, and Microsoft Fabric's OneLake — meaning the same gold.revenue Delta table is queryable by Databricks SQL, Synapse Serverless, Fabric Direct Lake from Power BI, and Trino without copies. Interview-canonical answer: Delta turns the lake into a database — that's the lakehouse pattern, and it's the reason Azure DE has consolidated on it.

How does Unity Catalog compare to Microsoft Purview?

Unity Catalog is Databricks' centralised metastore — the three-level namespace (catalog.schema.table) that owns active access control (table / row / column ACLs via GRANT), lineage capture, and Delta Sharing across workspaces. It's the active enforcement layer: a query against gold.revenue only returns the rows and columns the calling user is allowed to see. Microsoft Purview is Azure's cross-service data catalog and governance product — it auto-scans ADF, Synapse, SQL DB, Cosmos DB, Power BI, and Databricks; classifies sensitive columns (PII, PCI, GDPR); and surfaces lineage across the entire estate. Purview is the discovery + classification layer; Unity Catalog is the access-control layer on the Databricks side. The senior architectural pattern is both — Unity Catalog owns the active permissions on Databricks-managed tables, Purview owns the cross-service catalog so a finance analyst can find data outside Databricks (in Synapse Dedicated, Cosmos DB, Power BI) with classification + lineage carried through.


Where to go next

PipeCode ships 450+ data-engineering interview problems — including Python practice that maps directly onto the SQL, PySpark, and ETL patterns you'll write on Azure Data Factory, Synapse, and Databricks. Whether you're prepping azure data engineer interview questions or shipping a brand-new Azure lakehouse, the practice library mirrors the same set-based and distributed-compute thinking this guide teaches — plus the modern lakehouse skills you'll keep using once you've landed the role.

Kick off via Explore practice →; drill the Python practice lane →; fan out into the ETL lane →; rehearse data-manipulation patterns →; reinforce data-analysis drills →; widen coverage on the full streaming Python library →; or explore the aggregation lane → for warehouse-style queries.

Top comments (0)