azure data engineering is built on three flagship services every modern Azure team ships on: Azure Data Factory (orchestration and bulk movement — pipelines, activities, integration runtimes, triggers), Azure Synapse Analytics (the unified workspace that fuses Dedicated SQL Pools, Serverless SQL, and Spark Pools on one ADLS Gen2 lake), and Azure Databricks (the lakehouse compute engine — clusters, notebooks, Delta Lake, Unity Catalog). Whether you're shipping a daily batch pipeline at a financial-services team or prepping for azure data engineer interview questions at a FAANG-adjacent Azure-first shop, the same three services anchor every architecture.
This guide walks through every Azure service that matters for production DE work: azure data factory (pipelines and the three integration-runtime modes), azure synapse analytics (Dedicated vs Serverless vs Spark — when to pick each), azure databricks (clusters, Delta Lake medallion architecture, Unity Catalog governance), the ADLS Gen2 hierarchical-namespace storage layer they all share, and the broader Azure data platform stack (Event Hubs, Stream Analytics, Purview, Power BI, Microsoft Fabric) that wraps everything together. Every section ends as a teach-by-example block: a short worked example with Question → Input → Code → Step-by-step → Output, then a longer interview-style problem with the four-part Solution Tail (code → trace → output → why this works).
When you want hands-on reps while reading, browse Python practice library →, drill ETL Python drills →, sharpen data-manipulation patterns →, rehearse streaming Python drills →, or widen coverage on the full data-analysis library →.
On this page
- Why Azure data engineering matters in 2026
- Azure Data Factory — pipelines, activities, integration runtimes
- Azure Synapse Analytics — dedicated, serverless, Spark on one workspace
- Azure Databricks — clusters, Delta Lake, Unity Catalog
- The Azure data platform — storage, ingestion, compute, governance + serve
- Choosing the right Azure service (cheat sheet)
- Frequently asked questions
- Where to go next
1. Why Azure data engineering matters in 2026
Azure is the second-largest cloud data platform — and the only one with three first-class compute engines on one lake
The one-sentence invariant: azure data engineering is the discipline of moving data through ADLS Gen2 using ADF for orchestration, then picking the right compute engine — Synapse Dedicated for warehouse workloads, Synapse Serverless for ad-hoc lake queries, or Databricks for production Spark + ML — all governed by Unity Catalog or Microsoft Purview. Once you internalize that mental model, every Azure DE question becomes "which service is the right tool for this job, and which integration runtime do I need?"
The four pillars of Azure data engineering.
- Storage — ADLS Gen2 (hierarchical namespace over Azure Blob); the durable layer every engine reads from and writes to.
- Orchestration + movement — Azure Data Factory (ADF); pipelines, activities, triggers, and three flavours of integration runtime.
- Compute — Synapse Analytics (SQL warehouse + Spark + Serverless SQL) and Azure Databricks (Spark + Delta Lake); both anchor to the same ADLS Gen2 lake.
- Governance + serve — Microsoft Purview / Unity Catalog for lineage and access control; Power BI and Microsoft Fabric for downstream consumption.
Why Azure is the second-largest cloud data platform.
- Enterprise install base — every Microsoft 365 shop is already an Azure tenant; the path of least resistance for data workloads.
- Synapse is the only "one workspace" experience — SQL warehouse + Spark + Serverless SQL + pipelines all in a single Studio.
- Databricks on Azure is a first-party offering — billed through Azure, integrated with Azure AD, Azure Key Vault, and ADLS Gen2.
- Microsoft Fabric is the new SaaS umbrella over OneLake, Synapse, Power BI, Data Activator, and Real-Time Analytics — increasingly the default for greenfield projects.
- Hybrid is a first-class story — Self-hosted IRs in ADF reach on-prem SQL Server, Oracle, and SAP without standing up custom Kafka connectors.
What interviewers listen for in azure data engineer interview questions.
- Do you know which integration runtime each ADF activity needs? — basic-but-tested.
- Can you contrast Dedicated SQL Pool vs Serverless SQL Pool vs Spark Pool and pick the right one for the workload? — fluency signal.
- Do you describe the medallion architecture (bronze → silver → gold) on ADLS Gen2 with Delta? — senior signal.
- Do you mention Unity Catalog or Purview for governance (lineage, access control, row + column ACL)? — interview-canonical answer.
adls gen2 vs blob storage — the most-asked storage question.
-
ADLS Gen2 = Azure Blob + hierarchical namespace (HNS) —
/customers/year=2026/month=05/is a real folder, not a flat key prefix. -
Atomic directory operations —
renameanddeleteon a directory work as one transaction; on plain blob they're per-file loops. - POSIX-style ACLs — fine-grained file + directory permissions for Hadoop / Spark compatibility.
- The default for any analytics workload — Blob is fine for media and static assets; ADLS Gen2 is what every Azure DE pipeline writes to.
The Azure DE stack you'll see at most enterprises.
- Sources — on-prem SQL Server, Salesforce, SAP, Cosmos DB, Event Hubs.
- Ingestion + orchestration — ADF pipelines with Copy / Mapping Data Flow / Notebook activities.
-
Lake — ADLS Gen2 with a
raw / curated / consumption(or bronze / silver / gold) zone layout. - Compute — Databricks for production Spark; Synapse Spark for embedded workloads; Synapse SQL for warehouse queries.
- Governance — Microsoft Purview for catalog + lineage; Unity Catalog on Databricks side.
- Serve — Synapse Dedicated, Power BI Direct Lake (Fabric), or REST APIs over Cosmos DB.
Worked example — a typical Azure data engineering pipeline at a glance
Detailed explanation. A realistic Azure DE pipeline glues all four pillars together: a source system drops files (or rows) into ADLS Gen2, ADF orchestrates the run, Databricks or Synapse transforms the data, the curated output lands back in ADLS Gen2 as Delta tables, and Power BI reads through Direct Lake or DirectQuery. Knowing the exact shape — and which service owns which step — is the senior-signal answer.
Question. A 50GB daily sales CSV lands in an on-prem SQL Server. Land it on ADLS Gen2, transform it with Spark, and surface per-region revenue to Power BI by 8 AM.
Pipeline shape (Azure data engineering).
Ingest : Self-hosted IR + ADF Copy activity → ADLS Gen2 /bronze/sales/dt=2026-05-24/
Process : ADF Notebook activity → Azure Databricks job
Reads bronze Delta, cleans + joins, writes silver Delta
Aggregates per-region, writes gold Delta /gold/revenue/dt=2026-05-24/
Govern : Unity Catalog registers gold.revenue with row-level ACL by region
Serve : Power BI Direct Lake reads gold.revenue from OneLake
Trigger : ADF Schedule trigger at 06:00 UTC daily
Step-by-step explanation.
- Self-hosted IR is installed on the on-prem network so ADF's Copy activity can reach SQL Server behind the firewall.
-
Copy activity writes the day's slice to
/bronze/sales/dt=2026-05-24/on ADLS Gen2 as Parquet. -
Notebook activity invokes a Databricks job; the job reads bronze, applies cleaning + joining rules, writes
/silver/sales_clean/as a Delta table. - A second notebook aggregates silver to
/gold/revenue/partitioned by region. -
Unity Catalog registers
gold.revenueand applies a row filter so each regional analyst only sees their region. -
Power BI queries
gold.revenuevia Direct Lake — no data copy, sub-second response. - ADF Schedule trigger runs the pipeline daily at 06:00 UTC; alerts fire on activity failure.
Rule of thumb: Every Azure DE pipeline maps cleanly onto ingest → store → transform → govern → serve. Know which Azure service owns each step and the architecture answer practically writes itself.
Python
Topic — etl
ETL Python drills
Python
Topic — data-analysis
Azure lakehouse patterns
2. Azure Data Factory — pipelines, activities, integration runtimes
azure data factory — orchestration, bulk movement, and three integration runtimes
azure data factory (ADF) is Azure's managed orchestration service — the canonical place to schedule, parameterize, and monitor data movement and transformation jobs. In one sentence: ADF organises work as Pipelines containing Activities; the actual compute that executes an Activity runs on an Integration Runtime (Azure IR for cloud-to-cloud, Self-hosted IR for hybrid, Azure-SSIS IR for legacy SSIS packages).
The four ADF primitives every interview tests.
- Pipeline — a logical grouping of activities; the unit you schedule, run, and monitor.
- Activity — one unit of work (Copy, Mapping Data Flow, Notebook, Stored Procedure, Web, ForEach, If Condition).
- Linked service — a connection string + credentials to a data source or compute service.
- Dataset — a typed reference to data inside a linked service (a table, a file path with format).
The three Integration Runtime types.
- Azure IR — fully managed serverless compute for cloud-to-cloud activities (Copy from Blob to Synapse, Mapping Data Flow runs).
- Self-hosted IR — software you install on a VM (cloud or on-prem) so ADF can reach behind a firewall (on-prem SQL Server, Oracle, SAP).
- Azure-SSIS IR — lift-and-shift execution environment for legacy SQL Server Integration Services packages.
The activity types most teams use.
- Copy activity — bulk movement; runs on Azure IR or Self-hosted IR; supports 90+ source/sink connectors.
- Mapping Data Flow — visual ETL transform that compiles to a managed Spark cluster behind the scenes.
- Notebook activity — invoke a Databricks or Synapse notebook with parameters.
- Stored Procedure activity — call a procedure on SQL Server / Synapse Dedicated SQL Pool.
- ForEach / If Condition / Until — control-flow activities for parameterized loops and branching.
- Web activity — call a REST API; common for triggering external systems or fetching tokens.
- Lookup activity — query a small result set into a variable for downstream conditionals.
Triggers — how pipelines start.
- Schedule trigger — run at a wall-clock time (e.g. daily at 06:00 UTC).
- Tumbling-window trigger — back-fillable, stateful interval (run for every hour since X with retry support).
- Storage-event trigger — fire when a blob appears in or is deleted from an ADLS Gen2 path.
- Custom-event trigger — fire from an Event Grid event (e.g. SAP iDoc arrived).
- Manual trigger — kick off via the portal, REST API, or PowerShell.
adf vs databricks — the most-asked comparison.
- ADF is orchestration + bulk movement; Databricks is compute + transformation.
-
You usually run them together — an ADF pipeline whose main activity is
Databricks Notebookthat runs the heavy Spark job. - For pure SQL-on-warehouse pipelines you might use only ADF + Stored Procedure activities on Synapse Dedicated.
- Modern teams default to ADF for orchestration + Databricks for transform because Databricks has the richer notebook + cluster + Delta Lake experience.
Worked example — a basic ADF Copy activity pattern
Detailed explanation. The single most-used ADF pattern is a Copy activity that lands data from a source connector into ADLS Gen2 as Parquet. Combined with parameterisation, this one pattern powers thousands of "land raw data into the lake" pipelines.
Question. Land a daily snapshot of an on-prem dbo.orders SQL Server table into /bronze/orders/dt={ingest_date}/ on ADLS Gen2 as Parquet, then trigger a Databricks notebook that processes only that day's partition.
Input (pipeline parameters).
| Parameter | Example | Used by |
|---|---|---|
ingest_date |
2026-05-24 |
Copy sink path + Notebook activity argument |
source_table |
dbo.orders |
Copy source dataset |
sink_root |
bronze/orders |
Copy sink path |
Code (ADF pipeline JSON, abbreviated).
{
"name": "pl_ingest_orders",
"properties": {
"parameters": {
"ingest_date": { "type": "string" }
},
"activities": [
{
"name": "Copy_OrdersToBronze",
"type": "Copy",
"typeProperties": {
"source": { "type": "SqlServerSource", "sqlReaderQuery":
"SELECT * FROM dbo.orders WHERE CAST(updated_at AS DATE) = '@{pipeline().parameters.ingest_date}'" },
"sink": { "type": "ParquetSink" }
},
"inputs": [{ "referenceName": "ds_sql_orders" }],
"outputs": [{ "referenceName": "ds_adls_bronze_orders",
"parameters": { "ingest_date": "@pipeline().parameters.ingest_date" } }],
"linkedServiceName": { "referenceName": "ls_self_hosted_ir" }
},
{
"name": "Run_DatabricksNotebook",
"type": "DatabricksNotebook",
"dependsOn": [{ "activity": "Copy_OrdersToBronze", "dependencyConditions": ["Succeeded"] }],
"typeProperties": {
"notebookPath": "/jobs/bronze_to_silver_orders",
"baseParameters": { "ingest_date": "@pipeline().parameters.ingest_date" }
}
}
]
}
}
Step-by-step explanation.
-
Schedule trigger invokes the pipeline daily at 06:00 UTC with
ingest_dateset to today. -
Copy_OrdersToBronze runs on the Self-hosted IR so it can reach on-prem SQL Server through the firewall; it writes the day's slice to
/bronze/orders/dt=2026-05-24/as Parquet. -
Run_DatabricksNotebook depends on
Succeededof the copy; it invokes a Databricks notebook withingest_dateas a parameter. - The notebook reads only
dt=2026-05-24/, applies cleaning rules, writes/silver/orders_clean/. - If anything fails, ADF retries (configurable per activity); on terminal failure it raises an alert through a downstream Web activity that posts to Teams / PagerDuty.
Output (run summary surfaced in ADF monitoring).
| Activity | Duration | Rows | Status |
|---|---|---|---|
| Copy_OrdersToBronze | 00:04:12 | 1,250,438 | Succeeded |
| Run_DatabricksNotebook | 00:07:48 | — | Succeeded |
Rule of thumb: Parameterize the date and the source-table name; never hard-code paths inside the activity body. Every interviewer probes "how do you handle backfills?" — parameterisation + the tumbling-window trigger is the canonical answer.
Interview problem on Azure Data Factory — designing a parameterized, retryable ingestion pipeline
Question. Design an ADF pipeline that ingests a list of 20 on-prem SQL Server tables into ADLS Gen2 as daily Parquet snapshots. The list of tables must be configuration-driven (so adding a 21st table requires no code change), each table writes to its own bronze folder, failures retry up to 3 times, and a single failed table must not block the other 19.
Solution Using ForEach with isolated failure handling
Code.
{
"name": "pl_ingest_all_tables",
"properties": {
"activities": [
{
"name": "Lookup_TableList",
"type": "Lookup",
"typeProperties": {
"source": { "type": "JsonSource" },
"dataset": { "referenceName": "ds_config_tables_json" },
"firstRowOnly": false
}
},
{
"name": "ForEach_Table",
"type": "ForEach",
"dependsOn": [{ "activity": "Lookup_TableList",
"dependencyConditions": ["Succeeded"] }],
"typeProperties": {
"items": "@activity('Lookup_TableList').output.value",
"isSequential": false,
"batchCount": 5,
"activities": [
{
"name": "Copy_OneTable",
"type": "Copy",
"policy": {
"retry": 3,
"retryIntervalInSeconds": 60,
"timeout": "01:00:00"
},
"typeProperties": {
"source": { "type": "SqlServerSource", "sqlReaderQuery":
"SELECT * FROM @{item().schema}.@{item().table}" },
"sink": { "type": "ParquetSink" }
},
"inputs": [{ "referenceName": "ds_sql_dynamic",
"parameters": { "schema": "@item().schema",
"table": "@item().table" } }],
"outputs": [{ "referenceName": "ds_adls_bronze_dynamic",
"parameters": { "table": "@item().table",
"ingest_date": "@pipeline().parameters.ingest_date" } }]
}
]
}
}
]
}
}
Step-by-step trace.
Input — config/tables.json on ADLS Gen2:
| schema | table |
|---|---|
| dbo | orders |
| dbo | customers |
| sales | regions |
| ... | ... (20 rows total) |
-
Lookup_TableList reads the JSON config into
output.value(an array of 20{schema, table}records). -
ForEach_Table iterates with
isSequential=false, batchCount=5— meaning 5 tables run in parallel, then the next 5, etc. - For each
item, Copy_OneTable issuesSELECT * FROM {schema}.{table}and writes to/bronze/{table}/dt={ingest_date}/. -
retry: 3, retryIntervalInSeconds: 60— a single transient failure (network blip, momentary SQL deadlock) auto-retries up to 3 times with 60-second backoff before marking the activity Failed. - Because each Copy is inside the ForEach, one table's terminal failure marks only that iteration as Failed; the outer ForEach reports per-item success/failure but continues. The pipeline finishes with mixed status — failed tables are visible in monitoring without dragging down the 19 successful ones.
Output:
| Table | Status | Rows | Retries used |
|---|---|---|---|
| dbo.orders | Succeeded | 1,250,438 | 0 |
| dbo.customers | Succeeded | 412,003 | 1 |
| sales.regions | Failed | — | 3 |
| ... | (17 more Succeeded) | — | — |
Why this works — concept by concept:
-
Lookup activity turns a JSON / SQL config into a runtime array; adding a 21st table means editing
tables.json, not the pipeline — the configuration-driven requirement is satisfied. -
ForEach with batchCount parallelises across items while capping fan-out; without
batchCount, you'd hammer the source SQL Server with 20 simultaneous connections. - Activity-level retry policy isolates transient failures from terminal ones; the 60-second backoff handles network blips without page-flooding.
- Per-iteration error isolation is the ForEach default — a single bad item doesn't halt the others, which matches the "must not block the other 19" requirement.
- Self-hosted IR is the integration runtime that gives ADF reach across the on-prem firewall; without it, every Copy fails before retry can even help.
- Cost — Azure IR is billed per data-movement DIU-hour for Copy; Self-hosted IR is free (you pay only for the host VM); ForEach control-flow itself is metered per activity run at fractions of a cent.
Python
Topic — etl
ADF / orchestration drills
Python
Topic — data-manipulation
Bulk ingestion patterns
3. Azure Synapse Analytics — dedicated, serverless, Spark on one workspace
azure synapse analytics — three compute engines fused with the lake under one Studio
azure synapse analytics is Azure's unified analytics workspace. The senior answer in one sentence: Synapse is a single workspace that exposes three independent compute engines — Dedicated SQL Pool (MPP warehouse), Serverless SQL Pool (pay-per-query lake SQL), and Spark Pool (managed Spark) — all reading and writing the same ADLS Gen2 lake, with embedded ADF-style pipelines for orchestration.
The three Synapse compute engines.
- Dedicated SQL Pool (formerly SQL DW) — Massively Parallel Processing (MPP) data warehouse; you provision DWUs and pay by the hour; 60 distribution slots; ideal for high-concurrency reporting workloads.
-
Serverless SQL Pool — fully serverless T-SQL over files in ADLS Gen2; you pay per TB scanned ($5/TB); ideal for ad-hoc lake exploration and
OPENROWSETqueries. - Spark Pool — managed Apache Spark with auto-pause / auto-scale; ideal for embedded Spark inside the Synapse experience (where Databricks would be overkill).
Dedicated SQL Pool — the MPP warehouse.
- DWU sizing — DW100c (small) to DW30000c (huge); pause when not in use to avoid charges.
- 60 distribution slots — every table physically lives across 60 distributions; the distribution strategy determines join + aggregate performance.
-
Distribution types —
HASH (column)for large fact tables,ROUND_ROBINfor staging tables,REPLICATEfor small dim tables (< 2GB). - PolyBase + COPY INTO — bulk load from ADLS Gen2 / Blob into Dedicated tables.
- Result-set caching — repeat queries return from cache in milliseconds (configurable).
- Workload management — workload groups + classifiers for multi-tenant priority + concurrency control.
Serverless SQL Pool — pay-per-query lake SQL.
-
No provisioned compute — query files directly with
OPENROWSET('parquet'/'csv'/'delta'). - External tables — define a schema once, query like a real table.
- Per-TB-scanned billing — partition pruning + Parquet's columnar layout keep cost low.
- Use case — ad-hoc exploration, low-volume serving, joining lake data to small reference tables.
- Limitation — no INSERT / UPDATE / DELETE (read-only on the lake); use CETAS (CREATE EXTERNAL TABLE AS SELECT) to materialise transformations.
Spark Pool — embedded managed Spark.
- Auto-pause — idle for N minutes? cluster shuts down automatically.
- Auto-scale — min / max node range; cluster grows under load.
- PySpark / Scala / .NET / SparkSQL notebooks.
- Synapse Link — near-real-time replication from Cosmos DB / Dataverse / SQL into Synapse for analytics.
- When to pick it over Databricks — when your team lives in Synapse Studio and the Spark workload is moderate; Databricks wins for production lakehouse + ML.
Distribution strategies on Dedicated — the canonical interview question.
| Strategy | When to use | Notes |
|---|---|---|
HASH (column) |
Large fact tables (> 2GB) | Picks the same column as the most-common join key |
ROUND_ROBIN |
Staging tables, no good hash key | Even distribution; bad for joins |
REPLICATE |
Small dimension tables (< 2GB) | Full copy on every distribution; zero shuffle |
synapse vs databricks — the senior comparison.
- Synapse — one Microsoft-managed workspace; SQL-first audience; Dedicated SQL Pool gives true warehouse semantics.
- Databricks — Spark-first; better notebooks; richer Delta Lake + Unity Catalog ecosystem; preferred for ML / data science.
- Hybrid is common — Synapse Serverless for ad-hoc lake SQL, Databricks for production Spark.
- Microsoft Fabric is positioned as Synapse's evolution — SaaS lakehouse over OneLake.
Worked example — querying ADLS Gen2 Parquet with Serverless SQL
Detailed explanation. Serverless SQL Pool's killer feature is querying lake files with zero provisioning. You point at a Parquet path, define the schema (or let Synapse infer it), and write T-SQL. The cost is "per TB scanned" so partition pruning is your friend.
Question. A daily Parquet drop lives at /curated/sales/year=2026/month=05/day=*/*.parquet on ADLS Gen2. Return the top 5 regions by revenue for the current month using Serverless SQL.
Input. Parquet files with schema (order_id INT, region VARCHAR(50), customer_id INT, amount DECIMAL(18,2), order_date DATE).
Code.
SELECT TOP 5
region,
SUM(amount) AS revenue
FROM
OPENROWSET(
BULK 'https://pclake.dfs.core.windows.net/curated/sales/year=2026/month=05/day=*/*.parquet',
FORMAT = 'PARQUET'
) AS s
GROUP BY region
ORDER BY revenue DESC;
Step-by-step explanation.
-
OPENROWSET(BULK ..., FORMAT='PARQUET')opens the file set as a virtual table; the wildcard expands to every matching day-partition file under May 2026. -
Schema inference reads the Parquet column metadata so you don't have to declare it (you can override with
WITH (...)). -
Partition pruning — Serverless reads only files under
year=2026/month=05/; February partitions are never scanned, keeping the bill low. -
Columnar push-down — only
regionandamountare physically read from each Parquet file;order_id,customer_id,order_dateare skipped. -
GROUP BY region+SUMruns distributed across Serverless workers; results stream back ordered.
Output.
| region | revenue |
|---|---|
| North America | 18,402,113.55 |
| EMEA | 12,001,884.20 |
| APAC | 9,317,442.10 |
| LATAM | 4,128,005.77 |
| MEA | 2,001,338.42 |
Rule of thumb: Always include a partition predicate when querying Serverless; without one, you pay to scan the entire history. For repeated reads of the same aggregate, CETAS into a curated external table.
Interview problem on Synapse — choosing the right distribution for a star-schema warehouse
Question. You're loading a 200GB fact_orders table and four dimension tables (dim_customer 6GB, dim_product 1.2GB, dim_date 50MB, dim_region 5MB) into a Synapse Dedicated SQL Pool. The most common query joins fact_orders to all four dims by their natural key and aggregates revenue by customer_id. Pick the distribution strategy for each table and justify the choice.
Solution Using HASH on fact + REPLICATE on small dims
Code.
CREATE TABLE fact_orders (
order_id BIGINT NOT NULL,
customer_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
date_id INT NOT NULL,
region_id INT NOT NULL,
revenue DECIMAL(18,2) NOT NULL
)
WITH (
DISTRIBUTION = HASH (customer_id),
CLUSTERED COLUMNSTORE INDEX,
PARTITION (date_id RANGE RIGHT FOR VALUES (20260101, 20260201, 20260301, 20260401, 20260501))
);
CREATE TABLE dim_customer ( customer_id BIGINT, ... )
WITH ( DISTRIBUTION = HASH (customer_id), CLUSTERED COLUMNSTORE INDEX );
CREATE TABLE dim_product ( product_id BIGINT, ... )
WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX );
CREATE TABLE dim_date ( date_id INT, ... )
WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX );
CREATE TABLE dim_region ( region_id INT, ... )
WITH ( DISTRIBUTION = REPLICATE, CLUSTERED COLUMNSTORE INDEX );
Step-by-step trace.
Input — the warehouse query:
SELECT c.customer_name, SUM(f.revenue) AS revenue
FROM fact_orders f
JOIN dim_customer c ON f.customer_id = c.customer_id
JOIN dim_product p ON f.product_id = p.product_id
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_region r ON f.region_id = r.region_id
WHERE d.year = 2026
GROUP BY c.customer_name;
-
fact_ordersis HASH-distributed oncustomer_id— the 200GB table is sharded across 60 distributions byhash(customer_id) % 60. Rows for the same customer always land on the same distribution. -
dim_customeris also HASH-distributed oncustomer_id— the joinfact_orders ⋈ dim_customer ON customer_idis now distribution-local; every distribution joins its own fact slice to its own dim slice with zero shuffle. -
dim_product,dim_date,dim_regionare REPLICATEd — each is < 2GB, so a full copy lives on every distribution. Joins to them are also local; no shuffle, no broadcast. -
GROUP BY c.customer_nameis partially aggregated locally (each distribution computes its slice's sum), then a single shuffle returns the 60 partials to one node for final aggregation. -
Partitioning
fact_ordersbydate_idlets theWHERE d.year = 2026predicate prune entire partitions on the fact table; Synapse skips reading 2024 / 2025 data altogether.
Output:
| customer_name | revenue |
|---|---|
| Contoso Corp | 1,240,113.55 |
| Adventure Works | 884,201.10 |
| Fabrikam | 612,778.20 |
| ... | ... |
Why this works — concept by concept:
- HASH-on-join-key — colocating the fact and its largest dimension on the same hash column makes the join distribution-local; no shuffle is the #1 lever for MPP warehouse performance.
- REPLICATE for small dims — copying a 5MB dim onto every distribution costs ~300MB cluster-wide (60 × 5MB) but eliminates 100% of the shuffle on every query that joins it. The break-even is roughly 2GB.
-
Clustered Columnstore Index — Synapse's default; gives 5–10× compression and lets the engine read only the columns the query references (
revenue,customer_name). - Partitioning the fact by date — partition elimination lets the engine skip cold partitions entirely; the 2026-only filter scans roughly 1/3 of the rows for a 3-year fact.
- Distribution choice is the most-asked Synapse interview question — "HASH the fact on the most common join key; REPLICATE dims under 2GB; ROUND_ROBIN only for unloaded staging" is the canonical answer.
- Cost — DWU usage stays bounded; the query plan touches only ~33% of the fact and zero shuffle for 4 of 5 joins; pause the pool overnight to drop cost to zero.
Python
Topic — etl
Synapse / warehouse drills
Python
Topic — aggregation
MPP aggregation patterns
4. Azure Databricks — clusters, Delta Lake, Unity Catalog
azure databricks — clusters run the compute, Delta Lake stores the truth, Unity Catalog rules access
azure databricks is the most-deployed lakehouse engine on Azure. The senior answer in one sentence: Databricks gives you managed Spark clusters that read and write Delta Lake tables on ADLS Gen2, with Unity Catalog providing centralised metadata + access control across workspaces — the foundation of every Azure medallion (bronze → silver → gold) lakehouse architecture.
The four Databricks primitives every interview tests.
- Workspace — the UI + asset container (notebooks, jobs, repos, libraries, dashboards).
- Cluster — one driver node + N worker nodes running a specific Databricks Runtime version (Spark + Photon + libraries).
- Delta Lake — the table format (Parquet files + a transaction log) that gives you ACID transactions on the lake.
- Unity Catalog — the cross-workspace metastore for tables, views, models, lineage, and ACLs.
Cluster types — pick the right one.
- All-purpose cluster — interactive notebook work; multiple users; auto-terminate after N minutes idle.
- Job cluster — spun up for one job, killed when the job ends; cheapest for scheduled work.
- SQL warehouse (formerly SQL Endpoint) — auto-scaling cluster optimised for BI / dashboard queries.
- Photon runtime — Databricks' vectorised C++ query engine; 2–3× faster on SQL + DataFrame workloads.
Delta Lake — the lakehouse format.
-
Parquet + a JSON transaction log in
_delta_log/; every write appends a new log entry describing added / removed files. - ACID transactions — concurrent writers can't corrupt the table; snapshot isolation on reads.
-
Time travel —
SELECT * FROM t VERSION AS OF 17orTIMESTAMP AS OF '2026-05-01'. - MERGE INTO — upsert / SCD2 in one statement.
- OPTIMIZE + Z-ORDER — compact small files and physically cluster by a high-cardinality column for fast point reads.
-
Schema enforcement + evolution — new columns must be opted in via
mergeSchema=true.
Medallion architecture (bronze → silver → gold).
- Bronze — raw, append-only landing zone; one table per source; minimal transformation.
- Silver — cleaned, conformed, deduplicated; joined to relevant reference data.
- Gold — business-ready aggregates; the layer Power BI / serving APIs read.
- Why three layers — each one isolates a class of concerns (ingestion, conformance, business logic); failures are localised, reprocessing is cheap.
Unity Catalog — the governance layer.
-
Three-level namespace —
catalog.schema.table(replaces the two-level Hive metastore). - Metastore-per-region — one metastore covers many workspaces in one region.
- Lineage — automatically captured across notebooks, jobs, and dashboards.
- Access control — table, row, and column-level ACLs via GRANT statements; integrates with Azure AD groups.
- Delta Sharing — open protocol for sharing tables across orgs without copying data.
Auto Loader — the streaming-ingest workhorse.
-
cloudFilessource — incremental file ingest from ADLS Gen2; uses a checkpoint to avoid reprocessing. - Schema inference + evolution built in.
- Trigger.AvailableNow — micro-batch mode that processes all new files since the last checkpoint, then exits — perfect for ADF-orchestrated runs.
Worked example — incremental bronze → silver MERGE with Delta
Detailed explanation. The single most-used Databricks pattern is incremental upsert from a bronze staging table into a silver curated table. MERGE INTO makes this one statement that handles inserts, updates, and (optionally) deletes.
Question. Bronze bronze.orders_landing receives a daily CDC slice with new + changed rows. Upsert into silver.orders keyed by order_id, keeping the latest updated_at per key, and soft-deleting rows flagged is_deleted = true.
Input. Bronze sample for dt=2026-05-24:
| order_id | customer_id | amount | updated_at | is_deleted |
|---|---|---|---|---|
| 101 | 22 | 199.00 | 2026-05-24 09:00 | false |
| 102 | 22 | 49.00 | 2026-05-24 09:05 | false |
| 100 | 21 | 0 | 2026-05-24 09:07 | true |
Existing silver:
| order_id | customer_id | amount | updated_at | is_deleted |
|---|---|---|---|---|
| 100 | 21 | 99.00 | 2026-05-23 18:30 | false |
| 101 | 22 | 150.00 | 2026-05-23 18:35 | false |
Code.
from pyspark.sql.functions import col
bronze = spark.read.format("delta").load("/bronze/orders_landing").where("dt = '2026-05-24'")
bronze.createOrReplaceTempView("bronze_today")
spark.sql("""
MERGE INTO silver.orders AS t
USING (
SELECT * FROM bronze_today
) AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.updated_at > t.updated_at THEN
UPDATE SET *
WHEN NOT MATCHED AND s.is_deleted = false THEN
INSERT *
""")
Step-by-step explanation.
-
Read the day's bronze slice with partition predicate
dt = '2026-05-24'— partition pruning means only that day's files are read. -
MERGE INTO silver.orders AS t USING bronze_today AS sopens a single ACID transaction against the silver Delta table. -
ON t.order_id = s.order_ididentifies the matching key. -
WHEN MATCHED AND s.updated_at > t.updated_at— only newer rows overwrite; out-of-order CDC events don't downgrade a row. -
WHEN NOT MATCHED AND s.is_deleted = false— new live rows insert; new deletions are skipped (no zombie row gets created). - Delta writes new Parquet files for changed rows + appends a transaction-log entry; readers continue to see the pre-merge snapshot until the txn commits.
Output (silver after merge).
| order_id | customer_id | amount | updated_at | is_deleted |
|---|---|---|---|---|
| 100 | 21 | 99.00 | 2026-05-23 18:30 | false |
| 101 | 22 | 199.00 | 2026-05-24 09:00 | false |
| 102 | 22 | 49.00 | 2026-05-24 09:05 | false |
Rule of thumb: Always guard WHEN MATCHED with a updated_at comparison; otherwise an out-of-order replay overwrites newer state with older.
Interview problem on Databricks — designing a streaming bronze ingestor with Auto Loader
Question. Files land in /raw/events/ on ADLS Gen2 — about 5K files per day, ~1MB each. Build a near-real-time bronze ingestor that processes each new file exactly once, handles schema drift gracefully, and writes to a Delta table that downstream silver jobs can MERGE from. The job should be re-runnable, restartable, and survive cluster restarts.
Solution Using Auto Loader + Trigger.AvailableNow + checkpoint
Code.
from pyspark.sql.functions import col, current_timestamp, input_file_name
raw_path = "abfss://landing@pclake.dfs.core.windows.net/raw/events/"
bronze_path = "abfss://lake@pclake.dfs.core.windows.net/bronze/events/"
checkpoint_path = "abfss://lake@pclake.dfs.core.windows.net/_chk/bronze_events/"
schema_path = "abfss://lake@pclake.dfs.core.windows.net/_schema/bronze_events/"
bronze_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", schema_path)
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.load(raw_path)
.withColumn("ingest_ts", current_timestamp())
.withColumn("source_file", input_file_name())
)
(
bronze_stream.writeStream
.format("delta")
.option("checkpointLocation", checkpoint_path)
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable("bronze.events")
)
Step-by-step trace.
Input — bronze input directory shape:
| Day | New files | Avg size |
|---|---|---|
| 2026-05-24 | 5,200 | 1.1 MB |
| 2026-05-25 | 4,980 | 1.0 MB |
| 2026-05-26 | 5,310 | 1.2 MB |
-
cloudFiles(Auto Loader) uses Azure Event Grid + Queue Storage (or directory listing) to detect newly arrived JSON files since the last checkpoint. -
schemaLocationpersists the inferred schema;addNewColumnsevolution mode lets a brand-new field appear in tomorrow's JSON without breaking today's read. -
withColumn("ingest_ts", ...)stamps each row with arrival time;input_file_name()records source provenance for debugging. -
writeStream.trigger(availableNow=True)— micro-batch mode that processes all unseen files in one pass, then exits. ADF can call this job in a daily Notebook activity without leaving a streaming cluster running 24/7. -
checkpointLocationis the source-of-truth for "which files have I already consumed?" — restarting the job re-reads from the checkpoint, guaranteeing exactly-once write semantics. -
mergeSchema=truelets the silver table absorb the new column on commit; the Delta log records the schema change.
Output:
| Metric | Value |
|---|---|
| Files consumed this run | 5,200 |
| Rows written to bronze.events | 1,820,331 |
| New schema columns detected | 1 (device_os_build) |
| Run duration | 00:11:42 |
| Checkpoint advanced to | offset 5,200 / file 2026-05-24-23-59-58.json |
Why this works — concept by concept:
-
Auto Loader cloudFiles is the right primitive for "small files in a directory" — it avoids the listing storm you'd hit by
spark.read.json(...)on the whole folder daily. - Trigger.AvailableNow is the killer combo with ADF — you get streaming-style exactly-once semantics with batch-style "run, finish, release the cluster" cost.
- Schema location + addNewColumns turns schema drift from a job failure into a logged event; the silver layer can choose how to deal with the new column.
- Checkpoint location owns exactly-once; without it, every restart double-processes everything and you build a duplicate-handling layer downstream.
-
Delta
mergeSchema=truelets the destination table evolve in lockstep with the source; the transaction log records the schema change with a version stamp you can rewind to. -
Cost — a job cluster billed only while
availableNowis running is dramatically cheaper than a 24/7 structured-streaming cluster; for 5K small files/day you'll typically spend < $1/run.
Python
Topic — etl
Databricks / Spark drills
Python
Topic — streaming
Auto Loader / streaming patterns
5. The Azure data platform — storage, ingestion, compute, governance + serve
The four-layer Azure data platform every modern team ships on
Beyond the big three (ADF, Synapse, Databricks), Azure ships dozens of data services. Senior azure data engineer interview questions test fluency with the layered architecture and which service belongs where. The mental model: storage → ingestion → compute → governance + serve — four strata, each with two-to-four services.
Layer 1 — Storage.
- ADLS Gen2 — the default lake; hierarchical namespace, ACLs, optimised for analytics.
- Azure SQL DB / Managed Instance — fully managed OLTP; the source of truth for transactional apps.
- Cosmos DB — globally distributed, multi-model NoSQL (document, key-value, graph, column); 5 consistency levels; the choice for high-volume operational APIs.
- Azure Blob Storage — object store for media, backups, static assets (ADLS Gen2 is built on it).
Layer 2 — Ingestion + Streaming.
- Azure Data Factory — orchestration + bulk movement (covered in §2).
- Event Hubs — high-throughput event ingest; Kafka-protocol compatible; the front door for streaming pipelines.
- IoT Hub — Event Hubs + device-management features (bi-directional comms, device-twin state) for IoT workloads.
- Service Bus — enterprise message bus for app-to-app integration (FIFO, dead-letter queues, transactions).
Layer 3 — Compute + Transform.
- Azure Databricks — the lakehouse + ML default (covered in §4).
- Synapse Spark / Synapse SQL — embedded compute in the Synapse workspace (covered in §3).
- Azure Stream Analytics — managed SQL-on-streaming over Event Hubs / IoT Hub; great for sub-second windowed aggregates without a Spark cluster.
- Azure Functions — serverless compute; the glue for lightweight transforms triggered by Event Grid / Service Bus / HTTP.
Layer 4 — Governance + Serve.
- Microsoft Purview — enterprise data catalog + lineage + sensitivity classification; covers all Azure data services.
- Azure Key Vault — secrets / certificates / keys; every linked service in ADF should pull credentials from Key Vault, never inline.
- Power BI — the default BI tool; Direct Lake mode (in Fabric) reads OneLake parquet without a copy.
- Microsoft Fabric — SaaS umbrella with OneLake at the centre; Synapse + Power BI + Data Activator + Real-Time Analytics in one experience.
The modern Azure lakehouse — putting it together.
- Storage — ADLS Gen2 with bronze / silver / gold zones; Delta as the table format.
- Ingestion — Event Hubs for streams, ADF for batch; both write to bronze.
- Compute — Databricks for production Spark, Synapse SQL for warehouse queries, Stream Analytics for low-latency windowed aggregates.
- Governance — Unity Catalog (Databricks side) + Purview (cross-service catalog + lineage); Key Vault for all secrets.
- Serve — Power BI Direct Lake for dashboards, Cosmos DB or Synapse Dedicated for low-latency APIs.
Microsoft Fabric — what changed in 2024 and what it means for azure data engineering interviews.
- OneLake is "the OneDrive for data" — one logical lake across the tenant; backed by ADLS Gen2 under the hood.
- Lakehouse experience natively uses Delta; same medallion architecture as Databricks.
- Direct Lake in Power BI reads OneLake parquet without copying to a tabular model.
- Real-Time Intelligence rolls Stream Analytics + KQL together for sub-second analytics.
- Interview signal — mentioning Fabric without ditching Databricks/Synapse is the senior answer; Fabric complements rather than replaces in most enterprises.
Reference architecture cheat sheet.
| Pattern | Stack |
|---|---|
| Daily batch warehouse | ADF + Databricks + Delta + Synapse Dedicated + Power BI |
| Near-real-time analytics | Event Hubs + Stream Analytics + Cosmos DB + Power BI |
| Lakehouse with ML | ADF + Databricks + Delta + Unity Catalog + MLflow + Power BI |
| Fabric-native | OneLake + Fabric Lakehouse + Power BI Direct Lake |
| Hybrid (on-prem source) | Self-hosted IR + ADF + ADLS Gen2 + Databricks |
Worked example — picking the right compute service for a workload
Detailed explanation. Every Azure DE interview eventually asks "which compute would you pick and why?" — the answer is a decision tree over four axes: workload type (batch vs streaming), expected SLA (sub-second vs daily), data size, and team skillset (SQL-first vs Spark-first).
Question. You have four workloads. For each, name the Azure compute service you'd reach for first.
| Workload | Compute service | Why |
|---|---|---|
| Daily 200GB warehouse refresh, high BI concurrency | Synapse Dedicated SQL Pool | MPP + result-set cache + workload mgmt |
| Sub-second tumbling-window aggregate over 100K events/sec | Azure Stream Analytics | Managed streaming SQL; no Spark cluster mgmt |
| Production lakehouse ETL + ML training | Azure Databricks | Spark + Delta + MLflow + Unity Catalog |
| Ad-hoc "select count(*) from this raw parquet" exploration | Synapse Serverless SQL Pool | Pay-per-TB; zero provisioning |
Step-by-step explanation.
- Synapse Dedicated wins on predictable warehouse workloads with many concurrent BI users — the result-set cache + workload groups give you the SLA and tenant isolation.
- Stream Analytics wins on sub-second streaming SQL because it's fully managed and you don't pay for a Spark cluster idle between bursts.
- Databricks wins for production lakehouse + ML because of Delta + Unity Catalog + MLflow + Photon — the deepest ecosystem.
- Synapse Serverless wins for ad-hoc lake exploration because there's no compute to provision; you pay $5/TB scanned.
Rule of thumb: Match workload to the service that's purpose-built for it; resist using one service for everything just because it's already paid for. The cost difference between the right and wrong tool can be 10×.
Interview problem on the Azure ecosystem — design a near-real-time + batch lakehouse
Question. Design an Azure data platform for an e-commerce company that needs (a) sub-minute clickstream dashboards in Power BI, (b) daily revenue and inventory reports for finance, (c) ML feature engineering on clickstream + transactions, and (d) governed access to PII for the data-science team. Name every Azure service, the data flow, and the governance layer.
Solution Using a layered Fabric-friendly lakehouse
Code (architecture as Bicep-style pseudo-spec).
storage:
- adls_gen2: { name: pclake, zones: [bronze, silver, gold] }
ingestion:
streaming:
- event_hubs: { name: clicks, partitions: 32, retention_h: 72 }
batch:
- adf: { pipelines: [pl_ingest_transactions_sqlserver,
pl_ingest_inventory_oracle] }
compute:
streaming:
- stream_analytics: { input: event_hubs.clicks,
output: cosmosdb.live_dashboard }
lakehouse:
- databricks: { jobs: [auto_loader_bronze_clicks,
silver_clicks_dedup,
gold_revenue_daily,
feature_store_train_ready] }
warehouse:
- synapse_dedicated: { dwu: DW400c, tables: [fact_orders, dim_*] }
governance:
- unity_catalog: { metastore: pclake-east-us }
- purview: { catalog: pccatalog }
- key_vault: { secrets: [sql_conn, oracle_conn, eventhubs_conn] }
serve:
- power_bi: { mode: direct_lake, dataset: gold.revenue_daily }
- cosmosdb: { db: live, container: dashboard_aggregates }
Step-by-step trace.
Input — the four requirements:
| # | Requirement | Maps to |
|---|---|---|
| a | Sub-minute clickstream dashboards | Event Hubs → Stream Analytics → Cosmos DB → Power BI |
| b | Daily revenue + inventory reports | ADF → ADLS Gen2 → Databricks → Synapse Dedicated → Power BI |
| c | ML feature engineering | Databricks + Unity Catalog + MLflow |
| d | Governed PII access | Unity Catalog row + column ACL + Purview classification |
- Clickstream lands on Event Hubs (Kafka-protocol compatible, 32 partitions for parallelism). One branch flows to Stream Analytics which writes sub-minute windowed aggregates to Cosmos DB; Power BI hits Cosmos with DirectQuery for live dashboards.
- The same Event Hubs partition also feeds an Auto Loader bronze job in Databricks that lands raw clicks in
/bronze/clicks/as Delta. - ADF runs nightly pipelines that ingest transactions from on-prem SQL Server (Self-hosted IR) and inventory from Oracle, dropping each to its own bronze folder.
-
Databricks silver jobs dedupe + conform; gold jobs aggregate revenue / inventory;
gold.revenue_dailyis loaded into Synapse Dedicated for high-concurrency BI. -
Power BI Direct Lake reads
gold.revenue_dailyfrom OneLake (Fabric); finance dashboards refresh nightly with sub-second open times. - MLflow in Databricks trains the recommender model on the joined gold features; the feature table is governed by Unity Catalog.
- Unity Catalog row-level filter masks customer PII columns for the data-science group; Purview auto-classifies new columns as PII based on patterns; Key Vault holds every credential — no inline secrets anywhere in ADF.
Output (operational outcome).
| KPI | Target | Achieved |
|---|---|---|
| Clickstream dashboard latency | < 60s | 38s |
| Daily revenue report freshness | < 8h | 4.5h |
| Feature-store SLA | < 24h | 12h |
| PII data exfiltration risk | Zero | Zero — UC ACL blocks |
Why this works — concept by concept:
- Two-track ingestion — streaming (Event Hubs → Stream Analytics) and batch (ADF) flow into the same lake; downstream consumers see one source of truth regardless of arrival mode.
- Medallion lakehouse — bronze/silver/gold isolates ingestion failures from business-logic failures from serving failures; reprocessing is cheap because each layer is idempotent.
- Right tool per workload — Stream Analytics for sub-second streaming SQL, Databricks for ML + Spark, Synapse for warehouse BI; one service per job class keeps cost and complexity bounded.
- Unity Catalog + Purview — UC owns the active row + column access decisions; Purview owns the cross-service catalog + classification; the two complement each other.
- Key Vault everywhere — every linked service references a Key Vault secret URL; rotating a password becomes a single Key Vault update with no pipeline code change.
- Cost — Stream Analytics is metered per Streaming Unit (cheap for steady throughput); Databricks job clusters auto-terminate; Synapse Dedicated pauses outside business hours; Power BI Direct Lake removes the cost of a materialised tabular model.
Python
Topic — etl
Azure platform drills
Python
Topic — real-time-analytics
Streaming + lakehouse patterns
Choosing the right Azure service (cheat sheet)
A one-screen cheat sheet for the most-asked azure data engineering decisions.
| You want to … | Azure service | Notes |
|---|---|---|
| Orchestrate a daily batch | Azure Data Factory | Pipelines + Schedule trigger |
| Reach an on-prem source | Self-hosted IR + ADF | Self-hosted IR runs the Copy on your VM |
| Land raw files on the lake | ADLS Gen2 + ADF Copy / Auto Loader | Use HNS containers |
| Build a star-schema warehouse | Synapse Dedicated SQL Pool | HASH on join key; REPLICATE small dims |
| Ad-hoc lake SQL | Synapse Serverless SQL Pool | $5/TB scanned; partition pruning matters |
| Embedded Spark in Synapse | Synapse Spark Pool | Auto-pause; for moderate Spark workloads |
| Production lakehouse Spark | Azure Databricks | Delta + Unity Catalog + Photon |
| Upsert into a Delta table |
MERGE INTO on Databricks / Synapse |
Guard with updated_at for CDC |
| Compact Delta small files |
OPTIMIZE table (+ Z-ORDER) |
Run nightly on hot tables |
| Travel back in time | SELECT ... VERSION AS OF n |
Delta time-travel |
| Stream high-throughput events | Event Hubs (Kafka-compatible) | 32+ partitions for parallelism |
| Sub-second streaming SQL | Azure Stream Analytics | Tumbling / hopping / sliding windows |
| Workflow / app messaging | Azure Service Bus | Queues, topics, dead-letter |
| Serverless glue | Azure Functions | HTTP / Event Grid / Timer triggers |
| Catalog + lineage across services | Microsoft Purview | Pulls from ADF, Synapse, SQL, Databricks |
| Table + row + column ACL | Unity Catalog (Databricks) |
catalog.schema.table three-level namespace |
| Store secrets | Azure Key Vault | Linked-service URLs reference KV |
| BI dashboards | Power BI | Direct Lake (Fabric) for no-copy reads |
| New SaaS lakehouse (greenfield) | Microsoft Fabric | OneLake + Direct Lake + native Delta |
Frequently asked questions
What is Azure data engineering and which services should I learn first?
azure data engineering is the practice of building data pipelines on Microsoft Azure — moving data from sources (on-prem databases, SaaS apps, event streams) into a governed lake or warehouse, transforming it, and serving it to analytics and ML consumers. The three foundational services to learn first are Azure Data Factory (ADF) for orchestration and bulk movement, Azure Synapse Analytics for the unified SQL + Spark workspace, and Azure Databricks for production lakehouse compute on Delta Lake. Beneath them sits ADLS Gen2 — the hierarchical-namespace storage layer every service reads and writes — and around them sits the governance layer of Microsoft Purview, Unity Catalog, and Azure Key Vault. If you're shipping new pipelines in 2026, learn ADF + Databricks + ADLS Gen2 first; Synapse is essential when an enterprise BI workload demands the Dedicated SQL Pool's MPP semantics, and Microsoft Fabric is increasingly the default for greenfield projects.
What's the difference between Azure Synapse Analytics and Azure Databricks?
Synapse is a unified workspace fusing three compute engines — Dedicated SQL Pool (MPP warehouse), Serverless SQL Pool (pay-per-query lake SQL), and Spark Pool (managed Spark) — under one Studio, with embedded ADF-style pipelines. It's the right pick when your team is SQL-first, you need true warehouse semantics (high concurrency, workload management, result-set cache), or you want one tool for the whole job. Databricks is the most-deployed lakehouse engine on Azure; it gives you the richest Delta Lake + Unity Catalog + MLflow + Photon experience, with notebooks and clusters tuned for production Spark + ML. The most common pattern is both — Synapse for the SQL warehouse / Serverless lake SQL, Databricks for production Spark transforms and ML. For a 2026 interview, name Databricks for production lakehouse + ML and Synapse Dedicated for high-concurrency warehouse BI; mention Microsoft Fabric as the SaaS evolution if asked about the long-term direction.
What is ADLS Gen2 and how is it different from Azure Blob Storage?
ADLS Gen2 (Azure Data Lake Storage Gen2) is Azure Blob Storage with a hierarchical namespace (HNS) enabled — meaning /customers/year=2026/month=05/ is a true directory tree, not a flat key prefix. Three differences matter for data engineering: (1) atomic directory operations — renaming or deleting a directory is one transaction, not a per-file loop; (2) POSIX-style ACLs — file + directory permissions compatible with Hadoop / Spark ecosystems; (3) performance optimised for analytics — Spark, Databricks, and Synapse all detect HNS and use it for partition pruning, fast metadata listing, and locality-aware reads. For any analytics workload — Delta Lake medallion architectures, Parquet stores, Synapse external tables — use ADLS Gen2. Plain Blob is fine for media, backups, and static assets where directory semantics don't matter.
What are the three integration runtimes in Azure Data Factory?
ADF activities run on an Integration Runtime (IR) — the actual compute substrate. There are three flavours: (1) Azure IR — fully managed serverless cloud compute; the default for any cloud-to-cloud activity (Blob → Synapse Copy, Mapping Data Flow runs); (2) Self-hosted IR — software you install on a Windows VM (cloud or on-prem) so ADF can reach behind a firewall (on-prem SQL Server, Oracle, SAP); (3) Azure-SSIS IR — a lift-and-shift execution environment for legacy SQL Server Integration Services packages. The most common interview gotcha is "how do you reach an on-prem source?" — the answer is Self-hosted IR, not Azure IR. For a Mapping Data Flow that transforms cloud data, you use the Azure IR (it spins up a managed Spark cluster behind the scenes). Knowing which IR an activity needs — and the difference between time to live (TTL) and compute cost for each — is the senior fluency signal.
What is Delta Lake and why does every Azure lakehouse use it?
Delta Lake is the open-source table format that powers every modern Azure lakehouse. It's Parquet files plus a JSON transaction log (_delta_log/) that records every commit — added files, removed files, schema changes. From the user's perspective, this gives you ACID transactions on the lake (concurrent writers can't corrupt the table), MERGE INTO for upserts and SCD2, time travel (VERSION AS OF, TIMESTAMP AS OF), schema enforcement + evolution, and OPTIMIZE / Z-ORDER for compaction and fast point reads. On Azure, Delta is the default table format in Databricks, Synapse Spark Pools, and Microsoft Fabric's OneLake — meaning the same gold.revenue Delta table is queryable by Databricks SQL, Synapse Serverless, Fabric Direct Lake from Power BI, and Trino without copies. Interview-canonical answer: Delta turns the lake into a database — that's the lakehouse pattern, and it's the reason Azure DE has consolidated on it.
How does Unity Catalog compare to Microsoft Purview?
Unity Catalog is Databricks' centralised metastore — the three-level namespace (catalog.schema.table) that owns active access control (table / row / column ACLs via GRANT), lineage capture, and Delta Sharing across workspaces. It's the active enforcement layer: a query against gold.revenue only returns the rows and columns the calling user is allowed to see. Microsoft Purview is Azure's cross-service data catalog and governance product — it auto-scans ADF, Synapse, SQL DB, Cosmos DB, Power BI, and Databricks; classifies sensitive columns (PII, PCI, GDPR); and surfaces lineage across the entire estate. Purview is the discovery + classification layer; Unity Catalog is the access-control layer on the Databricks side. The senior architectural pattern is both — Unity Catalog owns the active permissions on Databricks-managed tables, Purview owns the cross-service catalog so a finance analyst can find data outside Databricks (in Synapse Dedicated, Cosmos DB, Power BI) with classification + lineage carried through.
Where to go next
PipeCode ships 450+ data-engineering interview problems — including Python practice that maps directly onto the SQL, PySpark, and ETL patterns you'll write on Azure Data Factory, Synapse, and Databricks. Whether you're prepping azure data engineer interview questions or shipping a brand-new Azure lakehouse, the practice library mirrors the same set-based and distributed-compute thinking this guide teaches — plus the modern lakehouse skills you'll keep using once you've landed the role.
Kick off via Explore practice →; drill the Python practice lane →; fan out into the ETL lane →; rehearse data-manipulation patterns →; reinforce data-analysis drills →; widen coverage on the full streaming Python library →; or explore the aggregation lane → for warehouse-style queries.





Top comments (0)