azure data engineer interview questions are won and lost on four rounds that most candidates underprepare for — and they have very little to do with reciting Azure Data Factory activity types. The high-signal rounds are ADLS Gen2 lake design (Hierarchical Namespace, medallion, partition keys, POSIX ACLs vs RBAC, lifecycle tiering), Event Hubs streaming (partitions, consumer groups, Capture, Stream Analytics vs Spark Structured Streaming), scenario rounds (incremental load with watermark control tables, CDC with Debezium / Azure DMS, SCD Type 2 with a single Delta MERGE INTO), and senior security + governance + cost-optimization (Managed Identity, Private Endpoint, Key Vault, Purview lineage, FinOps levers across Synapse Serverless and Databricks job-cluster + spot). Once you understand the rubric, every prompt becomes "which of the four scoring rounds is the reviewer running?"
This guide pivots away from generic service-recital trivia and walks through the azure data engineer interview questions that actually decide offers: how to defend WHERE event_date = ? partitioning on a 5-TB events table, how to argue partition key choice on an Event Hubs customer_id ordering question, how to write a watermark predicate and the exact SCD2 MERGE INTO interviewers expect on a whiteboard, and how to reason about a Managed-Identity-to-ADLS-Gen2 flow through a Private Endpoint with zero embedded secrets. Each ## section ends as azure interview questions with answers: a runnable solution, a step-by-step trace, a real output table, and a concept-by-concept why this works breakdown.
When you want hands-on reps immediately after reading, browse Python practice library →, drill ETL Python drills →, sharpen data-manipulation patterns →, rehearse streaming Python drills →, or widen coverage on the full data-analysis library →.
On this page
- Why these Azure interview questions trip people up
- ADLS Gen2 data lake design questions
- Event Hubs + streaming interview questions
- Scenario-design rounds — incremental load · CDC · SCD2
- Security, governance + cost optimization questions
- Azure interview cheat sheet
- Frequently asked questions
- Where to go next
1. Why these Azure interview questions trip people up
Azure loops use ADF / Synapse / Databricks as the backdrop — but the scoring rounds are about lake design, streaming correctness, scenarios, and FinOps
The honest framing: most candidates over-prepare service trivia ("name the three integration runtimes") and under-prepare the rounds that actually score the offer. Senior Azure DE loops keep the three big services (Azure Data Factory, Azure Synapse Analytics, Azure Databricks) as the setting, but the questions that decide pass/fail sit on top of that setting — they're about how you'd shape a lake, partition a stream, design a scenario end-to-end, and defend cost / security trade-offs to an architect.
Where candidates leak points.
-
Lake-design questions — they recite "medallion = bronze / silver / gold" but freeze on "pick a partition key for this 5-TB events table queried by
WHERE event_date = ?". -
Streaming questions — they name Event Hubs but cannot say why ordering only holds inside one partition, or pick the right partition key for
customer_id-ordered events. -
Scenario rounds — they describe ADF pipelines but cannot write the
WHERE updated_at > @watermarkpredicate or the exact SCD2MERGE INTO. - Security + cost rounds — they say "use Managed Identity" but cannot draw the end-to-end identity flow from Databricks through a Private Endpoint to ADLS Gen2 with zero embedded secrets.
What interviewers actually grade.
- Concrete design choices, not vocabulary — which partition key, which tier, which window function, which identity.
- Trade-off articulation — "I'd pick X over Y because of Z" with a real Z, not a slogan.
- Whiteboard-ability — can you write the SQL / Spark / KQL in 25 minutes under pressure?
- Cost awareness — at senior level, "I'd pause the Dedicated SQL Pool on a schedule and use spot for Databricks job clusters" beats "we have unlimited budget".
azure data engineer interview questions — the four rounds this guide drills.
- Round 1 — Lake design — ADLS Gen2 Hierarchical Namespace, partition keys, ACL vs RBAC, lifecycle.
- Round 2 — Streaming — Event Hubs partitioning, consumer groups, Capture, windowing.
- Round 3 — Scenarios — incremental load + CDC + SCD2 end-to-end.
- Round 4 — Security + Cost — Managed Identity, Private Endpoint, Key Vault, Purview, FinOps levers.
Cross-cutting worked example — picking a partition key for ADLS Gen2
Detailed explanation. The most common cross-cutting "warm-up" question on Azure DE loops is partition-key selection. Interviewers want to see that you partition by the query predicate (the column that will appear in 90% of WHERE clauses), not by ingest cadence or by a high-cardinality column.
Question. A 5-TB events table on ADLS Gen2 stores 18 months of clickstream events. 95% of downstream queries are WHERE event_date = ? for a single day. Pick a partition key and defend it.
Input.
| Column | Cardinality | Used in WHERE clause |
|---|---|---|
event_id |
~5 billion | 0% (surrogate) |
customer_id |
~50 million | 5% (rare) |
event_date |
~540 days | 95% |
country_code |
~250 | 30% |
Code.
# Spark write — partition by the predicate, not by ingest day
(events_df
.write
.format("delta")
.partitionBy("event_date") # query predicate, not surrogate
.mode("append")
.save("abfss://silver@acct.dfs.core.windows.net/events/")
)
# Reader benefits from partition pruning
spark.read.format("delta").load(path).where("event_date = '2026-05-24'").count()
Step-by-step explanation.
- Reader's
WHERE event_date = '2026-05-24'matches a partition directoryevent_date=2026-05-24/. - Spark prunes every other partition before file listing — it never opens 539 other days of Parquet files.
- Without partitioning, Spark would scan 5 TB; with this partitioning, ~10 GB.
-
customer_idwould have been a bad choice (50 million partitions = small-files explosion);event_idis even worse.
Output.
| Strategy | Files scanned | Estimated bytes read |
|---|---|---|
| No partitioning | all (~50,000) | ~5 TB |
partitionBy(event_date) |
~100 | ~10 GB |
partitionBy(customer_id) |
~50 million metadata entries | tiny scan but lake corrupt |
Rule of thumb: partition by the filter column with low-to-medium cardinality. Date is almost always the right answer for event data; high-cardinality keys belong in Z-ORDER, not partitionBy.
Python
Topic — etl
ETL Python drills (partition + write patterns)
Python
Topic — data-analysis
Distributed-systems design patterns
2. ADLS Gen2 data lake design questions
adls gen2 interview questions — Hierarchical Namespace, medallion, partition keys, ACL vs RBAC, lifecycle
adls gen2 interview questions is where the lake-design round opens. The senior answer in one sentence: ADLS Gen2 is Blob Storage with Hierarchical Namespace (HNS) turned on; HNS gives you atomic directory rename, POSIX-style ACLs, and the DFS abfss:// endpoint, which together unlock the medallion lake pattern and per-folder permissioning that analytics workloads need.
Hierarchical Namespace — non-negotiable for analytics.
- HNS on vs off — without HNS, ADLS Gen2 is a flat blob namespace; with HNS, you get real directories.
-
Atomic directory rename — Spark's
_temporary→ final commit relies on this; flat Blob fakes it by copying every file. -
POSIX-style ACLs — per-folder + per-file
r-x/rwxmasks, distinct from RBAC. -
DFS endpoint —
abfss://container@account.dfs.core.windows.net/path(thedfsis the HNS endpoint;blobis the flat endpoint). - Interview signal — saying "I'd enable HNS" within the first sentence is the basic-but-tested signal that you've actually built on ADLS Gen2.
Medallion container layout.
-
bronze/— raw landed data, append-only, schema-on-read; partitioned by ingest date (year=YYYY/month=MM/day=DD). -
silver/— cleansed and conformed; deduped; typed; partitioned by the natural business key + date. -
gold/— business-ready facts and dimensions; aggregations; partitioned by query predicate. - Why three containers, not three folders? — container-level ACLs and lifecycle policies make security + cost rules much cleaner; senior interviewers ask for this.
Partition key selection — date vs categorical vs hash.
-
Date partitioning (
event_date=YYYY-MM-DD) — the default; works for 90% of fact tables where queries filter by time. -
Categorical (
country=US,region=EU) — works when query predicate is the category AND cardinality is low (≤ ~1,000). -
Hash partitioning (
bucket=0..N) — niche; used when you need balanced parallelism on a join key with high skew; bucketing in Spark / Delta. -
Anti-pattern — partitioning by
customer_id(millions of partitions = millions of metadata entries = NameNode-equivalent meltdown).
ACL vs RBAC — when each wins.
- RBAC — Azure-native, applies at subscription / resource group / storage account / container scope; coarse-grained; integrates with Entra ID.
-
ACL — POSIX
r-x/rwxon individual folders + files; fine-grained; needed when you want "team A readsgold/finance/, team B readsgold/marketing/". -
Effective permission — a principal needs RBAC
Storage Blob Data Readerat the storage account level plus an ACL that grants execute on every parent folder to reach the leaf. - Default ACL — set on a folder so newly created children inherit it; otherwise every new file is a permission ticket.
Lifecycle policies — Hot → Cool → Archive.
- Hot tier — high $/GB but free reads; everything within the last ~30 days.
- Cool tier — lower $/GB but per-read charges; auditable history 30–180 days.
- Archive tier — cheapest $/GB but rehydration is hours-long; regulatory retention only.
-
Policy expression — JSON rule that transitions blobs based on
daysSinceModificationordaysSinceCreation; one rule across all containers, scoped by prefix.
File format — Parquet vs Delta vs Iceberg.
- Parquet — column-store, splittable, compressed; the lowest-common-denominator format every Azure compute engine reads.
-
Delta Lake — Parquet + transaction log (
_delta_log/); ACID, time travel,MERGE INTO, schema evolution. - Iceberg — Parquet + a different manifest layout; gaining traction on Azure via OneLake; not as deeply integrated as Delta on Azure Databricks.
- Interview default — pick Delta unless you have a reason not to; it's the default in Azure Databricks and Microsoft Fabric.
Small-files problem.
- Definition — many sub-MB files in a partition; metadata operations balloon, listings get slow, Spark job startup gets bottlenecked on the driver.
- Cause — streaming sinks committing every micro-batch; over-partitioned writes.
-
Fix on Delta —
OPTIMIZE table ZORDER BY (cols)compacts and clusters; on plain Parquet, scheduled compaction jobs. - Prevention — coalesce / repartition before writing; raise streaming trigger interval; partition by lower-cardinality keys.
Worked example 1 — pick a partition key on a 5-TB events table
Topic explanation.
- The scoring lens — interviewers want partition by the dominant query predicate, with cardinality in the low hundreds-to-thousands range per directory.
-
The trap — candidates often pick a high-cardinality column like
customer_idbecause "it's selective" — but that creates one folder per customer.
Question. A 5-TB events table on ADLS Gen2 needs a partition key. 95% of queries filter by event_date. Pick one.
Input. Same table as §1 — event_date (~540 days), country_code (~250), customer_id (~50 M).
Code.
# Delta write
(events_df
.write
.format("delta")
.partitionBy("event_date")
.mode("append")
.save("abfss://silver@acct.dfs.core.windows.net/events/")
)
Step-by-step explanation.
- We pick
event_datebecause 95% of queries filter by it — partition pruning is maximised. - We do not add a second partition column (e.g.
event_date, country_code) because that would 250× the partition count to ~135,000 — small-files territory. - We leave
country_codeandcustomer_idforZORDER BYon the silver table to get file-level skipping without partition explosion.
Output.
| Partition strategy | Partition count | Avg file size | Query bytes read |
|---|---|---|---|
partitionBy(event_date) |
~540 | ~1 GB | ~10 GB |
partitionBy(event_date, country) |
~135,000 | ~4 MB | ~10 GB but slow listing |
partitionBy(customer_id) |
~50 M | KB | lake unusable |
Rule of thumb: one partition column, hundreds-to-low-thousands of partitions, ~256 MB to ~1 GB per file.
Worked example 2 — calculate ACL inheritance for a new file
Topic explanation.
- The scoring lens — interviewers test whether you know that ACLs on a parent do not retroactively apply to existing children, and that default ACLs are how you make new children inherit.
- The trap — candidates assume "I set ACL on the folder, all new files inherit" — true only if it was set as a default ACL.
Question. A silver/ folder has Default ACL group:data-eng:rwx. A user in data-eng writes a new file under silver/orders/2026-05-24/. What ACL does the new file have?
Input.
| Path | ACL set on folder |
|---|---|
silver/ |
Default group:data-eng:rwx
|
silver/orders/ |
(inherits from silver/) |
silver/orders/2026-05-24/ |
(inherits from silver/orders/) |
Code.
# Azure CLI to inspect
az storage fs access show \
--file-system silver \
--path "orders/2026-05-24/new_file.parquet" \
--account-name acct
Step-by-step explanation.
- At write time, the new file inherits the access ACL from the parent's default ACL.
-
default:group:data-eng:rwxbecomes the file'sgroup:data-eng:rwxaccess entry. - To read the file, the principal also needs execute (
--x) on every parent folder up to the container root — this is the "X all the way down" rule. - If the principal doesn't have RBAC
Storage Blob Data Readeron the storage account, ACLs alone are not enough.
Output. The new file's effective ACL is group:data-eng:rwx (plus the owner / mask defaults); any data-eng member can read it provided they have RBAC at the account level.
Rule of thumb: Default ACL on the parent + execute on every ancestor + RBAC at the account = three things every Azure storage permission round expects.
Solution Using a partitioned + ACL-secured medallion layout
# === Solution: bronze → silver → gold with HNS + ACLs + lifecycle ===
account = "abfss://{container}@acct.dfs.core.windows.net"
# 1) Bronze — append-only raw, partitioned by ingest day
(raw_events_df
.write
.format("delta")
.mode("append")
.partitionBy("ingest_date")
.save(account.format(container="bronze") + "/events/")
)
# 2) Silver — cleansed + deduped, partitioned by event_date
(spark.read.format("delta").load(account.format(container="bronze") + "/events/")
.dropDuplicates(["event_id"])
.withColumn("event_date", to_date("event_ts"))
.write
.format("delta")
.mode("append")
.partitionBy("event_date")
.save(account.format(container="silver") + "/events/")
)
# 3) Gold — daily agg by country, OPTIMIZE + ZORDER for file skipping
(spark.read.format("delta").load(account.format(container="silver") + "/events/")
.groupBy("event_date", "country_code")
.agg(count("*").alias("event_count"))
.write
.format("delta")
.mode("overwrite")
.partitionBy("event_date")
.save(account.format(container="gold") + "/events_daily/")
)
spark.sql("OPTIMIZE delta.`{path}` ZORDER BY (country_code)".format(
path=account.format(container="gold") + "/events_daily/"
))
Step-by-step trace.
| Step | Container | Operation | Partition col | Notes |
|---|---|---|---|---|
| 1 | bronze | append raw events | ingest_date |
append-only, schema-on-read |
| 2 | silver | dedup by event_id
|
event_date |
cleansed, business-time |
| 3 | gold | daily agg by country | event_date |
ZORDER on country_code
|
| 4 | gold |
OPTIMIZE compact + Z |
- | small-file fix |
-
Bronze accepts a daily append of raw events under
ingest_date=YYYY-MM-DD/; ACLgroup:ingestion:rwxon the container. -
Silver deduplicates by
event_idand switches to business time (event_date); ACLgroup:data-eng:rwx. -
Gold rolls daily aggregates and runs
OPTIMIZE ... ZORDER BY (country_code)so the BI layer scans MB, not GB; ACLgroup:bi:r-x. -
Lifecycle policy at the storage-account level transitions
bronze/to Cool after 30 days and Archive after 180.
Output:
| Container | Partition pattern | Reader RBAC + ACL |
|---|---|---|
| bronze | events/ingest_date=2026-05-24/ |
RBAC Storage Blob Data Reader + --x chain |
| silver | events/event_date=2026-05-24/ |
RBAC + ACL group:data-eng:r-x
|
| gold | events_daily/event_date=2026-05-24/ |
RBAC + ACL group:bi:r-x
|
Why this works — concept by concept:
-
Hierarchical Namespace — gives us POSIX ACLs + atomic dir rename + the
abfss://DFS endpoint that Spark expects. -
Medallion containers — separate ACLs and lifecycle scopes; ingestion never has write to
gold/. - Partition by event_date — matches the dominant query predicate so partition pruning kicks in.
- ZORDER BY country_code — adds file-level skipping for the secondary predicate without exploding the partition count.
- OPTIMIZE — compacts small files written by streaming or micro-batch sinks into ~1 GB files.
- Default ACL + RBAC layering — gives sub-team isolation while a single RBAC role manages account-wide access.
- Lifecycle policy — cuts storage cost by tiering cold history without touching the read path.
-
Cost —
O(partitions_scanned × avg_file_size)for the read path; lifecycle policies make the long-tail storageO($cool_or_archive_rate).
Python
Topic — etl
Lake-design + medallion drills
Python
Topic — data-manipulation
Partitioning + Z-ORDER drills
3. Event Hubs + streaming interview questions
azure event hubs interview questions — partitions, consumer groups, Capture, windowing
azure event hubs interview questions is the streaming round. Event Hubs sits where Kafka sits in AWS / on-prem stacks — it's the durable, partitioned log that producers publish to and consumers read from with their own cursors. The senior answer in one sentence: Event Hubs is a partitioned, append-only event log where partition count caps parallelism, ordering holds inside a partition, and consumer groups give independent readers; Capture auto-archives to ADLS Gen2, the Kafka API makes Kafka producers Just Work, and downstream you choose Stream Analytics or Databricks Structured Streaming based on language and team skill.
Partitions — the parallelism + ordering knob.
- Partition count — set at hub creation; basic / standard tiers allow up to 32; Premium / Dedicated go higher.
- Parallelism cap — at most one consumer per partition per consumer group reads in parallel; partition count = max parallelism.
- Ordering — guaranteed only inside a single partition; across partitions you see interleaving.
-
Partition key — events sharing a key always land on the same partition (via
hash(key) mod num_partitions); this is how you preserve per-key ordering. -
Interview signal — "I'd pick
customer_idas the partition key if downstream needs per-customer ordering" is the canonical answer.
Consumer groups.
- Definition — a named subscription on the hub with an independent cursor per partition.
-
Use case — fan-out:
cg-analyticspowers Stream Analytics,cg-databrickspowers Structured Streaming, and both see the same events independently. - Limit — Standard tier caps at 20 consumer groups per hub; Premium higher.
- Anti-pattern — sharing one consumer group across two unrelated apps causes cursor races and missed messages.
Capture — auto-archive to ADLS Gen2.
- What it does — writes every event into ADLS Gen2 (or Blob) in Avro on a configurable cadence (minutes / MB-rolled).
- Why — gives you a free, replayable bronze layer without writing a consumer.
-
Path template —
{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}— partition-by-partition. - Downstream — point Databricks Auto Loader at the Capture path and your bronze layer maintains itself.
Throughput units, Premium, and Dedicated.
- Throughput Unit (TU) — Standard tier; 1 TU = 1 MB/s ingress + 2 MB/s egress + 1000 events/s ingress per TU.
- Premium — more reliable performance, more partitions per hub, geo-DR; better fit for production.
- Dedicated — single-tenant clusters; price-per-CU, useful only at very high throughput.
- Interview signal — "I'd start on Premium" beats "Standard, then scale TUs" once volume > a few hundred MB/s.
Kafka API compatibility.
-
Event Hubs speaks Kafka 1.0+ — a Kafka producer / consumer points at the Event Hubs
kafka.serversendpoint without code changes. - Why it matters — lets teams migrate from on-prem Kafka without rewriting producers.
- Caveat — some Kafka admin APIs aren't supported (topic creation via Kafka API is restricted; create hubs via ARM / portal).
At-least-once + dedup.
- Event Hubs delivers at-least-once — duplicates are possible across producer retries and consumer crash-restart.
-
Dedup strategies — idempotent producer (Kafka-style), consumer-side dedup keyed by
event_id, DeltaMERGE INTOupserts on the consumer. - Exactly-once is a chain — only true when every link (producer, broker, consumer, sink) is exactly-once.
Windowing — Stream Analytics vs Structured Streaming.
-
Tumbling — non-overlapping fixed-size windows (
SELECT ... FROM Input TIMESTAMP BY ts GROUP BY TumblingWindow(minute, 5)). - Hopping — fixed size with a shorter hop; same event appears in multiple windows (rolling 5-min stats every 1 min).
- Sliding — windows close on every event; expensive but precise.
- Session — windows close after a gap of inactivity; user-session analytics.
- Stream Analytics — declarative SQL on Event Hubs / IoT Hub; great for analyst-style streaming.
- Databricks Structured Streaming — full Spark; supports checkpointing, watermark, custom triggers; better for complex transformations and Delta sinks.
When Event Hubs vs Kafka on HDInsight vs Confluent.
- Event Hubs — fully managed, Azure-native, billing in the same subscription, free Capture; default for greenfield Azure.
- Kafka on HDInsight — managed Kafka cluster; legacy, more knobs, more ops cost.
- Confluent Cloud on Azure — full Kafka ecosystem (Connect, Schema Registry, ksqlDB); pick when you need the Confluent connectors and tooling.
Worked example 1 — partition key for customer_id ordering
Topic explanation.
-
The scoring lens — interviewers want you to use the ordering guarantee (events for the same
customer_idland on the same partition, so they arrive in order) rather than rely on the consumer to sort.
Question. Orders for the same customer_id must be processed in arrival order. Pick a partition key and explain.
Input.
| Event | Producer time | customer_id |
|---|---|---|
| e1 | 10:00:00 | C1 |
| e2 | 10:00:01 | C2 |
| e3 | 10:00:02 | C1 |
Code.
from azure.eventhub import EventHubProducerClient, EventData
producer = EventHubProducerClient.from_connection_string(conn_str, eventhub_name="orders")
batch = producer.create_batch(partition_key="C1") # all C1 events → same partition
batch.add(EventData(b'{"customer_id":"C1","order":"e1"}'))
batch.add(EventData(b'{"customer_id":"C1","order":"e3"}'))
producer.send_batch(batch)
Step-by-step explanation.
- Producer sets
partition_key="C1"so e1 and e3 hash to the same partition. - Inside that partition, Event Hubs preserves arrival order — e1 is offset N, e3 is offset N+1.
- Consumer reading that one partition sees e1 before e3 — guaranteed.
- e2 lands on a different partition (hashed from
C2) — no ordering relationship with C1's events.
Output. Per-customer ordering is preserved without any consumer-side sorting.
Rule of thumb: the partition key is the grain at which ordering matters — customer_id, device_id, account_id, etc.
Worked example 2 — Hopping window aggregation
Topic explanation.
- The scoring lens — interviewers expect you to know that a 5-minute Hopping window with a 1-minute hop emits one row per minute summarising the last 5 minutes — i.e. rolling stats with overlap.
Question. From an orders Event Hub, emit (window_end, orders_count, revenue) every 1 minute over a 5-minute rolling window.
Input. A stream of order events with ts and amount.
Code (Stream Analytics SQL).
SELECT
System.Timestamp() AS window_end,
COUNT(*) AS orders_count,
SUM(amount) AS revenue
INTO [orders-1m-rollup]
FROM [orders] TIMESTAMP BY ts
GROUP BY HoppingWindow(minute, 5, 1)
Step-by-step explanation.
-
TIMESTAMP BY tstells ASA to use the event's own timestamp, not the arrival time. -
HoppingWindow(minute, 5, 1)defines a 5-minute window that hops every 1 minute. - Every minute, ASA emits one row summarising the previous 5 minutes — so each event participates in 5 different windows.
-
System.Timestamp()is the window end.
Output.
| window_end | orders_count | revenue |
|---|---|---|
| 2026-05-24 10:01:00 | 482 | 18,402 |
| 2026-05-24 10:02:00 | 511 | 19,920 |
| 2026-05-24 10:03:00 | 533 | 20,801 |
Rule of thumb: tumbling for non-overlapping reporting buckets, hopping for rolling stats, sliding for per-event triggers, session for inactivity-bounded analysis.
Solution Using a partitioned hub + dedicated consumer group + Auto Loader
# === Solution: Event Hubs → Databricks Structured Streaming → Delta silver ===
from pyspark.sql.functions import col, from_json, to_timestamp
eh_conf = {
"eventhubs.connectionString": dbutils.secrets.get("kv-eh", "orders-conn"),
"eventhubs.consumerGroup": "cg-databricks",
}
raw = (spark.readStream
.format("eventhubs")
.options(**eh_conf)
.load()) # 8 partitions → 8 Spark tasks
schema = "customer_id STRING, order_id STRING, amount DOUBLE, ts STRING"
parsed = (raw
.select(from_json(col("body").cast("string"), schema).alias("d"))
.select("d.*")
.withColumn("ts", to_timestamp("ts")))
(parsed.writeStream
.format("delta")
.option("checkpointLocation",
"abfss://silver@acct.dfs.core.windows.net/_chk/orders/")
.option("mergeSchema", "true")
.partitionBy("customer_id")
.trigger(processingTime="60 seconds")
.start("abfss://silver@acct.dfs.core.windows.net/orders/"))
Step-by-step trace.
| Step | Component | Action |
|---|---|---|
| 1 | Producer | publishes orders keyed by customer_id
|
| 2 | Event Hub orders
|
8 partitions; per-customer order preserved within partition |
| 3 | Consumer group cg-databricks
|
independent cursor per partition |
| 4 | Spark readStream
|
one task per partition reads from offsets |
| 5 | Spark transform | JSON parse + cast ts
|
| 6 | Delta writeStream | append to silver with checkpoint + 60s trigger |
| 7 | Capture (separate) | auto-archives raw Avro to bronze |
- Producer sets
partition_key=customer_idso per-customer arrival order is preserved. - Event Hubs distributes by
hash(customer_id) % 8; one Spark task reads each of the 8 partitions. - The Databricks consumer group has its own cursor per partition; it doesn't interfere with
cg-analytics(used by Stream Analytics). - Structured Streaming uses
checkpointLocationto persist the cursor — exactly-once-with-idempotent-sink semantics. - The 60s trigger batches micro-batches so we don't write a 4-KB Parquet file every 50 ms.
- Capture independently archives raw events to bronze as a free durable replay log.
Output:
| Sink | Format | Frequency | Purpose |
|---|---|---|---|
silver/orders/ |
Delta | every 60 s | downstream analytics + MERGE |
bronze/eh-capture/orders/{PartitionId}/.../ |
Avro | per Capture window | durable replay |
Stream Analytics → orders-1m-rollup (separate cg-analytics) |
SQL | every 60 s | BI rollups |
Why this works — concept by concept:
- Partition key = customer_id — pins per-customer ordering at the broker.
-
Consumer group isolation —
cg-databricksandcg-analyticsread the same hub with independent cursors. - Checkpoint location — Structured Streaming durably persists offsets, so a crash resumes where it left off.
- Trigger interval — 60 s avoids the small-files trap of per-millisecond writes.
- Capture — gives a free bronze raw archive without writing a consumer.
- Kafka API — same Event Hub can be re-fronted to a Kafka producer with zero protocol change.
-
At-least-once + idempotent sink — Delta
mergeSchema+ upserts make duplicates safe. -
Cost —
O(partitions × TU_throughput)for ingress; reads scale with consumer group count, not extra ingress.
Python
Topic — streaming
Streaming Python drills
Python
Topic — sliding-window
Sliding-window drills
4. Scenario-design rounds — incremental load · CDC · SCD2
azure scenario based interview questions — incremental load · CDC · SCD2
azure scenario based interview questions is the round where the interviewer hands you a vague business prompt ("ingest a 50-million-row table daily, only the changes; track customer history over time") and watches you decompose it into a concrete Azure design. The three classics are incremental load with a watermark, CDC streaming via Debezium / Azure DMS, and SCD Type 2 with a Delta MERGE INTO.
Incremental load — the watermark pattern.
- Goal — only pick up rows that have changed since the last successful run; no full table scans on a 500-million-row source.
-
Control table — a small Azure SQL DB table
watermark_control(pipeline_name, last_loaded_ts)updated atomically at the end of each successful run. -
ADF predicate —
WHERE updated_at > @watermarkinjected via parameter binding;@watermarkread from the control table at the start of the run. - Idempotency — read-then-update is atomic; if the load fails, the watermark is not advanced.
CDC streaming — Debezium / Azure DMS.
- Goal — stream every INSERT / UPDATE / DELETE from an OLTP SQL Server into the lake without polling.
- Mechanism — Debezium (or Azure Data Migration Service) reads the SQL Server transaction log; emits change events to Event Hubs.
-
Schema — each event is
{op: c|u|d, before: {...}, after: {...}, ts_ms: ...}. -
Downstream — Databricks Auto Loader (or Structured Streaming) reads the Event Hub; applies the change to silver Delta with
MERGE INTO. -
Why over polling — log-based CDC catches deletes; polling on
updated_atmisses them.
SCD Type 2 — MERGE INTO with __start_dt__ / __end_dt__ / __is_current__.
-
Goal — preserve history of slowly changing dimension rows (e.g.
dim_customer.addresschanges over time, but reports for last quarter must show the old address). -
Schema — natural key columns +
__start_dt__+__end_dt__+__is_current__. -
Update rule — when a row's tracked attributes change: close the old row (
__end_dt__ = today,__is_current__ = false) and insert a new row (__start_dt__ = today,__end_dt__ = '9999-12-31',__is_current__ = true). -
Single transaction — Delta
MERGE INTOwith twoWHEN MATCHED/WHEN NOT MATCHEDclauses does both halves atomically.
Idempotency + retry.
-
Idempotent design — re-running the same incremental load with the same
@watermarkproduces the same result. -
ADF retry policy — set on each activity (
retry: 3, retryIntervalInSeconds: 60); set policy toOnFailureto update an audit row. -
Dead-letter — bad rows go to a
errors/container with the row, the error message, and the timestamp.
Worked example 1 — write the watermark predicate for ADF
Topic explanation.
- The scoring lens — interviewers want the read-then-update flow, with the watermark stored in a transactional store (Azure SQL DB) and the update happening only on success.
Question. Write the source query and the control-table update for an ADF incremental load of orders from Azure SQL DB.
Input.
| Table | Columns |
|---|---|
orders (source) |
order_id, customer_id, amount, updated_at
|
watermark_control |
pipeline_name, last_loaded_ts
|
Code.
-- Lookup activity: read current watermark
SELECT last_loaded_ts
FROM watermark_control
WHERE pipeline_name = 'orders_incremental';
-- Copy activity: source query (parameterised)
SELECT order_id, customer_id, amount, updated_at
FROM orders
WHERE updated_at > @{activity('LookupWatermark').output.firstRow.last_loaded_ts}
AND updated_at <= @{pipeline().parameters.runStartTs};
-- Stored procedure activity: update watermark on success
UPDATE watermark_control
SET last_loaded_ts = @runStartTs
WHERE pipeline_name = 'orders_incremental';
Step-by-step explanation.
- ADF runs a Lookup that reads the current
last_loaded_tsfromwatermark_control. - ADF runs Copy with a parameterised source query that filters
WHERE updated_at > @watermark AND <= @runStartTs— the upper bound makes the window deterministic across retries. -
Only on success does ADF run the Stored Procedure activity that advances the watermark to
@runStartTs. - If Copy fails, the watermark is not advanced — the next run re-reads the same window safely.
Output. Each successful run advances last_loaded_ts by exactly the run window; failed runs do not advance it.
Rule of thumb: always pair the watermark with an explicit upper bound (runStartTs) so retries are deterministic.
Worked example 2 — write the SCD2 MERGE INTO
Topic explanation.
-
The scoring lens — interviewers want the single-MERGE that closes the old row and inserts the new row in one transaction, with
__is_current__as the natural-key uniqueness guard.
Question. Write the Delta MERGE INTO for SCD Type 2 on dim_customer where address changes track history.
Input.
| Side | Table | Columns |
|---|---|---|
| target | dim_customer |
customer_id, name, address, __start_dt__, __end_dt__, __is_current__
|
| source | stg_customer |
customer_id, name, address, effective_dt
|
Code.
MERGE INTO gold.dim_customer t
USING (
-- 1) the "close the old row" half — same customer, attributes changed
SELECT s.customer_id AS merge_key,
s.customer_id, s.name, s.address, s.effective_dt
FROM silver.stg_customer s
JOIN gold.dim_customer g
ON g.customer_id = s.customer_id AND g.__is_current__
WHERE g.address <> s.address
UNION ALL
-- 2) the "insert the new row" half — uses a NULL merge_key so MERGE inserts
SELECT CAST(NULL AS STRING) AS merge_key,
s.customer_id, s.name, s.address, s.effective_dt
FROM silver.stg_customer s
) src
ON t.customer_id = src.merge_key AND t.__is_current__
WHEN MATCHED THEN UPDATE SET
t.__end_dt__ = src.effective_dt,
t.__is_current__ = false
WHEN NOT MATCHED THEN INSERT
(customer_id, name, address, __start_dt__, __end_dt__, __is_current__)
VALUES
(src.customer_id, src.name, src.address, src.effective_dt, DATE '9999-12-31', true);
Step-by-step explanation.
- The
USINGclause builds a two-row-per-change source viaUNION ALL— one row to close the current version, one row to insert the new version. - The first half carries the real
customer_idasmerge_key— this row triggersWHEN MATCHEDand closes the old row. - The second half carries
merge_key = NULL— this row never matches and triggersWHEN NOT MATCHEDto insert the new version. - Both halves run in the same
MERGE INTOtransaction — Delta's optimistic concurrency makes this atomic.
Output.
customer_id |
name |
address |
__start_dt__ |
__end_dt__ |
__is_current__ |
|---|---|---|---|---|---|
| C1 | Alice | 1 Old St | 2025-01-01 | 2026-05-24 | false |
| C1 | Alice | 99 New Ave | 2026-05-24 | 9999-12-31 | true |
Rule of thumb: the SCD2 idiom is two rows per change in the USING clause, joined to the current row via a real key for the close half and a NULL key for the insert half.
Solution Using watermark + CDC + SCD2 end-to-end
# === Solution: end-to-end Azure scenario stack ===
# 1) Incremental load (ADF)
Lookup(watermark_control) → Copy(source: SELECT * WHERE updated_at > @wm) → SP(update watermark)
# 2) CDC streaming (always-on)
SQL Server (CDC ON) → Debezium / Azure DMS → Event Hubs cdc.orders
↓
Databricks Auto Loader → silver.orders (Delta)
# 3) SCD2 nightly (Databricks job)
silver.stg_customer → MERGE INTO gold.dim_customer (close + insert)
Step-by-step trace.
| Stage | Component | Cadence | Idempotency mechanism |
|---|---|---|---|
| 1 | ADF Copy | hourly | watermark + upper bound |
| 2 | Debezium → EH | always-on | log-position cursor |
| 3 | Auto Loader | continuous | checkpointLocation |
| 4 | MERGE INTO SCD2 | nightly |
__is_current__ uniqueness guard |
- ADF hourly loads the incremental delta into
bronze.ordersusing the watermark predicate. - Debezium reads the SQL Server transaction log and emits
cdc.ordersevents to Event Hubs. - Databricks Auto Loader consumes
cdc.orders, applies the c / u / d operation, and lands silver. - A nightly Databricks job runs the SCD2
MERGE INTOagainstgold.dim_customer.
Output:
| Layer | Table | What it represents |
|---|---|---|
| bronze | bronze.orders |
raw hourly snapshot deltas (ADF) |
| bronze | bronze.cdc_orders |
raw CDC events (Auto Loader from Event Hubs) |
| silver | silver.orders |
conformed, deduped, partitioned by event date |
| gold | gold.dim_customer |
SCD2 history with __start_dt__ / __end_dt__
|
Why this works — concept by concept:
-
Watermark + upper bound — guarantees deterministic retries; the same
@runStartTsalways defines the same window. - Log-based CDC — catches deletes that timestamp-based polling misses.
- Debezium / Azure DMS — Microsoft-blessed CDC connectors that emit a stable change-event schema.
- Auto Loader — manages file listing + checkpointing for streaming sinks; cheaper than custom Spark Structured Streaming code.
- MERGE INTO — single-transaction close + insert; Delta's optimistic concurrency makes it atomic.
-
`is_current_` flag_ — fast
WHERE __is_current__queries and the natural uniqueness guard for the MERGE join. - Idempotency chain — every link (Copy, CDC, MERGE) is independently idempotent, so re-runs don't corrupt downstream.
-
Cost —
O(delta_rows_per_window)for incremental +O(changed_rows)for SCD2; full-snapshot scans are avoided.
SQL
Topic — etl
CDC + watermark + MERGE drills
SQL
Topic — real-time-analytics
Real-time analytics drills
5. Security, governance + cost optimization questions
azure security interview questions and azure cost optimization interview questions — the rounds that decide senior offers
The senior-grade Azure DE round splits into two halves: identity + network + governance and FinOps. The one-sentence invariant: Managed Identity over SAS tokens, Private Endpoint over public networking, Key Vault over inline secrets, Purview over scattered docs — and on cost, pause + spot + partition prune over throwing DWUs at the problem.
Managed Identity vs Service Principal vs SAS token.
- Managed Identity (system or user-assigned) — an Entra-managed identity bound to an Azure resource (Databricks cluster, ADF, VM); the resource calls Azure APIs with no secret to rotate.
- Service Principal — an Entra app registration with a client secret / certificate; requires rotation; necessary for federated identities or cross-tenant access.
- SAS token — a time-bounded URL with embedded permissions; emergency-use only; cannot be revoked before expiry.
- Interview default — Managed Identity for anything in-Azure; Service Principal only when MI is impossible; SAS only for limited external sharing.
Private Endpoint vs Service Endpoint vs public network.
- Public network — your storage account is reachable from anywhere on the internet; only protected by RBAC + ACL.
- Service Endpoint — VNet-to-PaaS shortcut; storage account is still publicly addressable, but traffic from your VNet stays on the Azure backbone.
- Private Endpoint — a private NIC inside your VNet for the storage account; the storage account is no longer publicly reachable; DNS resolves to the private IP.
- Compliance default — Private Endpoint + storage firewall set to "deny by default".
Key Vault — secrets + keys + certificates.
- What it stores — connection strings, SAS tokens, API keys, certificates, customer-managed encryption keys.
- Access — every consumer (ADF Linked Service, Databricks secret scope, App Service) authenticates via Managed Identity and pulls secrets at runtime.
- Rotation — Key Vault supports auto-rotation for certain resource types (Storage, SQL); for others, rotate manually + update Linked Services to read the new version.
- Audit — every read is logged via diagnostic settings; SOC2 / ISO27001 require this.
RBAC scopes.
- Subscription — broadest; rarely the right scope for data plane.
- Resource group — grouping of related resources; reasonable for app-level access.
- Resource — most specific; the recommended scope for storage-data-plane roles.
-
Data-plane roles —
Storage Blob Data Reader/Contributor/Ownerare separate from control-plane roles likeStorage Account Contributor; this distinction is a senior-interview filter.
Purview — catalog, lineage, classification.
- Catalog — automatic + manual asset registration; metadata search across Azure SQL, ADLS Gen2, Synapse, Power BI.
- Lineage — column-level lineage harvested from ADF, Databricks, Synapse pipelines.
- Classification — automatic PII / GDPR-sensitive labelling via regex + ML classifiers.
- Data masking — column-level masking surfaced via policies on the source store (Azure SQL Dynamic Data Masking, Databricks row/column filters).
Cost lever 1 — Dedicated SQL Pool pause/resume.
- Mechanism — Dedicated SQL Pool DWUs are billed by the hour; pause it when idle.
- Automation — a Logic App or ADF schedule pauses at 7 PM and resumes at 7 AM; weekend pauses save another 28%.
- Savings — ~60–70% off list price for a Mon-Fri 7am–7pm workload.
Cost lever 2 — Databricks job-cluster + spot + auto-terminate.
- Job clusters vs interactive clusters — interactive bills idle time; job clusters spin up for the run and terminate at the end.
- Spot instances — up to 80% off VM list; tolerate eviction; perfect for retryable batch jobs.
- Auto-terminate — set to 10–30 minutes for interactive clusters so nobody forgets a notebook and burns a weekend.
Cost lever 3 — Synapse Serverless ($/TB scanned) + partition pruning.
- Pricing model — pay per terabyte scanned, not per second of compute.
- Lever — partition by the dominant predicate, store in Parquet / Delta, and Serverless reads only the relevant partitions.
-
Real impact — a 5-TB table scanned with
WHERE event_date = ?reads ~10 GB after pruning, saving ~99.8% of cost.
Reserved capacity vs pay-as-you-go.
- Reserved — 1-year or 3-year commitment for fixed capacity; ~30–60% off list.
- Pay-as-you-go — full flexibility; right for unpredictable workloads.
- Rule — reserve the baseline (the floor you always run); pay-as-you-go for the burst.
Azure FinOps reporting.
- Cost Management + Billing — daily-resolution cost analysis, budget alerts, tagged-resource grouping.
- Azure Advisor — automated cost recommendations (e.g. "right-size this Dedicated SQL Pool").
-
Tag discipline — every resource tagged with
cost-center,env,pipeline-name; chargeback works only when tags do.
Worked example 1 — Managed Identity flow from Databricks to ADLS Gen2
Topic explanation.
- The scoring lens — interviewers want the no-embedded-secrets path: Databricks cluster's Managed Identity → RBAC on storage account → Private Endpoint → file read.
Question. Design an identity flow from a Databricks job cluster to ADLS Gen2 silver container with zero credentials in code.
Input.
| Resource | Identity |
|---|---|
| Databricks cluster | system-assigned Managed Identity dbx-mi
|
| Storage account |
acct with HNS on |
| Container | silver |
| Network | Private Endpoint + storage firewall = deny |
Code.
# No connection strings, no SAS, no client secrets
df = spark.read.format("delta").load(
"abfss://silver@acct.dfs.core.windows.net/orders/"
)
df.show()
Step-by-step explanation.
- Grant
dbx-mithe RBAC roleStorage Blob Data Readerscoped to the storage account. - Grant
dbx-miACL--xon every parent folder up tosilver/orders/(the "execute all the way" rule). - Cluster boots, fetches a token from the Instance Metadata Service for
dbx-mi. - Storage account is firewall-deny by default — only the Databricks subnet's Private Endpoint can reach it.
-
abfss://...request is authenticated by the token + authorised by RBAC + ACL; data flows over the Private Endpoint.
Output. Read succeeds with zero secrets in code, zero public network exposure, and a full Entra audit trail.
Rule of thumb: if your Spark code contains a connection string, the design is wrong.
Worked example 2 — daily Synapse Serverless cost on a partition layout
Topic explanation.
- The scoring lens — interviewers want you to multiply scanned bytes by the published rate and show that partition pruning is the dominant cost lever on Serverless.
Question. A 5-TB Parquet events table on ADLS Gen2 is partitioned by event_date. BI dashboards run 1,000 queries/day, each with WHERE event_date = ? for a single day. Synapse Serverless is priced at $5/TB scanned. What's the daily cost?
Input. 5 TB across 540 days ≈ 10 GB / day on disk; 1,000 single-day queries / day; rate = $5 / TB.
Code.
bytes_per_query = 10 GB = 0.01 TB
queries_per_day = 1,000
scanned_per_day = 10 TB
daily_cost = 10 TB × $5 / TB = $50
Step-by-step explanation.
- Partition pruning lets Serverless read only the requested day's directory — ~10 GB instead of 5 TB.
- 1,000 queries × 10 GB = 10 TB scanned per day.
- 10 TB × $5/TB = $50 / day = ~$1,500 / month.
- Without partitioning, each query scans the full 5 TB → 5,000 TB / day → $25,000 / day. Partitioning cuts cost by 500×.
Output. $50/day with partition pruning; $25,000/day without. Same Serverless engine, same data — partition layout is the FinOps lever.
Rule of thumb: on Serverless engines (Synapse Serverless, BigQuery, Athena), partition + columnar format = the entire cost story.
Solution Using Managed Identity + Private Endpoint + Key Vault + FinOps levers
# === Solution: secure + cost-optimal Azure stack ===
# Identity
Databricks cluster (system-assigned MI 'dbx-mi') ──┐
ADF (managed identity 'adf-mi') ├──→ RBAC: Storage Blob Data Contributor
│ (scoped to container)
# Network
VNet ─→ Private Endpoint ─→ Storage account (firewall: deny public)
# Secrets
Key Vault: kv-prod
├─ secret/eh-orders-conn (rotated quarterly)
├─ cert/synapse-link-cert
└─ key/cmek-storage (customer-managed encryption key)
ADF Linked Service → references AKV secret by URI (never inline)
Databricks secret scope = AKV-backed; dbutils.secrets.get("kv-prod", "...")
# Governance
Microsoft Purview scans:
- ADLS Gen2 (all containers)
- Azure SQL DB
- Synapse + Databricks
→ builds column-level lineage + PII classification
# FinOps
Synapse Dedicated SQL Pool → Logic App pause 7pm–7am + weekends (~65% saved)
Databricks → job clusters + spot + auto-terminate (~50–80% saved)
Synapse Serverless → Parquet + partition by event_date (~99% saved on scans)
Reserved capacity → covers the baseline DWU + storage floor (~30–60% off list)
Cost Management → tag-grouped daily reports + budget alerts
Step-by-step trace.
| Pillar | Lever | Mechanism |
|---|---|---|
| Identity | Managed Identity | no secrets in code |
| Network | Private Endpoint + firewall deny | no public surface |
| Secrets | Key Vault + AKV-backed secret scope | secrets fetched at runtime, rotatable |
| Governance | Microsoft Purview scans + lineage | column-level lineage, PII classification |
| Cost | Dedicated SQL pause/resume | ~65% off the on-hours bill |
| Cost | Databricks job-cluster + spot | ~50–80% off interactive cluster |
| Cost | Synapse Serverless partition pruning | ~99% off naive SELECT * scans |
| Cost | Reserved capacity for the floor | ~30–60% off list for steady-state |
| Cost | Tag discipline + Cost Management | chargeback + budget alerts |
- Identity flows from Managed Identity → RBAC → ACL → file, zero secrets in code.
- Network locks the storage account to a single Private Endpoint inside the data VNet.
- Key Vault holds every connection string, certificate, and customer-managed encryption key; ADF and Databricks read at runtime.
- Purview scans every store nightly and surfaces lineage + classification in one catalog.
- The four FinOps levers (pause, job-cluster + spot, partition pruning, reserved capacity) typically cut a naive Azure bill by 60–80% with no SLO change.
Output:
| Workload | Naive monthly bill | Optimised monthly bill | Lever applied |
|---|---|---|---|
| Dedicated SQL Pool DW2000c | $9,000 | $3,150 | pause/resume + reserved |
| Databricks all-purpose | $12,000 | $3,600 | job clusters + spot + auto-terminate |
| Synapse Serverless BI | $25,000 | $1,500 | partition by event_date + Parquet |
Why this works — concept by concept:
- Managed Identity — eliminates the rotate-or-leak risk of stored credentials.
- Private Endpoint — the storage account isn't on the public internet, full stop.
- Key Vault — every secret is rotatable + auditable + RBAC-controlled.
- RBAC scope = resource — least-privilege by default; storage data-plane separated from control-plane.
- ACL inheritance — sub-team isolation without minting new RBAC roles per folder.
- Purview lineage + classification — answers "where did this column come from" + "is this PII" in one place.
- Pause/resume + job-cluster + spot — the three compute levers that move the bill the most.
- Partition pruning on Serverless — converts $/TB pricing into a near-free read path.
- Reserved capacity — pays for the predictable floor; PAYG carries the spikes.
-
Cost —
O($managed_compute + $TB_scanned + $TB_stored); every term has a dedicated lever above.
Python
Topic — etl
Identity + governance drills
Python
Topic — data-analysis
FinOps + cost-model drills
Azure interview cheat sheet
A one-screen cheat sheet for the 15 highest-value primitives across the four rounds — drill these and you've covered ~80% of the azure data engineer interview questions surface.
| Primitive | Round | The single-sentence answer interviewers want |
|---|---|---|
| Hierarchical Namespace | Lake design | Must be ON for analytics — gives atomic dir rename + POSIX ACLs + abfss:// endpoint. |
| Medallion containers (bronze / silver / gold) | Lake design | Container-level ACL + lifecycle scopes; ingestion never writes to gold. |
| Partition key | Lake design | Partition by the dominant WHERE predicate; date wins ~90% of the time. |
| POSIX ACL vs RBAC | Lake design | RBAC at account, ACL per folder; default ACL on parent for inheritance + --x on ancestors. |
| Lifecycle Hot → Cool → Archive | Lake design | Days-since-modification rule; cuts bronze cost ~80% without touching reads. |
| Event Hubs partition count | Streaming | Cap on max parallelism; ordering only inside one partition. |
| Partition key on the producer | Streaming | The grain at which per-key ordering matters (customer_id, device_id). |
| Consumer groups | Streaming | Independent cursor per partition; fan-out without races. |
| Event Hubs Capture | Streaming | Free auto-archive to ADLS Gen2 in Avro; perfect bronze layer. |
| Windowing — Tumbling / Hopping / Session | Streaming | Non-overlap / rolling-with-hop / inactivity-bounded — pick by the report shape. |
| Watermark control table | Scenario | Read → Copy with WHERE updated_at > @watermark AND <= @runStartTs → update on success. |
| CDC via Debezium / Azure DMS | Scenario | Log-based, catches deletes; emits `c |
SCD2 {% raw %}MERGE INTO
|
Scenario | Two-rows-per-change USING + WHEN MATCHED close + WHEN NOT MATCHED insert, one txn. |
| Managed Identity + Private Endpoint | Security | Zero secrets in code + zero public surface = the only senior-grade auth pattern. |
| Pause + job-cluster + partition pruning | Cost | The three levers that cut a naive bill by 60–80% with no SLO change. |
Frequently asked questions
What's the average Azure data engineer salary in 2026 and which rounds determine band?
Azure data engineer compensation in 2026 sits roughly between $120K and $230K total in US metros (and £75K–£135K in the UK), with senior + staff roles routinely clearing $250K-plus at Microsoft, Databricks, and Fortune 100 banks. The bands aren't decided by ADF activity-type trivia — they're decided by the four rounds this guide drills: lake design, streaming correctness, scenario design, and security + cost. Candidates who recite service vocabulary but cannot defend a partition key, write a watermark predicate, or draw a Managed-Identity-to-ADLS-Gen2 flow over a Private Endpoint plateau at mid-band. Candidates who can do all four pass senior loops cleanly and negotiate from the top of the range.
What do the Azure data engineer interview rounds actually look like?
A typical Azure DE loop is 5 rounds: (1) a behavioural / motivation round; (2) a SQL + Python live-coding round (window functions, joins, aggregation under time pressure); (3) a system-design round where you whiteboard a real Azure pipeline end-to-end (this is where lake design + streaming + scenarios + security all collide); (4) a focused deep-dive on one Azure service the team uses heavily (Databricks tuning, Synapse MPP internals, or ADF parameterisation); (5) a "design + cost" round where a senior architect grills you on FinOps levers. The third and fifth rounds are where most offers are won or lost — and they're not service-recital rounds, they're real design rounds.
How does an Azure data engineer compare to AWS or GCP data engineer?
The azure vs aws data engineer and azure vs gcp data engineer comparisons are pure ecosystem swaps — the underlying skills transfer almost 1-to-1. ADLS Gen2 ↔ S3 ↔ GCS, Event Hubs ↔ Kinesis ↔ Pub/Sub, Synapse ↔ Redshift ↔ BigQuery, Databricks on Azure ↔ Databricks on AWS / GCP, ADF ↔ Glue ↔ Dataflow / Cloud Composer, Managed Identity ↔ IAM Roles ↔ Workload Identity, Purview ↔ DataZone ↔ Dataplex. The framework is identical: data lake on object storage with hierarchical layout, partition-keyed event log for streaming, scenario design centred on incremental loads + CDC + SCD2, and a security + cost round on top. Interviewers expect you to know the Azure names — but they're really testing the underlying architecture, which is portable across clouds.
How do I crack an Azure data engineer interview if I'm coming from a traditional SQL / ETL background?
The three highest-leverage moves are: (1) spend two weeks designing real ADLS Gen2 medallion lakes with HNS + ACLs + lifecycle so the lake-design round becomes muscle memory; (2) build one Event Hubs streaming pipeline end-to-end (producer → partition key → consumer group → Capture → Structured Streaming → Delta silver) so you can defend partition + windowing trade-offs from experience, not from a blog; (3) memorise the SCD2 MERGE INTO template and the watermark predicate so you can write them on a whiteboard in under five minutes. SQL fluency is a prerequisite, not the differentiator — every other candidate already has it.
Does the Azure Data Engineer DP-203 / DP-700 certification still matter in 2026?
The DP-203 (Azure Data Engineer Associate) and the newer DP-700 (Fabric Data Engineer) certifications help at the resume-screen stage — they're a credible signal that you've touched the platform end-to-end and they appear on most Azure DE job descriptions. Inside the actual loop, certifications are almost never the deciding factor; what you can design and defend live is. Treat the cert as a fast way to fill knowledge gaps (Purview, Private Endpoints, Synapse Serverless pricing nuances) and a passport through automated screens, then prepare the four scoring rounds in this guide as your real interview prep.
What tools should I know for an Azure data engineer interview in 2026?
The minimum kit is: ADLS Gen2 (HNS, ACLs, lifecycle); Azure Data Factory (pipelines, parameterisation, watermark loads); Azure Synapse (Dedicated SQL Pool MPP + Serverless $/TB scanned); Azure Databricks + Delta Lake (medallion, Auto Loader, MERGE INTO, OPTIMIZE + ZORDER); Event Hubs (partitions, consumer groups, Capture); Stream Analytics and / or Structured Streaming (Tumbling / Hopping / Session windows); Azure SQL DB (CDC + watermark control table); Key Vault (Managed Identity, secret rotation); Microsoft Purview (lineage + classification); and Cost Management + Advisor for FinOps. Bonus signals: Microsoft Fabric / OneLake (the unified SaaS layer on top of Synapse + Delta), Power BI integration, and a basic Terraform / Bicep vocabulary for IaC questions.





Top comments (0)