DEV Community

Cover image for Azure Data Engineering Interview Questions: Lake Design, Streaming, Scenarios, Security + Cost
Gowtham Potureddi
Gowtham Potureddi

Posted on

Azure Data Engineering Interview Questions: Lake Design, Streaming, Scenarios, Security + Cost

azure data engineer interview questions are won and lost on four rounds that most candidates underprepare for — and they have very little to do with reciting Azure Data Factory activity types. The high-signal rounds are ADLS Gen2 lake design (Hierarchical Namespace, medallion, partition keys, POSIX ACLs vs RBAC, lifecycle tiering), Event Hubs streaming (partitions, consumer groups, Capture, Stream Analytics vs Spark Structured Streaming), scenario rounds (incremental load with watermark control tables, CDC with Debezium / Azure DMS, SCD Type 2 with a single Delta MERGE INTO), and senior security + governance + cost-optimization (Managed Identity, Private Endpoint, Key Vault, Purview lineage, FinOps levers across Synapse Serverless and Databricks job-cluster + spot). Once you understand the rubric, every prompt becomes "which of the four scoring rounds is the reviewer running?"

This guide pivots away from generic service-recital trivia and walks through the azure data engineer interview questions that actually decide offers: how to defend WHERE event_date = ? partitioning on a 5-TB events table, how to argue partition key choice on an Event Hubs customer_id ordering question, how to write a watermark predicate and the exact SCD2 MERGE INTO interviewers expect on a whiteboard, and how to reason about a Managed-Identity-to-ADLS-Gen2 flow through a Private Endpoint with zero embedded secrets. Each ## section ends as azure interview questions with answers: a runnable solution, a step-by-step trace, a real output table, and a concept-by-concept why this works breakdown.

PipeCode blog header for an Azure data engineering interview tutorial — bold white headline 'Azure DE · Interview Questions' with subtitle 'Lake design · Streaming · Scenarios · Security + cost' and a stylised four-quadrant infographic on a dark gradient with purple, blue, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse Python practice library →, drill ETL Python drills →, sharpen data-manipulation patterns →, rehearse streaming Python drills →, or widen coverage on the full data-analysis library →.


On this page


1. Why these Azure interview questions trip people up

Azure loops use ADF / Synapse / Databricks as the backdrop — but the scoring rounds are about lake design, streaming correctness, scenarios, and FinOps

The honest framing: most candidates over-prepare service trivia ("name the three integration runtimes") and under-prepare the rounds that actually score the offer. Senior Azure DE loops keep the three big services (Azure Data Factory, Azure Synapse Analytics, Azure Databricks) as the setting, but the questions that decide pass/fail sit on top of that setting — they're about how you'd shape a lake, partition a stream, design a scenario end-to-end, and defend cost / security trade-offs to an architect.

Where candidates leak points.

  • Lake-design questions — they recite "medallion = bronze / silver / gold" but freeze on "pick a partition key for this 5-TB events table queried by WHERE event_date = ?".
  • Streaming questions — they name Event Hubs but cannot say why ordering only holds inside one partition, or pick the right partition key for customer_id-ordered events.
  • Scenario rounds — they describe ADF pipelines but cannot write the WHERE updated_at > @watermark predicate or the exact SCD2 MERGE INTO.
  • Security + cost rounds — they say "use Managed Identity" but cannot draw the end-to-end identity flow from Databricks through a Private Endpoint to ADLS Gen2 with zero embedded secrets.

What interviewers actually grade.

  • Concrete design choices, not vocabulary — which partition key, which tier, which window function, which identity.
  • Trade-off articulation — "I'd pick X over Y because of Z" with a real Z, not a slogan.
  • Whiteboard-ability — can you write the SQL / Spark / KQL in 25 minutes under pressure?
  • Cost awareness — at senior level, "I'd pause the Dedicated SQL Pool on a schedule and use spot for Databricks job clusters" beats "we have unlimited budget".

azure data engineer interview questions — the four rounds this guide drills.

  • Round 1 — Lake design — ADLS Gen2 Hierarchical Namespace, partition keys, ACL vs RBAC, lifecycle.
  • Round 2 — Streaming — Event Hubs partitioning, consumer groups, Capture, windowing.
  • Round 3 — Scenarios — incremental load + CDC + SCD2 end-to-end.
  • Round 4 — Security + Cost — Managed Identity, Private Endpoint, Key Vault, Purview, FinOps levers.

Cross-cutting worked example — picking a partition key for ADLS Gen2

Detailed explanation. The most common cross-cutting "warm-up" question on Azure DE loops is partition-key selection. Interviewers want to see that you partition by the query predicate (the column that will appear in 90% of WHERE clauses), not by ingest cadence or by a high-cardinality column.

Question. A 5-TB events table on ADLS Gen2 stores 18 months of clickstream events. 95% of downstream queries are WHERE event_date = ? for a single day. Pick a partition key and defend it.

Input.

Column Cardinality Used in WHERE clause
event_id ~5 billion 0% (surrogate)
customer_id ~50 million 5% (rare)
event_date ~540 days 95%
country_code ~250 30%

Code.

# Spark write — partition by the predicate, not by ingest day
(events_df
    .write
    .format("delta")
    .partitionBy("event_date")          # query predicate, not surrogate
    .mode("append")
    .save("abfss://silver@acct.dfs.core.windows.net/events/")
)

# Reader benefits from partition pruning
spark.read.format("delta").load(path).where("event_date = '2026-05-24'").count()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Reader's WHERE event_date = '2026-05-24' matches a partition directory event_date=2026-05-24/.
  2. Spark prunes every other partition before file listing — it never opens 539 other days of Parquet files.
  3. Without partitioning, Spark would scan 5 TB; with this partitioning, ~10 GB.
  4. customer_id would have been a bad choice (50 million partitions = small-files explosion); event_id is even worse.

Output.

Strategy Files scanned Estimated bytes read
No partitioning all (~50,000) ~5 TB
partitionBy(event_date) ~100 ~10 GB
partitionBy(customer_id) ~50 million metadata entries tiny scan but lake corrupt

Rule of thumb: partition by the filter column with low-to-medium cardinality. Date is almost always the right answer for event data; high-cardinality keys belong in Z-ORDER, not partitionBy.

Python
Topic — etl
ETL Python drills (partition + write patterns)

Practice →

Python
Topic — data-analysis
Distributed-systems design patterns

Practice →


2. ADLS Gen2 data lake design questions

Visual diagram of an ADLS Gen2 medallion data lake — a single storage account card on top with hierarchical namespace enabled; below it three container cards labelled bronze / silver / gold each holding a folder-tree showing year=YYYY/month=MM/day=DD partition structure; on the right a small ACL + RBAC governance card with POSIX ACL rows; a thin lifecycle policy ribbon underneath showing hot → cool → archive tier transitions; on a light PipeCode card.

adls gen2 interview questions — Hierarchical Namespace, medallion, partition keys, ACL vs RBAC, lifecycle

adls gen2 interview questions is where the lake-design round opens. The senior answer in one sentence: ADLS Gen2 is Blob Storage with Hierarchical Namespace (HNS) turned on; HNS gives you atomic directory rename, POSIX-style ACLs, and the DFS abfss:// endpoint, which together unlock the medallion lake pattern and per-folder permissioning that analytics workloads need.

Hierarchical Namespace — non-negotiable for analytics.

  • HNS on vs off — without HNS, ADLS Gen2 is a flat blob namespace; with HNS, you get real directories.
  • Atomic directory rename — Spark's _temporary → final commit relies on this; flat Blob fakes it by copying every file.
  • POSIX-style ACLs — per-folder + per-file r-x / rwx masks, distinct from RBAC.
  • DFS endpointabfss://container@account.dfs.core.windows.net/path (the dfs is the HNS endpoint; blob is the flat endpoint).
  • Interview signal — saying "I'd enable HNS" within the first sentence is the basic-but-tested signal that you've actually built on ADLS Gen2.

Medallion container layout.

  • bronze/ — raw landed data, append-only, schema-on-read; partitioned by ingest date (year=YYYY/month=MM/day=DD).
  • silver/ — cleansed and conformed; deduped; typed; partitioned by the natural business key + date.
  • gold/ — business-ready facts and dimensions; aggregations; partitioned by query predicate.
  • Why three containers, not three folders? — container-level ACLs and lifecycle policies make security + cost rules much cleaner; senior interviewers ask for this.

Partition key selection — date vs categorical vs hash.

  • Date partitioning (event_date=YYYY-MM-DD) — the default; works for 90% of fact tables where queries filter by time.
  • Categorical (country=US, region=EU) — works when query predicate is the category AND cardinality is low (≤ ~1,000).
  • Hash partitioning (bucket=0..N) — niche; used when you need balanced parallelism on a join key with high skew; bucketing in Spark / Delta.
  • Anti-pattern — partitioning by customer_id (millions of partitions = millions of metadata entries = NameNode-equivalent meltdown).

ACL vs RBAC — when each wins.

  • RBAC — Azure-native, applies at subscription / resource group / storage account / container scope; coarse-grained; integrates with Entra ID.
  • ACL — POSIX r-x / rwx on individual folders + files; fine-grained; needed when you want "team A reads gold/finance/, team B reads gold/marketing/".
  • Effective permission — a principal needs RBAC Storage Blob Data Reader at the storage account level plus an ACL that grants execute on every parent folder to reach the leaf.
  • Default ACL — set on a folder so newly created children inherit it; otherwise every new file is a permission ticket.

Lifecycle policies — Hot → Cool → Archive.

  • Hot tier — high $/GB but free reads; everything within the last ~30 days.
  • Cool tier — lower $/GB but per-read charges; auditable history 30–180 days.
  • Archive tier — cheapest $/GB but rehydration is hours-long; regulatory retention only.
  • Policy expression — JSON rule that transitions blobs based on daysSinceModification or daysSinceCreation; one rule across all containers, scoped by prefix.

File format — Parquet vs Delta vs Iceberg.

  • Parquet — column-store, splittable, compressed; the lowest-common-denominator format every Azure compute engine reads.
  • Delta Lake — Parquet + transaction log (_delta_log/); ACID, time travel, MERGE INTO, schema evolution.
  • Iceberg — Parquet + a different manifest layout; gaining traction on Azure via OneLake; not as deeply integrated as Delta on Azure Databricks.
  • Interview default — pick Delta unless you have a reason not to; it's the default in Azure Databricks and Microsoft Fabric.

Small-files problem.

  • Definition — many sub-MB files in a partition; metadata operations balloon, listings get slow, Spark job startup gets bottlenecked on the driver.
  • Cause — streaming sinks committing every micro-batch; over-partitioned writes.
  • Fix on DeltaOPTIMIZE table ZORDER BY (cols) compacts and clusters; on plain Parquet, scheduled compaction jobs.
  • Prevention — coalesce / repartition before writing; raise streaming trigger interval; partition by lower-cardinality keys.

Worked example 1 — pick a partition key on a 5-TB events table

Topic explanation.

  • The scoring lens — interviewers want partition by the dominant query predicate, with cardinality in the low hundreds-to-thousands range per directory.
  • The trap — candidates often pick a high-cardinality column like customer_id because "it's selective" — but that creates one folder per customer.

Question. A 5-TB events table on ADLS Gen2 needs a partition key. 95% of queries filter by event_date. Pick one.

Input. Same table as §1 — event_date (~540 days), country_code (~250), customer_id (~50 M).

Code.

# Delta write
(events_df
    .write
    .format("delta")
    .partitionBy("event_date")
    .mode("append")
    .save("abfss://silver@acct.dfs.core.windows.net/events/")
)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. We pick event_date because 95% of queries filter by it — partition pruning is maximised.
  2. We do not add a second partition column (e.g. event_date, country_code) because that would 250× the partition count to ~135,000 — small-files territory.
  3. We leave country_code and customer_id for ZORDER BY on the silver table to get file-level skipping without partition explosion.

Output.

Partition strategy Partition count Avg file size Query bytes read
partitionBy(event_date) ~540 ~1 GB ~10 GB
partitionBy(event_date, country) ~135,000 ~4 MB ~10 GB but slow listing
partitionBy(customer_id) ~50 M KB lake unusable

Rule of thumb: one partition column, hundreds-to-low-thousands of partitions, ~256 MB to ~1 GB per file.

Worked example 2 — calculate ACL inheritance for a new file

Topic explanation.

  • The scoring lens — interviewers test whether you know that ACLs on a parent do not retroactively apply to existing children, and that default ACLs are how you make new children inherit.
  • The trap — candidates assume "I set ACL on the folder, all new files inherit" — true only if it was set as a default ACL.

Question. A silver/ folder has Default ACL group:data-eng:rwx. A user in data-eng writes a new file under silver/orders/2026-05-24/. What ACL does the new file have?

Input.

Path ACL set on folder
silver/ Default group:data-eng:rwx
silver/orders/ (inherits from silver/)
silver/orders/2026-05-24/ (inherits from silver/orders/)

Code.

# Azure CLI to inspect
az storage fs access show \
    --file-system silver \
    --path "orders/2026-05-24/new_file.parquet" \
    --account-name acct
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. At write time, the new file inherits the access ACL from the parent's default ACL.
  2. default:group:data-eng:rwx becomes the file's group:data-eng:rwx access entry.
  3. To read the file, the principal also needs execute (--x) on every parent folder up to the container root — this is the "X all the way down" rule.
  4. If the principal doesn't have RBAC Storage Blob Data Reader on the storage account, ACLs alone are not enough.

Output. The new file's effective ACL is group:data-eng:rwx (plus the owner / mask defaults); any data-eng member can read it provided they have RBAC at the account level.

Rule of thumb: Default ACL on the parent + execute on every ancestor + RBAC at the account = three things every Azure storage permission round expects.

Solution Using a partitioned + ACL-secured medallion layout

# === Solution: bronze → silver → gold with HNS + ACLs + lifecycle ===
account = "abfss://{container}@acct.dfs.core.windows.net"

# 1) Bronze — append-only raw, partitioned by ingest day
(raw_events_df
    .write
    .format("delta")
    .mode("append")
    .partitionBy("ingest_date")
    .save(account.format(container="bronze") + "/events/")
)

# 2) Silver — cleansed + deduped, partitioned by event_date
(spark.read.format("delta").load(account.format(container="bronze") + "/events/")
    .dropDuplicates(["event_id"])
    .withColumn("event_date", to_date("event_ts"))
    .write
    .format("delta")
    .mode("append")
    .partitionBy("event_date")
    .save(account.format(container="silver") + "/events/")
)

# 3) Gold — daily agg by country, OPTIMIZE + ZORDER for file skipping
(spark.read.format("delta").load(account.format(container="silver") + "/events/")
    .groupBy("event_date", "country_code")
    .agg(count("*").alias("event_count"))
    .write
    .format("delta")
    .mode("overwrite")
    .partitionBy("event_date")
    .save(account.format(container="gold") + "/events_daily/")
)
spark.sql("OPTIMIZE delta.`{path}` ZORDER BY (country_code)".format(
    path=account.format(container="gold") + "/events_daily/"
))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Container Operation Partition col Notes
1 bronze append raw events ingest_date append-only, schema-on-read
2 silver dedup by event_id event_date cleansed, business-time
3 gold daily agg by country event_date ZORDER on country_code
4 gold OPTIMIZE compact + Z - small-file fix
  1. Bronze accepts a daily append of raw events under ingest_date=YYYY-MM-DD/; ACL group:ingestion:rwx on the container.
  2. Silver deduplicates by event_id and switches to business time (event_date); ACL group:data-eng:rwx.
  3. Gold rolls daily aggregates and runs OPTIMIZE ... ZORDER BY (country_code) so the BI layer scans MB, not GB; ACL group:bi:r-x.
  4. Lifecycle policy at the storage-account level transitions bronze/ to Cool after 30 days and Archive after 180.

Output:

Container Partition pattern Reader RBAC + ACL
bronze events/ingest_date=2026-05-24/ RBAC Storage Blob Data Reader + --x chain
silver events/event_date=2026-05-24/ RBAC + ACL group:data-eng:r-x
gold events_daily/event_date=2026-05-24/ RBAC + ACL group:bi:r-x

Why this works — concept by concept:

  • Hierarchical Namespace — gives us POSIX ACLs + atomic dir rename + the abfss:// DFS endpoint that Spark expects.
  • Medallion containers — separate ACLs and lifecycle scopes; ingestion never has write to gold/.
  • Partition by event_date — matches the dominant query predicate so partition pruning kicks in.
  • ZORDER BY country_code — adds file-level skipping for the secondary predicate without exploding the partition count.
  • OPTIMIZE — compacts small files written by streaming or micro-batch sinks into ~1 GB files.
  • Default ACL + RBAC layering — gives sub-team isolation while a single RBAC role manages account-wide access.
  • Lifecycle policy — cuts storage cost by tiering cold history without touching the read path.
  • CostO(partitions_scanned × avg_file_size) for the read path; lifecycle policies make the long-tail storage O($cool_or_archive_rate).

Python
Topic — etl
Lake-design + medallion drills

Practice →

Python
Topic — data-manipulation
Partitioning + Z-ORDER drills

Practice →


3. Event Hubs + streaming interview questions

Visual diagram of an Event Hubs streaming pipeline — producers on the left publish to an Event Hub namespace with 8 partitions; a Schema Registry chip is attached; consumer groups on the right fan out to two downstream paths (Stream Analytics windowed aggregation, Databricks Structured Streaming with checkpoints); a Capture sink lands raw records to ADLS Gen2; a tiny watermark + late-data chip floats; on a light PipeCode card.

azure event hubs interview questions — partitions, consumer groups, Capture, windowing

azure event hubs interview questions is the streaming round. Event Hubs sits where Kafka sits in AWS / on-prem stacks — it's the durable, partitioned log that producers publish to and consumers read from with their own cursors. The senior answer in one sentence: Event Hubs is a partitioned, append-only event log where partition count caps parallelism, ordering holds inside a partition, and consumer groups give independent readers; Capture auto-archives to ADLS Gen2, the Kafka API makes Kafka producers Just Work, and downstream you choose Stream Analytics or Databricks Structured Streaming based on language and team skill.

Partitions — the parallelism + ordering knob.

  • Partition count — set at hub creation; basic / standard tiers allow up to 32; Premium / Dedicated go higher.
  • Parallelism cap — at most one consumer per partition per consumer group reads in parallel; partition count = max parallelism.
  • Ordering — guaranteed only inside a single partition; across partitions you see interleaving.
  • Partition key — events sharing a key always land on the same partition (via hash(key) mod num_partitions); this is how you preserve per-key ordering.
  • Interview signal — "I'd pick customer_id as the partition key if downstream needs per-customer ordering" is the canonical answer.

Consumer groups.

  • Definition — a named subscription on the hub with an independent cursor per partition.
  • Use case — fan-out: cg-analytics powers Stream Analytics, cg-databricks powers Structured Streaming, and both see the same events independently.
  • Limit — Standard tier caps at 20 consumer groups per hub; Premium higher.
  • Anti-pattern — sharing one consumer group across two unrelated apps causes cursor races and missed messages.

Capture — auto-archive to ADLS Gen2.

  • What it does — writes every event into ADLS Gen2 (or Blob) in Avro on a configurable cadence (minutes / MB-rolled).
  • Why — gives you a free, replayable bronze layer without writing a consumer.
  • Path template{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second} — partition-by-partition.
  • Downstream — point Databricks Auto Loader at the Capture path and your bronze layer maintains itself.

Throughput units, Premium, and Dedicated.

  • Throughput Unit (TU) — Standard tier; 1 TU = 1 MB/s ingress + 2 MB/s egress + 1000 events/s ingress per TU.
  • Premium — more reliable performance, more partitions per hub, geo-DR; better fit for production.
  • Dedicated — single-tenant clusters; price-per-CU, useful only at very high throughput.
  • Interview signal — "I'd start on Premium" beats "Standard, then scale TUs" once volume > a few hundred MB/s.

Kafka API compatibility.

  • Event Hubs speaks Kafka 1.0+ — a Kafka producer / consumer points at the Event Hubs kafka.servers endpoint without code changes.
  • Why it matters — lets teams migrate from on-prem Kafka without rewriting producers.
  • Caveat — some Kafka admin APIs aren't supported (topic creation via Kafka API is restricted; create hubs via ARM / portal).

At-least-once + dedup.

  • Event Hubs delivers at-least-once — duplicates are possible across producer retries and consumer crash-restart.
  • Dedup strategies — idempotent producer (Kafka-style), consumer-side dedup keyed by event_id, Delta MERGE INTO upserts on the consumer.
  • Exactly-once is a chain — only true when every link (producer, broker, consumer, sink) is exactly-once.

Windowing — Stream Analytics vs Structured Streaming.

  • Tumbling — non-overlapping fixed-size windows (SELECT ... FROM Input TIMESTAMP BY ts GROUP BY TumblingWindow(minute, 5)).
  • Hopping — fixed size with a shorter hop; same event appears in multiple windows (rolling 5-min stats every 1 min).
  • Sliding — windows close on every event; expensive but precise.
  • Session — windows close after a gap of inactivity; user-session analytics.
  • Stream Analytics — declarative SQL on Event Hubs / IoT Hub; great for analyst-style streaming.
  • Databricks Structured Streaming — full Spark; supports checkpointing, watermark, custom triggers; better for complex transformations and Delta sinks.

When Event Hubs vs Kafka on HDInsight vs Confluent.

  • Event Hubs — fully managed, Azure-native, billing in the same subscription, free Capture; default for greenfield Azure.
  • Kafka on HDInsight — managed Kafka cluster; legacy, more knobs, more ops cost.
  • Confluent Cloud on Azure — full Kafka ecosystem (Connect, Schema Registry, ksqlDB); pick when you need the Confluent connectors and tooling.

Worked example 1 — partition key for customer_id ordering

Topic explanation.

  • The scoring lens — interviewers want you to use the ordering guarantee (events for the same customer_id land on the same partition, so they arrive in order) rather than rely on the consumer to sort.

Question. Orders for the same customer_id must be processed in arrival order. Pick a partition key and explain.

Input.

Event Producer time customer_id
e1 10:00:00 C1
e2 10:00:01 C2
e3 10:00:02 C1

Code.

from azure.eventhub import EventHubProducerClient, EventData
producer = EventHubProducerClient.from_connection_string(conn_str, eventhub_name="orders")
batch = producer.create_batch(partition_key="C1")    # all C1 events → same partition
batch.add(EventData(b'{"customer_id":"C1","order":"e1"}'))
batch.add(EventData(b'{"customer_id":"C1","order":"e3"}'))
producer.send_batch(batch)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Producer sets partition_key="C1" so e1 and e3 hash to the same partition.
  2. Inside that partition, Event Hubs preserves arrival order — e1 is offset N, e3 is offset N+1.
  3. Consumer reading that one partition sees e1 before e3 — guaranteed.
  4. e2 lands on a different partition (hashed from C2) — no ordering relationship with C1's events.

Output. Per-customer ordering is preserved without any consumer-side sorting.

Rule of thumb: the partition key is the grain at which ordering matterscustomer_id, device_id, account_id, etc.

Worked example 2 — Hopping window aggregation

Topic explanation.

  • The scoring lens — interviewers expect you to know that a 5-minute Hopping window with a 1-minute hop emits one row per minute summarising the last 5 minutes — i.e. rolling stats with overlap.

Question. From an orders Event Hub, emit (window_end, orders_count, revenue) every 1 minute over a 5-minute rolling window.

Input. A stream of order events with ts and amount.

Code (Stream Analytics SQL).

SELECT
    System.Timestamp()         AS window_end,
    COUNT(*)                   AS orders_count,
    SUM(amount)                AS revenue
INTO   [orders-1m-rollup]
FROM   [orders] TIMESTAMP BY ts
GROUP BY HoppingWindow(minute, 5, 1)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. TIMESTAMP BY ts tells ASA to use the event's own timestamp, not the arrival time.
  2. HoppingWindow(minute, 5, 1) defines a 5-minute window that hops every 1 minute.
  3. Every minute, ASA emits one row summarising the previous 5 minutes — so each event participates in 5 different windows.
  4. System.Timestamp() is the window end.

Output.

window_end orders_count revenue
2026-05-24 10:01:00 482 18,402
2026-05-24 10:02:00 511 19,920
2026-05-24 10:03:00 533 20,801

Rule of thumb: tumbling for non-overlapping reporting buckets, hopping for rolling stats, sliding for per-event triggers, session for inactivity-bounded analysis.

Solution Using a partitioned hub + dedicated consumer group + Auto Loader

# === Solution: Event Hubs → Databricks Structured Streaming → Delta silver ===
from pyspark.sql.functions import col, from_json, to_timestamp

eh_conf = {
    "eventhubs.connectionString": dbutils.secrets.get("kv-eh", "orders-conn"),
    "eventhubs.consumerGroup":    "cg-databricks",
}

raw = (spark.readStream
    .format("eventhubs")
    .options(**eh_conf)
    .load())                              # 8 partitions → 8 Spark tasks

schema = "customer_id STRING, order_id STRING, amount DOUBLE, ts STRING"
parsed = (raw
    .select(from_json(col("body").cast("string"), schema).alias("d"))
    .select("d.*")
    .withColumn("ts", to_timestamp("ts")))

(parsed.writeStream
    .format("delta")
    .option("checkpointLocation",
            "abfss://silver@acct.dfs.core.windows.net/_chk/orders/")
    .option("mergeSchema", "true")
    .partitionBy("customer_id")
    .trigger(processingTime="60 seconds")
    .start("abfss://silver@acct.dfs.core.windows.net/orders/"))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Component Action
1 Producer publishes orders keyed by customer_id
2 Event Hub orders 8 partitions; per-customer order preserved within partition
3 Consumer group cg-databricks independent cursor per partition
4 Spark readStream one task per partition reads from offsets
5 Spark transform JSON parse + cast ts
6 Delta writeStream append to silver with checkpoint + 60s trigger
7 Capture (separate) auto-archives raw Avro to bronze
  1. Producer sets partition_key=customer_id so per-customer arrival order is preserved.
  2. Event Hubs distributes by hash(customer_id) % 8; one Spark task reads each of the 8 partitions.
  3. The Databricks consumer group has its own cursor per partition; it doesn't interfere with cg-analytics (used by Stream Analytics).
  4. Structured Streaming uses checkpointLocation to persist the cursor — exactly-once-with-idempotent-sink semantics.
  5. The 60s trigger batches micro-batches so we don't write a 4-KB Parquet file every 50 ms.
  6. Capture independently archives raw events to bronze as a free durable replay log.

Output:

Sink Format Frequency Purpose
silver/orders/ Delta every 60 s downstream analytics + MERGE
bronze/eh-capture/orders/{PartitionId}/.../ Avro per Capture window durable replay
Stream Analytics → orders-1m-rollup (separate cg-analytics) SQL every 60 s BI rollups

Why this works — concept by concept:

  • Partition key = customer_id — pins per-customer ordering at the broker.
  • Consumer group isolationcg-databricks and cg-analytics read the same hub with independent cursors.
  • Checkpoint location — Structured Streaming durably persists offsets, so a crash resumes where it left off.
  • Trigger interval — 60 s avoids the small-files trap of per-millisecond writes.
  • Capture — gives a free bronze raw archive without writing a consumer.
  • Kafka API — same Event Hub can be re-fronted to a Kafka producer with zero protocol change.
  • At-least-once + idempotent sink — Delta mergeSchema + upserts make duplicates safe.
  • CostO(partitions × TU_throughput) for ingress; reads scale with consumer group count, not extra ingress.

Python
Topic — streaming
Streaming Python drills

Practice →

Python
Topic — sliding-window
Sliding-window drills

Practice →


4. Scenario-design rounds — incremental load · CDC · SCD2

Visual scenario-design diagram — three side-by-side mini-architectures on one card: Panel 1 (Incremental load) shows ADF watermark control table → ADF Copy with high-watermark predicate → ADLS Gen2 bronze. Panel 2 (CDC with Debezium / Azure DMS) shows OLTP SQL Server → Debezium → Event Hub → Databricks Auto Loader. Panel 3 (SCD Type 2) shows a small Delta table with __start_dt__, __end_dt__, __is_current__ columns and a MERGE INTO chip; on a light PipeCode card.

azure scenario based interview questions — incremental load · CDC · SCD2

azure scenario based interview questions is the round where the interviewer hands you a vague business prompt ("ingest a 50-million-row table daily, only the changes; track customer history over time") and watches you decompose it into a concrete Azure design. The three classics are incremental load with a watermark, CDC streaming via Debezium / Azure DMS, and SCD Type 2 with a Delta MERGE INTO.

Incremental load — the watermark pattern.

  • Goal — only pick up rows that have changed since the last successful run; no full table scans on a 500-million-row source.
  • Control table — a small Azure SQL DB table watermark_control(pipeline_name, last_loaded_ts) updated atomically at the end of each successful run.
  • ADF predicateWHERE updated_at > @watermark injected via parameter binding; @watermark read from the control table at the start of the run.
  • Idempotency — read-then-update is atomic; if the load fails, the watermark is not advanced.

CDC streaming — Debezium / Azure DMS.

  • Goal — stream every INSERT / UPDATE / DELETE from an OLTP SQL Server into the lake without polling.
  • Mechanism — Debezium (or Azure Data Migration Service) reads the SQL Server transaction log; emits change events to Event Hubs.
  • Schema — each event is {op: c|u|d, before: {...}, after: {...}, ts_ms: ...}.
  • Downstream — Databricks Auto Loader (or Structured Streaming) reads the Event Hub; applies the change to silver Delta with MERGE INTO.
  • Why over polling — log-based CDC catches deletes; polling on updated_at misses them.

SCD Type 2 — MERGE INTO with __start_dt__ / __end_dt__ / __is_current__.

  • Goal — preserve history of slowly changing dimension rows (e.g. dim_customer.address changes over time, but reports for last quarter must show the old address).
  • Schema — natural key columns + __start_dt__ + __end_dt__ + __is_current__.
  • Update rule — when a row's tracked attributes change: close the old row (__end_dt__ = today, __is_current__ = false) and insert a new row (__start_dt__ = today, __end_dt__ = '9999-12-31', __is_current__ = true).
  • Single transaction — Delta MERGE INTO with two WHEN MATCHED / WHEN NOT MATCHED clauses does both halves atomically.

Idempotency + retry.

  • Idempotent design — re-running the same incremental load with the same @watermark produces the same result.
  • ADF retry policy — set on each activity (retry: 3, retryIntervalInSeconds: 60); set policy to OnFailure to update an audit row.
  • Dead-letter — bad rows go to a errors/ container with the row, the error message, and the timestamp.

Worked example 1 — write the watermark predicate for ADF

Topic explanation.

  • The scoring lens — interviewers want the read-then-update flow, with the watermark stored in a transactional store (Azure SQL DB) and the update happening only on success.

Question. Write the source query and the control-table update for an ADF incremental load of orders from Azure SQL DB.

Input.

Table Columns
orders (source) order_id, customer_id, amount, updated_at
watermark_control pipeline_name, last_loaded_ts

Code.

-- Lookup activity: read current watermark
SELECT last_loaded_ts
FROM   watermark_control
WHERE  pipeline_name = 'orders_incremental';

-- Copy activity: source query (parameterised)
SELECT order_id, customer_id, amount, updated_at
FROM   orders
WHERE  updated_at > @{activity('LookupWatermark').output.firstRow.last_loaded_ts}
  AND  updated_at <= @{pipeline().parameters.runStartTs};

-- Stored procedure activity: update watermark on success
UPDATE watermark_control
SET    last_loaded_ts = @runStartTs
WHERE  pipeline_name = 'orders_incremental';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. ADF runs a Lookup that reads the current last_loaded_ts from watermark_control.
  2. ADF runs Copy with a parameterised source query that filters WHERE updated_at > @watermark AND <= @runStartTs — the upper bound makes the window deterministic across retries.
  3. Only on success does ADF run the Stored Procedure activity that advances the watermark to @runStartTs.
  4. If Copy fails, the watermark is not advanced — the next run re-reads the same window safely.

Output. Each successful run advances last_loaded_ts by exactly the run window; failed runs do not advance it.

Rule of thumb: always pair the watermark with an explicit upper bound (runStartTs) so retries are deterministic.

Worked example 2 — write the SCD2 MERGE INTO

Topic explanation.

  • The scoring lens — interviewers want the single-MERGE that closes the old row and inserts the new row in one transaction, with __is_current__ as the natural-key uniqueness guard.

Question. Write the Delta MERGE INTO for SCD Type 2 on dim_customer where address changes track history.

Input.

Side Table Columns
target dim_customer customer_id, name, address, __start_dt__, __end_dt__, __is_current__
source stg_customer customer_id, name, address, effective_dt

Code.

MERGE INTO gold.dim_customer t
USING (
    -- 1) the "close the old row" half — same customer, attributes changed
    SELECT s.customer_id AS merge_key,
           s.customer_id, s.name, s.address, s.effective_dt
    FROM   silver.stg_customer s
    JOIN   gold.dim_customer    g
           ON g.customer_id = s.customer_id AND g.__is_current__
    WHERE  g.address <> s.address

    UNION ALL

    -- 2) the "insert the new row" half — uses a NULL merge_key so MERGE inserts
    SELECT CAST(NULL AS STRING) AS merge_key,
           s.customer_id, s.name, s.address, s.effective_dt
    FROM   silver.stg_customer s
) src
ON  t.customer_id = src.merge_key AND t.__is_current__

WHEN MATCHED THEN UPDATE SET
    t.__end_dt__     = src.effective_dt,
    t.__is_current__ = false

WHEN NOT MATCHED THEN INSERT
    (customer_id, name, address, __start_dt__, __end_dt__, __is_current__)
VALUES
    (src.customer_id, src.name, src.address, src.effective_dt, DATE '9999-12-31', true);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The USING clause builds a two-row-per-change source via UNION ALL — one row to close the current version, one row to insert the new version.
  2. The first half carries the real customer_id as merge_key — this row triggers WHEN MATCHED and closes the old row.
  3. The second half carries merge_key = NULL — this row never matches and triggers WHEN NOT MATCHED to insert the new version.
  4. Both halves run in the same MERGE INTO transaction — Delta's optimistic concurrency makes this atomic.

Output.

customer_id name address __start_dt__ __end_dt__ __is_current__
C1 Alice 1 Old St 2025-01-01 2026-05-24 false
C1 Alice 99 New Ave 2026-05-24 9999-12-31 true

Rule of thumb: the SCD2 idiom is two rows per change in the USING clause, joined to the current row via a real key for the close half and a NULL key for the insert half.

Solution Using watermark + CDC + SCD2 end-to-end

# === Solution: end-to-end Azure scenario stack ===

# 1) Incremental load (ADF)
Lookup(watermark_control) → Copy(source: SELECT * WHERE updated_at > @wm) → SP(update watermark)

# 2) CDC streaming (always-on)
SQL Server (CDC ON)  →  Debezium / Azure DMS  →  Event Hubs cdc.orders
                                                ↓
                            Databricks Auto Loader → silver.orders (Delta)

# 3) SCD2 nightly (Databricks job)
silver.stg_customer  →  MERGE INTO gold.dim_customer (close + insert)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage Component Cadence Idempotency mechanism
1 ADF Copy hourly watermark + upper bound
2 Debezium → EH always-on log-position cursor
3 Auto Loader continuous checkpointLocation
4 MERGE INTO SCD2 nightly __is_current__ uniqueness guard
  1. ADF hourly loads the incremental delta into bronze.orders using the watermark predicate.
  2. Debezium reads the SQL Server transaction log and emits cdc.orders events to Event Hubs.
  3. Databricks Auto Loader consumes cdc.orders, applies the c / u / d operation, and lands silver.
  4. A nightly Databricks job runs the SCD2 MERGE INTO against gold.dim_customer.

Output:

Layer Table What it represents
bronze bronze.orders raw hourly snapshot deltas (ADF)
bronze bronze.cdc_orders raw CDC events (Auto Loader from Event Hubs)
silver silver.orders conformed, deduped, partitioned by event date
gold gold.dim_customer SCD2 history with __start_dt__ / __end_dt__

Why this works — concept by concept:

  • Watermark + upper bound — guarantees deterministic retries; the same @runStartTs always defines the same window.
  • Log-based CDC — catches deletes that timestamp-based polling misses.
  • Debezium / Azure DMS — Microsoft-blessed CDC connectors that emit a stable change-event schema.
  • Auto Loader — manages file listing + checkpointing for streaming sinks; cheaper than custom Spark Structured Streaming code.
  • MERGE INTO — single-transaction close + insert; Delta's optimistic concurrency makes it atomic.
  • `is_current_` flag_ — fast WHERE __is_current__ queries and the natural uniqueness guard for the MERGE join.
  • Idempotency chain — every link (Copy, CDC, MERGE) is independently idempotent, so re-runs don't corrupt downstream.
  • CostO(delta_rows_per_window) for incremental + O(changed_rows) for SCD2; full-snapshot scans are avoided.

SQL
Topic — etl
CDC + watermark + MERGE drills

Practice →

SQL
Topic — real-time-analytics
Real-time analytics drills

Practice →


5. Security, governance + cost optimization questions

Visual security + cost diagram — top half shows a Managed Identity chip authenticating from a Databricks cluster to ADLS Gen2 through a Private Endpoint, with Key Vault holding secrets, RBAC scopes (Storage Blob Data Contributor) and Purview lineage card; bottom half shows a cost-FinOps dial card with three levers (Dedicated SQL Pool pause/resume, Databricks job-cluster + spot, Synapse Serverless TB scanned); on a light PipeCode card.

azure security interview questions and azure cost optimization interview questions — the rounds that decide senior offers

The senior-grade Azure DE round splits into two halves: identity + network + governance and FinOps. The one-sentence invariant: Managed Identity over SAS tokens, Private Endpoint over public networking, Key Vault over inline secrets, Purview over scattered docs — and on cost, pause + spot + partition prune over throwing DWUs at the problem.

Managed Identity vs Service Principal vs SAS token.

  • Managed Identity (system or user-assigned) — an Entra-managed identity bound to an Azure resource (Databricks cluster, ADF, VM); the resource calls Azure APIs with no secret to rotate.
  • Service Principal — an Entra app registration with a client secret / certificate; requires rotation; necessary for federated identities or cross-tenant access.
  • SAS token — a time-bounded URL with embedded permissions; emergency-use only; cannot be revoked before expiry.
  • Interview default — Managed Identity for anything in-Azure; Service Principal only when MI is impossible; SAS only for limited external sharing.

Private Endpoint vs Service Endpoint vs public network.

  • Public network — your storage account is reachable from anywhere on the internet; only protected by RBAC + ACL.
  • Service Endpoint — VNet-to-PaaS shortcut; storage account is still publicly addressable, but traffic from your VNet stays on the Azure backbone.
  • Private Endpoint — a private NIC inside your VNet for the storage account; the storage account is no longer publicly reachable; DNS resolves to the private IP.
  • Compliance default — Private Endpoint + storage firewall set to "deny by default".

Key Vault — secrets + keys + certificates.

  • What it stores — connection strings, SAS tokens, API keys, certificates, customer-managed encryption keys.
  • Access — every consumer (ADF Linked Service, Databricks secret scope, App Service) authenticates via Managed Identity and pulls secrets at runtime.
  • Rotation — Key Vault supports auto-rotation for certain resource types (Storage, SQL); for others, rotate manually + update Linked Services to read the new version.
  • Audit — every read is logged via diagnostic settings; SOC2 / ISO27001 require this.

RBAC scopes.

  • Subscription — broadest; rarely the right scope for data plane.
  • Resource group — grouping of related resources; reasonable for app-level access.
  • Resource — most specific; the recommended scope for storage-data-plane roles.
  • Data-plane rolesStorage Blob Data Reader / Contributor / Owner are separate from control-plane roles like Storage Account Contributor; this distinction is a senior-interview filter.

Purview — catalog, lineage, classification.

  • Catalog — automatic + manual asset registration; metadata search across Azure SQL, ADLS Gen2, Synapse, Power BI.
  • Lineage — column-level lineage harvested from ADF, Databricks, Synapse pipelines.
  • Classification — automatic PII / GDPR-sensitive labelling via regex + ML classifiers.
  • Data masking — column-level masking surfaced via policies on the source store (Azure SQL Dynamic Data Masking, Databricks row/column filters).

Cost lever 1 — Dedicated SQL Pool pause/resume.

  • Mechanism — Dedicated SQL Pool DWUs are billed by the hour; pause it when idle.
  • Automation — a Logic App or ADF schedule pauses at 7 PM and resumes at 7 AM; weekend pauses save another 28%.
  • Savings — ~60–70% off list price for a Mon-Fri 7am–7pm workload.

Cost lever 2 — Databricks job-cluster + spot + auto-terminate.

  • Job clusters vs interactive clusters — interactive bills idle time; job clusters spin up for the run and terminate at the end.
  • Spot instances — up to 80% off VM list; tolerate eviction; perfect for retryable batch jobs.
  • Auto-terminate — set to 10–30 minutes for interactive clusters so nobody forgets a notebook and burns a weekend.

Cost lever 3 — Synapse Serverless ($/TB scanned) + partition pruning.

  • Pricing model — pay per terabyte scanned, not per second of compute.
  • Lever — partition by the dominant predicate, store in Parquet / Delta, and Serverless reads only the relevant partitions.
  • Real impact — a 5-TB table scanned with WHERE event_date = ? reads ~10 GB after pruning, saving ~99.8% of cost.

Reserved capacity vs pay-as-you-go.

  • Reserved — 1-year or 3-year commitment for fixed capacity; ~30–60% off list.
  • Pay-as-you-go — full flexibility; right for unpredictable workloads.
  • Rule — reserve the baseline (the floor you always run); pay-as-you-go for the burst.

Azure FinOps reporting.

  • Cost Management + Billing — daily-resolution cost analysis, budget alerts, tagged-resource grouping.
  • Azure Advisor — automated cost recommendations (e.g. "right-size this Dedicated SQL Pool").
  • Tag discipline — every resource tagged with cost-center, env, pipeline-name; chargeback works only when tags do.

Worked example 1 — Managed Identity flow from Databricks to ADLS Gen2

Topic explanation.

  • The scoring lens — interviewers want the no-embedded-secrets path: Databricks cluster's Managed Identity → RBAC on storage account → Private Endpoint → file read.

Question. Design an identity flow from a Databricks job cluster to ADLS Gen2 silver container with zero credentials in code.

Input.

Resource Identity
Databricks cluster system-assigned Managed Identity dbx-mi
Storage account acct with HNS on
Container silver
Network Private Endpoint + storage firewall = deny

Code.

# No connection strings, no SAS, no client secrets
df = spark.read.format("delta").load(
    "abfss://silver@acct.dfs.core.windows.net/orders/"
)
df.show()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Grant dbx-mi the RBAC role Storage Blob Data Reader scoped to the storage account.
  2. Grant dbx-mi ACL --x on every parent folder up to silver/orders/ (the "execute all the way" rule).
  3. Cluster boots, fetches a token from the Instance Metadata Service for dbx-mi.
  4. Storage account is firewall-deny by default — only the Databricks subnet's Private Endpoint can reach it.
  5. abfss://... request is authenticated by the token + authorised by RBAC + ACL; data flows over the Private Endpoint.

Output. Read succeeds with zero secrets in code, zero public network exposure, and a full Entra audit trail.

Rule of thumb: if your Spark code contains a connection string, the design is wrong.

Worked example 2 — daily Synapse Serverless cost on a partition layout

Topic explanation.

  • The scoring lens — interviewers want you to multiply scanned bytes by the published rate and show that partition pruning is the dominant cost lever on Serverless.

Question. A 5-TB Parquet events table on ADLS Gen2 is partitioned by event_date. BI dashboards run 1,000 queries/day, each with WHERE event_date = ? for a single day. Synapse Serverless is priced at $5/TB scanned. What's the daily cost?

Input. 5 TB across 540 days ≈ 10 GB / day on disk; 1,000 single-day queries / day; rate = $5 / TB.

Code.

bytes_per_query = 10 GB                = 0.01 TB
queries_per_day = 1,000
scanned_per_day = 10 TB
daily_cost      = 10 TB × $5 / TB      = $50
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Partition pruning lets Serverless read only the requested day's directory — ~10 GB instead of 5 TB.
  2. 1,000 queries × 10 GB = 10 TB scanned per day.
  3. 10 TB × $5/TB = $50 / day = ~$1,500 / month.
  4. Without partitioning, each query scans the full 5 TB → 5,000 TB / day → $25,000 / day. Partitioning cuts cost by 500×.

Output. $50/day with partition pruning; $25,000/day without. Same Serverless engine, same data — partition layout is the FinOps lever.

Rule of thumb: on Serverless engines (Synapse Serverless, BigQuery, Athena), partition + columnar format = the entire cost story.

Solution Using Managed Identity + Private Endpoint + Key Vault + FinOps levers

# === Solution: secure + cost-optimal Azure stack ===

# Identity
Databricks cluster (system-assigned MI 'dbx-mi') ──┐
ADF (managed identity 'adf-mi')                    ├──→  RBAC: Storage Blob Data Contributor
                                                   │     (scoped to container)
# Network
VNet ─→ Private Endpoint ─→ Storage account (firewall: deny public)

# Secrets
Key Vault: kv-prod
  ├─ secret/eh-orders-conn       (rotated quarterly)
  ├─ cert/synapse-link-cert
  └─ key/cmek-storage            (customer-managed encryption key)
ADF Linked Service → references AKV secret by URI (never inline)
Databricks secret scope = AKV-backed; dbutils.secrets.get("kv-prod", "...")

# Governance
Microsoft Purview scans:
  - ADLS Gen2 (all containers)
  - Azure SQL DB
  - Synapse + Databricks
  → builds column-level lineage + PII classification

# FinOps
Synapse Dedicated SQL Pool → Logic App pause 7pm–7am + weekends     (~65% saved)
Databricks                → job clusters + spot + auto-terminate    (~50–80% saved)
Synapse Serverless        → Parquet + partition by event_date       (~99% saved on scans)
Reserved capacity         → covers the baseline DWU + storage floor (~30–60% off list)
Cost Management           → tag-grouped daily reports + budget alerts
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Pillar Lever Mechanism
Identity Managed Identity no secrets in code
Network Private Endpoint + firewall deny no public surface
Secrets Key Vault + AKV-backed secret scope secrets fetched at runtime, rotatable
Governance Microsoft Purview scans + lineage column-level lineage, PII classification
Cost Dedicated SQL pause/resume ~65% off the on-hours bill
Cost Databricks job-cluster + spot ~50–80% off interactive cluster
Cost Synapse Serverless partition pruning ~99% off naive SELECT * scans
Cost Reserved capacity for the floor ~30–60% off list for steady-state
Cost Tag discipline + Cost Management chargeback + budget alerts
  1. Identity flows from Managed Identity → RBAC → ACL → file, zero secrets in code.
  2. Network locks the storage account to a single Private Endpoint inside the data VNet.
  3. Key Vault holds every connection string, certificate, and customer-managed encryption key; ADF and Databricks read at runtime.
  4. Purview scans every store nightly and surfaces lineage + classification in one catalog.
  5. The four FinOps levers (pause, job-cluster + spot, partition pruning, reserved capacity) typically cut a naive Azure bill by 60–80% with no SLO change.

Output:

Workload Naive monthly bill Optimised monthly bill Lever applied
Dedicated SQL Pool DW2000c $9,000 $3,150 pause/resume + reserved
Databricks all-purpose $12,000 $3,600 job clusters + spot + auto-terminate
Synapse Serverless BI $25,000 $1,500 partition by event_date + Parquet

Why this works — concept by concept:

  • Managed Identity — eliminates the rotate-or-leak risk of stored credentials.
  • Private Endpoint — the storage account isn't on the public internet, full stop.
  • Key Vault — every secret is rotatable + auditable + RBAC-controlled.
  • RBAC scope = resource — least-privilege by default; storage data-plane separated from control-plane.
  • ACL inheritance — sub-team isolation without minting new RBAC roles per folder.
  • Purview lineage + classification — answers "where did this column come from" + "is this PII" in one place.
  • Pause/resume + job-cluster + spot — the three compute levers that move the bill the most.
  • Partition pruning on Serverless — converts $/TB pricing into a near-free read path.
  • Reserved capacity — pays for the predictable floor; PAYG carries the spikes.
  • CostO($managed_compute + $TB_scanned + $TB_stored); every term has a dedicated lever above.

Python
Topic — etl
Identity + governance drills

Practice →

Python
Topic — data-analysis
FinOps + cost-model drills

Practice →


Azure interview cheat sheet

A one-screen cheat sheet for the 15 highest-value primitives across the four rounds — drill these and you've covered ~80% of the azure data engineer interview questions surface.

Primitive Round The single-sentence answer interviewers want
Hierarchical Namespace Lake design Must be ON for analytics — gives atomic dir rename + POSIX ACLs + abfss:// endpoint.
Medallion containers (bronze / silver / gold) Lake design Container-level ACL + lifecycle scopes; ingestion never writes to gold.
Partition key Lake design Partition by the dominant WHERE predicate; date wins ~90% of the time.
POSIX ACL vs RBAC Lake design RBAC at account, ACL per folder; default ACL on parent for inheritance + --x on ancestors.
Lifecycle Hot → Cool → Archive Lake design Days-since-modification rule; cuts bronze cost ~80% without touching reads.
Event Hubs partition count Streaming Cap on max parallelism; ordering only inside one partition.
Partition key on the producer Streaming The grain at which per-key ordering matters (customer_id, device_id).
Consumer groups Streaming Independent cursor per partition; fan-out without races.
Event Hubs Capture Streaming Free auto-archive to ADLS Gen2 in Avro; perfect bronze layer.
Windowing — Tumbling / Hopping / Session Streaming Non-overlap / rolling-with-hop / inactivity-bounded — pick by the report shape.
Watermark control table Scenario Read → Copy with WHERE updated_at > @watermark AND <= @runStartTs → update on success.
CDC via Debezium / Azure DMS Scenario Log-based, catches deletes; emits `c
SCD2 {% raw %}MERGE INTO Scenario Two-rows-per-change USING + WHEN MATCHED close + WHEN NOT MATCHED insert, one txn.
Managed Identity + Private Endpoint Security Zero secrets in code + zero public surface = the only senior-grade auth pattern.
Pause + job-cluster + partition pruning Cost The three levers that cut a naive bill by 60–80% with no SLO change.

Frequently asked questions

What's the average Azure data engineer salary in 2026 and which rounds determine band?

Azure data engineer compensation in 2026 sits roughly between $120K and $230K total in US metros (and £75K–£135K in the UK), with senior + staff roles routinely clearing $250K-plus at Microsoft, Databricks, and Fortune 100 banks. The bands aren't decided by ADF activity-type trivia — they're decided by the four rounds this guide drills: lake design, streaming correctness, scenario design, and security + cost. Candidates who recite service vocabulary but cannot defend a partition key, write a watermark predicate, or draw a Managed-Identity-to-ADLS-Gen2 flow over a Private Endpoint plateau at mid-band. Candidates who can do all four pass senior loops cleanly and negotiate from the top of the range.

What do the Azure data engineer interview rounds actually look like?

A typical Azure DE loop is 5 rounds: (1) a behavioural / motivation round; (2) a SQL + Python live-coding round (window functions, joins, aggregation under time pressure); (3) a system-design round where you whiteboard a real Azure pipeline end-to-end (this is where lake design + streaming + scenarios + security all collide); (4) a focused deep-dive on one Azure service the team uses heavily (Databricks tuning, Synapse MPP internals, or ADF parameterisation); (5) a "design + cost" round where a senior architect grills you on FinOps levers. The third and fifth rounds are where most offers are won or lost — and they're not service-recital rounds, they're real design rounds.

How does an Azure data engineer compare to AWS or GCP data engineer?

The azure vs aws data engineer and azure vs gcp data engineer comparisons are pure ecosystem swaps — the underlying skills transfer almost 1-to-1. ADLS Gen2 ↔ S3 ↔ GCS, Event Hubs ↔ Kinesis ↔ Pub/Sub, Synapse ↔ Redshift ↔ BigQuery, Databricks on Azure ↔ Databricks on AWS / GCP, ADF ↔ Glue ↔ Dataflow / Cloud Composer, Managed Identity ↔ IAM Roles ↔ Workload Identity, Purview ↔ DataZone ↔ Dataplex. The framework is identical: data lake on object storage with hierarchical layout, partition-keyed event log for streaming, scenario design centred on incremental loads + CDC + SCD2, and a security + cost round on top. Interviewers expect you to know the Azure names — but they're really testing the underlying architecture, which is portable across clouds.

How do I crack an Azure data engineer interview if I'm coming from a traditional SQL / ETL background?

The three highest-leverage moves are: (1) spend two weeks designing real ADLS Gen2 medallion lakes with HNS + ACLs + lifecycle so the lake-design round becomes muscle memory; (2) build one Event Hubs streaming pipeline end-to-end (producer → partition key → consumer group → Capture → Structured Streaming → Delta silver) so you can defend partition + windowing trade-offs from experience, not from a blog; (3) memorise the SCD2 MERGE INTO template and the watermark predicate so you can write them on a whiteboard in under five minutes. SQL fluency is a prerequisite, not the differentiator — every other candidate already has it.

Does the Azure Data Engineer DP-203 / DP-700 certification still matter in 2026?

The DP-203 (Azure Data Engineer Associate) and the newer DP-700 (Fabric Data Engineer) certifications help at the resume-screen stage — they're a credible signal that you've touched the platform end-to-end and they appear on most Azure DE job descriptions. Inside the actual loop, certifications are almost never the deciding factor; what you can design and defend live is. Treat the cert as a fast way to fill knowledge gaps (Purview, Private Endpoints, Synapse Serverless pricing nuances) and a passport through automated screens, then prepare the four scoring rounds in this guide as your real interview prep.

What tools should I know for an Azure data engineer interview in 2026?

The minimum kit is: ADLS Gen2 (HNS, ACLs, lifecycle); Azure Data Factory (pipelines, parameterisation, watermark loads); Azure Synapse (Dedicated SQL Pool MPP + Serverless $/TB scanned); Azure Databricks + Delta Lake (medallion, Auto Loader, MERGE INTO, OPTIMIZE + ZORDER); Event Hubs (partitions, consumer groups, Capture); Stream Analytics and / or Structured Streaming (Tumbling / Hopping / Session windows); Azure SQL DB (CDC + watermark control table); Key Vault (Managed Identity, secret rotation); Microsoft Purview (lineage + classification); and Cost Management + Advisor for FinOps. Bonus signals: Microsoft Fabric / OneLake (the unified SaaS layer on top of Synapse + Delta), Power BI integration, and a basic Terraform / Bicep vocabulary for IaC questions.


Where to go next

Top comments (0)