DEV Community

Cover image for Microsoft Data Engineering Interview Questions & Prep
Gowtham Potureddi
Gowtham Potureddi

Posted on

Microsoft Data Engineering Interview Questions & Prep

Microsoft data engineering interview questions usually combine SQL (joins, aggregations, window functions, and date logic) with Python (arrays, hash maps, heaps, strings, and sometimes ETL-style file handling with validation and error handling). Deeper conversations often include dimensional modeling and data pipeline or warehouse design at a whiteboard or conversational level. For data engineering interview preparation, treat SQL interview depth and Python data engineering problem-solving as the default loop—then layer verbal data modeling and pipeline reliability.

Everything below uses original teaching examples only—not proprietary items. The guide is aligned with a 18-problem Microsoft-tagged practice set and the topic mix described on that page: say grain, assumptions, and NULL behavior out loud. Interviewers at large tech firms reward clear reasoning over a silent perfect query.

Microsoft data engineering interview prep—header image with SQL, Python, and cloud data themes on a dark gradient.


Top Microsoft data engineering interview topics

From the Microsoft data engineering practice set and the areas those problems emphasize, the numbered sections below follow this topic map (one row per H2):

# Top Topics Why it shows up (pattern)
1 SQL: GROUP BY, aggregates, subqueries, HAVING “Above average per dept,” per-group filters, HAVING, correlated subqueries vs pre-aggregated joins.
2 SQL: joins, cardinality, per-entity / order-style metrics “Customers with enough orders,” fan-out from 1:N joins, grain before SUM, dedup with COUNT(DISTINCT …) when joins multiply rows.
3 SQL: window functions YoY same-month revenue, running totals, ranking, PARTITION BY / ORDER BY / frames.
4 SQL: date & fiscal logic Fiscal year labels, EXTRACT / CASE, half-open date ranges, calendar vs rolling windows.
5 Python: arrays, hash maps, heaps — includes strings Two-sum / hash map, first unique character (string scan + Counter), merge-K lists with heap—typical O(n) or O(n log k) tasks.
6 Python: fault-tolerant ETL-style loading Pipe/CSV lines, per-row validation, try/except, capped error samples, quarantine / logging (production-minded prompts).
7 Modeling: star schema, ETL, warehouse (verbal) Fact grain, conformed dimensions, ETL vs ELT, idempotent loads, SCD1/2 and whiteboard-level design.

SQL evaluation order (mental model): FROM / joins → WHERE (row filter) → GROUP BY → aggregates → HAVING (group filter) → windows → SELECT / ORDER BY. If you filter a sum or count of a group, that belongs in HAVING, not WHERE, unless the condition is a per-row predicate before grouping.


1. Aggregations, subqueries, and HAVING in SQL

Aggregation and GROUP BY in SQL for data engineering

Aggregation maps many input rows to fewer result rows: metrics per group. GROUP BY defines the grain (for example one row per dept_id). HAVING filters groups after aggregates are computed—unlike WHERE, which filters rows before grouping (with the usual SQL evaluation order in mind).

Report grain (say in the interview): After the query runs, one result row means one ___. If the prompt is “per employee,” your result should still have one row per employee, not one per department unless they asked for a department report.

SQL evaluation order (simplified): FROM / joins → WHERE (row filter) → GROUP BY → aggregates → HAVING (group filter) → window functions → SELECT / ORDER BY. If the condition uses SUM or COUNT of a group, it almost always belongs in HAVING, not WHERE.

Diagram comparing WHERE (row-level filter) and HAVING (group-level filter after GROUP BY) for Microsoft data engineering SQL interview prep on PipeCode.

SUM, AVG, and NULL

  • SUM(col) — Sums non-null numerics in each group. NULL cells are skipped (not treated as 0). If every value in the group is NULL, SUM is typically NULL, not 0.
  • AVG(col) — Sum of non-null values ÷ count of non-null values in the group.
  • Worked example: In one group, salaries 80, NULL, 120SUM(salary) = 200, AVG(salary) = 100 (two values averaged).

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)

  • COUNT(*) — Counts rows in the group, even if some columns are NULL.
  • COUNT(col) — Counts rows where col is not NULL.
  • COUNT(DISTINCT col) — Counts unique non-null values; use when a join duplicates keys and you need “how many users,” not “how many joined rows.”

Worked example — one group, three rows, third salary NULL: COUNT(*) = 3, COUNT(salary) = 2.

WHERE vs HAVING with a concrete story

Step Question you are answering
WHERE salary > 50000 “Ignore low salaries before we form groups.”
HAVING SUM(salary) > 200000 “After we sum per group, keep only groups whose total passes the bar.”

Worked example — two departments:

dept_id emp_id salary
A 1 100
A 2 200
B 3 50
  • WHERE salary > 75 on the base table removes only employee 3 if applied before grouping.
  • GROUP BY dept_id then HAVING SUM(salary) > 200 returns only dept A (sum 300); dept B (sum 50) drops out.

Correlated subquery vs JOIN to a grouped aggregate

To answer “list employees earning more than their department average,” you need each employee row next to that department’s single average. Two common shapes:

  • Subquery in FROM: precompute AVG(salary) per dept_id, then JOIN to employees.
  • Correlated subquery in WHERE: ... WHERE salary > (SELECT AVG(salary) FROM employees e2 WHERE e2.dept_id = e.dept_id).

Worked example (logic): Dept A average = 100. Salaries 80, 100, 120 → only 120 is strictly greater than 100.

Worked example — one dataset, several aggregates (PostgreSQL-style)

orders(order_id, user_id, amount):

order_id user_id amount
101 u1 20
102 u1 NULL
103 u1 40
104 u2 100
SELECT user_id,
       SUM(amount)   AS sum_amt,
       AVG(amount)   AS avg_amt,
       COUNT(*)      AS n_rows,
       COUNT(amount) AS n_nonnull_amt
FROM orders
GROUP BY user_id
ORDER BY user_id;
Enter fullscreen mode Exit fullscreen mode

Result (trace by hand): For u1: sum_amt = 60, avg_amt = 30 (i.e. (20+40)/2), n_rows = 3, n_nonnull_amt = 2. For u2: one row, sum_amt = 100, avg_amt = 100, counts 1 and 1.

Common beginner mistakes

  • Filtering on AVG(...) or SUM(...) in WHERE at the same level as FROM employees without a subquery or GROUP BY (invalid shape in standard SQL for that intent).
  • Double-counting by joining a 1:N table, then SUMing a dimension field that repeats per fact row.
  • Using COUNT(*) when the business asked for distinct people—use COUNT(DISTINCT ...).

SQL interview question on employees above department average

Tables: employees(emp_id, dept_id, salary) (one row per employee; salary is NOT NULL). Return emp_id, dept_id, and salary for every employee whose salary is strictly greater than the average salary in that employee’s dept_id. Order by dept_id, then emp_id.

Solution using a grouped subquery and join

SELECT e.emp_id, e.dept_id, e.salary
FROM employees e
INNER JOIN (
  SELECT dept_id, AVG(salary) AS avg_sal
  FROM employees
  GROUP BY dept_id
) a ON a.dept_id = e.dept_id
WHERE e.salary > a.avg_sal
ORDER BY e.dept_id, e.emp_id;
Enter fullscreen mode Exit fullscreen mode

Why this works: The inner select is one row per dept_id with the true average. Joining back gives per-employee rows again; WHERE compares each salary to the one number for that department. You could instead use a correlated WHERE on AVG—this join form is easy to read on a whiteboard.

SQL interview question on departments with enough people and a large payroll

Tables: employees(emp_id, dept_id, salary) (same as above, salary not null). Return dept_id, headcount, and total payroll SUM(salary) for departments that have at least two employees and total payroll strictly greater than 250. Order by dept_id.

Solution using HAVING on count and sum

SELECT dept_id,
       COUNT(*)        AS headcount,
       SUM(salary)     AS total_payroll
FROM employees
GROUP BY dept_id
HAVING COUNT(*) >= 2
   AND SUM(salary) > 250
ORDER BY dept_id;
Enter fullscreen mode Exit fullscreen mode

Why this works: HAVING is the right place for conditions on COUNT(*) and SUM(salary) after GROUP BY. The grain of the result is one row per qualifying dept_id.

Practice

For SQL filtered to Microsoft, use the Microsoft SQL hub; warm up with easy or medium on the same company page.


2. Joins, cardinality, and order-style metrics in SQL

Joins and fan-out in SQL for data engineering

A join links rows from two (or more) relations when a key predicate is true. In interviews, the failure mode is almost always cardinality: a 1:N (or M:N) join multiplies rows. If you then SUM a column that is not at the same grain as the fact you think you are summing, you double- or over-count.

Fact vs dimension (how to talk about it): A fact table (or CTE) holds events or line-level measures at a clear grain. A dimension enriches those rows with attributes (user tier, product category). When you join a fact to a dimension that is not unique on the key, you can blow up the fact rows before any SUM.

Pro tip: Before you write SELECT, say: “This join is 1:? on the key, so each left row becomes n result rows. My SUM is over … which grain?”

Diagram of one-to-many join row multiplication versus aggregating at line grain before rolling up for Microsoft data engineering SQL interview prep on PipeCode.

Inner, left, and semi-joins (high level)

  • INNER JOIN — Keep only key matches on both sides. “Orders that have at least one line” often starts here.
  • LEFT [OUTER] JOIN — Keep all left rows; right side NULL when missing. “All customers and their order total if any” is a LEFT JOIN to an aggregated orders subquery so you do not pre-explode line rows.
  • Semi-join (EXISTS / IN) — “Accounts that have at least one ‘Closed Won’ opportunity” is often WHERE EXISTS (SELECT 1 FROM opps ...) without pulling all opps to the SELECT list—stays at one row per account in the outer query.

Worked example (narrative): customersinner orders on cust_id drops customers with zero orders. For retention lists you may need a left join from customers to a per-customer order aggregate subquery instead of raw orders.

Fan-out: when SUM explodes (numeric toy)

orders(order_id, cust_id) and order_lines(order_id, line_no, line_amt):

order_id line_no line_amt
1 1 10
1 2 20

If the order table also carried header field order_total = 30 and you (wrongly) JOIN lines to orders and SUM(orders.order_total), you add 30 + 30 = 60 for one order. The fix is to sum only line_amt at line grain, or use order_total once (e.g. from a subquery grouped by order_id with MAX(order_total) if it is constant on the order—but that is a smell; the truth is usually sum of lines).

COUNT(*) vs COUNT(DISTINCT ...) after a 1:N join

  • COUNT(*) after a join to order lines counts line-level result rows, not “orders” unless the join is 1:1 at the grain of the count you want.
  • COUNT(DISTINCT order_id) answers “how many distinct orders” even when the join has multiple line rows per order.

Common beginner mistakes

  • Summing a repeated header value across a 1:N join.
  • Pushing a filter on the right table from ON to WHERE in a LEFT JOIN in a way that turns the join into an inner join accidentally.
  • Using COUNT(*) when the question asked for unique user_id after complex joins.

SQL interview question on customers with enough orders

Tables: customers(cust_id, name) and orders(order_id, cust_id, order_date) (one row per order; cust_id is NOT NULL). Return cust_id and order count for customers who have at least 3 orders in 2024 (use a half-open year filter: order_date in [2024-01-01, 2025-01-01)). Order by order count descending.

Solution using GROUP BY and HAVING

SELECT o.cust_id, COUNT(*) AS order_cnt
FROM orders o
WHERE o.order_date >= DATE '2024-01-01'
  AND o.order_date < DATE '2025-01-01'
GROUP BY o.cust_id
HAVING COUNT(*) >= 3
ORDER BY order_cnt DESC;
Enter fullscreen mode Exit fullscreen mode

Why this works: The population of rows is one per order; GROUP BY cust_id + HAVING filters groups of orders. customers is not needed in the FROM clause unless the prompt requires name in the result.

SQL interview question on 2024 line revenue by region (joins + GROUP BY)

Tables: orders(order_id, cust_id, ship_region, order_date) and order_lines(order_id, line_id, line_amt) (numeric, not null). order_id is a key in orders and a FK in order_lines. Return ship_region and sum of line_amt for all lines that belong to orders with order_date in the 2024 calendar year (same half-open range as above). Order by ship_region.

Solution with join at line grain, then GROUP BY region

SELECT o.ship_region,
       SUM(l.line_amt) AS line_revenue_2024
FROM orders o
INNER JOIN order_lines l
  ON l.order_id = o.order_id
WHERE o.order_date >= DATE '2024-01-01'
  AND o.order_date < DATE '2025-01-01'
GROUP BY o.ship_region
ORDER BY o.ship_region;
Enter fullscreen mode Exit fullscreen mode

Why this works: Revenue is naturally a line-level sum; the join to orders only brings filter and region columns. If orders also stored a pre-rolled order_total, you would not add SUM(orders.order_total) over multiple line rows without first de-duplicating to order grain—narrate that if the interviewer plants a duplicate column in the table.

Practice


3. Window functions, YoY, and running totals in SQL

PARTITION BY, ORDER BY, and frames in SQL

Window (analytic) functions compute a value per input row using a neighborhood of other rows, without collapsing the result to one row per group. That is the main semantic difference from GROUP BY + aggregates: the result set is still as many rows as the selected rows (with rare exceptions in DBs that support GROUP BY with frame-like extensions).

  • PARTITION BY — Like a “local GROUP BY” for the function: the window resets on each new partition (for example, each salesperson_id or region_id).
  • ORDER BY inside OVER (...) — Defines the order in which the engine walks rows within the partition (required for ROW_NUMBER, LAG/LEAD, and running sums).
  • Frame (ROWS / RANGE between preceding and current row) — For cumulative SUM or trailing 7-day avgs, you are specifying how many neighbor rows the function sees. Check dialect defaults: for many engines, a plain ORDER BY with SUM(...) OVER sets a default frame—confirm the interview question’s spec.

Mental check: GROUP BY = one output row per group (unless you use extra tricks). A windowed SUM = one output row per input row, plus a repeated or cumulative column value.

Diagram of PARTITION BY, ORDER BY, and cumulative sum within a window for Microsoft data engineering SQL interview prep on PipeCode.

ROW_NUMBER, RANK, and DENSE_RANK

  • ROW_NUMBER() — 1, 2, 3, … in the window order; no ties (two rows in the same partition never share the same number unless a bug or unstable sort without tie-breaking).
  • RANK() — Ties get the same rank; the next rank skips (1, 1, 3).
  • DENSE_RANK() — Ties, but no gap: 1, 1, 2.

Typical interview prompt:First order per customer in chron order” → ROW_NUMBER() ... PARTITION BY cust_id ORDER BY order_ts then filter = 1 in an outer query or a CTE.

LAG and LEAD for year-over-year

For one row per (region, month, year), a self-join to (y-1, m) is very readable. If you have a dense monthly series with ORDER BY time, LAG(revenue, 12) is one row: “revenue 12 months ago in this partition.” If months can be missing, LAG(12) can line up the wrong months—say you assume a contiguous series or you pad the series in a pre-step.

Worked example (numbers): Region R1, m=3, 2025 revenue 120; same region/m in 2024 = 100 → YoY growth 20% = (120-100) / 100.0 when the definition is (this − last) / last.

Running total with a frame (toy + SQL + trace)

daily_sales(sales_id, sp_id, sale_date, amt)cumulative sum of amt per sp_id in chron order:

sp_id sale_date amt running (expected)
S1 2025-01-01 10 10
S1 2025-01-03 20 30
S1 2025-01-05 5 35

Common beginner mistakes

  • Omitting PARTITION BY so a company-wide SUM(…) OVER (ORDER BY …) runs a cumulative total that is not per rep or per store.
  • Using LAG with the wrong offset when the data is not a fully dense calendar at the stated grain.

SQL interview question on year-over-year monthly revenue (same month)

Table: monthly_revenue(region_id, y, m, revenue) with primary key (region_id, y, m); m is 1..12. Return one row per (region_id, y, m) for y = 2025 and m in (1,2,3) with columns region_id, y, m, revenue as rev_ty, and rev_ly (revenue for the same region_id and m in y - 1), or NULL if there is no such row. Order by region_id, m.

Solution using a left self-join on region and calendar month

SELECT
  t.region_id, t.y, t.m, t.revenue AS rev_ty,
  ly.revenue AS rev_ly
FROM monthly_revenue t
LEFT JOIN monthly_revenue ly
  ON ly.region_id = t.region_id
 AND ly.m = t.m
 AND ly.y = t.y - 1
WHERE t.y = 2025
  AND t.m IN (1, 2, 3)
ORDER BY t.region_id, t.m;
Enter fullscreen mode Exit fullscreen mode

Why this works: You align (region, month slot) in consecutive calendar years; LEFT JOIN keeps 2025 rows even when 2024 was absent (rev_ly is NULL).

SQL interview question on running total of sales by rep (window)

Table: sales(sales_id, sp_id, sale_date, amt); amt is NOT NULL. Return all columns of sales plus column running_total, defined as the cumulative sum of amt for each sp_id in sale_date order (ties: any deterministic order is fine if the interviewer allows—otherwise add sales_id to ORDER BY).

Solution using SUM(...) OVER with a row frame (PostgreSQL-style)

SELECT
  sales_id, sp_id, sale_date, amt,
  SUM(amt) OVER (
    PARTITION BY sp_id
    ORDER BY sale_date, sales_id
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM sales;
Enter fullscreen mode Exit fullscreen mode

Why this works: The frame from partition start to current row is exactly “running total in time order within rep.” Ties on sale_date are broken by sales_id for reproducibility.

Practice


4. Date functions and fiscal period logic in SQL

Calendars, fiscal year, and CASE in SQL

Many orgs use a fiscal calendar (e.g. year starts in July or April). The pattern: (1) map every row’s event_date to a fiscal key (often fy_id), then (2) GROUP BY that key and aggregate.

Stating assumptions scores points: name half-open date ranges, and say whether timestamps are UTC or local before you write WHERE.

EXTRACT / date_trunc vs fiscal CASE

  • date_trunc('month', d) (PostgreSQL) or similar: snap to a month for MTD-style buckets or to join a date dimension.
  • Fiscal labels are often a CASE on month and year (see the table below), because Q3 in a fiscal Year is not the same as Q3 in a calendar year.

Building a July-start fy_id (toy + trace)

Rule in this article (same as the solutions): if EXTRACT(MONTH FROM event_date) >= 7 then fy_id = EXTRACT(YEAR FROM event_date) else fy_id = EXTRACT(YEAR FROM event_date) - 1.

event_date month fy_id (this rule)
2024-03-10 3 2023
2024-08-20 8 2024

Calendar Q1 with a half-open range (teaching)

WHERE event_date >= DATE '2024-01-01' AND event_date < DATE '2024-04-01' selects Q1 2024 for date-only keys: inclusive of Jan 1, exclusive of Apr 1.

Half-open ranges on timestamp

WHERE event_ts >= TIMESTAMP '2024-01-01' AND event_ts < TIMESTAMP '2025-01-01' in the correct timezone avoids edge-time issues from BETWEEN on high-resolution timestamps.

Common beginner mistakes

  • Changing CASE branches for fy_id while relabeling output for dashboards without updating the stakeholder definition of the fiscal calendar.
  • Mixing UTC and PST without a single rule.

SQL interview question on labeling fiscal year (July start, July–June labels)

Table: events(event_id, event_date DATE, region_id); fiscal year starts July 1; fiscal year id is the calendar year of the July that starts the fiscal year (e.g. dates in 2024-07-01 .. 2025-06-30 map to fy_id = 2024 as the “July 2024” start—adjust if your org uses +1 naming; be explicit in the answer). Return event_id, event_date, and fy_id for each row using: if month of event_date is >= 7 then EXTRACT(YEAR FROM event_date) else EXTRACT(YEAR FROM event_date) - 1.

Solution using EXTRACT and CASE (illustrative)

SELECT
  event_id,
  event_date,
  CASE
    WHEN EXTRACT(MONTH FROM event_date) >= 7
      THEN EXTRACT(YEAR FROM event_date)::int
    ELSE EXTRACT(YEAR FROM event_date)::int - 1
  END AS fy_id
FROM events;
Enter fullscreen mode Exit fullscreen mode

Why this works: The CASE encodes a fixed July-split. If the org names fiscal years with a +1 name in slides, the id in the table can still be this int; align in narration.

SQL interview question on counting events in calendar Q1 2024

Table: events(event_id, event_date DATE, region_id). Return region_id and event count for rows with event_date in the first calendar quarter of 2024 (use a half-open range: [2024-01-01, 2024-04-01) on date-only). Order by region_id.

Solution using GROUP BY on a half-open WHERE

SELECT region_id, COUNT(*) AS n_events
FROM events
WHERE event_date >= DATE '2024-01-01'
  AND event_date < DATE '2024-04-01'
GROUP BY region_id
ORDER BY region_id;
Enter fullscreen mode Exit fullscreen mode

Why this works: Half-open ranges on date-only data line up with Q boundaries and play well with orchestrated backfills that reprocess by date partition.

Practice


5. Arrays, hash maps, and heaps in Python

Data structures in Python for data engineering coding rounds

Python for data engineering at Microsoft-style screens (see Array, Hash table, Heap, and String on the Microsoft practice set) usually means: O(n) or O(n log n) on iterables, hash-based maps for counts and index-by-key, and heaps when you must process the smallest next item among many ordered streams (e.g. time-ordered logs from K shards).

In an interview, name time and space; hand-trace a small example on paper before you golf the code.

dict and Counter for frequency and O(1) lookup

collections.Counter(iterable) counts how many times each key appears. A plain dict with get(k, 0) does the same with a few more lines. After one pass, average O(1) lookups support “first unique,” “top-k frequent,” and deduplication at a key.

Hash map: two-sum (classic O(n))

Idea: Scan the array once; keep a dict of valueindex (or, equivalently, check for complement target - nums[i]). When target - nums[i] is already a key, return the pair of indices (read the prompt for 0- vs 1-based).

Worked example: nums = [2, 7, 11, 15], target = 9. At index 0, value 2, complement 7 is missing; store 2 -> 0. At index 1, value 7, complement 2 is in the map at index 0 → return (0, 1).

Min-heap: merge K sorted iterables (skeleton + trace)

heapq in Python is a min-heap. For two streams you can merge with two pointers; for K streams you keep (next_value, stream_id, iterator) in a heap of size at most K; each pop emits the min; push the next from that stream. Time is O(N log K) for N total elements across K lists.

Worked example (A = [1,4], B = [2,3]): the merge order is 1, 2, 3, 4. The first heap min is 1 (from A) vs 2 (from B), then advance A; repeat until both streams are exhausted.

Common beginner mistakes

  • Using is for string or int equality (use == for value).
  • Off-by-one on 0- vs 1-based indices; confirm the return format in the prompt.
  • Pushing None or exhausting an iter without checkinghandle StopIteration.

Python interview question on first unique character index

Given a string s of ASCII letters, return the 0-based index of the first character that does not repeat anywhere else in s. If none, return -1.

Solution using two passes (count then scan)

def first_unique(s: str) -> int:
    from collections import Counter
    c = Counter(s)
    for i, ch in enumerate(s):
        if c[ch] == 1:
            return i
    return -1
Enter fullscreen mode Exit fullscreen mode

Why this works: O(n) time, O(σ) extra space (σ = alphabet size). A one-pass + ordered structure is optional; clarity first.

Python interview question on two sum (return indices)

Given a list nums of integers and an integer target, return 0-based indices i and j (with i < j) such that nums[i] + nums[j] == target. You may assume exactly one such pair exists (simpler version; if not, the interviewer will clarify return value rules).

Solution using a hash map of complementindex

def two_sum(nums: list[int], target: int) -> tuple[int, int]:
    need: dict[int, int] = {}
    for j, v in enumerate(nums):
        t = target - v
        if t in need:
            return (need[t], j)
        need[v] = j
    raise ValueError("no pair (should not happen under exercise assumptions)")
Enter fullscreen mode Exit fullscreen mode

Why this works: For each j, we ask if complement target - v was seen; if so we return the stored index and j. O(n) time, O(n) space.

Practice

Microsoft Python (all) and Microsoft Python — medium are good spines for coding volume.


6. Fault-tolerant loading, validation, and ETL-style Python

Resilient batch ingestion in Python for data engineering

Production pipelines quarantine bad rows, log (line number + reason + optional payload hash), and re-run idempotent partitions when upstream fixes land. A “fault-tolerant loader” in interviews often is: read lines or CSV records, validate a schema, return counts and capped error sampleswithout aborting the whole file on one bad line if the spec says to continue.

Design narration (say out loud): UTF-8 vs legacy encoding; header row present or not; delimiter; what to do with empty lines; max errors to log; where to write a dead-letter file; idempotent merge to the sink.

Validate, then coerce types (per field)

Idea: Split string fields first; for each field rule, cast in try/except. Never let a malformed int row take down the job if the spec wants per-row errors.

Worked example — toy pipe-delimited rows (no header):

line # raw content outcome
1 `A 2`
2 B invalid — wrong field count
3 `C not_int`

A line-by-line loop should return ok=1, bad=2 with up-to-5 sample messages (line # + snipped error).

Common beginner mistakes

  • Catching a broad Exception in prod and hiding real bugs (interviews: mention you would narrow in a real codebase).
  • Dropping blank lines without counting them—match the spec (sometimes empty = ok skip; sometimes error).
  • Loading the entire huge file to memory; interviews may add a stream-or-batch follow*-*up.

Python interview question on parsed pipe file with integer second field

Prompt (original teaching example): Each non-empty line is id|age where id is a non-empty str and age is a non-negative integer ≤ 120. Return a dict { "ok": n_ok, "bad": n_bad, "error_samples": [ ... at most 5 ] } as you read a text file line-by-line from path. Empty lines are ignored and not counted in ok or bad.

Solution with parse, count, and capped samples

def load_pipe_rows(path: str) -> dict:
    ok = bad = 0
    err_samples: list[str] = []
    with open(path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            try:
                parts = line.split("|", 1)
                if len(parts) != 2 or not parts[0]:
                    raise ValueError("expected id|age")
                age = int(parts[1])
                if age < 0 or age > 120:
                    raise ValueError("age out of range")
            except (ValueError, TypeError) as e:
                bad += 1
                if len(err_samples) < 5:
                    err_samples.append(f"line {i}: {e}")
                continue
            ok += 1
    return {"ok": ok, "bad": bad, "error_samples": err_samples}
Enter fullscreen mode Exit fullscreen mode

Why this works: Control-flow is per line; ok is only for valid rows; errors are typed; cap on error_samples is O(1) extra memory for the message list.

Python interview question on counting valid rows (generic sketch)

Prompt: A file has one record per line; some lines are invalid. Count valid vs invalid and return ok, bad, and up-to-5 error samples (skeleton; fill a real parse function in the interview).

Solution using try/except and a line-level loop (skeleton)

def summarize(path: str) -> dict:
    ok = bad = 0
    err_samples: list[str] = []
    with open(path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f, start=1):
            try:
                _ = line.strip()
                if not _:
                    continue
                # parse_and_validate(_)
                ok += 1
            except Exception as e:  # noqa: BLE001 — narrow in production
                bad += 1
                if len(err_samples) < 5:
                    err_samples.append(f"line {i}: {e!s}")
    return {"ok": ok, "bad": bad, "error_samples": err_samples}
Enter fullscreen mode Exit fullscreen mode

Why this works: The skeleton separates I/O from parsing; you swap the body of try for the exercise-specific rules.

Practice

The Microsoft company hub also lists problems that combine Python with error-handling-style tags; filter by language and difficulty to match your plan.


7. Star schema, ETL, and data warehouse design (verbal)

Dimensional facts and conformed dimensions in data engineering

A star schema puts a fact (measurements) at a well-posed grain in the center and connects it to dimension tables (who / what / when / where). A conformed dimension reuses the same product_id (or date_id) across multiple fact tables so rollups line up. In a verbal round, you win by naming grain and which table holds which attribute as it changes over time.

Star schema and ETL flow diagram for Microsoft data engineering modeling and pipeline interview prep on PipeCode.

Fact table vs dimension table (tight)

  • Fact — Many rows, often one per line or per event; numeric measures and keys to dims.
  • Dimension — Fewer, descriptive rows; one (or a Type-2 history of) row per entity in the ideal (until reality breaks that with duplicates—then interview time is for lineage and SCDs).

Worked example (verbal): If you need daily revenue and refund lines at different grains, you might use a revenue fact at line-level and a separate refund fact, both with conformed date_id and product_idnot one jumbled fact that tries to be two grains at once.

ETL vs ELT (and where compute sits)

  • ETLTransform (clean, type, business rules) in a compute service, then load narrow tables in the warehouse (classic on-prem style).
  • ELTLand raw in the lake; use the warehouse’s SQL engine for heavy transforms (common with separation of storage and compute).
  • In Microsoft-adjacent stacks (e.g. Azure-style), be ready to name (at a whiteboard level) ingestion (events → ADLS), orchestration (scheduled jobs), and serving (Synapse / SQL / BI)—without fabricating org-specific details you have not been told about in the prompt.

SCD Type 1 vs Type 2 (verbal, high signal)

  • Type 1Overwrite the row; no history (good for typos or if the org only cares about “current city”).
  • Type 2 — Add a new row (or version) with valid_from / valid_to (or is_current) so revenue in 2023 can be tied to geography as of that period.

Common beginner mistakes

  • One fact table serving two incompatible grains with no bridge and no plan for reconciliation.
  • SCD1-ing a field the compliance team wants as-of-date for revenue reporting.

Data engineering interview question on idempotent daily loads

Prompt: A nightly job reloads yesterday’s partition in object storage. How do you make re-runs safe without double-counting in the downstream table?

Solution (verbal)

Answer outline: Replace the day partition in the curated layer (or TRUNCATE+load the one-day slice in a staging then atomic swap); or MERGE on a natural key including the day; version the ETL run in a log; reconcile row counts and key-uniqueness with the source before promotion.

Why this works: Reprocessing is expected; idempotent writes + a defined “what is d-1 in the curated zoneinvariant makes re-runs safe.

Data engineering interview question on SCD2 for moves in territory

Prompt: dim_account has one row per account for “current” territory; revenue facts are at month grain. A key account moved from West to East mid-year. How do you model the attribute if Q1 revenue must tie to West and H2 to East as of that month?

Solution (verbal)

Answer outline: SCD2 (or a separate account-territory bridge with valid dates); fact rows join the territory version valid for the fact month; or a conformed key in the fact that resolves through the versioned dim. Narrate the as-of join in one sentence—that is the goal in a whiteboard round.

Practice

For courses that reinforce modeling and ETL system design, browse PipeCode courses and pair with problems from the table above.


Tips to crack Microsoft data engineering interviews

Data engineering interview preparation for a Microsoft-style loop means reps in SQL (especially joins and windows), Python for clarity and correct complexity, and a credible story on reliable pipelines and warehouse design.

  1. Solve a broad set from the Microsoft company practice page: 7 easy, 7 medium, 4 hard in the curated 18-problem pack—balance SQL and Python using the SQL and Python filters on the hub.
  2. For SQL, state grain in one sentence before writing FROM. Name join cardinality and NULL behavior when you use OUTER joins.
  3. For windows, always name the PARTITION, ORDER, and frame (if relevant), even if dialect syntax differs on paperreasoning counts more than trivia.
  4. For Python, default to readable solutions first; optimize only if asked; test on a tiny case you wrote down.
  5. For reliability and ingestion, list logging, quarantine path, and idempotence in any file-or-stream question.

Across the platform, 450+ problems with tests and AI-style feedback are available; use topic and company filters to stay on the lanes in the table at the top of this guide.


Frequently asked questions

What is the Microsoft data engineering interview process?

Rounds often include a phone screen, one or more technical coding sessions focused on SQL and Python, a system or pipeline design discussion, and behavioral interviews. Exact order and number of rounds vary by level and team—treat the Microsoft practice list as pattern prep, not a public guarantee of on-site structure.

What topics should I study for a Microsoft data engineering interview?

SQL: joins, aggregations and HAVING, subqueries, window functions, and date / fiscal rules. Python: arrays and hash maps, heaps and merging sorted data, strings, and robust file or stream handling where the spec asks for resilience. Modeling / system level: star schema, fact grain, ETL vs ELT, idempotent loads, warehouse boundaries—as in the section table at the start of this article.

How difficult are Microsoft data engineering interview questions?

The curated 18-problem set on PipeCode is split into 7 easy, 7 medium, and 4 hard—use the easy track to validate basics and the hard track to stress-test windows and non-trivial Python under time pressure.

How should I prepare for a Microsoft data engineering interview?

Work the Microsoft-tagged set and global topic pages (e.g. aggregation, joins, windows) with deliberate post-mortems on misses—especially grain and join shape. Aim for dozens of solved problems with verbal explanations before an onsite-style day; add courses if you need structured review in modeling or ETL system design.

What programming languages does Microsoft use in data engineering interviews?

SQL and Python are the default filters in many data engineering loops; some rounds are conversational for high-level Azure services and data architecture. Practice SQL on Microsoft SQL and Python on Microsoft Python to match the language mix you will face.

Where can I practice Microsoft-style data engineering questions with tests and feedback?

PipeCode is built for data engineering interview prep with 450+ problems and tests; the dedicated hub is Microsoft practice. You can also browse all company hubs to compare volume and topics across employers.

Start practicing Microsoft data engineering problems

(Closing CTA is injected by regen_blog16.pyconvert_cta after regeneration.)

Top comments (0)