Microsoft data engineering interview questions usually combine SQL (joins, aggregations, window functions, and date logic) with Python (arrays, hash maps, heaps, strings, and sometimes ETL-style file handling with validation and error handling). Deeper conversations often include dimensional modeling and data pipeline or warehouse design at a whiteboard or conversational level. For data engineering interview preparation, treat SQL interview depth and Python data engineering problem-solving as the default loop—then layer verbal data modeling and pipeline reliability.
Everything below uses original teaching examples only—not proprietary items. The guide is aligned with a 18-problem Microsoft-tagged practice set and the topic mix described on that page: say grain, assumptions, and NULL behavior out loud. Interviewers at large tech firms reward clear reasoning over a silent perfect query.
Top Microsoft data engineering interview topics
From the Microsoft data engineering practice set and the areas those problems emphasize, the numbered sections below follow this topic map (one row per H2):
| # | Top Topics | Why it shows up (pattern) |
|---|---|---|
| 1 | SQL: GROUP BY, aggregates, subqueries, HAVING |
“Above average per dept,” per-group filters, HAVING, correlated subqueries vs pre-aggregated joins. |
| 2 | SQL: joins, cardinality, per-entity / order-style metrics | “Customers with enough orders,” fan-out from 1:N joins, grain before SUM, dedup with COUNT(DISTINCT …) when joins multiply rows. |
| 3 | SQL: window functions |
YoY same-month revenue, running totals, ranking, PARTITION BY / ORDER BY / frames. |
| 4 | SQL: date & fiscal logic |
Fiscal year labels, EXTRACT / CASE, half-open date ranges, calendar vs rolling windows. |
| 5 | Python: arrays, hash maps, heaps — includes strings | Two-sum / hash map, first unique character (string scan + Counter), merge-K lists with heap—typical O(n) or O(n log k) tasks. |
| 6 | Python: fault-tolerant ETL-style loading |
Pipe/CSV lines, per-row validation, try/except, capped error samples, quarantine / logging (production-minded prompts). |
| 7 | Modeling: star schema, ETL, warehouse (verbal) | Fact grain, conformed dimensions, ETL vs ELT, idempotent loads, SCD1/2 and whiteboard-level design. |
SQL evaluation order (mental model):
FROM/ joins →WHERE(row filter) →GROUP BY→ aggregates →HAVING(group filter) → windows →SELECT/ORDER BY. If you filter a sum or count of a group, that belongs inHAVING, notWHERE, unless the condition is a per-row predicate before grouping.
1. Aggregations, subqueries, and HAVING in SQL
Aggregation and GROUP BY in SQL for data engineering
Aggregation maps many input rows to fewer result rows: metrics per group. GROUP BY defines the grain (for example one row per dept_id). HAVING filters groups after aggregates are computed—unlike WHERE, which filters rows before grouping (with the usual SQL evaluation order in mind).
Report grain (say in the interview): After the query runs, one result row means one ___. If the prompt is “per employee,” your result should still have one row per employee, not one per department unless they asked for a department report.
SQL evaluation order (simplified):
FROM/ joins →WHERE(row filter) →GROUP BY→ aggregates →HAVING(group filter) → window functions →SELECT/ORDER BY. If the condition usesSUMorCOUNTof a group, it almost always belongs inHAVING, notWHERE.
SUM, AVG, and NULL
-
SUM(col)— Sums non-null numerics in each group.NULLcells are skipped (not treated as 0). If every value in the group isNULL,SUMis typicallyNULL, not 0. -
AVG(col)— Sum of non-null values ÷ count of non-null values in the group. -
Worked example: In one group, salaries
80,NULL,120→SUM(salary) = 200,AVG(salary) = 100(two values averaged).
COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)
-
COUNT(*)— Counts rows in the group, even if some columns areNULL. -
COUNT(col)— Counts rows wherecolis notNULL. -
COUNT(DISTINCT col)— Counts unique non-null values; use when a join duplicates keys and you need “how many users,” not “how many joined rows.”
Worked example — one group, three rows, third salary NULL: COUNT(*) = 3, COUNT(salary) = 2.
WHERE vs HAVING with a concrete story
| Step | Question you are answering |
|---|---|
WHERE salary > 50000 |
“Ignore low salaries before we form groups.” |
HAVING SUM(salary) > 200000 |
“After we sum per group, keep only groups whose total passes the bar.” |
Worked example — two departments:
| dept_id | emp_id | salary |
|---|---|---|
| A | 1 | 100 |
| A | 2 | 200 |
| B | 3 | 50 |
-
WHERE salary > 75on the base table removes only employee 3 if applied before grouping. -
GROUP BY dept_idthenHAVING SUM(salary) > 200returns only dept A (sum 300); dept B (sum 50) drops out.
Correlated subquery vs JOIN to a grouped aggregate
To answer “list employees earning more than their department average,” you need each employee row next to that department’s single average. Two common shapes:
-
Subquery in
FROM: precomputeAVG(salary)perdept_id, thenJOINtoemployees. -
Correlated subquery in
WHERE:... WHERE salary > (SELECT AVG(salary) FROM employees e2 WHERE e2.dept_id = e.dept_id).
Worked example (logic): Dept A average = 100. Salaries 80, 100, 120 → only 120 is strictly greater than 100.
Worked example — one dataset, several aggregates (PostgreSQL-style)
orders(order_id, user_id, amount):
| order_id | user_id | amount |
|---|---|---|
| 101 | u1 | 20 |
| 102 | u1 | NULL |
| 103 | u1 | 40 |
| 104 | u2 | 100 |
SELECT user_id,
SUM(amount) AS sum_amt,
AVG(amount) AS avg_amt,
COUNT(*) AS n_rows,
COUNT(amount) AS n_nonnull_amt
FROM orders
GROUP BY user_id
ORDER BY user_id;
Result (trace by hand): For u1: sum_amt = 60, avg_amt = 30 (i.e. (20+40)/2), n_rows = 3, n_nonnull_amt = 2. For u2: one row, sum_amt = 100, avg_amt = 100, counts 1 and 1.
Common beginner mistakes
- Filtering on
AVG(...)orSUM(...)inWHEREat the same level asFROM employeeswithout a subquery orGROUP BY(invalid shape in standard SQL for that intent). -
Double-counting by joining a 1:N table, then
SUMing a dimension field that repeats per fact row. - Using
COUNT(*)when the business asked for distinct people—useCOUNT(DISTINCT ...).
SQL interview question on employees above department average
Tables: employees(emp_id, dept_id, salary) (one row per employee; salary is NOT NULL). Return emp_id, dept_id, and salary for every employee whose salary is strictly greater than the average salary in that employee’s dept_id. Order by dept_id, then emp_id.
Solution using a grouped subquery and join
SELECT e.emp_id, e.dept_id, e.salary
FROM employees e
INNER JOIN (
SELECT dept_id, AVG(salary) AS avg_sal
FROM employees
GROUP BY dept_id
) a ON a.dept_id = e.dept_id
WHERE e.salary > a.avg_sal
ORDER BY e.dept_id, e.emp_id;
Why this works: The inner select is one row per dept_id with the true average. Joining back gives per-employee rows again; WHERE compares each salary to the one number for that department. You could instead use a correlated WHERE on AVG—this join form is easy to read on a whiteboard.
SQL interview question on departments with enough people and a large payroll
Tables: employees(emp_id, dept_id, salary) (same as above, salary not null). Return dept_id, headcount, and total payroll SUM(salary) for departments that have at least two employees and total payroll strictly greater than 250. Order by dept_id.
Solution using HAVING on count and sum
SELECT dept_id,
COUNT(*) AS headcount,
SUM(salary) AS total_payroll
FROM employees
GROUP BY dept_id
HAVING COUNT(*) >= 2
AND SUM(salary) > 250
ORDER BY dept_id;
Why this works: HAVING is the right place for conditions on COUNT(*) and SUM(salary) after GROUP BY. The grain of the result is one row per qualifying dept_id.
Practice
- SQL · Topic — Aggregation (all companies)
- COMPANY · Microsoft — window functions Microsoft-tagged window functions
For SQL filtered to Microsoft, use the Microsoft SQL hub; warm up with easy or medium on the same company page.
2. Joins, cardinality, and order-style metrics in SQL
Joins and fan-out in SQL for data engineering
A join links rows from two (or more) relations when a key predicate is true. In interviews, the failure mode is almost always cardinality: a 1:N (or M:N) join multiplies rows. If you then SUM a column that is not at the same grain as the fact you think you are summing, you double- or over-count.
Fact vs dimension (how to talk about it): A fact table (or CTE) holds events or line-level measures at a clear grain. A dimension enriches those rows with attributes (user tier, product category). When you join a fact to a dimension that is not unique on the key, you can blow up the fact rows before any SUM.
Pro tip: Before you write
SELECT, say: “This join is 1:? on the key, so each left row becomes n result rows. MySUMis over … which grain?”
Inner, left, and semi-joins (high level)
-
INNER JOIN— Keep only key matches on both sides. “Orders that have at least one line” often starts here. -
LEFT [OUTER] JOIN— Keep all left rows; right sideNULLwhen missing. “All customers and their order total if any” is aLEFT JOINto an aggregatedorderssubquery so you do not pre-explode line rows. -
Semi-join (
EXISTS/IN) — “Accounts that have at least one ‘Closed Won’ opportunity” is oftenWHERE EXISTS (SELECT 1 FROM opps ...)without pulling all opps to theSELECTlist—stays at one row per account in the outer query.
Worked example (narrative): customers ⋈inner orders on cust_id drops customers with zero orders. For retention lists you may need a left join from customers to a per-customer order aggregate subquery instead of raw orders.
Fan-out: when SUM explodes (numeric toy)
orders(order_id, cust_id) and order_lines(order_id, line_no, line_amt):
| order_id | line_no | line_amt |
|---|---|---|
| 1 | 1 | 10 |
| 1 | 2 | 20 |
If the order table also carried header field order_total = 30 and you (wrongly) JOIN lines to orders and SUM(orders.order_total), you add 30 + 30 = 60 for one order. The fix is to sum only line_amt at line grain, or use order_total once (e.g. from a subquery grouped by order_id with MAX(order_total) if it is constant on the order—but that is a smell; the truth is usually sum of lines).
COUNT(*) vs COUNT(DISTINCT ...) after a 1:N join
-
COUNT(*)after a join to order lines counts line-level result rows, not “orders” unless the join is 1:1 at the grain of the count you want. -
COUNT(DISTINCT order_id)answers “how many distinct orders” even when the join has multiple line rows per order.
Common beginner mistakes
- Summing a repeated header value across a 1:N join.
- Pushing a filter on the right table from
ONtoWHEREin aLEFT JOINin a way that turns the join into an inner join accidentally. - Using
COUNT(*)when the question asked for uniqueuser_idafter complex joins.
SQL interview question on customers with enough orders
Tables: customers(cust_id, name) and orders(order_id, cust_id, order_date) (one row per order; cust_id is NOT NULL). Return cust_id and order count for customers who have at least 3 orders in 2024 (use a half-open year filter: order_date in [2024-01-01, 2025-01-01)). Order by order count descending.
Solution using GROUP BY and HAVING
SELECT o.cust_id, COUNT(*) AS order_cnt
FROM orders o
WHERE o.order_date >= DATE '2024-01-01'
AND o.order_date < DATE '2025-01-01'
GROUP BY o.cust_id
HAVING COUNT(*) >= 3
ORDER BY order_cnt DESC;
Why this works: The population of rows is one per order; GROUP BY cust_id + HAVING filters groups of orders. customers is not needed in the FROM clause unless the prompt requires name in the result.
SQL interview question on 2024 line revenue by region (joins + GROUP BY)
Tables: orders(order_id, cust_id, ship_region, order_date) and order_lines(order_id, line_id, line_amt) (numeric, not null). order_id is a key in orders and a FK in order_lines. Return ship_region and sum of line_amt for all lines that belong to orders with order_date in the 2024 calendar year (same half-open range as above). Order by ship_region.
Solution with join at line grain, then GROUP BY region
SELECT o.ship_region,
SUM(l.line_amt) AS line_revenue_2024
FROM orders o
INNER JOIN order_lines l
ON l.order_id = o.order_id
WHERE o.order_date >= DATE '2024-01-01'
AND o.order_date < DATE '2025-01-01'
GROUP BY o.ship_region
ORDER BY o.ship_region;
Why this works: Revenue is naturally a line-level sum; the join to orders only brings filter and region columns. If orders also stored a pre-rolled order_total, you would not add SUM(orders.order_total) over multiple line rows without first de-duplicating to order grain—narrate that if the interviewer plants a duplicate column in the table.
Practice
- SQL · Topic — Joins (all companies)
- COMPANY · Microsoft — SQL Microsoft-tagged SQL
3. Window functions, YoY, and running totals in SQL
PARTITION BY, ORDER BY, and frames in SQL
Window (analytic) functions compute a value per input row using a neighborhood of other rows, without collapsing the result to one row per group. That is the main semantic difference from GROUP BY + aggregates: the result set is still as many rows as the selected rows (with rare exceptions in DBs that support GROUP BY with frame-like extensions).
-
PARTITION BY— Like a “localGROUP BY” for the function: the window resets on each new partition (for example, eachsalesperson_idorregion_id). -
ORDER BYinsideOVER (...)— Defines the order in which the engine walks rows within the partition (required forROW_NUMBER,LAG/LEAD, and running sums). -
Frame (
ROWS/RANGEbetween preceding and current row) — For cumulativeSUMor trailing 7-day avgs, you are specifying how many neighbor rows the function sees. Check dialect defaults: for many engines, a plainORDER BYwithSUM(...) OVERsets a default frame—confirm the interview question’s spec.
Mental check:
GROUP BY= one output row per group (unless you use extra tricks). A windowedSUM= one output row per input row, plus a repeated or cumulative column value.
ROW_NUMBER, RANK, and DENSE_RANK
-
ROW_NUMBER()— 1, 2, 3, … in the window order; no ties (two rows in the same partition never share the same number unless a bug or unstable sort without tie-breaking). -
RANK()— Ties get the same rank; the next rank skips (1, 1, 3). -
DENSE_RANK()— Ties, but no gap: 1, 1, 2.
Typical interview prompt: “First order per customer in chron order” → ROW_NUMBER() ... PARTITION BY cust_id ORDER BY order_ts then filter = 1 in an outer query or a CTE.
LAG and LEAD for year-over-year
For one row per (region, month, year), a self-join to (y-1, m) is very readable. If you have a dense monthly series with ORDER BY time, LAG(revenue, 12) is one row: “revenue 12 months ago in this partition.” If months can be missing, LAG(12) can line up the wrong months—say you assume a contiguous series or you pad the series in a pre-step.
Worked example (numbers): Region R1, m=3, 2025 revenue 120; same region/m in 2024 = 100 → YoY growth 20% = (120-100) / 100.0 when the definition is (this − last) / last.
Running total with a frame (toy + SQL + trace)
daily_sales(sales_id, sp_id, sale_date, amt) — cumulative sum of amt per sp_id in chron order:
| sp_id | sale_date | amt | running (expected) |
|---|---|---|---|
| S1 | 2025-01-01 | 10 | 10 |
| S1 | 2025-01-03 | 20 | 30 |
| S1 | 2025-01-05 | 5 | 35 |
Common beginner mistakes
- Omitting
PARTITION BYso a company-wideSUM(…) OVER (ORDER BY …)runs a cumulative total that is not per rep or per store. - Using
LAGwith the wrong offset when the data is not a fully dense calendar at the stated grain.
SQL interview question on year-over-year monthly revenue (same month)
Table: monthly_revenue(region_id, y, m, revenue) with primary key (region_id, y, m); m is 1..12. Return one row per (region_id, y, m) for y = 2025 and m in (1,2,3) with columns region_id, y, m, revenue as rev_ty, and rev_ly (revenue for the same region_id and m in y - 1), or NULL if there is no such row. Order by region_id, m.
Solution using a left self-join on region and calendar month
SELECT
t.region_id, t.y, t.m, t.revenue AS rev_ty,
ly.revenue AS rev_ly
FROM monthly_revenue t
LEFT JOIN monthly_revenue ly
ON ly.region_id = t.region_id
AND ly.m = t.m
AND ly.y = t.y - 1
WHERE t.y = 2025
AND t.m IN (1, 2, 3)
ORDER BY t.region_id, t.m;
Why this works: You align (region, month slot) in consecutive calendar years; LEFT JOIN keeps 2025 rows even when 2024 was absent (rev_ly is NULL).
SQL interview question on running total of sales by rep (window)
Table: sales(sales_id, sp_id, sale_date, amt); amt is NOT NULL. Return all columns of sales plus column running_total, defined as the cumulative sum of amt for each sp_id in sale_date order (ties: any deterministic order is fine if the interviewer allows—otherwise add sales_id to ORDER BY).
Solution using SUM(...) OVER with a row frame (PostgreSQL-style)
SELECT
sales_id, sp_id, sale_date, amt,
SUM(amt) OVER (
PARTITION BY sp_id
ORDER BY sale_date, sales_id
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM sales;
Why this works: The frame from partition start to current row is exactly “running total in time order within rep.” Ties on sale_date are broken by sales_id for reproducibility.
Practice
- SQL · Topic — Window functions (all companies)
- COMPANY · Microsoft — window functions Microsoft-tagged window functions
4. Date functions and fiscal period logic in SQL
Calendars, fiscal year, and CASE in SQL
Many orgs use a fiscal calendar (e.g. year starts in July or April). The pattern: (1) map every row’s event_date to a fiscal key (often fy_id), then (2) GROUP BY that key and aggregate.
Stating assumptions scores points: name half-open date ranges, and say whether timestamps are UTC or local before you write
WHERE.
EXTRACT / date_trunc vs fiscal CASE
-
date_trunc('month', d)(PostgreSQL) or similar: snap to a month for MTD-style buckets or to join a date dimension. -
Fiscal labels are often a
CASEon month and year (see the table below), because Q3 in a fiscal Year is not the same as Q3 in a calendar year.
Building a July-start fy_id (toy + trace)
Rule in this article (same as the solutions): if EXTRACT(MONTH FROM event_date) >= 7 then fy_id = EXTRACT(YEAR FROM event_date) else fy_id = EXTRACT(YEAR FROM event_date) - 1.
| event_date | month | fy_id (this rule) |
|---|---|---|
| 2024-03-10 | 3 | 2023 |
| 2024-08-20 | 8 | 2024 |
Calendar Q1 with a half-open range (teaching)
WHERE event_date >= DATE '2024-01-01' AND event_date < DATE '2024-04-01' selects Q1 2024 for date-only keys: inclusive of Jan 1, exclusive of Apr 1.
Half-open ranges on timestamp
WHERE event_ts >= TIMESTAMP '2024-01-01' AND event_ts < TIMESTAMP '2025-01-01' in the correct timezone avoids edge-time issues from BETWEEN on high-resolution timestamps.
Common beginner mistakes
-
Changing
CASEbranches forfy_idwhile relabeling output for dashboards without updating the stakeholder definition of the fiscal calendar. - Mixing UTC and PST without a single rule.
SQL interview question on labeling fiscal year (July start, July–June labels)
Table: events(event_id, event_date DATE, region_id); fiscal year starts July 1; fiscal year id is the calendar year of the July that starts the fiscal year (e.g. dates in 2024-07-01 .. 2025-06-30 map to fy_id = 2024 as the “July 2024” start—adjust if your org uses +1 naming; be explicit in the answer). Return event_id, event_date, and fy_id for each row using: if month of event_date is >= 7 then EXTRACT(YEAR FROM event_date) else EXTRACT(YEAR FROM event_date) - 1.
Solution using EXTRACT and CASE (illustrative)
SELECT
event_id,
event_date,
CASE
WHEN EXTRACT(MONTH FROM event_date) >= 7
THEN EXTRACT(YEAR FROM event_date)::int
ELSE EXTRACT(YEAR FROM event_date)::int - 1
END AS fy_id
FROM events;
Why this works: The CASE encodes a fixed July-split. If the org names fiscal years with a +1 name in slides, the id in the table can still be this int; align in narration.
SQL interview question on counting events in calendar Q1 2024
Table: events(event_id, event_date DATE, region_id). Return region_id and event count for rows with event_date in the first calendar quarter of 2024 (use a half-open range: [2024-01-01, 2024-04-01) on date-only). Order by region_id.
Solution using GROUP BY on a half-open WHERE
SELECT region_id, COUNT(*) AS n_events
FROM events
WHERE event_date >= DATE '2024-01-01'
AND event_date < DATE '2024-04-01'
GROUP BY region_id
ORDER BY region_id;
Why this works: Half-open ranges on date-only data line up with Q boundaries and play well with orchestrated backfills that reprocess by date partition.
Practice
- SQL · Topic — Date functions / topic hubs on PipeCode (filter SQL + date-related)
- COMPANY · Microsoft Microsoft practice hub
5. Arrays, hash maps, and heaps in Python
Data structures in Python for data engineering coding rounds
Python for data engineering at Microsoft-style screens (see Array, Hash table, Heap, and String on the Microsoft practice set) usually means: O(n) or O(n log n) on iterables, hash-based maps for counts and index-by-key, and heaps when you must process the smallest next item among many ordered streams (e.g. time-ordered logs from K shards).
In an interview, name time and space; hand-trace a small example on paper before you golf the code.
dict and Counter for frequency and O(1) lookup
collections.Counter(iterable) counts how many times each key appears. A plain dict with get(k, 0) does the same with a few more lines. After one pass, average O(1) lookups support “first unique,” “top-k frequent,” and deduplication at a key.
Hash map: two-sum (classic O(n))
Idea: Scan the array once; keep a dict of value → index (or, equivalently, check for complement target - nums[i]). When target - nums[i] is already a key, return the pair of indices (read the prompt for 0- vs 1-based).
Worked example: nums = [2, 7, 11, 15], target = 9. At index 0, value 2, complement 7 is missing; store 2 -> 0. At index 1, value 7, complement 2 is in the map at index 0 → return (0, 1).
Min-heap: merge K sorted iterables (skeleton + trace)
heapq in Python is a min-heap. For two streams you can merge with two pointers; for K streams you keep (next_value, stream_id, iterator) in a heap of size at most K; each pop emits the min; push the next from that stream. Time is O(N log K) for N total elements across K lists.
Worked example (A = [1,4], B = [2,3]): the merge order is 1, 2, 3, 4. The first heap min is 1 (from A) vs 2 (from B), then advance A; repeat until both streams are exhausted.
Common beginner mistakes
- Using
isfor string or int equality (use==for value). - Off-by-one on 0- vs 1-based indices; confirm the return format in the prompt.
- Pushing
Noneor exhausting an iter without checking—handle StopIteration.
Python interview question on first unique character index
Given a string s of ASCII letters, return the 0-based index of the first character that does not repeat anywhere else in s. If none, return -1.
Solution using two passes (count then scan)
def first_unique(s: str) -> int:
from collections import Counter
c = Counter(s)
for i, ch in enumerate(s):
if c[ch] == 1:
return i
return -1
Why this works: O(n) time, O(σ) extra space (σ = alphabet size). A one-pass + ordered structure is optional; clarity first.
Python interview question on two sum (return indices)
Given a list nums of integers and an integer target, return 0-based indices i and j (with i < j) such that nums[i] + nums[j] == target. You may assume exactly one such pair exists (simpler version; if not, the interviewer will clarify return value rules).
Solution using a hash map of complement → index
def two_sum(nums: list[int], target: int) -> tuple[int, int]:
need: dict[int, int] = {}
for j, v in enumerate(nums):
t = target - v
if t in need:
return (need[t], j)
need[v] = j
raise ValueError("no pair (should not happen under exercise assumptions)")
Why this works: For each j, we ask if complement target - v was seen; if so we return the stored index and j. O(n) time, O(n) space.
Practice
- PYTHON · Topic — Array (Python)
- COMPANY · Microsoft — hash table Microsoft-tagged hash table
Microsoft Python (all) and Microsoft Python — medium are good spines for coding volume.
6. Fault-tolerant loading, validation, and ETL-style Python
Resilient batch ingestion in Python for data engineering
Production pipelines quarantine bad rows, log (line number + reason + optional payload hash), and re-run idempotent partitions when upstream fixes land. A “fault-tolerant loader” in interviews often is: read lines or CSV records, validate a schema, return counts and capped error samples—without aborting the whole file on one bad line if the spec says to continue.
Design narration (say out loud): UTF-8 vs legacy encoding; header row present or not; delimiter; what to do with empty lines; max errors to log; where to write a dead-letter file; idempotent merge to the sink.
Validate, then coerce types (per field)
Idea: Split string fields first; for each field rule, cast in try/except. Never let a malformed int row take down the job if the spec wants per-row errors.
Worked example — toy pipe-delimited rows (no header):
| line # | raw content | outcome |
|---|---|---|
| 1 | `A | 2` |
| 2 | B |
invalid — wrong field count |
| 3 | `C | not_int` |
A line-by-line loop should return ok=1, bad=2 with up-to-5 sample messages (line # + snipped error).
Common beginner mistakes
- Catching a broad
Exceptionin prod and hiding real bugs (interviews: mention you would narrow in a real codebase). - Dropping blank lines without counting them—match the spec (sometimes empty = ok skip; sometimes error).
- Loading the entire huge file to memory; interviews may add a stream-or-batch follow*-*up.
Python interview question on parsed pipe file with integer second field
Prompt (original teaching example): Each non-empty line is id|age where id is a non-empty str and age is a non-negative integer ≤ 120. Return a dict { "ok": n_ok, "bad": n_bad, "error_samples": [ ... at most 5 ] } as you read a text file line-by-line from path. Empty lines are ignored and not counted in ok or bad.
Solution with parse, count, and capped samples
def load_pipe_rows(path: str) -> dict:
ok = bad = 0
err_samples: list[str] = []
with open(path, "r", encoding="utf-8") as f:
for i, line in enumerate(f, start=1):
line = line.strip()
if not line:
continue
try:
parts = line.split("|", 1)
if len(parts) != 2 or not parts[0]:
raise ValueError("expected id|age")
age = int(parts[1])
if age < 0 or age > 120:
raise ValueError("age out of range")
except (ValueError, TypeError) as e:
bad += 1
if len(err_samples) < 5:
err_samples.append(f"line {i}: {e}")
continue
ok += 1
return {"ok": ok, "bad": bad, "error_samples": err_samples}
Why this works: Control-flow is per line; ok is only for valid rows; errors are typed; cap on error_samples is O(1) extra memory for the message list.
Python interview question on counting valid rows (generic sketch)
Prompt: A file has one record per line; some lines are invalid. Count valid vs invalid and return ok, bad, and up-to-5 error samples (skeleton; fill a real parse function in the interview).
Solution using try/except and a line-level loop (skeleton)
def summarize(path: str) -> dict:
ok = bad = 0
err_samples: list[str] = []
with open(path, "r", encoding="utf-8") as f:
for i, line in enumerate(f, start=1):
try:
_ = line.strip()
if not _:
continue
# parse_and_validate(_)
ok += 1
except Exception as e: # noqa: BLE001 — narrow in production
bad += 1
if len(err_samples) < 5:
err_samples.append(f"line {i}: {e!s}")
return {"ok": ok, "bad": bad, "error_samples": err_samples}
Why this works: The skeleton separates I/O from parsing; you swap the body of try for the exercise-specific rules.
Practice
- PYTHON · Topic — CSV / parsing topics
- SQL · Topic — ETL (complements verbal design of load steps)
The Microsoft company hub also lists problems that combine Python with error-handling-style tags; filter by language and difficulty to match your plan.
7. Star schema, ETL, and data warehouse design (verbal)
Dimensional facts and conformed dimensions in data engineering
A star schema puts a fact (measurements) at a well-posed grain in the center and connects it to dimension tables (who / what / when / where). A conformed dimension reuses the same product_id (or date_id) across multiple fact tables so rollups line up. In a verbal round, you win by naming grain and which table holds which attribute as it changes over time.
Fact table vs dimension table (tight)
- Fact — Many rows, often one per line or per event; numeric measures and keys to dims.
- Dimension — Fewer, descriptive rows; one (or a Type-2 history of) row per entity in the ideal (until reality breaks that with duplicates—then interview time is for lineage and SCDs).
Worked example (verbal): If you need daily revenue and refund lines at different grains, you might use a revenue fact at line-level and a separate refund fact, both with conformed date_id and product_id—not one jumbled fact that tries to be two grains at once.
ETL vs ELT (and where compute sits)
- ETL — Transform (clean, type, business rules) in a compute service, then load narrow tables in the warehouse (classic on-prem style).
- ELT — Land raw in the lake; use the warehouse’s SQL engine for heavy transforms (common with separation of storage and compute).
- In Microsoft-adjacent stacks (e.g. Azure-style), be ready to name (at a whiteboard level) ingestion (events → ADLS), orchestration (scheduled jobs), and serving (Synapse / SQL / BI)—without fabricating org-specific details you have not been told about in the prompt.
SCD Type 1 vs Type 2 (verbal, high signal)
- Type 1 — Overwrite the row; no history (good for typos or if the org only cares about “current city”).
-
Type 2 — Add a new row (or version) with
valid_from/valid_to(oris_current) so revenue in 2023 can be tied to geography as of that period.
Common beginner mistakes
- One fact table serving two incompatible grains with no bridge and no plan for reconciliation.
- SCD1-ing a field the compliance team wants as-of-date for revenue reporting.
Data engineering interview question on idempotent daily loads
Prompt: A nightly job reloads yesterday’s partition in object storage. How do you make re-runs safe without double-counting in the downstream table?
Solution (verbal)
Answer outline: Replace the day partition in the curated layer (or TRUNCATE+load the one-day slice in a staging then atomic swap); or MERGE on a natural key including the day; version the ETL run in a log; reconcile row counts and key-uniqueness with the source before promotion.
Why this works: Reprocessing is expected; idempotent writes + a defined “what is d-1 in the curated zone” invariant makes re-runs safe.
Data engineering interview question on SCD2 for moves in territory
Prompt: dim_account has one row per account for “current” territory; revenue facts are at month grain. A key account moved from West to East mid-year. How do you model the attribute if Q1 revenue must tie to West and H2 to East as of that month?
Solution (verbal)
Answer outline: SCD2 (or a separate account-territory bridge with valid dates); fact rows join the territory version valid for the fact month; or a conformed key in the fact that resolves through the versioned dim. Narrate the as-of join in one sentence—that is the goal in a whiteboard round.
Practice
- SQL · Topic — Dimensional modeling
- COMPANY · Microsoft All Microsoft problems
For courses that reinforce modeling and ETL system design, browse PipeCode courses and pair with problems from the table above.
Tips to crack Microsoft data engineering interviews
Data engineering interview preparation for a Microsoft-style loop means reps in SQL (especially joins and windows), Python for clarity and correct complexity, and a credible story on reliable pipelines and warehouse design.
- Solve a broad set from the Microsoft company practice page: 7 easy, 7 medium, 4 hard in the curated 18-problem pack—balance SQL and Python using the SQL and Python filters on the hub.
- For SQL, state grain in one sentence before writing
FROM. Name join cardinality and NULL behavior when you useOUTERjoins. - For windows, always name the
PARTITION,ORDER, and frame (if relevant), even if dialect syntax differs on paper—reasoning counts more than trivia. - For Python, default to readable solutions first; optimize only if asked; test on a tiny case you wrote down.
- For reliability and ingestion, list logging, quarantine path, and idempotence in any file-or-stream question.
| Lane | Microsoft-focused path |
|---|---|
| Company hub | /explore/practice/company/microsoft |
| SQL (all) | /explore/practice/company/microsoft/sql |
| Python (all) | /explore/practice/company/microsoft/python |
| By difficulty (easy) | /explore/practice/company/microsoft/difficulty/easy |
| By difficulty (hard) | /explore/practice/company/microsoft/difficulty/hard |
Across the platform, 450+ problems with tests and AI-style feedback are available; use topic and company filters to stay on the lanes in the table at the top of this guide.
Frequently asked questions
What is the Microsoft data engineering interview process?
Rounds often include a phone screen, one or more technical coding sessions focused on SQL and Python, a system or pipeline design discussion, and behavioral interviews. Exact order and number of rounds vary by level and team—treat the Microsoft practice list as pattern prep, not a public guarantee of on-site structure.
What topics should I study for a Microsoft data engineering interview?
SQL: joins, aggregations and HAVING, subqueries, window functions, and date / fiscal rules. Python: arrays and hash maps, heaps and merging sorted data, strings, and robust file or stream handling where the spec asks for resilience. Modeling / system level: star schema, fact grain, ETL vs ELT, idempotent loads, warehouse boundaries—as in the section table at the start of this article.
How difficult are Microsoft data engineering interview questions?
The curated 18-problem set on PipeCode is split into 7 easy, 7 medium, and 4 hard—use the easy track to validate basics and the hard track to stress-test windows and non-trivial Python under time pressure.
How should I prepare for a Microsoft data engineering interview?
Work the Microsoft-tagged set and global topic pages (e.g. aggregation, joins, windows) with deliberate post-mortems on misses—especially grain and join shape. Aim for dozens of solved problems with verbal explanations before an onsite-style day; add courses if you need structured review in modeling or ETL system design.
What programming languages does Microsoft use in data engineering interviews?
SQL and Python are the default filters in many data engineering loops; some rounds are conversational for high-level Azure services and data architecture. Practice SQL on Microsoft SQL and Python on Microsoft Python to match the language mix you will face.
Where can I practice Microsoft-style data engineering questions with tests and feedback?
PipeCode is built for data engineering interview prep with 450+ problems and tests; the dedicated hub is Microsoft practice. You can also browse all company hubs to compare volume and topics across employers.
Start practicing Microsoft data engineering problems
(Closing CTA is injected by regen_blog16.py → convert_cta after regeneration.)





Top comments (0)