Gowtham Potureddi

Posted on Apr 21 • Edited on Apr 23

Meta Data Engineering Interview Questions: Top Topics, Problems & Solutions

#python #sql #dataengineering #interview

Meta Data Engineering Interview Questions: Top Topics, Problems & Solutions

Meta data engineering interviews (and similar big-tech loops) usually lean heavily on SQL—often PostgreSQL-style—and on Python for problems involving hash maps, counting, arrays, strings, and streaming or interval-style logic. Interviewers care that you can write correct queries, pick the right grain, and explain trade-offs—not just memorize syntax.

This guide introduces the topics that show up most often in that kind of prep: aggregations, joins, windows, dates, null-safe reporting, set logic, and representative Python patterns. The sample questions here are original teaching examples.

#	Topic	Why it matters on Meta-style sets
1	Aggregation & `GROUP BY` / `HAVING`	Summaries by brand, seller, day, etc.; filter groups after summing.
2	Filtering (`WHERE`)	Keep the right rows before you aggregate.
3	Joins & deduplication	Combine tables without inflating row counts.
4	Window functions & ranking	Running totals, top-N per group, ties.
5	Subqueries & CTEs	Multi-step logic readable in one script.
6	Dates & time-series	Daily revenue, latency, cohort-style buckets.
7	`NULL` & safe percentages	Correct numerators/denominators; avoid silent wrong rates.
8	Set-style logic (overlap / both / except)	Customers in both stores, A-not-B, etc.
9	Python: hash maps & counting	Dicts, `Counter`, frequency and “most common.”
10	Python: streaming & intervals	Update state as events arrive; merge intervals.

1. Aggregation and GROUP BY Concepts in Data Engineering

Aggregation and GROUP BY in SQL for Data Engineering

Picture a table with many detail rows—for example one row per order. Aggregation means: “turn lots of rows into one summary value (or a few values) per bucket.” The bucket is whatever the question cares about: per user, per day, per campaign, and so on.

GROUP BY defines the bucket: “Put all rows with the same user_id together,” or “the same (store_id, day) together.” Every distinct combination of the GROUP BY columns is one group; the database runs your aggregate functions separately inside each group.

`SUM(col)`

What it does: Adds all numeric values of col in the bucket.
NULL behavior: NULL cells are ignored (they are not treated as 0). If every value is NULL, SUM is usually NULL, not 0—say that in an interview if the edge case matters.
Typical use: Revenue totals, quantities, scores summed per seller or per day.

Worked example: In one group, amount values 10, NULL, 30 → SUM(amount) = 40 (only 10 + 30).

`AVG(col)`

What it does: Average of non-NULL values: sum of non-null values divided by count of non-null values.
NULL behavior: Rows where col is NULL do not enter the numerator or the denominator. If you need “average where missing means 0,” use AVG(COALESCE(col, 0)) (only if the business defines it that way).
Typical use: Average order value per user, average latency per region.

Worked example: Same three rows 10, NULL, 30 → AVG(amount) = 20 because (10 + 30) / 2; the NULL row is not counted in the average.

`COUNT(*)`

What it does: Counts how many rows are in the bucket—every row counts, even if some columns are NULL.
Typical use: “How many orders per customer?”, “how many events in this hour?”—when each row is one event.

Worked example: Same three rows → COUNT(*) = 3 (the NULL row still counts as a row).

`COUNT(col)`

What it does: Counts rows where col is not NULL. Differs from COUNT(*) as soon as col has nulls.
Example intuition: COUNT(user_id) might count rows with a known user; COUNT(*) counts all rows in the group.
Related: COUNT(DISTINCT col) counts unique non-null values in the bucket—essential after joins when you must count people, not multiplied rows (see section 3).

Worked example: Same three rows → COUNT(amount) = 2. If the third row were 50 instead of NULL, COUNT(DISTINCT amount) with values 10, 30, 50 would be 3; with 10, 30, 10 it would be 2.

`MIN(col)` and `MAX(col)`

What they do: Return the smallest or largest value of col in the bucket. Works on orderable types: numbers, dates/timestamps, strings (lexicographic order).
NULL behavior: NULLs are skipped. If all values are NULL, the result is NULL.
Typical use: Latest (MAX(ts)), earliest (MIN(day)), cheapest product in a category (MIN(price)).

Worked example: Amounts 10, NULL, 30 → MIN(amount) = 10, MAX(amount) = 30. For strings 'apple', 'banana' in one group → MIN = 'apple' (lexicographic).

Worked example — one dataset, several aggregates

Suppose orders looks like this:

order_id	user_id	amount
101	u1	20.00
102	u1	NULL
103	u1	40.00
104	u2	100.00

Run:

SELECT user_id,
       SUM(amount)   AS sum_amt,
       AVG(amount)   AS avg_amt,
       COUNT(*)      AS n_rows,
       COUNT(amount) AS n_known_amt,
       MIN(amount)   AS min_amt,
       MAX(amount)   AS max_amt
FROM orders
GROUP BY user_id
ORDER BY user_id;

You should get:

user_id	sum_amt	avg_amt	n_rows	n_known_amt	min_amt	max_amt
u1	60.00	30.0000…	3	2	20.00	40.00
u2	100.00	100.0000…	1	1	100.00	100.00

For u1, the row with amount NULL still counts in COUNT(*) (3 rows) but not in SUM / AVG / COUNT(amount) (only the 20 and 40 matter). That single picture is how most interviews test whether you understand NULL with aggregates.

COUNT(DISTINCT) mini-example: if clicks has two rows for the same user_id (double click), COUNT(*) is 2 but COUNT(DISTINCT user_id) is 1 for that user’s bucket.

Conditional aggregation (`CASE` inside aggregates)

Idea: Count or sum only some rows in the group without splitting into multiple queries—put the condition inside the aggregate.
Patterns: SUM(CASE WHEN condition THEN col ELSE 0 END), COUNT(CASE WHEN … THEN 1 END) (or SUM(CASE WHEN … THEN 1 ELSE 0 END)), AVG(CASE WHEN … THEN col END) (averages only non-null branches).

Worked example — events per user by type

user_id	event_type
u1	view
u1	purchase
u1	view
u2	view

SELECT user_id,
       COUNT(*) AS total_events,
       SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases,
       SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END)    AS views
FROM events
GROUP BY user_id;

Result

user_id	total_events	purchases	views
u1	3	1	2
u2	1	0	1

This is the same “one GROUP BY, many metrics” style as COUNT(*) FILTER (WHERE …) in PostgreSQL—portable warehouses use CASE heavily.

`GROUP BY` and `HAVING` (how they fit together)

HAVING is WHERE for buckets: it runs after grouping, so you can filter on AVG(amount), COUNT(*), etc. WHERE runs before grouping and only sees raw row columns—so WHERE AVG(amount) > 50 is invalid: the average does not exist until after GROUP BY.

Worked example — WHERE vs HAVING

order_id	brand_id	amount
1	A	40
2	A	70
3	A	20
4	B	200

WHERE amount > 30 drops row 3 before grouping. Then GROUP BY brand_id with SUM(amount) gives A =110 (40+70), B = 200.
HAVING SUM(amount) > 100 runs after grouping on the unfiltered table: A’s sum is 130, B’s is 200—both pass; if you needed “brands with at least 3 orders and sum > 100,” you would use HAVING COUNT(*) >= 3 AND SUM(amount) > 100.

Worked example — same orders as above, show HAVING output

SELECT brand_id, SUM(amount) AS total_amt, COUNT(*) AS n_orders
FROM orders
GROUP BY brand_id
HAVING SUM(amount) > 100;

Result

brand_id	total_amt	n_orders
A	130	3
B	200	1

If you add AND COUNT(*) >= 2, only A remains (B has a single order).

Rule of thumb: If the condition uses SUM / COUNT / AVG / …` of the group, use HAVING. If it only uses this row’s columns, use WHERE (and put it first—it usually makes the query faster too).

If a column appears in SELECT and is not inside an aggregate, it must appear in GROUP BY (in strict SQL). Otherwise the database does not know which row’s value to show for that column inside a bucket.

Common beginner mistakes

Putting AVG(amount) > 50 in WHERE → use HAVING after GROUP BY.
Forgetting a column in GROUP BY when it appears in SELECT without an aggregate → invalid query in strict SQL.
Answering at the wrong grain (e.g. one row per ad_id when the question asked per campaign).

SQL Interview Question on Aggregation and GROUP BY

Table orders(order_id, user_id, amount) lists purchases. Return each user_id whose average amount is greater than 50 and who has at least 3 orders.

Solution Using GROUP BY and HAVING

sql SELECT user_id, AVG(amount) AS avg_amount, COUNT(*) AS order_cnt FROM orders GROUP BY user_id HAVING COUNT(*) >= 3 AND AVG(amount) > 50;

Why this works: We group all orders by user_id, compute average and count per user, then keep only groups that pass both conditions. Those conditions depend on aggregates, so they belong in HAVING.

Practice

SQL · Topic — Aggregation problems (all companies)
COMPANY · Meta — aggregation Meta-tagged aggregation

2. Data Filtering and WHERE Clause in SQL

Filtering Data Using the WHERE Clause in SQL

WHERE is the row filter: after FROM and joins, each row is tested once. If the condition is true, the row stays and can be grouped, counted, or shown in the result; if false, the row is dropped and never enters aggregates. Rows removed here are invisible to GROUP BY and to COUNT(*) on the remaining set—so push the filters that narrow the problem early.

Comparisons (`=`, `<>`, `<`, `>`, `<=`, `>=`)

Compare a column to a literal or another column: amount > 100, ts >= TIMESTAMP '2026-01-01'.
Strings compare in the database’s collation order unless you use explicit functions.
Watch types: comparing a DATE to a TIMESTAMP may require casting so you do not accidentally exclude boundary instants.

Worked example

id	amount	status
1	60	paid
2	40	refunded
3	90	paid

WHERE amount > 50 AND status <> 'refunded' keeps rows 1 and 3 only.

Range and membership

BETWEEN a AND b is inclusive on both ends. For timestamps, >= start AND < end (half-open) is often safer so you do not double-count midnight.
IN ('US', 'CA', 'UK') is shorthand for multiple ORs; NOT IN (...) can surprise you if the list or a subquery contains NULL—NOT EXISTS is sometimes safer (see section 8).

Worked example

user_id	country
a	US
b	IN
c	CA

WHERE country IN ('US', 'CA') returns users a and c.

`LIKE` and text patterns

LIKE uses % (any substring) and _ (single character): email LIKE '%@company.com'.
Case sensitivity depends on the database/column collation; use ILIKE (PostgreSQL) or LOWER(col) if the problem requires case-insensitive match.

Worked example

id	email
1	`ada@work.com`
2	`bob@gmail.com`

WHERE email LIKE '%@work.com' matches ada only.

`IS NULL` / `IS NOT NULL`

NULL means unknown, not “empty string” or 0. WHERE col = NULL is wrong in SQL; use WHERE col IS NULL or IS NOT NULL.

Worked example

id	phone
1	`555-0100`
2	NULL

WHERE phone IS NOT NULL returns row 1; WHERE phone IS NULL returns row 2.

Combining conditions

AND — all must hold. OR — at least one holds; use parentheses when mixing with AND so intent is unambiguous: WHERE (a OR b) AND c.

Worked example

id	region	is_active
1	EU	TRUE
2	UK	FALSE
3	US	TRUE

WHERE (region = 'EU' OR region = 'UK') AND is_active = TRUE → row 1 only (UK row is inactive).
Without parentheses, AND binds tighter than OR, so WHERE region = 'EU' OR region = 'UK' AND is_active = TRUE is read as region = 'EU' OR (region = 'UK' AND is_active = TRUE)—not the same as “(EU or UK) and active.” Always parenthesize OR groups when you mix with AND.

Because WHERE runs before GROUP BY, it cannot reference per-group totals. Those belong in HAVING.

Common beginner mistakes

Using = instead of IS NULL.
Mixing DATE and TIMESTAMP or time zones without casting.
Putting aggregate conditions in WHERE (use HAVING instead).

Data Engineering Interview Question on WHERE and Filtering

From events(user_id, event_type, ts), count how many purchase events happened after 2026-01-01 for each user_id.

Solution Using WHERE and GROUP BY

sql SELECT user_id, COUNT(*) AS purchase_cnt FROM events WHERE event_type = 'purchase' AND ts > TIMESTAMP '2026-01-01 00:00:00' GROUP BY user_id;

Why this works: WHERE keeps only purchase rows after the cutoff; then we count per user.

Practice

SQL · Topic — Filtering & WHERE clause
COMPANY · Meta — filtering Meta-tagged filtering

3. Joins and Data Deduplication Techniques

Joins and Deduplication in SQL for Data Engineering

A join answers: “For each row in table A, which rows in table B belong with it?” You declare that with ON a.key = b.key. Think of it as enriching events (orders, clicks) with reference data (users, ads).

Fact vs dimension and grain (how DEs talk about joins)

Fact table: Many rows, often one row per event or transaction (order line, click, impression)—the thing you SUM or COUNT.
Dimension table: Fewer rows, one row per entity you describe (user, ad, product)—the thing you JOIN to get names, categories, brand ids.
Grain: “One result row per what?” (per order, per user per day, per campaign). Wrong grain after a join usually means fan-out (section below): you aggregated at order level but summed a user attribute.

Worked example: order_lines (fact: order_id, sku, qty) joins products (dimension: sku, list_price, name`). SUM(qty) by sku is correct at line grain. If products accidentally had duplicate rows per sku, one join could inflate SUM(list_price)—fix the dimension to one row per sku before summing price.

Inner join (`JOIN` / `INNER JOIN`)

Result: Only rows where the ON condition succeeds on both sides.
Use when: You only care about “matched” pairs—e.g. clicks that have a known ad row.
NULL keys: In most join semantics, NULL = NULL is not true, so inner joins drop rows with a null join key unless you COALESCE or filter explicitly.

Worked example

clicks

click_id	user_id	ad_id
c1	u1	10
c2	u1	99

ads

ad_id	brand_id
10	7
11	7

FROM clicks c JOIN ads a ON c.ad_id = a.ad_id returns one row: (c1, u1, 10, 7). Click c2 has ad_id = 99 with no ad row, so it disappears from an inner join.

Left outer join (`LEFT JOIN`)

Result: Every row from the left table appears once per matching right row; if there is no match, right-side columns are NULL.
Use when: You must keep all left entities even if they have zero events—e.g. all users including those with no orders.

Worked example

users

id	name
1	Ann
2	Bo

orders

id	user_id	amount
o1	1	50

FROM users u LEFT JOIN orders o ON u.id = o.user_id yields two rows: Ann + o1, and Bo + NULL (no order—right side null-padded).

Right outer join (`RIGHT JOIN`)

Result: Mirror of LEFT JOIN: keep all rows from the right table; left columns NULL when unmatched.
In practice: Rarely needed—swap tables and use LEFT JOIN so “keep all rows from the table I care about” always reads the same way.

Worked example: Same users and orders as in the left join. FROM orders o RIGHT JOIN users u ON u.id = o.user_id returns the same two logical rows as FROM users u LEFT JOIN orders o … (Ann + o1, Bo + nulls)—only the preferred style differs.

Full outer join (`FULL OUTER JOIN`)

Result: Keeps rows that match plus unmatched rows from either side (missing side filled with NULL).
Use when: You need a symmetric “everything from A and B” merge; less common in basic interview cards but shows up in reconciliation-style questions.

Worked example: If legacy has id 1,2 and new has id 2,3, a full outer join on id returns three logical keys 1, 2, 3 with NULLs where one side is missing.

Cross join (`CROSS JOIN`)

Result: Cartesian product: every row of A paired with every row of B.
Use when: You intentionally need all combinations (e.g. users × dates grid)—easy to explode row count; use sparingly.

Worked example: stores has 2 rows, quarters has 4 rows → CROSS JOIN produces 8 rows (every store × every quarter).

Semi-join vs anti-join (names, not syntax)

Semi-join: “Keep rows from A where some match exists in B”—typically WHERE EXISTS (SELECT 1 FROM B WHERE …) or WHERE key IN (SELECT key FROM B). You do not add B’s columns to the result; you only filter A.
Anti-join: “Keep rows from A with no match in B”—LEFT JOIN B … WHERE B.key IS NULL or WHERE NOT EXISTS (SELECT 1 FROM B WHERE …). Same idea: filter A, not a full merge for reporting.

Worked example

customers id: 1, 2, 3 — table vip customer_id: 2 only.

Semi-join: SELECT id FROM customers c WHERE EXISTS (SELECT 1 FROM vip v WHERE v.customer_id = c.id) → 2.
Anti-join: SELECT c.id FROM customers c LEFT JOIN vip v ON v.customer_id = c.id WHERE v.customer_id IS NULL → 1 and 3 (never appeared in vip).

Fan-out (row multiplication)

If the join key is not unique on one side (one user, many orders), the “one” side’s columns repeat on every matching row. That is correct for order-grain analysis.
Bug pattern: Summing a user-level numeric column after the join counts that user’s balance multiple times. Fixes: aggregate first, SUM(DISTINCT …) rarely fixes business logic—prefer COUNT(DISTINCT user_id) or a subquery that makes the dimension one row per key before joining.

Worked example

users: (id=1, credit_balance=500)

orders: two rows for user_id=1.

After users JOIN orders, credit_balance appears twice (500, 500). SUM(u.credit_balance) would add 1000, which is wrong for “total credit in the system.” Fix: SUM order amounts, or MAX(u.credit_balance) per user if you only need the balance once.

Deduplication

SELECT DISTINCT col (or DISTINCT on several columns) removes duplicate rows—fine when any duplicate row is interchangeable.
Latest row per id: Prefer ROW_NUMBER() OVER (PARTITION BY id ORDER BY time DESC) AS rn in a CTE, then WHERE rn = 1, so you control ties and ordering. DISTINCT ON (PostgreSQL) is a compact alternative when you know the dialect.

Worked example — DISTINCT

visit_id	user_id
v1	u1
v2	u1
v3	u2

SELECT DISTINCT user_id FROM visits → u1, u2 (one row per user, duplicates collapsed).

Worked example — latest event per user

user_id	ts	page
u1	t2	/home
u1	t1	/cart

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) gives 1 for (u1,t2,/home) and 2 for (u1,t1,/cart); filter rn = 1 keeps only the latest row.

Worked example — DISTINCT ON (PostgreSQL only)

Same table as above. This picks one row per **user_id, the first row after sorting** (here, latest ts):

SELECT DISTINCT ON (user_id) user_id, ts, page
FROM events
ORDER BY user_id, ts DESC;

DISTINCT ON (user_id) requires ORDER BY to start with the same columns; the first row per user_id in that order is kept. For tie-breaking and clarity, ROW_NUMBER() in a CTE is often easier in interviews unless the prompt says PostgreSQL.

Common beginner mistakes

Joining on the wrong key or ignoring NULL keys (they do not match in an inner join).
Assuming one row per user after a join when the join multiplied rows.
Using COUNT(*) when the question wanted distinct people—use COUNT(DISTINCT user_id).

Joins-Based Interview Question on Distinct Users and Brands

clicks(user_id, ad_id) and ads(ad_id, brand_id). How many distinct users clicked any ad for brand_id = 7?

Solution Using INNER JOIN and COUNT DISTINCT

SELECT COUNT(DISTINCT c.user_id) AS distinct_users
FROM clicks c
JOIN ads a ON a.ad_id = c.ad_id
WHERE a.brand_id = 7;

Why this works: Join finds clicks whose ad belongs to brand 7; COUNT(DISTINCT user_id) counts people, not click rows.

Practice

SQL · Topic — Joins & deduplication
COMPANY · Meta — joins Meta-tagged joins

4. Window Functions and Ranking Methods in SQL

Window Functions and Ranking in SQL for Data Analysis

GROUP BY shrinks the table: many rows → one row per bucket. Window functions do the opposite idea: every row stays, but you add a new column computed from a window of rows around it. Syntax: some_function(...) OVER (...).

`PARTITION BY` (optional but common)

Meaning: “Start a new window for each distinct value of these columns”—like running a separate mini-table per store_id or user_id.
Without it: The window is usually the whole result set (after WHERE), which is rare for rankings but valid for some global running totals.

Worked example (one store, running sum over days)

store_id	day	revenue
S1	2026-01-01	10
S1	2026-01-02	30

SUM(revenue) OVER (PARTITION BY store_id ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) adds 10 on the first row and 40 on the second—cumulative revenue for that store.

`ORDER BY` inside `OVER`

Meaning: Defines which row is ‘first’ inside each partition—required for RANK, ROW_NUMBER, LAG/LEAD, and for ordered frames (running sums).
Direction: DESC vs ASC changes “who is rank 1” (e.g. highest revenue first).

Worked example

store_id	day	revenue
S1	D1	200
S1	D2	150

RANK() OVER (PARTITION BY store_id ORDER BY revenue DESC) gives 1 on D1 and 2 on D2—highest revenue is “first” in the window because of DESC.

`ROW_NUMBER()`

Assigns 1, 2, 3, … with no ties—the database breaks ties arbitrarily unless you add more ORDER BY keys (e.g. ORDER BY revenue DESC, day ASC).
Use when: You need exactly one row per partition (pick-one tie-break) or a unique sequence for deduplication.

Worked example (sales — one row per store per day)

store_id	day	revenue
S1	D1	200
S1	D2	150
S1	D3	150

ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY revenue DESC, day ASC) might assign 1 to D1, 2 to D2, 3 to D3 (tie on revenue broken by earlier day).

`RANK()`

Ties get the same rank; the next rank skips after a tie: 1, 2, 2, 4.
Use when: “Top tier” should include everyone tied for first place, and skipped ranks are OK.

Worked example: Scores 100, 90, 90, 80 → RANK() → 1, 2, 2, 4.

`DENSE_RANK()`

Ties get the same rank, but no gaps: 1, 2, 2, 3.
Use when: You care about “how many distinct tiers,” not Olympic-style gaps.

Worked example: Same scores → DENSE_RANK() → 1, 2, 2, 3.

`NTILE(n)`

Splits each partition’s ordered rows into n roughly equal-sized buckets (quartiles if n = 4, deciles if n = 10).
Use when: “Top 10% of users by spend” style bucketing (check whether ties straddle bucket boundaries—business rules vary).

Worked example

user_id	spend
u1	10
u2	20
u3	30
u4	40

NTILE(2) OVER (ORDER BY spend ASC) assigns bucket 1 to u1 and u2, bucket 2 to u3 and u4 (two rows per bucket when n = 2 and n divides evenly).

Window aggregates and frames (`ROWS BETWEEN …`)

SUM(col) OVER (PARTITION BY store ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) — 7-day trailing sum per store, one output row per day (unlike collapsing the whole table with GROUP BY).
AVG, MIN, MAX work similarly over a frame.

Worked example (trailing 2-day window, illustration)

store_id	day	rev
S1	1	10
S1	2	20
S1	3	5

With ROWS BETWEEN 1 PRECEDING AND CURRENT ROW, the 2-day trailing sums on rows 1–3 are 10, 30, 25.

`LAG` / `LEAD`

LAG(col, 1) = previous row’s value in partition order; LEAD = next row’s—used for day-over-day deltas, session gaps, and streak logic.

Worked example

store_id	day	revenue
S1	1	100
S1	2	130

LAG(revenue) OVER (PARTITION BY store_id ORDER BY day) on the second row is 100; revenue - LAG(revenue) gives +30 day-over-day.

LEAD — look ahead

Same partition order; LEAD(revenue, 1) on the first row is 130 (next day’s revenue). Use it for “how much does tomorrow differ?” or to pair each row with the next event time when building gaps.

`FIRST_VALUE` / `LAST_VALUE`

FIRST_VALUE(col) with ORDER BY in OVER is the value in the first row of the window (often the first row of the partition if the frame defaults to “start of partition through current row”).
LAST_VALUE(col) is the value in the last row of the frame—with the default frame, that is often the current row, not “last row of the whole partition.” To get “last value in partition up to here,” you may need an explicit frame such as ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING (dialect-dependent), or use ORDER BY … DESC with FIRST_VALUE instead of fighting LAST_VALUE defaults.

Worked example

emp_id	dept	salary
1	A	70
2	A	90
3	A	80

With the usual frame “from start of partition through current row” and ORDER BY salary DESC, the first row in that ordering is always the highest salary (90), so FIRST_VALUE(salary) OVER (PARTITION BY dept ORDER BY salary DESC) repeats 90 on each row in dept A—a quick way to annotate every line with the dept max without a grouped subquery. If you need the minimum instead, use ORDER BY salary ASC or FIRST_VALUE on ORDER BY salary ASC with the same pattern.

Filtering window results: You cannot put WHERE RANK() = 1 in the same SELECT that defines the rank in most databases. Compute the window in an inner query / CTE, then WHERE rnk = 1 on the outside—exactly like the sample solution below.

Common beginner mistakes

Omitting ORDER BY inside OVER when order defines the answer.
Using GROUP BY when the question still needs one output row per input row.
Using ROW_NUMBER() when the problem says tied rows should share the same rank—use RANK or DENSE_RANK.

SQL Interview Question on Window Functions and Ranking

Table sales(store_id, day, revenue) has one row per store per day. For each store, find the single best day by highest revenue (if two days tie for first, returning both is fine).

Solution Using RANK and a CTE

WITH ranked AS (
  SELECT store_id,
         day,
         revenue,
         RANK() OVER (PARTITION BY store_id ORDER BY revenue DESC) AS rnk
  FROM sales
)
SELECT store_id, day, revenue
FROM ranked
WHERE rnk = 1;

Why this works: RANK() orders days inside each store by revenue; rnk = 1 keeps all top days, including ties.

Practice

SQL · Topic — Window functions & ranking
COMPANY · Meta — windows Meta-tagged window functions
SQL · Deep dives — SQL deep dives library

5. Subqueries and Common Table Expressions (CTEs)

Subqueries and CTEs in SQL for Data Engineering

Many problems are multi-step: first compute something intermediate (averages per department, revenue per day, ranked rows), then join or filter using that result. Two ways to package that:

Subquery: a SELECT inside parentheses in FROM, WHERE, or SELECT.
CTE (WITH): the same logic with a name at the top: WITH step1 AS (...) SELECT ... FROM step1 ....

Subquery in `FROM` (“inline view”)

Pattern: FROM (SELECT …) AS sub — treat the inner query like a temporary table you join or filter.
Use when: You need a shaped intermediate dataset (deduped ranks, daily sums) before the next step.

Worked example

orders: (user_id, amount) → inner query SELECT user_id, SUM(amount) AS s FROM orders GROUP BY user_id yields (u1, 90), (u2, 40). Outer query FROM (…) AS t JOIN profiles p ON p.id = t.user_id attaches profile columns to each user’s total.

Subquery in `WHERE`

WHERE col IN (SELECT …) — row kept if col matches any value from the subquery (watch NULL in the subquery with NOT IN).
WHERE EXISTS (SELECT1 FROM … WHERE … correlated …) — true if any matching row exists; often clearer than IN for existence checks.
Scalar subquery: WHERE amount > (SELECT AVG(amount) FROM …) — inner query must return at most one row and one column.

Worked example

orders: (id, dept_id, amount) — WHERE amount > (SELECT AVG(amount) FROM orders o2 WHERE o2.dept_id = orders.dept_id) keeps rows above that row’s department average (correlated scalar subquery).

Worked example — IN (membership)

products sku: A, B, C — discontinued sku: B. SELECT sku FROM products WHERE sku IN (SELECT sku FROM discontinued) → B only.

Worked example — EXISTS (existence, no need to return columns from inner query)

Same tables. SELECT sku FROM products p WHERE EXISTS (SELECT 1 FROM discontinued d WHERE d.sku = p.sku) → B—same result as IN here; EXISTS often reads better when the inner query is large or correlated.

Subquery in `SELECT` list

SELECT id, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_cnt — runs per outer row; fine in interviews; on big data often rewritten as a join for performance.

Worked example

users: Ann (id=1), Bo (id=2). orders: two rows for Ann, none for Bo. SELECT u.id, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) returns (1, 2) and (2, 0).

Correlated subqueries

The inner query references the outer row (e.g. WHERE o.dept_id = e.dept_id with e from outside). Think: “for each outer row, evaluate this.”

Worked example

employees (id, dept_id, salary): (1, A, 80k), (2, A, 60k), (3, B, 90k). For row 1, AVG(salary) WHERE dept_id = 'A' is 70k; WHERE salary > that avg keeps row 1 only among dept A.

CTEs (`WITH`)

Readability: Steps read top to bottom like a pipeline: clean → aggregate → join.
Reuse: The same CTE name can appear multiple times in the final query; subqueries in FROM must be duplicated or wrapped.
Chaining: WITH a AS (…), b AS (SELECT … FROM a …) SELECT … FROM b — b can use a.
Recursive CTEs (WITH RECURSIVE): for trees/org charts; specialty syntax—learn when you hit that problem class.

Worked example

WITH per_day AS (
  SELECT day, SUM(amount) AS revenue
  FROM orders
  GROUP BY day
)
SELECT day, revenue
FROM per_day
WHERE revenue > 100;

Step 1 builds a daily revenue table; step 2 filters it—same logic as an inline subquery, but easier to read.

Subquery vs CTE (quick compare):

	Subquery	CTE (`WITH`)
Readability	Fine for one small nest; deep nesting gets hard to read	Often easier to read—steps read top to bottom
Reuse	Repeat the whole nest if you need it twice	Same CTE name can be referenced multiple times in one query
Style	“Inline”	“Named pipeline”

Common beginner mistakes

Expecting a subquery to return one value when it returns many rows—use IN, EXISTS, or a join.
Nesting many subqueries when WITH would make the steps obvious.

SQL Interview Question on Subqueries and Department Averages

From employees(id, dept_id, salary), list employees who earn more than their department’s average salary.

Solution Using a CTE and JOIN

WITH dept_avg AS (
  SELECT dept_id, AVG(salary) AS avg_sal
  FROM employees
  GROUP BY dept_id
)
SELECT e.id, e.dept_id, e.salary
FROM employees e
JOIN dept_avg d ON d.dept_id = e.dept_id
WHERE e.salary > d.avg_sal;

Why this works: dept_avg is a small table of one row per department; we join each employee to their department’s average and filter.

Practice

SQL · Topic — Subqueries & CTEs
COMPANY · Meta — CTE Meta-tagged CTEs

6. Date Handling and Time-Series Data Concepts

Date and Time-Series Handling in SQL

Time-series questions almost always mean: “Put each event in a time bucket (day, hour, week), then aggregate.”

Bucketing with `date_trunc` (PostgreSQL-style)

date_trunc('day', ts) snaps every timestamp in that calendar day to the same instant (midnight at the start of the day). Common grains: hour, week, month.
After truncating, GROUP BY the truncated value (or cast to date) and apply SUM, COUNT, etc.

Worked example

created_at (UTC)	amount
2026-01-01 22:00	10
2026-01-02 03:00	20

Both fall on different calendar days in UTC, so GROUP BY date_trunc('day', created_at)::date gives two buckets: 2026-01-01 → 10, 2026-01-02 → 20.

`INTERVAL` and relative windows

CURRENT_DATE - INTERVAL '7 days' or NOW() - INTERVAL '1 hour' expresses sliding windows without hard-coding calendar dates.
Pair with WHERE ts >= … (and usually < end for half-open ranges).

Worked example: “Last 24 hours” from NOW(): WHERE event_ts >= NOW() - INTERVAL '24 hours'.

`EXTRACT` / `date_part`

Pull hour of day, dow (day of week), month, etc. for “volume by hour” or seasonality slices: EXTRACT(HOUR FROM ts).

Worked example

ts
2026-01-01 08:30
2026-01-01 09:15

GROUP BY EXTRACT(HOUR FROM ts) groups both into hours 8 and 9.

Difference between timestamps

end_ts - start_ts (PostgreSQL interval) or DATEDIFF-style functions in other engines—useful for durations and “time between events.”

Worked example

ordered_at	shipped_at
2026-01-01 10:00	2026-01-01 16:00

shipped_at - ordered_at is a 6-hour interval (cast to minutes/seconds if the question asks for a number).

Rolling metrics (window frames)

SUM(amount) OVER (PARTITION BY store_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) = 7-day trailing sum per store, one row per day still present.
Change ROWS to RANGE only when you understand your SQL dialect’s semantics for gaps in dates.

Worked example (7-day trailing sum, one row per day stays)

store_id	day	amount
S1	2026-01-01	10
S1	2026-01-02	5
S1	2026-01-03	20
S1	2026-01-04	0
S1	2026-01-05	15
S1	2026-01-06	10
S1	2026-01-07	5

SUM(amount) OVER (PARTITION BY store_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) on 2026-01-07 is 65 (sum of all seven days)—same pattern as section 4’s shorter 2-day illustration, stretched to a week of daily facts.

Boundaries: half-open vs `BETWEEN`

Prefer ts >= start_ts AND ts < end_ts so an event exactly on end_ts is not counted twice across adjacent windows.
BETWEEN a AND b includes both endpoints—fine for inclusive business dates; easy to off-by-one with timestamps.

Worked example: “All of January 2026” → WHERE ts >= TIMESTAMP '2026-01-01' AND ts < TIMESTAMP '2026-02-01' (midnight Feb1 is excluded).

Time zones

Store and compare in UTC internally when possible; convert to local before date_trunc('day', …) if “calendar day” means a specific region.

Worked example: Event at 2026-01-02 02:00 UTC might be Jan 1 evening in New York—bucket by local day if the metric is “sales per US calendar day.”

Gaps, sessions, and neighbors

Sort by time, then LAG(ts) / LEAD(ts) to compute gaps between consecutive events—foundation for sessionization and streak problems.

Worked example

user_id	ts
u1	t0
u1	t0 + 5 min
u1	t0 + 2 hours

Gaps are 5 minutes then 115 minutes; a 30-minute session timeout would break a new session after the second event.

Common beginner mistakes

Mixing DATE and TIMESTAMP without ::date or casting.
Using BETWEEN on timestamps when a half-open range is safer.
Forgetting UTC vs local when filtering “today.”

SQL Interview Question on Dates and Daily Totals

orders(order_id, created_at, amount) in UTC. Return total amount per calendar day for the last 7 days (relative to CURRENT_DATE).

Solution Using date_trunc and GROUP BY

SELECT date_trunc('day', created_at)::date AS order_day,
       SUM(amount) AS daily_total
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY 1
ORDER BY 1;

Why this works: We filter to the rolling window, truncate each created_at to midnight, and sum per day.

Practice

SQL · Topic — Date functions & time buckets
COMPANY · Meta — dates Meta-tagged date functions

7. Handling NULL Values and Safe Calculations

NULL Handling and Safe Calculations in SQL

NULL means “we don’t know the value”—not zero, not “false,” not an empty string.

In `WHERE` and expressions

WHERE col = NULL is invalid / wrong; use IS NULL or IS NOT NULL.
COALESCE(a, b) returns the first non-null argument—use when the business says “treat unknown as default X.”

Worked example

id	discount_pct
1	NULL
2	10

SELECT COALESCE(discount_pct, 0) → 0 and 10. WHERE discount_pct IS NULL returns row 1 only.

How aggregates treat `NULL` (per bucket)

SUM / AVG / MIN / MAX: NULL inputs are skipped; if there is nothing left to aggregate, the result is often NULL.
COUNT(*): Counts rows, regardless of nulls in individual columns.
COUNT(col): Counts rows where col is not NULL.
COUNT(DISTINCT col): Counts distinct non-null values.

Worked example (one group, three rows)

amount
10
NULL
30

SUM = 40, AVG = 20, COUNT(*) = 3, COUNT(amount) = 2.

Percentages and rates

Write the definition explicitly: numerator = rows (or sum) matching success; denominator = all rows in scope (or a filtered population). “Completion rate” changes if the denominator is “all tasks” vs “tasks that started.”
COUNT(*) FILTER (WHERE condition) (PostgreSQL) builds numerators/denominators in one grouped query without subqueries.
SUM(CASE WHEN condition THEN 1 ELSE 0 END) * 1.0 / COUNT(*) is the portable analog.

Worked example

task_id	status
1	done
2	open
3	done

Done fraction = 2 / 3, i.e. about 0.667. Query: COUNT(*) FILTER (WHERE status = 'done')::numeric / NULLIF(COUNT(*), 0) → 0.666…

Divide-by-zero

If the denominator can be 0, use NULLIF(denominator, 0) so the division yields NULL instead of an error—then handle in the app or outer query.

Worked example

player_id	hits	at_bats
1	0	0

hits::numeric / NULLIF(at_bats, 0) → NULL, not a runtime error.

Empty input

No rows after WHERE: aggregates like COUNT(*) → 0, SUM → NULL (typical)—state assumptions out loud in an interview.

Worked example: SELECT SUM(amount) FROM orders WHERE FALSE — no rows match → SUM is NULL in PostgreSQL; COUNT(*) would be 0.

Common beginner mistakes

Treating NULL like 0 in business logic.
Using the wrong denominator for a rate.
Dividing without guarding zero with NULLIF.

SQL Interview Question on NULL-Safe Rates and Fractions

tasks(task_id, status) where status is 'done' or 'open'. What fraction of tasks are done?

Solution Using FILTER and NULLIF

SELECT COUNT(*) FILTER (WHERE status = 'done')::numeric
       / NULLIF(COUNT(*), 0) AS done_fraction
FROM tasks;

Why this works: Numerator = done rows; denominator = all tasks; NULLIF prevents divide-by-zero.

Practice

SQL · Topic — NULL handling & safe rates
COMPANY · Meta — hub Meta SQL & Python (includes null-edge drills)

8. Set Operations and Data Comparison Techniques

Set Operations in SQL (UNION, INTERSECT, EXCEPT)

Set problems sound like: “users in both A and B,” “customers who bought X but not Y,” “combine two lists of ids.” You are doing intersection, difference, or union on keys (usually user_id).

Set operators require both branches to return the same number of columns with compatible types.

`INTERSECT`

Result: Only rows that appear in both SELECTs (duplicate handling depends on dialect; often distinct rows).
Interview mapping: “Users who did both action A and action B” when each branch returns the same key column.

Worked example

mobile: user_ids u1, u2 — web: u2, u3 — INTERSECT → u2 only.

`EXCEPT` (some engines: `MINUS`)

Result: Rows in the first query not present in the second.
Interview mapping: “Signed up but never purchased,” “In feed A but not in feed B.”

Worked example

signups: u1, u2 — buyers: u2 — EXCEPT → u1 (“signed up, never bought” if buyers is “ever purchased”).

`UNION`

Result: Stack the two result sets and deduplicate rows (can sort + dedupe—often more expensive than UNION ALL).

Worked example: SELECT id FROM a yields 1,1,2; SELECT id FROM b yields 2,3 — UNION → 1, 2, 3 (unique).

`UNION ALL`

Result: Stack results keeping all duplicates—preferred when you know duplicates are impossible or when you want repeated rows (e.g. concatenating event streams).

Worked example: Same as above → UNION ALL → 1,1,2,2,3 (five rows).

Join and `EXISTS` equivalents

Intersection on keys: INNER JOIN on user_id from two deduped subqueries—or WHERE EXISTS for semi-join style.
Difference (A not B): Anti-join: LEFT JOIN B ON … WHERE B.key IS NULL.
NOT EXISTS (SELECT 1 FROM B WHERE …) is often safer than NOT IN when B can produce NULL keys.

Worked example (semi-join with IN)

customers 1, 2, 3 — vip has customer_id 2 only. SELECT id FROM customers WHERE id IN (SELECT customer_id FROM vip) → 2.

Worked example (anti-join)

users 1,2 — buyers only user 2. FROM users u LEFT JOIN buyers b ON u.id = b.user_id WHERE b.user_id IS NULL → user 1 (not in buyers).

`GROUP BY` + `HAVING` for “both” conditions

HAVING COUNT(DISTINCT CASE WHEN … THEN tag END) = 2 (or two boolean conditions) can express “user did both activities” in one fact table without set operators—useful when set SQL is awkward or slow.

Worked example

purchases

user_id	product_code
u1	BOOK
u1	PEN
u2	BOOK

GROUP BY user_id HAVING MAX(CASE WHEN product_code = 'BOOK' THEN 1 ELSE 0 END) = 1 AND MAX(CASE WHEN product_code = 'PEN' THEN 1 ELSE 0 END) = 1 → u1 only (has both).

Common beginner mistakes

Using OR when the problem needs both conditions on the same user (not “either event”).
NOT IN (subquery) when the subquery can return NULL—use NOT EXISTS instead.

SQL Interview Question on INTERSECT and Multiple Purchases

purchases(user_id, product_code). Find user_ids who bought BOOK and PEN (possibly on different rows).

Solution Using INTERSECT

SELECT user_id
FROM purchases
WHERE product_code = 'BOOK'
INTERSECT
SELECT user_id
FROM purchases
WHERE product_code = 'PEN';

Why this works: First query = set of users with BOOK; second = set with PEN; intersect = users in both sets.

Practice

SQL · Topic — Set operations (INTERSECT, EXCEPT, …)
COMPANY · Meta — sets Meta-tagged set logic

9. Hash Maps and Counting Techniques in Python

Hash Maps and Counting in Python for Data Processing

A dict (hash map) maps keys to values. Average-time lookup, insert, and update are O(1) in the usual amortized sense—much faster than rescanning a whole list for every key.

Plain `dict` and `.get`

Frequency: counts[k] = counts.get(k, 0) + 1 avoids KeyError on first sighting of k.
Grouping into lists: d.setdefault(k, []).append(item) or use defaultdict below.

Worked example

words = ["cat", "dog", "cat"] — after the loop, freq is {"cat": 2, "dog": 1}.

`collections.Counter`

Built for frequency counts: Counter(iterable), .most_common(k) for top-k.
Good when the problem is “how many of each label?” with minimal code.

Worked example

Counter(["err", "ok", "err"]).most_common(1)  # [('err', 2)]

`defaultdict` from `collections`

defaultdict(int) — same ergonomics as counting with 0 default.
defaultdict(list) — d[user_id].append(event) mirrors GROUP BY user_id “collect all rows in a bucket.”
defaultdict(set) — handy for unique neighbors per key.

Worked example

Rows ("u1", "click"), ("u1", "view") — dd["u1"] becomes ["click", "view"] with defaultdict(list).

defaultdict(set): Edges ("u1","u2"), ("u1","u3"), ("u1","u2") — g["u1"].add(...) yields {"u2", "u3"} (duplicates collapsed).

Multi-column `GROUP BY`

Use a tuple key: key = (country, day) as the dict key when the bucket is more than one dimension.

Worked example

totals = {}

totals[("US", "2026-01-01")] = totals.get(("US", "2026-01-01"), 0) + 50

→ one bucket per (country, day) pair.

Simulating SQL aggregates

SQL idea	Python sketch
`COUNT(*)` per key	`counts[k] += 1` or `Counter`
`SUM(amount)` per key	`sums[k] = sums.get(k, 0) + amount`
`AVG(amount)` per key	Store `(sum, count)` per key, divide at the end
`COUNT(DISTINCT x)` per key	`d[k].add(x)` with a `set` per key

Worked example

rows = [{"user": "a", "amt": 10}, {"user": "a", "amt": 20}, {"user": "b", "amt": 5}]

One pass: sums["a"]=30, sums["b"]=5.

Worked example — distinct per key

rows = [{"user": "a", "sku": "X"}, {"user": "a", "sku": "Y"}, {"user": "a", "sku": "X"}] — after uniq["a"].add(sku) with defaultdict(set), len(uniq["a"]) == 2 (matches COUNT(DISTINCT sku) GROUP BY user).

Why avoid nested scans?

“For each distinct key, loop the entire list” is O(n²). One pass with a dict is typically O(n) time and O(distinct keys) space—what interviewers expect.

Worked example: With about 10,000 rows and 1,000 distinct keys, rescanning all rows per key is on the order of ten million operations; one dict pass stays on the order of ten thousand.

`heapq` for top-k (when k is small)

heapq.nlargest(k, iterable) / nsmallest return the k best items without sorting the whole list (O(n log k)).
For “top k keys by count,” you can also push (-count, key) into a min-heap of size k while streaming counts—useful when memory must stay O(k).

Worked example

import heapq
counts = {"a": 5, "b": 2, "c": 9, "d": 1}
top2 = heapq.nlargest(2, counts.items(), key=lambda x: x[1])  # [('c', 9), ('a', 5)]

Common beginner mistakes

Re-scanning the whole list inside a loop over unique keys.
Forgetting tie-breaking (e.g. lexicographically smallest among max frequency).
Not handling missing keys—use .get(key, 0) or Counter.

Python Interview Question on Frequency and Tie-Breaking

Given a list of words, return the word with the highest count (break ties by lexicographically smallest word).

Solution Using Counter and min

from collections import Counter

def top_word(words: list[str]) -> str:
    cnt = Counter(words)
    max_freq = max(cnt.values())
    candidates = [w for w, c in cnt.items() if c == max_freq]
    return min(candidates)

# Example: top_word(["apple", "banana", "apple"]) -> "apple"

Why this works: Counter gets frequencies; we keep all words at the max count; min picks the tie-breaker.

Practice

PYTHON · Topic — Hash tables & counting
COMPANY · Meta — hash table Meta-tagged hash tables

10. Streaming Data and Interval Processing in Python

Streaming Data Processing in Python

Streaming means: events arrive one after another (or in time order). You keep a small piece of state—last timestamp seen, counts in the current window, “open” orders in a dict—and update it when the next event arrives. You should not rescan the entire history for each new line if you want an efficient solution.

Sorting event streams

Most simulations assume events.sort() by timestamp (and stable tie-breaking: e.g. process end before start at the same time if “touching” is not overlap—depends on problem statement).

Worked example

events = [(10, "start"), (5, "end"), (7, "start")] — sort by time → (5,end), (7,start), (10,start) so processing order matches the real timeline.

Half-open intervals `[start, end)`

Start included, end excluded—standard for “busy from second start up to but not including end.”
Overlap test for [a1, a2) and [b1, b2): a1 < b2 and a2 > b1. (Closed intervals use different inequalities—match the prompt.)

Worked example: [1, 5) and [5, 8) — touch at 5 but no overlap (half-open).

Merge overlapping intervals

Sort by start, then sweep: if the next interval overlaps the current merged block, extend the block’s end with max(end1, end2); else start a new block. Used for “total covered time” after merging.

Worked example: [1,4) and [3,6) merge to [1,6) (covered length 5).

Sweep line (concurrency / max overlap)

Expand each interval to (start, +1) and (end, -1); sort all points; walk while tracking a running balance of active intervals; max of the balance is the peak concurrency (see sample below).

Worked example: Intervals [0,3), [2,5) — at time 2 both are active → peak concurrency 2 (see full solution below).

Stateful dict (“stage machine”)

state[order_id] = (stage, last_ts) (or similar): on each log line, update the entity’s state and maybe accumulate durations now - last_ts for the previous stage—pattern for marketplace order timelines.

Worked example

Log lines: order_A placed t=0, order_A shipped t=10. After second line, time-in-placed = 10 − 0; update state["order_A"] = ("shipped", 10).

Sliding windows in memory

Keep a deque or index of events within the last N seconds; drop expired items as time advances—pattern for “active users in last 5 minutes” style prompts.

Worked example: Times 100, 250, 400 (seconds), window 300s. When processing 400, drop timestamps < 100 → deque holds [250, 400] (two events in window).

Out-of-order events (real pipelines)

Network or collectors may deliver late rows. If the problem says “events can arrive out of order,” you may need to buffer by key until a watermark (e.g. “no more events with ts < T will arrive”), or re-sort a bounded buffer before updating state.
Interview simulations usually assume sorted input or single-threaded append—if unsure, ask whether sort first is allowed.

Worked example: Stream order ts = 5, 2, 8 for the same key. If you must emit “sum so far in time order,” hold rows in a min-heap by ts until the watermark passes 2 (e.g. you have seen ts ≥ 3), then release 2 before 5—or sort the batch if the problem allows.

Common beginner mistakes

Not sorting by time before simulating.
Closed intervals on both ends → accidental double count at shared boundaries.
Tie-breaking in events.sort() when start and end times coincide.

Python Interview Question on Intervals and Maximum Overlap

Events are (start, end) tuples when a server is busy. Use half-open intervals [start, end). Return the maximum number of overlapping busy intervals at any time.

Optimized Solution Using Sweep Line Algorithm

def max_overlap(intervals: list[tuple[int, int]]) -> int:
    events = []
    for s, e in intervals:
        events.append((s, 1))   # start: +1 concurrent
        events.append((e, -1))  # end: -1 concurrent
    events.sort()
    cur = best = 0
    for _, delta in events:
        cur += delta
        best = max(best, cur)
    return best

Why this works: At any time, cur is how many intervals are active; we record the peak.

Practice

PYTHON · Topic — Streaming & interval-style problems
COMPANY · Meta — streaming Meta-tagged streaming

Tips to Crack Meta Data Engineering Interviews

These tips will help you confidently crack Meta data engineering interviews by focusing on the technical and problem-solving skills interviewers actually score—not textbook definitions. Solid data engineering interview preparation blends SQL preparation for interviews with typed Python practice; how to crack a data engineering interview at this level is mostly repetition with feedback, not passive reading.

The Meta data engineer interview tips below are practical: habits, patterns, and where to drill on PipeCode. They do not re-explain sections 1–10—they tell you what to do with that material.

Quick checklist (prep habits):

Drill SQL daily: joins, GROUP BY, window functions, dates, nulls, and set-style logic.
Rehearse data pipeline and ETL thinking: sources, transforms, delivery, and what breaks at scale.
Be ready to discuss data-oriented system design: schemas, data flow, and reliability—not just a single query.
Strengthen Python for data processing: dictionaries, counting, streaming, and interval-style problems.
Work real problems on the Meta company practice hub with tests, not theory alone.

SQL Preparation Tips

Strong SQL is non-negotiable. SQL preparation for interviews should emphasize correctness first, then clarity: name the grain (what one row means), say WHERE vs HAVING, and sanity-check join fan-out before you optimize. Timed reps beat rereading notes—use Meta · aggregation, Meta · filtering, Meta · joins, Meta · window functions, Meta · subqueries, and Meta · CTE for company-scoped drills; add Topic · null handling when you practice NULL-safe reporting.

Data Pipeline & ETL Tips

Many data engineering interview loops expect you to reason about pipelines: ingestion, transformation, and serving—often with ETL-style trade-offs (batch vs incremental, idempotency, late data). You do not need a slide deck—you need vocabulary: what is upstream, what is idempotent, what schema does the consumer need? On PipeCode, warm up with Meta · ETL, Meta · dimensional modeling, Meta · event modeling, and the broader topic · ETL hub.

System Design Preparation Tips

System design for data engineering is usually data-centric: sketch components (ingest, store, process, serve), data flow, failure modes, and scale (partitioning, backfill, duplicates). If you only prepare SQL in isolation, practice one whiteboard-style walkthrough per week: inputs, outputs, and where quality is enforced. Deep dives and Explore courses can complement problem reps when you want structured depth.

Coding & Python Tips

Python screens favor clear code over clever tricks. Prioritize Meta · hash table and topic · hash table for frequency and dict patterns; use Meta · streaming with topic · sliding window and topic · intervals for window and sweep-line style tasks. State time and space complexity when the interviewer signals they care.

Understand Common Interview Patterns

To crack data engineering interview problems faster, recognize the pattern before you code—aggregation, filtering, joins, windows, CTEs, dates, sets, hash maps, streaming. When a prompt feels new, map it to one of those shapes (see sections 1–10), then pick the smallest example and trace it by hand.

Where to practice on PipeCode

Primary loop: Meta company practice hub. Browse all practice topics and company hubs; full library: Explore practice. Commitment: Subscribe when you want full access.

Skill lane	Where to practice on PipeCode
Aggregations & `GROUP BY` / `HAVING`	Meta · aggregation, Meta · grouping, Meta · having clause
Filtering	Meta · filtering
Joins & dedupe	Meta · joins, Meta · join
Windows & ranking	Meta · window functions, Meta · ranking
Subqueries & CTEs	Meta · subqueries, Meta · CTE
Dates & time-series	Meta · date functions, Meta · time-series
Set-style logic	Meta · set

Interview habit: say WHERE vs HAVING and join grain out loud before you type.

Communication under time pressure

State assumptions (grain, nulls, ties), sketch a tiny test case, then code—interviewers reward data engineering judgment, not just syntax.

Frequently asked questions

What questions are asked in Meta data engineering interviews?

Loops often stress SQL (aggregations, joins, windows, dates, nulls, sets) and Python (hash maps, counting, streaming-style intervals). Exact questions vary by team and level; this guide teaches those topic types with original examples aligned to PipeCode’s Meta skill tags.

How to prepare for Meta data engineering interview questions?

Read the first teaching subheading under each topic (the SQL or Python explainer block), type the sample solutions, then practice on PipeCode’s Meta hub. Usually SQL first, then Python; 450+ problems are available for reps.

Are coding questions asked in Meta data engineering interviews?

Yes. Many candidates see SQL and Python exercises. This guide matches that mix with worked examples in sections 1–10—not a copy of any single live question.

What SQL questions are asked in Meta data engineering interviews?

Expect shapes like aggregation and GROUP BY, joins, window functions, dates, NULL-safe rates, and set-style logic. See sections 1–8; examples are PostgreSQL-style unless noted.

What are common data engineering interview problems at Meta?

Common patterns include summaries by grain, join fan-out, ranking and windows, time buckets, safe percentages, and Python frequency or interval sweeps. The topic table near the top lists them.

What skills are required to crack Meta data engineering interviews?

Strong SQL (aggregations, joins, windows, dates), solid Python for dicts, counts, and streaming-style logic, plus clear communication under time pressure. PipeCode offers Meta-tagged practice across these topics.

Start practicing Meta-style data engineering problems

PipeCode pairs company-tagged Meta problems with tests and feedback so you move from reading solutions to typing your own.

Pipecode.ai is Leetcode for Data Engineering

Browse Meta practice → View plans →

Meta Data Engineering Interview Questions: Top Topics, Problems & Solutions

Top Meta Data Engineering Interview Topics

1. Aggregation and GROUP BY Concepts in Data Engineering

Aggregation and GROUP BY in SQL for Data Engineering

SUM(col)

AVG(col)

COUNT(*)

COUNT(col)

MIN(col) and MAX(col)

Conditional aggregation (CASE inside aggregates)

GROUP BY and HAVING (how they fit together)

SQL Interview Question on Aggregation and GROUP BY

Solution Using GROUP BY and HAVING

2. Data Filtering and WHERE Clause in SQL

Filtering Data Using the WHERE Clause in SQL

Comparisons (=, <>, <, >, <=, >=)

Range and membership

LIKE and text patterns

IS NULL / IS NOT NULL

Combining conditions

Data Engineering Interview Question on WHERE and Filtering

Solution Using WHERE and GROUP BY

3. Joins and Data Deduplication Techniques

Joins and Deduplication in SQL for Data Engineering

Inner join (JOIN / INNER JOIN)

Left outer join (LEFT JOIN)

Right outer join (RIGHT JOIN)

Full outer join (FULL OUTER JOIN)

Cross join (CROSS JOIN)

Semi-join vs anti-join (names, not syntax)

Fan-out (row multiplication)

Deduplication

Joins-Based Interview Question on Distinct Users and Brands

Solution Using INNER JOIN and COUNT DISTINCT

4. Window Functions and Ranking Methods in SQL

Window Functions and Ranking in SQL for Data Analysis

PARTITION BY (optional but common)

ORDER BY inside OVER

ROW_NUMBER()

RANK()

DENSE_RANK()

NTILE(n)

Window aggregates and frames (ROWS BETWEEN …)

LAG / LEAD

FIRST_VALUE / LAST_VALUE

SQL Interview Question on Window Functions and Ranking

Solution Using RANK and a CTE

5. Subqueries and Common Table Expressions (CTEs)

Subqueries and CTEs in SQL for Data Engineering

Subquery in FROM (“inline view”)

Subquery in WHERE

Subquery in SELECT list

Correlated subqueries

CTEs (WITH)

SQL Interview Question on Subqueries and Department Averages

Solution Using a CTE and JOIN

6. Date Handling and Time-Series Data Concepts

Date and Time-Series Handling in SQL

Bucketing with date_trunc (PostgreSQL-style)

INTERVAL and relative windows

EXTRACT / date_part

Difference between timestamps

Rolling metrics (window frames)

Boundaries: half-open vs BETWEEN

Time zones

Gaps, sessions, and neighbors

SQL Interview Question on Dates and Daily Totals

Solution Using date_trunc and GROUP BY

7. Handling NULL Values and Safe Calculations

NULL Handling and Safe Calculations in SQL

In WHERE and expressions

How aggregates treat NULL (per bucket)

Percentages and rates

Divide-by-zero

Empty input

SQL Interview Question on NULL-Safe Rates and Fractions

Solution Using FILTER and NULLIF

8. Set Operations and Data Comparison Techniques

Set Operations in SQL (UNION, INTERSECT, EXCEPT)

INTERSECT

`SUM(col)`

`AVG(col)`

`COUNT(*)`

`COUNT(col)`

`MIN(col)` and `MAX(col)`

Conditional aggregation (`CASE` inside aggregates)

`GROUP BY` and `HAVING` (how they fit together)

Comparisons (`=`, `<>`, `<`, `>`, `<=`, `>=`)

`LIKE` and text patterns

`IS NULL` / `IS NOT NULL`

Inner join (`JOIN` / `INNER JOIN`)

Left outer join (`LEFT JOIN`)

Right outer join (`RIGHT JOIN`)

Full outer join (`FULL OUTER JOIN`)

Cross join (`CROSS JOIN`)

`PARTITION BY` (optional but common)

`ORDER BY` inside `OVER`

`ROW_NUMBER()`

`RANK()`

`DENSE_RANK()`

`NTILE(n)`

Window aggregates and frames (`ROWS BETWEEN …`)

`LAG` / `LEAD`

`FIRST_VALUE` / `LAST_VALUE`

Subquery in `FROM` (“inline view”)

Subquery in `WHERE`

Subquery in `SELECT` list

CTEs (`WITH`)

Bucketing with `date_trunc` (PostgreSQL-style)

`INTERVAL` and relative windows

`EXTRACT` / `date_part`

Boundaries: half-open vs `BETWEEN`

In `WHERE` and expressions

How aggregates treat `NULL` (per bucket)

`INTERSECT`

`EXCEPT` (some engines: `MINUS`)

`UNION`

`UNION ALL`

Join and `EXISTS` equivalents

`GROUP BY` + `HAVING` for “both” conditions

Plain `dict` and `.get`

`collections.Counter`

`defaultdict` from `collections`

Multi-column `GROUP BY`

`heapq` for top-k (when k is small)

Half-open intervals `[start, end)`