Meta data engineering interviews (and similar big-tech loops) usually lean heavily on SQL—often PostgreSQL-style—and on Python for problems involving hash maps, counting, arrays, strings, and streaming or interval-style logic. Interviewers care that you can write correct queries, pick the right grain, and explain trade-offs—not just memorize syntax.
This guide introduces the topics that show up most often in that kind of prep: aggregations, joins, windows, dates, null-safe reporting, set logic, and representative Python patterns. The sample questions here are original teaching examples.
Top Meta Data Engineering Interview Topics
| # | Topic | Why it matters on Meta-style sets |
|---|---|---|
| 1 | Aggregation & GROUP BY / HAVING |
Summaries by brand, seller, day, etc.; filter groups after summing. |
| 2 | Filtering (WHERE) |
Keep the right rows before you aggregate. |
| 3 | Joins & deduplication | Combine tables without inflating row counts. |
| 4 | Window functions & ranking | Running totals, top-N per group, ties. |
| 5 | Subqueries & CTEs | Multi-step logic readable in one script. |
| 6 | Dates & time-series | Daily revenue, latency, cohort-style buckets. |
| 7 | NULL & safe percentages |
Correct numerators/denominators; avoid silent wrong rates. |
| 8 | Set-style logic (overlap / both / except) | Customers in both stores, A-not-B, etc. |
| 9 | Python: hash maps & counting | Dicts, Counter, frequency and “most common.” |
| 10 | Python: streaming & intervals | Update state as events arrive; merge intervals. |
If you are new to SQL: In most databases the engine processes a query roughly in this order:
FROM/ joins →WHERE(filter rows) →GROUP BY→ compute aggregates →HAVING(filter groups) → window functions →SELECT/ORDER BY. When in doubt, ask: “Am I filtering one row at a time (WHERE) or a whole group after summing (HAVING)?”
1. Aggregation and GROUP BY Concepts in Data Engineering
Aggregation and GROUP BY in SQL for Data Engineering
Picture a table with many detail rows—for example one row per order. Aggregation means: “turn lots of rows into one summary value (or a few values) per bucket.” The bucket is whatever the question cares about: per user, per day, per campaign, and so on.
-
GROUP BYdefines the bucket: “Put all rows with the sameuser_idtogether,” or “the same(store_id, day)together.” Every distinct combination of theGROUP BYcolumns is one group; the database runs your aggregate functions separately inside each group.
SUM(col)
-
What it does: Adds all numeric values of
colin the bucket. -
NULLbehavior:NULLcells are ignored (they are not treated as 0). If every value isNULL,SUMis usuallyNULL, not 0—say that in an interview if the edge case matters. - Typical use: Revenue totals, quantities, scores summed per seller or per day.
Worked example: In one group, amount values 10, NULL, 30 → SUM(amount) = 40 (only 10 + 30).
AVG(col)
-
What it does: Average of non-
NULLvalues: sum of non-null values divided by count of non-null values. -
NULLbehavior: Rows wherecolisNULLdo not enter the numerator or the denominator. If you need “average where missing means 0,” useAVG(COALESCE(col, 0))(only if the business defines it that way). - Typical use: Average order value per user, average latency per region.
Worked example: Same three rows 10, NULL, 30 → AVG(amount) = 20 because (10 + 30) / 2; the NULL row is not counted in the average.
COUNT(*)
-
What it does: Counts how many rows are in the bucket—every row counts, even if some columns are
NULL. - Typical use: “How many orders per customer?”, “how many events in this hour?”—when each row is one event.
Worked example: Same three rows → COUNT(*) = 3 (the NULL row still counts as a row).
COUNT(col)
-
What it does: Counts rows where
colis notNULL. Differs fromCOUNT(*)as soon ascolhas nulls. -
Example intuition:
COUNT(user_id)might count rows with a known user;COUNT(*)counts all rows in the group. -
Related:
COUNT(DISTINCT col)counts unique non-null values in the bucket—essential after joins when you must count people, not multiplied rows (see section 3).
Worked example: Same three rows → COUNT(amount) = 2. If the third row were 50 instead of NULL, COUNT(DISTINCT amount) with values 10, 30, 50 would be 3; with 10, 30, 10 it would be 2.
MIN(col) and MAX(col)
-
What they do: Return the smallest or largest value of
colin the bucket. Works on orderable types: numbers, dates/timestamps, strings (lexicographic order). -
NULLbehavior:NULLs are skipped. If all values areNULL, the result isNULL. -
Typical use: Latest (
MAX(ts)), earliest (MIN(day)), cheapest product in a category (MIN(price)).
Worked example: Amounts 10, NULL, 30 → MIN(amount) = 10, MAX(amount) = 30. For strings 'apple', 'banana' in one group → MIN = 'apple' (lexicographic).
Worked example — one dataset, several aggregates
Suppose orders looks like this:
| order_id | user_id | amount |
|---|---|---|
| 101 | u1 | 20.00 |
| 102 | u1 | NULL |
| 103 | u1 | 40.00 |
| 104 | u2 | 100.00 |
Run:
SELECT user_id,
SUM(amount) AS sum_amt,
AVG(amount) AS avg_amt,
COUNT(*) AS n_rows,
COUNT(amount) AS n_known_amt,
MIN(amount) AS min_amt,
MAX(amount) AS max_amt
FROM orders
GROUP BY user_id
ORDER BY user_id;
You should get:
| user_id | sum_amt | avg_amt | n_rows | n_known_amt | min_amt | max_amt |
|---|---|---|---|---|---|---|
| u1 | 60.00 | 30.0000… | 3 | 2 | 20.00 | 40.00 |
| u2 | 100.00 | 100.0000… | 1 | 1 | 100.00 | 100.00 |
For u1, the row with amount NULL still counts in COUNT(*) (3 rows) but not in SUM / AVG / COUNT(amount) (only the 20 and 40 matter). That single picture is how most interviews test whether you understand NULL with aggregates.
COUNT(DISTINCT) mini-example: if clicks has two rows for the same user_id (double click), COUNT(*) is 2 but COUNT(DISTINCT user_id) is 1 for that user’s bucket.
Conditional aggregation (CASE inside aggregates)
- Idea: Count or sum only some rows in the group without splitting into multiple queries—put the condition inside the aggregate.
-
Patterns:
SUM(CASE WHEN condition THEN col ELSE 0 END),COUNT(CASE WHEN … THEN 1 END)(orSUM(CASE WHEN … THEN 1 ELSE 0 END)),AVG(CASE WHEN … THEN col END)(averages only non-null branches).
Worked example — events per user by type
| user_id | event_type |
|---|---|
| u1 | view |
| u1 | purchase |
| u1 | view |
| u2 | view |
SELECT user_id,
COUNT(*) AS total_events,
SUM(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) AS purchases,
SUM(CASE WHEN event_type = 'view' THEN 1 ELSE 0 END) AS views
FROM events
GROUP BY user_id;
Result
| user_id | total_events | purchases | views |
|---|---|---|---|
| u1 | 3 | 1 | 2 |
| u2 | 1 | 0 | 1 |
This is the same “one GROUP BY, many metrics” style as COUNT(*) FILTER (WHERE …) in PostgreSQL—portable warehouses use CASE heavily.
GROUP BY and HAVING (how they fit together)
-
HAVINGisWHEREfor buckets: it runs after grouping, so you can filter onAVG(amount),COUNT(*), etc.WHEREruns before grouping and only sees raw row columns—soWHERE AVG(amount) > 50is invalid: the average does not exist until afterGROUP BY.
Worked example — WHERE vs HAVING
| order_id | brand_id | amount |
|---|---|---|
| 1 | A | 40 |
| 2 | A | 70 |
| 3 | A | 20 |
| 4 | B | 200 |
-
WHERE amount > 30drops row 3 before grouping. ThenGROUP BY brand_idwithSUM(amount)gives A =110 (40+70), B = 200. -
HAVING SUM(amount) > 100runs after grouping on the unfiltered table: A’s sum is 130, B’s is 200—both pass; if you needed “brands with at least 3 orders and sum > 100,” you would useHAVING COUNT(*) >= 3 AND SUM(amount) > 100.
Worked example — same orders as above, show HAVING output
SELECT brand_id, SUM(amount) AS total_amt, COUNT(*) AS n_orders
FROM orders
GROUP BY brand_id
HAVING SUM(amount) > 100;
Result
| brand_id | total_amt | n_orders |
|---|---|---|
| A | 130 | 3 |
| B | 200 | 1 |
If you add AND COUNT(*) >= 2, only A remains (B has a single order).
Rule of thumb: If the condition uses SUM / COUNT / AVG / …` of the group, use HAVING. If it only uses this row’s columns, use WHERE (and put it first—it usually makes the query faster too).
If a column appears in SELECT and is not inside an aggregate, it must appear in GROUP BY (in strict SQL). Otherwise the database does not know which row’s value to show for that column inside a bucket.
Common beginner mistakes
- Putting
AVG(amount) > 50inWHERE→ useHAVINGafterGROUP BY. - Forgetting a column in
GROUP BYwhen it appears inSELECTwithout an aggregate → invalid query in strict SQL. - Answering at the wrong grain (e.g. one row per
ad_idwhen the question asked per campaign).
SQL Interview Question on Aggregation and GROUP BY
Table orders(order_id, user_id, amount) lists purchases. Return each user_id whose average amount is greater than 50 and who has at least 3 orders.
Solution Using GROUP BY and HAVING
`sql
SELECT user_id,
AVG(amount) AS avg_amount,
COUNT() AS order_cnt
FROM orders
GROUP BY user_id
HAVING COUNT() >= 3
AND AVG(amount) > 50;
`
Why this works: We group all orders by user_id, compute average and count per user, then keep only groups that pass both conditions. Those conditions depend on aggregates, so they belong in HAVING.
Practice
- SQL · Topic — Aggregation problems (all companies)
- COMPANY · Meta — aggregation Meta-tagged aggregation
2. Data Filtering and WHERE Clause in SQL
Filtering Data Using the WHERE Clause in SQL
WHERE is the row filter: after FROM and joins, each row is tested once. If the condition is true, the row stays and can be grouped, counted, or shown in the result; if false, the row is dropped and never enters aggregates. Rows removed here are invisible to GROUP BY and to COUNT(*) on the remaining set—so push the filters that narrow the problem early.
Comparisons (=, <>, <, >, <=, >=)
- Compare a column to a literal or another column:
amount > 100,ts >= TIMESTAMP '2026-01-01'. - Strings compare in the database’s collation order unless you use explicit functions.
- Watch types: comparing a
DATEto aTIMESTAMPmay require casting so you do not accidentally exclude boundary instants.
Worked example
| id | amount | status |
|---|---|---|
| 1 | 60 | paid |
| 2 | 40 | refunded |
| 3 | 90 | paid |
WHERE amount > 50 AND status <> 'refunded' keeps rows 1 and 3 only.
Range and membership
-
BETWEEN a AND bis inclusive on both ends. For timestamps,>= start AND < end(half-open) is often safer so you do not double-count midnight. -
IN ('US', 'CA', 'UK')is shorthand for multipleORs;NOT IN (...)can surprise you if the list or a subquery containsNULL—NOT EXISTSis sometimes safer (see section 8).
Worked example
| user_id | country |
|---|---|
| a | US |
| b | IN |
| c | CA |
WHERE country IN ('US', 'CA') returns users a and c.
LIKE and text patterns
-
LIKEuses%(any substring) and_(single character):email LIKE '%@company.com'. -
Case sensitivity depends on the database/column collation; use
ILIKE(PostgreSQL) orLOWER(col)if the problem requires case-insensitive match.
Worked example
| id | |
|---|---|
| 1 | ada@work.com |
| 2 | bob@gmail.com |
WHERE email LIKE '%@work.com' matches ada only.
IS NULL / IS NOT NULL
-
NULLmeans unknown, not “empty string” or 0.WHERE col = NULLis wrong in SQL; useWHERE col IS NULLorIS NOT NULL.
Worked example
| id | phone |
|---|---|
| 1 | 555-0100 |
| 2 | NULL |
WHERE phone IS NOT NULL returns row 1; WHERE phone IS NULL returns row 2.
Combining conditions
-
AND— all must hold.OR— at least one holds; use parentheses when mixing withANDso intent is unambiguous:WHERE (a OR b) AND c.
Worked example
| id | region | is_active |
|---|---|---|
| 1 | EU | TRUE |
| 2 | UK | FALSE |
| 3 | US | TRUE |
-
WHERE (region = 'EU' OR region = 'UK') AND is_active = TRUE→ row 1 only (UK row is inactive). - Without parentheses,
ANDbinds tighter thanOR, soWHERE region = 'EU' OR region = 'UK' AND is_active = TRUEis read asregion = 'EU' OR (region = 'UK' AND is_active = TRUE)—not the same as “(EU or UK) and active.” Always parenthesizeORgroups when you mix withAND.
Because WHERE runs before GROUP BY, it cannot reference per-group totals. Those belong in HAVING.
Common beginner mistakes
- Using
=instead ofIS NULL. - Mixing
DATEandTIMESTAMPor time zones without casting. - Putting aggregate conditions in
WHERE(useHAVINGinstead).
Data Engineering Interview Question on WHERE and Filtering
From events(user_id, event_type, ts), count how many purchase events happened after 2026-01-01 for each user_id.
Solution Using WHERE and GROUP BY
sql
SELECT user_id,
COUNT(*) AS purchase_cnt
FROM events
WHERE event_type = 'purchase'
AND ts > TIMESTAMP '2026-01-01 00:00:00'
GROUP BY user_id;
Why this works: WHERE keeps only purchase rows after the cutoff; then we count per user.
Practice
- SQL · Topic — Filtering & WHERE clause
- COMPANY · Meta — filtering Meta-tagged filtering
3. Joins and Data Deduplication Techniques
Joins and Deduplication in SQL for Data Engineering
A join answers: “For each row in table A, which rows in table B belong with it?” You declare that with ON a.key = b.key. Think of it as enriching events (orders, clicks) with reference data (users, ads).
Fact vs dimension and grain (how DEs talk about joins)
-
Fact table: Many rows, often one row per event or transaction (order line, click, impression)—the thing you
SUMorCOUNT. -
Dimension table: Fewer rows, one row per entity you describe (user, ad, product)—the thing you
JOINto get names, categories, brand ids. - Grain: “One result row per what?” (per order, per user per day, per campaign). Wrong grain after a join usually means fan-out (section below): you aggregated at order level but summed a user attribute.
Worked example: order_lines (fact: order_id, sku, qty) joins products (dimension: sku, list_price, name`). SUM(qty) by sku is correct at line grain. If products accidentally had duplicate rows per sku, one join could inflate SUM(list_price)—fix the dimension to one row per sku before summing price.
Inner join (JOIN / INNER JOIN)
-
Result: Only rows where the
ONcondition succeeds on both sides. - Use when: You only care about “matched” pairs—e.g. clicks that have a known ad row.
-
NULLkeys: In most join semantics,NULL = NULLis not true, so inner joins drop rows with a null join key unless youCOALESCEor filter explicitly.
Worked example
clicks
| click_id | user_id | ad_id |
|---|---|---|
| c1 | u1 | 10 |
| c2 | u1 | 99 |
ads
| ad_id | brand_id |
|---|---|
| 10 | 7 |
| 11 | 7 |
FROM clicks c JOIN ads a ON c.ad_id = a.ad_id returns one row: (c1, u1, 10, 7). Click c2 has ad_id = 99 with no ad row, so it disappears from an inner join.
Left outer join (LEFT JOIN)
-
Result: Every row from the left table appears once per matching right row; if there is no match, right-side columns are
NULL. - Use when: You must keep all left entities even if they have zero events—e.g. all users including those with no orders.
Worked example
users
| id | name |
|---|---|
| 1 | Ann |
| 2 | Bo |
orders
| id | user_id | amount |
|---|---|---|
| o1 | 1 | 50 |
FROM users u LEFT JOIN orders o ON u.id = o.user_id yields two rows: Ann + o1, and Bo + NULL (no order—right side null-padded).
Right outer join (RIGHT JOIN)
-
Result: Mirror of
LEFT JOIN: keep all rows from the right table; left columnsNULLwhen unmatched. -
In practice: Rarely needed—swap tables and use
LEFT JOINso “keep all rows from the table I care about” always reads the same way.
Worked example: Same users and orders as in the left join. FROM orders o RIGHT JOIN users u ON u.id = o.user_id returns the same two logical rows as FROM users u LEFT JOIN orders o … (Ann + o1, Bo + nulls)—only the preferred style differs.
Full outer join (FULL OUTER JOIN)
-
Result: Keeps rows that match plus unmatched rows from either side (missing side filled with
NULL). - Use when: You need a symmetric “everything from A and B” merge; less common in basic interview cards but shows up in reconciliation-style questions.
Worked example: If legacy has id 1,2 and new has id 2,3, a full outer join on id returns three logical keys 1, 2, 3 with NULLs where one side is missing.
Cross join (CROSS JOIN)
- Result: Cartesian product: every row of A paired with every row of B.
- Use when: You intentionally need all combinations (e.g. users × dates grid)—easy to explode row count; use sparingly.
Worked example: stores has 2 rows, quarters has 4 rows → CROSS JOIN produces 8 rows (every store × every quarter).
Semi-join vs anti-join (names, not syntax)
-
Semi-join: “Keep rows from A where some match exists in B”—typically
WHERE EXISTS (SELECT 1 FROM B WHERE …)orWHERE key IN (SELECT key FROM B). You do not addB’s columns to the result; you only filterA. -
Anti-join: “Keep rows from A with no match in B”—
LEFT JOIN B … WHERE B.key IS NULLorWHERE NOT EXISTS (SELECT 1 FROM B WHERE …). Same idea: filterA, not a full merge for reporting.
Worked example
customers id: 1, 2, 3 — table vip customer_id: 2 only.
-
Semi-join:
SELECT id FROM customers c WHERE EXISTS (SELECT 1 FROM vip v WHERE v.customer_id = c.id)→ 2. -
Anti-join:
SELECT c.id FROM customers c LEFT JOIN vip v ON v.customer_id = c.id WHERE v.customer_id IS NULL→ 1 and 3 (never appeared invip).
Fan-out (row multiplication)
- If the join key is not unique on one side (one user, many orders), the “one” side’s columns repeat on every matching row. That is correct for order-grain analysis.
-
Bug pattern: Summing a user-level numeric column after the join counts that user’s balance multiple times. Fixes: aggregate first,
SUM(DISTINCT …)rarely fixes business logic—preferCOUNT(DISTINCT user_id)or a subquery that makes the dimension one row per key before joining.
Worked example
users: (id=1, credit_balance=500)
orders: two rows for user_id=1.
After users JOIN orders, credit_balance appears twice (500, 500). SUM(u.credit_balance) would add 1000, which is wrong for “total credit in the system.” Fix: SUM order amounts, or MAX(u.credit_balance) per user if you only need the balance once.
Deduplication
-
SELECT DISTINCT col(orDISTINCTon several columns) removes duplicate rows—fine when any duplicate row is interchangeable. -
Latest row per id: Prefer
ROW_NUMBER() OVER (PARTITION BY id ORDER BY time DESC) AS rnin a CTE, thenWHERE rn = 1, so you control ties and ordering.DISTINCT ON(PostgreSQL) is a compact alternative when you know the dialect.
Worked example — DISTINCT
| visit_id | user_id |
|---|---|
| v1 | u1 |
| v2 | u1 |
| v3 | u2 |
SELECT DISTINCT user_id FROM visits → u1, u2 (one row per user, duplicates collapsed).
Worked example — latest event per user
| user_id | ts | page |
|---|---|---|
| u1 | t2 | /home |
| u1 | t1 | /cart |
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) gives 1 for (u1,t2,/home) and 2 for (u1,t1,/cart); filter rn = 1 keeps only the latest row.
Worked example — DISTINCT ON (PostgreSQL only)
Same table as above. This picks one row per **user_id, the first row after sorting** (here, latest ts):
SELECT DISTINCT ON (user_id) user_id, ts, page
FROM events
ORDER BY user_id, ts DESC;
DISTINCT ON (user_id) requires ORDER BY to start with the same columns; the first row per user_id in that order is kept. For tie-breaking and clarity, ROW_NUMBER() in a CTE is often easier in interviews unless the prompt says PostgreSQL.
Common beginner mistakes
- Joining on the wrong key or ignoring
NULLkeys (they do not match in an inner join). - Assuming one row per user after a join when the join multiplied rows.
- Using
COUNT(*)when the question wanted distinct people—useCOUNT(DISTINCT user_id).
Joins-Based Interview Question on Distinct Users and Brands
clicks(user_id, ad_id) and ads(ad_id, brand_id). How many distinct users clicked any ad for brand_id = 7?
Solution Using INNER JOIN and COUNT DISTINCT
SELECT COUNT(DISTINCT c.user_id) AS distinct_users
FROM clicks c
JOIN ads a ON a.ad_id = c.ad_id
WHERE a.brand_id = 7;
Why this works: Join finds clicks whose ad belongs to brand 7; COUNT(DISTINCT user_id) counts people, not click rows.
Practice
- SQL · Topic — Joins & deduplication
- COMPANY · Meta — joins Meta-tagged joins
4. Window Functions and Ranking Methods in SQL
Window Functions and Ranking in SQL for Data Analysis
GROUP BY shrinks the table: many rows → one row per bucket. Window functions do the opposite idea: every row stays, but you add a new column computed from a window of rows around it. Syntax: some_function(...) OVER (...).
PARTITION BY (optional but common)
-
Meaning: “Start a new window for each distinct value of these columns”—like running a separate mini-table per
store_idoruser_id. -
Without it: The window is usually the whole result set (after
WHERE), which is rare for rankings but valid for some global running totals.
Worked example (one store, running sum over days)
| store_id | day | revenue |
|---|---|---|
| S1 | 2026-01-01 | 10 |
| S1 | 2026-01-02 | 30 |
SUM(revenue) OVER (PARTITION BY store_id ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) adds 10 on the first row and 40 on the second—cumulative revenue for that store.
ORDER BY inside OVER
-
Meaning: Defines which row is ‘first’ inside each partition—required for
RANK,ROW_NUMBER,LAG/LEAD, and for ordered frames (running sums). -
Direction:
DESCvsASCchanges “who is rank 1” (e.g. highest revenue first).
Worked example
| store_id | day | revenue |
|---|---|---|
| S1 | D1 | 200 |
| S1 | D2 | 150 |
RANK() OVER (PARTITION BY store_id ORDER BY revenue DESC) gives 1 on D1 and 2 on D2—highest revenue is “first” in the window because of DESC.
ROW_NUMBER()
- Assigns 1, 2, 3, … with no ties—the database breaks ties arbitrarily unless you add more
ORDER BYkeys (e.g.ORDER BY revenue DESC, day ASC). - Use when: You need exactly one row per partition (pick-one tie-break) or a unique sequence for deduplication.
Worked example (sales — one row per store per day)
| store_id | day | revenue |
|---|---|---|
| S1 | D1 | 200 |
| S1 | D2 | 150 |
| S1 | D3 | 150 |
ROW_NUMBER() OVER (PARTITION BY store_id ORDER BY revenue DESC, day ASC) might assign 1 to D1, 2 to D2, 3 to D3 (tie on revenue broken by earlier day).
RANK()
- Ties get the same rank; the next rank skips after a tie: 1, 2, 2, 4.
- Use when: “Top tier” should include everyone tied for first place, and skipped ranks are OK.
Worked example: Scores 100, 90, 90, 80 → RANK() → 1, 2, 2, 4.
DENSE_RANK()
- Ties get the same rank, but no gaps: 1, 2, 2, 3.
- Use when: You care about “how many distinct tiers,” not Olympic-style gaps.
Worked example: Same scores → DENSE_RANK() → 1, 2, 2, 3.
NTILE(n)
- Splits each partition’s ordered rows into
nroughly equal-sized buckets (quartiles ifn = 4, deciles ifn = 10). - Use when: “Top 10% of users by spend” style bucketing (check whether ties straddle bucket boundaries—business rules vary).
Worked example
| user_id | spend |
|---|---|
| u1 | 10 |
| u2 | 20 |
| u3 | 30 |
| u4 | 40 |
NTILE(2) OVER (ORDER BY spend ASC) assigns bucket 1 to u1 and u2, bucket 2 to u3 and u4 (two rows per bucket when n = 2 and n divides evenly).
Window aggregates and frames (ROWS BETWEEN …)
-
SUM(col) OVER (PARTITION BY store ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)— 7-day trailing sum per store, one output row per day (unlike collapsing the whole table withGROUP BY). -
AVG,MIN,MAXwork similarly over a frame.
Worked example (trailing 2-day window, illustration)
| store_id | day | rev |
|---|---|---|
| S1 | 1 | 10 |
| S1 | 2 | 20 |
| S1 | 3 | 5 |
With ROWS BETWEEN 1 PRECEDING AND CURRENT ROW, the 2-day trailing sums on rows 1–3 are 10, 30, 25.
LAG / LEAD
-
LAG(col, 1)= previous row’s value in partition order;LEAD= next row’s—used for day-over-day deltas, session gaps, and streak logic.
Worked example
| store_id | day | revenue |
|---|---|---|
| S1 | 1 | 100 |
| S1 | 2 | 130 |
LAG(revenue) OVER (PARTITION BY store_id ORDER BY day) on the second row is 100; revenue - LAG(revenue) gives +30 day-over-day.
LEAD — look ahead
Same partition order; LEAD(revenue, 1) on the first row is 130 (next day’s revenue). Use it for “how much does tomorrow differ?” or to pair each row with the next event time when building gaps.
FIRST_VALUE / LAST_VALUE
-
FIRST_VALUE(col)withORDER BYinOVERis the value in the first row of the window (often the first row of the partition if the frame defaults to “start of partition through current row”). -
LAST_VALUE(col)is the value in the last row of the frame—with the default frame, that is often the current row, not “last row of the whole partition.” To get “last value in partition up to here,” you may need an explicit frame such asROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING(dialect-dependent), or useORDER BY … DESCwithFIRST_VALUEinstead of fightingLAST_VALUEdefaults.
Worked example
| emp_id | dept | salary |
|---|---|---|
| 1 | A | 70 |
| 2 | A | 90 |
| 3 | A | 80 |
With the usual frame “from start of partition through current row” and ORDER BY salary DESC, the first row in that ordering is always the highest salary (90), so FIRST_VALUE(salary) OVER (PARTITION BY dept ORDER BY salary DESC) repeats 90 on each row in dept A—a quick way to annotate every line with the dept max without a grouped subquery. If you need the minimum instead, use ORDER BY salary ASC or FIRST_VALUE on ORDER BY salary ASC with the same pattern.
Filtering window results: You cannot put WHERE RANK() = 1 in the same SELECT that defines the rank in most databases. Compute the window in an inner query / CTE, then WHERE rnk = 1 on the outside—exactly like the sample solution below.
Common beginner mistakes
- Omitting
ORDER BYinsideOVERwhen order defines the answer. - Using
GROUP BYwhen the question still needs one output row per input row. - Using
ROW_NUMBER()when the problem says tied rows should share the same rank—useRANKorDENSE_RANK.
SQL Interview Question on Window Functions and Ranking
Table sales(store_id, day, revenue) has one row per store per day. For each store, find the single best day by highest revenue (if two days tie for first, returning both is fine).
Solution Using RANK and a CTE
WITH ranked AS (
SELECT store_id,
day,
revenue,
RANK() OVER (PARTITION BY store_id ORDER BY revenue DESC) AS rnk
FROM sales
)
SELECT store_id, day, revenue
FROM ranked
WHERE rnk = 1;
Why this works: RANK() orders days inside each store by revenue; rnk = 1 keeps all top days, including ties.
Practice
- SQL · Topic — Window functions & ranking
- COMPANY · Meta — windows Meta-tagged window functions
- SQL · Deep dives — SQL deep dives library
5. Subqueries and Common Table Expressions (CTEs)
Subqueries and CTEs in SQL for Data Engineering
Many problems are multi-step: first compute something intermediate (averages per department, revenue per day, ranked rows), then join or filter using that result. Two ways to package that:
-
Subquery: a
SELECTinside parentheses inFROM,WHERE, orSELECT. -
CTE (
WITH): the same logic with a name at the top:WITH step1 AS (...) SELECT ... FROM step1 ....
Subquery in FROM (“inline view”)
-
Pattern:
FROM (SELECT …) AS sub— treat the inner query like a temporary table you join or filter. - Use when: You need a shaped intermediate dataset (deduped ranks, daily sums) before the next step.
Worked example
orders: (user_id, amount) → inner query SELECT user_id, SUM(amount) AS s FROM orders GROUP BY user_id yields (u1, 90), (u2, 40). Outer query FROM (…) AS t JOIN profiles p ON p.id = t.user_id attaches profile columns to each user’s total.
Subquery in WHERE
-
WHERE col IN (SELECT …)— row kept ifcolmatches any value from the subquery (watchNULLin the subquery withNOT IN). -
WHERE EXISTS (SELECT1 FROM … WHERE … correlated …)— true if any matching row exists; often clearer thanINfor existence checks. -
Scalar subquery:
WHERE amount > (SELECT AVG(amount) FROM …)— inner query must return at most one row and one column.
Worked example
orders: (id, dept_id, amount) — WHERE amount > (SELECT AVG(amount) FROM orders o2 WHERE o2.dept_id = orders.dept_id) keeps rows above that row’s department average (correlated scalar subquery).
Worked example — IN (membership)
products sku: A, B, C — discontinued sku: B. SELECT sku FROM products WHERE sku IN (SELECT sku FROM discontinued) → B only.
Worked example — EXISTS (existence, no need to return columns from inner query)
Same tables. SELECT sku FROM products p WHERE EXISTS (SELECT 1 FROM discontinued d WHERE d.sku = p.sku) → B—same result as IN here; EXISTS often reads better when the inner query is large or correlated.
Subquery in SELECT list
-
SELECT id, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_cnt— runs per outer row; fine in interviews; on big data often rewritten as a join for performance.
Worked example
users: Ann (id=1), Bo (id=2). orders: two rows for Ann, none for Bo. SELECT u.id, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) returns (1, 2) and (2, 0).
Correlated subqueries
- The inner query references the outer row (e.g.
WHERE o.dept_id = e.dept_idwithefrom outside). Think: “for each outer row, evaluate this.”
Worked example
employees (id, dept_id, salary): (1, A, 80k), (2, A, 60k), (3, B, 90k). For row 1, AVG(salary) WHERE dept_id = 'A' is 70k; WHERE salary > that avg keeps row 1 only among dept A.
CTEs (WITH)
- Readability: Steps read top to bottom like a pipeline: clean → aggregate → join.
-
Reuse: The same CTE name can appear multiple times in the final query; subqueries in
FROMmust be duplicated or wrapped. -
Chaining:
WITH a AS (…), b AS (SELECT … FROM a …) SELECT … FROM b—bcan usea. -
Recursive CTEs (
WITH RECURSIVE): for trees/org charts; specialty syntax—learn when you hit that problem class.
Worked example
WITH per_day AS (
SELECT day, SUM(amount) AS revenue
FROM orders
GROUP BY day
)
SELECT day, revenue
FROM per_day
WHERE revenue > 100;
Step 1 builds a daily revenue table; step 2 filters it—same logic as an inline subquery, but easier to read.
Subquery vs CTE (quick compare):
| Subquery | CTE (WITH) |
|
|---|---|---|
| Readability | Fine for one small nest; deep nesting gets hard to read | Often easier to read—steps read top to bottom |
| Reuse | Repeat the whole nest if you need it twice | Same CTE name can be referenced multiple times in one query |
| Style | “Inline” | “Named pipeline” |
Common beginner mistakes
- Expecting a subquery to return one value when it returns many rows—use
IN,EXISTS, or a join. - Nesting many subqueries when
WITHwould make the steps obvious.
SQL Interview Question on Subqueries and Department Averages
From employees(id, dept_id, salary), list employees who earn more than their department’s average salary.
Solution Using a CTE and JOIN
WITH dept_avg AS (
SELECT dept_id, AVG(salary) AS avg_sal
FROM employees
GROUP BY dept_id
)
SELECT e.id, e.dept_id, e.salary
FROM employees e
JOIN dept_avg d ON d.dept_id = e.dept_id
WHERE e.salary > d.avg_sal;
Why this works: dept_avg is a small table of one row per department; we join each employee to their department’s average and filter.
Practice
- SQL · Topic — Subqueries & CTEs
- COMPANY · Meta — CTE Meta-tagged CTEs
6. Date Handling and Time-Series Data Concepts
Date and Time-Series Handling in SQL
Time-series questions almost always mean: “Put each event in a time bucket (day, hour, week), then aggregate.”
Bucketing with date_trunc (PostgreSQL-style)
-
date_trunc('day', ts)snaps every timestamp in that calendar day to the same instant (midnight at the start of the day). Common grains:hour,week,month. - After truncating,
GROUP BYthe truncated value (or cast todate) and applySUM,COUNT, etc.
Worked example
| created_at (UTC) | amount |
|---|---|
| 2026-01-01 22:00 | 10 |
| 2026-01-02 03:00 | 20 |
Both fall on different calendar days in UTC, so GROUP BY date_trunc('day', created_at)::date gives two buckets: 2026-01-01 → 10, 2026-01-02 → 20.
INTERVAL and relative windows
-
CURRENT_DATE - INTERVAL '7 days'orNOW() - INTERVAL '1 hour'expresses sliding windows without hard-coding calendar dates. - Pair with
WHERE ts >= …(and usually<end for half-open ranges).
Worked example: “Last 24 hours” from NOW(): WHERE event_ts >= NOW() - INTERVAL '24 hours'.
EXTRACT / date_part
- Pull hour of day, dow (day of week), month, etc. for “volume by hour” or seasonality slices:
EXTRACT(HOUR FROM ts).
Worked example
| ts |
|---|
| 2026-01-01 08:30 |
| 2026-01-01 09:15 |
GROUP BY EXTRACT(HOUR FROM ts) groups both into hours 8 and 9.
Difference between timestamps
-
end_ts - start_ts(PostgreSQLinterval) orDATEDIFF-style functions in other engines—useful for durations and “time between events.”
Worked example
| ordered_at | shipped_at |
|---|---|
| 2026-01-01 10:00 | 2026-01-01 16:00 |
shipped_at - ordered_at is a 6-hour interval (cast to minutes/seconds if the question asks for a number).
Rolling metrics (window frames)
-
SUM(amount) OVER (PARTITION BY store_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)= 7-day trailing sum per store, one row per day still present. - Change
ROWStoRANGEonly when you understand your SQL dialect’s semantics for gaps in dates.
Worked example (7-day trailing sum, one row per day stays)
| store_id | day | amount |
|---|---|---|
| S1 | 2026-01-01 | 10 |
| S1 | 2026-01-02 | 5 |
| S1 | 2026-01-03 | 20 |
| S1 | 2026-01-04 | 0 |
| S1 | 2026-01-05 | 15 |
| S1 | 2026-01-06 | 10 |
| S1 | 2026-01-07 | 5 |
SUM(amount) OVER (PARTITION BY store_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) on 2026-01-07 is 65 (sum of all seven days)—same pattern as section 4’s shorter 2-day illustration, stretched to a week of daily facts.
Boundaries: half-open vs BETWEEN
- Prefer
ts >= start_ts AND ts < end_tsso an event exactly onend_tsis not counted twice across adjacent windows. -
BETWEEN a AND bincludes both endpoints—fine for inclusive business dates; easy to off-by-one with timestamps.
Worked example: “All of January 2026” → WHERE ts >= TIMESTAMP '2026-01-01' AND ts < TIMESTAMP '2026-02-01' (midnight Feb1 is excluded).
Time zones
- Store and compare in UTC internally when possible; convert to local before
date_trunc('day', …)if “calendar day” means a specific region.
Worked example: Event at 2026-01-02 02:00 UTC might be Jan 1 evening in New York—bucket by local day if the metric is “sales per US calendar day.”
Gaps, sessions, and neighbors
- Sort by time, then
LAG(ts)/LEAD(ts)to compute gaps between consecutive events—foundation for sessionization and streak problems.
Worked example
| user_id | ts |
|---|---|
| u1 | t0 |
| u1 | t0 + 5 min |
| u1 | t0 + 2 hours |
Gaps are 5 minutes then 115 minutes; a 30-minute session timeout would break a new session after the second event.
Common beginner mistakes
- Mixing
DATEandTIMESTAMPwithout::dateor casting. - Using
BETWEENon timestamps when a half-open range is safer. - Forgetting UTC vs local when filtering “today.”
SQL Interview Question on Dates and Daily Totals
orders(order_id, created_at, amount) in UTC. Return total amount per calendar day for the last 7 days (relative to CURRENT_DATE).
Solution Using date_trunc and GROUP BY
SELECT date_trunc('day', created_at)::date AS order_day,
SUM(amount) AS daily_total
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY 1
ORDER BY 1;
Why this works: We filter to the rolling window, truncate each created_at to midnight, and sum per day.
Practice
- SQL · Topic — Date functions & time buckets
- COMPANY · Meta — dates Meta-tagged date functions
7. Handling NULL Values and Safe Calculations
NULL Handling and Safe Calculations in SQL
NULL means “we don’t know the value”—not zero, not “false,” not an empty string.
In WHERE and expressions
-
WHERE col = NULLis invalid / wrong; useIS NULLorIS NOT NULL. -
COALESCE(a, b)returns the first non-null argument—use when the business says “treat unknown as default X.”
Worked example
| id | discount_pct |
|---|---|
| 1 | NULL |
| 2 | 10 |
SELECT COALESCE(discount_pct, 0) → 0 and 10. WHERE discount_pct IS NULL returns row 1 only.
How aggregates treat NULL (per bucket)
-
SUM/AVG/MIN/MAX:NULLinputs are skipped; if there is nothing left to aggregate, the result is oftenNULL. -
COUNT(*): Counts rows, regardless of nulls in individual columns. -
COUNT(col): Counts rows wherecolis notNULL. -
COUNT(DISTINCT col): Counts distinct non-null values.
Worked example (one group, three rows)
| amount |
|---|
| 10 |
| NULL |
| 30 |
SUM = 40, AVG = 20, COUNT(*) = 3, COUNT(amount) = 2.
Percentages and rates
- Write the definition explicitly: numerator = rows (or sum) matching success; denominator = all rows in scope (or a filtered population). “Completion rate” changes if the denominator is “all tasks” vs “tasks that started.”
-
COUNT(*) FILTER (WHERE condition)(PostgreSQL) builds numerators/denominators in one grouped query without subqueries. -
SUM(CASE WHEN condition THEN 1 ELSE 0 END) * 1.0 / COUNT(*)is the portable analog.
Worked example
| task_id | status |
|---|---|
| 1 | done |
| 2 | open |
| 3 | done |
Done fraction = 2 / 3, i.e. about 0.667. Query: COUNT(*) FILTER (WHERE status = 'done')::numeric / NULLIF(COUNT(*), 0) → 0.666…
Divide-by-zero
- If the denominator can be 0, use
NULLIF(denominator, 0)so the division yieldsNULLinstead of an error—then handle in the app or outer query.
Worked example
| player_id | hits | at_bats |
|---|---|---|
| 1 | 0 | 0 |
hits::numeric / NULLIF(at_bats, 0) → NULL, not a runtime error.
Empty input
-
No rows after
WHERE: aggregates likeCOUNT(*)→ 0,SUM→NULL(typical)—state assumptions out loud in an interview.
Worked example: SELECT SUM(amount) FROM orders WHERE FALSE — no rows match → SUM is NULL in PostgreSQL; COUNT(*) would be 0.
Common beginner mistakes
- Treating
NULLlike 0 in business logic. - Using the wrong denominator for a rate.
- Dividing without guarding zero with
NULLIF.
SQL Interview Question on NULL-Safe Rates and Fractions
tasks(task_id, status) where status is 'done' or 'open'. What fraction of tasks are done?
Solution Using FILTER and NULLIF
SELECT COUNT(*) FILTER (WHERE status = 'done')::numeric
/ NULLIF(COUNT(*), 0) AS done_fraction
FROM tasks;
Why this works: Numerator = done rows; denominator = all tasks; NULLIF prevents divide-by-zero.
Practice
- SQL · Topic — NULL handling & safe rates
- COMPANY · Meta — hub Meta SQL & Python (includes null-edge drills)
8. Set Operations and Data Comparison Techniques
Set Operations in SQL (UNION, INTERSECT, EXCEPT)
Set problems sound like: “users in both A and B,” “customers who bought X but not Y,” “combine two lists of ids.” You are doing intersection, difference, or union on keys (usually user_id).
Set operators require both branches to return the same number of columns with compatible types.
INTERSECT
-
Result: Only rows that appear in both
SELECTs (duplicate handling depends on dialect; often distinct rows). - Interview mapping: “Users who did both action A and action B” when each branch returns the same key column.
Worked example
mobile: user_ids u1, u2 — web: u2, u3 — INTERSECT → u2 only.
EXCEPT (some engines: MINUS)
- Result: Rows in the first query not present in the second.
- Interview mapping: “Signed up but never purchased,” “In feed A but not in feed B.”
Worked example
signups: u1, u2 — buyers: u2 — EXCEPT → u1 (“signed up, never bought” if buyers is “ever purchased”).
UNION
-
Result: Stack the two result sets and deduplicate rows (can sort + dedupe—often more expensive than
UNION ALL).
Worked example: SELECT id FROM a yields 1,1,2; SELECT id FROM b yields 2,3 — UNION → 1, 2, 3 (unique).
UNION ALL
- Result: Stack results keeping all duplicates—preferred when you know duplicates are impossible or when you want repeated rows (e.g. concatenating event streams).
Worked example: Same as above → UNION ALL → 1,1,2,2,3 (five rows).
Join and EXISTS equivalents
-
Intersection on keys:
INNER JOINonuser_idfrom two deduped subqueries—orWHERE EXISTSfor semi-join style. -
Difference (A not B): Anti-join:
LEFT JOIN B ON … WHERE B.key IS NULL. -
NOT EXISTS (SELECT 1 FROM B WHERE …)is often safer thanNOT INwhenBcan produceNULLkeys.
Worked example (semi-join with IN)
customers 1, 2, 3 — vip has customer_id 2 only. SELECT id FROM customers WHERE id IN (SELECT customer_id FROM vip) → 2.
Worked example (anti-join)
users 1,2 — buyers only user 2. FROM users u LEFT JOIN buyers b ON u.id = b.user_id WHERE b.user_id IS NULL → user 1 (not in buyers).
GROUP BY + HAVING for “both” conditions
-
HAVING COUNT(DISTINCT CASE WHEN … THEN tag END) = 2(or two boolean conditions) can express “user did both activities” in one fact table without set operators—useful when set SQL is awkward or slow.
Worked example
purchases
| user_id | product_code |
|---|---|
| u1 | BOOK |
| u1 | PEN |
| u2 | BOOK |
GROUP BY user_id HAVING MAX(CASE WHEN product_code = 'BOOK' THEN 1 ELSE 0 END) = 1 AND MAX(CASE WHEN product_code = 'PEN' THEN 1 ELSE 0 END) = 1 → u1 only (has both).
Common beginner mistakes
- Using
ORwhen the problem needs both conditions on the same user (not “either event”). -
NOT IN (subquery)when the subquery can returnNULL—useNOT EXISTSinstead.
SQL Interview Question on INTERSECT and Multiple Purchases
purchases(user_id, product_code). Find user_ids who bought BOOK and PEN (possibly on different rows).
Solution Using INTERSECT
SELECT user_id
FROM purchases
WHERE product_code = 'BOOK'
INTERSECT
SELECT user_id
FROM purchases
WHERE product_code = 'PEN';
Why this works: First query = set of users with BOOK; second = set with PEN; intersect = users in both sets.
Practice
- SQL · Topic — Set operations (INTERSECT, EXCEPT, …)
- COMPANY · Meta — sets Meta-tagged set logic
9. Hash Maps and Counting Techniques in Python
Hash Maps and Counting in Python for Data Processing
A dict (hash map) maps keys to values. Average-time lookup, insert, and update are O(1) in the usual amortized sense—much faster than rescanning a whole list for every key.
Plain dict and .get
-
Frequency:
counts[k] = counts.get(k, 0) + 1avoidsKeyErroron first sighting ofk. -
Grouping into lists:
d.setdefault(k, []).append(item)or usedefaultdictbelow.
Worked example
words = ["cat", "dog", "cat"] — after the loop, freq is {"cat": 2, "dog": 1}.
collections.Counter
- Built for frequency counts:
Counter(iterable),.most_common(k)for top-k. - Good when the problem is “how many of each label?” with minimal code.
Worked example
Counter(["err", "ok", "err"]).most_common(1) # [('err', 2)]
defaultdict from collections
-
defaultdict(int)— same ergonomics as counting with 0 default. -
defaultdict(list)—d[user_id].append(event)mirrorsGROUP BY user_id“collect all rows in a bucket.” -
defaultdict(set)— handy for unique neighbors per key.
Worked example
Rows ("u1", "click"), ("u1", "view") — dd["u1"] becomes ["click", "view"] with defaultdict(list).
defaultdict(set): Edges ("u1","u2"), ("u1","u3"), ("u1","u2") — g["u1"].add(...) yields {"u2", "u3"} (duplicates collapsed).
Multi-column GROUP BY
- Use a tuple key:
key = (country, day)as the dict key when the bucket is more than one dimension.
Worked example
totals = {}
totals[("US", "2026-01-01")] = totals.get(("US", "2026-01-01"), 0) + 50
→ one bucket per (country, day) pair.
Simulating SQL aggregates
| SQL idea | Python sketch |
|---|---|
COUNT(*) per key |
counts[k] += 1 or Counter
|
SUM(amount) per key |
sums[k] = sums.get(k, 0) + amount |
AVG(amount) per key |
Store (sum, count) per key, divide at the end |
COUNT(DISTINCT x) per key |
d[k].add(x) with a set per key |
Worked example
rows = [{"user": "a", "amt": 10}, {"user": "a", "amt": 20}, {"user": "b", "amt": 5}]
One pass: sums["a"]=30, sums["b"]=5.
Worked example — distinct per key
rows = [{"user": "a", "sku": "X"}, {"user": "a", "sku": "Y"}, {"user": "a", "sku": "X"}] — after uniq["a"].add(sku) with defaultdict(set), len(uniq["a"]) == 2 (matches COUNT(DISTINCT sku) GROUP BY user).
Why avoid nested scans?
- “For each distinct key, loop the entire list” is O(n²). One pass with a dict is typically O(n) time and O(distinct keys) space—what interviewers expect.
Worked example: With about 10,000 rows and 1,000 distinct keys, rescanning all rows per key is on the order of ten million operations; one dict pass stays on the order of ten thousand.
heapq for top-k (when k is small)
-
heapq.nlargest(k, iterable)/nsmallestreturn the k best items without sorting the whole list (O(n log k)). - For “top k keys by count,” you can also push (-count, key) into a min-heap of size k while streaming counts—useful when memory must stay O(k).
Worked example
import heapq
counts = {"a": 5, "b": 2, "c": 9, "d": 1}
top2 = heapq.nlargest(2, counts.items(), key=lambda x: x[1]) # [('c', 9), ('a', 5)]
Common beginner mistakes
- Re-scanning the whole list inside a loop over unique keys.
- Forgetting tie-breaking (e.g. lexicographically smallest among max frequency).
- Not handling missing keys—use
.get(key, 0)orCounter.
Python Interview Question on Frequency and Tie-Breaking
Given a list of words, return the word with the highest count (break ties by lexicographically smallest word).
Solution Using Counter and min
from collections import Counter
def top_word(words: list[str]) -> str:
cnt = Counter(words)
max_freq = max(cnt.values())
candidates = [w for w, c in cnt.items() if c == max_freq]
return min(candidates)
# Example: top_word(["apple", "banana", "apple"]) -> "apple"
Why this works: Counter gets frequencies; we keep all words at the max count; min picks the tie-breaker.
Practice
- PYTHON · Topic — Hash tables & counting
- COMPANY · Meta — hash table Meta-tagged hash tables
10. Streaming Data and Interval Processing in Python
Streaming Data Processing in Python
Streaming means: events arrive one after another (or in time order). You keep a small piece of state—last timestamp seen, counts in the current window, “open” orders in a dict—and update it when the next event arrives. You should not rescan the entire history for each new line if you want an efficient solution.
Sorting event streams
- Most simulations assume
events.sort()by timestamp (and stable tie-breaking: e.g. process end before start at the same time if “touching” is not overlap—depends on problem statement).
Worked example
events = [(10, "start"), (5, "end"), (7, "start")] — sort by time → (5,end), (7,start), (10,start) so processing order matches the real timeline.
Half-open intervals [start, end)
-
Start included, end excluded—standard for “busy from second
startup to but not includingend.” -
Overlap test for
[a1, a2)and[b1, b2):a1 < b2anda2 > b1. (Closed intervals use different inequalities—match the prompt.)
Worked example: [1, 5) and [5, 8) — touch at 5 but no overlap (half-open).
Merge overlapping intervals
- Sort by start, then sweep: if the next interval overlaps the current merged block, extend the block’s end with
max(end1, end2); else start a new block. Used for “total covered time” after merging.
Worked example: [1,4) and [3,6) merge to [1,6) (covered length 5).
Sweep line (concurrency / max overlap)
- Expand each interval to
(start, +1)and(end, -1); sort all points; walk while tracking a running balance of active intervals;maxof the balance is the peak concurrency (see sample below).
Worked example: Intervals [0,3), [2,5) — at time 2 both are active → peak concurrency 2 (see full solution below).
Stateful dict (“stage machine”)
-
state[order_id] = (stage, last_ts)(or similar): on each log line, update the entity’s state and maybe accumulate durationsnow - last_tsfor the previous stage—pattern for marketplace order timelines.
Worked example
Log lines: order_A placed t=0, order_A shipped t=10. After second line, time-in-placed = 10 − 0; update state["order_A"] = ("shipped", 10).
Sliding windows in memory
- Keep a deque or index of events within the last N seconds; drop expired items as time advances—pattern for “active users in last 5 minutes” style prompts.
Worked example: Times 100, 250, 400 (seconds), window 300s. When processing 400, drop timestamps < 100 → deque holds [250, 400] (two events in window).
Out-of-order events (real pipelines)
- Network or collectors may deliver late rows. If the problem says “events can arrive out of order,” you may need to buffer by key until a watermark (e.g. “no more events with
ts < Twill arrive”), or re-sort a bounded buffer before updating state. - Interview simulations usually assume sorted input or single-threaded append—if unsure, ask whether
sortfirst is allowed.
Worked example: Stream order ts = 5, 2, 8 for the same key. If you must emit “sum so far in time order,” hold rows in a min-heap by ts until the watermark passes 2 (e.g. you have seen ts ≥ 3), then release 2 before 5—or sort the batch if the problem allows.
Common beginner mistakes
- Not sorting by time before simulating.
- Closed intervals on both ends → accidental double count at shared boundaries.
- Tie-breaking in
events.sort()when start and end times coincide.
Python Interview Question on Intervals and Maximum Overlap
Events are (start, end) tuples when a server is busy. Use half-open intervals [start, end). Return the maximum number of overlapping busy intervals at any time.
Optimized Solution Using Sweep Line Algorithm
def max_overlap(intervals: list[tuple[int, int]]) -> int:
events = []
for s, e in intervals:
events.append((s, 1)) # start: +1 concurrent
events.append((e, -1)) # end: -1 concurrent
events.sort()
cur = best = 0
for _, delta in events:
cur += delta
best = max(best, cur)
return best
Why this works: At any time, cur is how many intervals are active; we record the peak.
Practice
- PYTHON · Topic — Streaming & interval-style problems
- COMPANY · Meta — streaming Meta-tagged streaming
Tips to Crack Meta Data Engineering Interviews
These tips will help you confidently crack Meta data engineering interviews by focusing on the technical and problem-solving skills interviewers actually score—not textbook definitions. Solid data engineering interview preparation blends SQL preparation for interviews with typed Python practice; how to crack a data engineering interview at this level is mostly repetition with feedback, not passive reading.
The Meta data engineer interview tips below are practical: habits, patterns, and where to drill on PipeCode. They do not re-explain sections 1–10—they tell you what to do with that material.
Quick checklist (prep habits):
- Drill SQL daily: joins,
GROUP BY, window functions, dates, nulls, and set-style logic. - Rehearse data pipeline and ETL thinking: sources, transforms, delivery, and what breaks at scale.
- Be ready to discuss data-oriented system design: schemas, data flow, and reliability—not just a single query.
- Strengthen Python for data processing: dictionaries, counting, streaming, and interval-style problems.
- Work real problems on the Meta company practice hub with tests, not theory alone.
SQL Preparation Tips
Strong SQL is non-negotiable. SQL preparation for interviews should emphasize correctness first, then clarity: name the grain (what one row means), say WHERE vs HAVING, and sanity-check join fan-out before you optimize. Timed reps beat rereading notes—use Meta · aggregation, Meta · filtering, Meta · joins, Meta · window functions, Meta · subqueries, and Meta · CTE for company-scoped drills; add Topic · null handling when you practice NULL-safe reporting.
Data Pipeline & ETL Tips
Many data engineering interview loops expect you to reason about pipelines: ingestion, transformation, and serving—often with ETL-style trade-offs (batch vs incremental, idempotency, late data). You do not need a slide deck—you need vocabulary: what is upstream, what is idempotent, what schema does the consumer need? On PipeCode, warm up with Meta · ETL, Meta · dimensional modeling, Meta · event modeling, and the broader topic · ETL hub.
System Design Preparation Tips
System design for data engineering is usually data-centric: sketch components (ingest, store, process, serve), data flow, failure modes, and scale (partitioning, backfill, duplicates). If you only prepare SQL in isolation, practice one whiteboard-style walkthrough per week: inputs, outputs, and where quality is enforced. Deep dives and Explore courses can complement problem reps when you want structured depth.
Coding & Python Tips
Python screens favor clear code over clever tricks. Prioritize Meta · hash table and topic · hash table for frequency and dict patterns; use Meta · streaming with topic · sliding window and topic · intervals for window and sweep-line style tasks. State time and space complexity when the interviewer signals they care.
Understand Common Interview Patterns
To crack data engineering interview problems faster, recognize the pattern before you code—aggregation, filtering, joins, windows, CTEs, dates, sets, hash maps, streaming. When a prompt feels new, map it to one of those shapes (see sections 1–10), then pick the smallest example and trace it by hand.
Where to practice on PipeCode
Primary loop: Meta company practice hub. Browse all practice topics and company hubs; full library: Explore practice. Commitment: Subscribe when you want full access.
| Skill lane | Where to practice on PipeCode |
|---|---|
Aggregations & GROUP BY / HAVING
|
Meta · aggregation, Meta · grouping, Meta · having clause |
| Filtering | Meta · filtering |
| Joins & dedupe | Meta · joins, Meta · join |
| Windows & ranking | Meta · window functions, Meta · ranking |
| Subqueries & CTEs | Meta · subqueries, Meta · CTE |
| Dates & time-series | Meta · date functions, Meta · time-series |
| Set-style logic | Meta · set |
Interview habit: say WHERE vs HAVING and join grain out loud before you type.
Communication under time pressure
State assumptions (grain, nulls, ties), sketch a tiny test case, then code—interviewers reward data engineering judgment, not just syntax.
Frequently asked questions
What questions are asked in Meta data engineering interviews?
Loops often stress SQL (aggregations, joins, windows, dates, nulls, sets) and Python (hash maps, counting, streaming-style intervals). Exact questions vary by team and level; this guide teaches those topic types with original examples aligned to PipeCode’s Meta skill tags.
How to prepare for Meta data engineering interview questions?
Read the first teaching subheading under each topic (the SQL or Python explainer block), type the sample solutions, then practice on PipeCode’s Meta hub. Usually SQL first, then Python; 450+ problems are available for reps.
Are coding questions asked in Meta data engineering interviews?
Yes. Many candidates see SQL and Python exercises. This guide matches that mix with worked examples in sections 1–10—not a copy of any single live question.
What SQL questions are asked in Meta data engineering interviews?
Expect shapes like aggregation and GROUP BY, joins, window functions, dates, NULL-safe rates, and set-style logic. See sections 1–8; examples are PostgreSQL-style unless noted.
What are common data engineering interview problems at Meta?
Common patterns include summaries by grain, join fan-out, ranking and windows, time buckets, safe percentages, and Python frequency or interval sweeps. The topic table near the top lists them.
What skills are required to crack Meta data engineering interviews?
Strong SQL (aggregations, joins, windows, dates), solid Python for dicts, counts, and streaming-style logic, plus clear communication under time pressure. PipeCode offers Meta-tagged practice across these topics.
Start practicing Meta-style data engineering problems
PipeCode pairs company-tagged Meta problems with tests and feedback so you move from reading solutions to typing your own.
Pipecode.ai is Leetcode for Data Engineering




Top comments (0)