Gowtham Potureddi

Posted on Apr 28

Netflix Data Engineering Interview Questions

#dataengineering #interview #python #sql

Netflix data engineering interview questions lean harder on production-DE patterns than the typical FAANG SQL/algorithm loop. Expect SQL problems built around window functions (LAG, LEAD, PARTITION BY), anti-joins, set operations, relational division, and time-series date logic—and Python problems that read like real pipeline code: monotonic stacks for span queries, sliding windows for streak detection, deque rolling buffers for fixed-memory streams, and resumable ETL with checkpoint files.

This guide walks through the eight topic clusters Netflix actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 13-problem Netflix interview set (1 easy, 11 medium, 1 hard) so you spend prep time on what shows up, not what doesn't.

#	Topic (sections 1–8)	Why it shows up at Netflix
1	Hash maps and data analysis in Python	Two-title sequence analysis, popular-movie next-watch, user-engagement patterns.
2	Monotonic stacks for span queries in Python	Consecutive left-span counts—classic stack-pattern interview question.
3	SQL window functions, aggregation and joins	"Most-used promotion in California"—window inside a CTE, joined to a filter.
4	SQL anti-join, set operations, relational division	"Products never purchased" (anti-join) and "products sold in all stores" (relational division).
5	SQL window functions for time-series and date logic	Product price increases over time—`LAG` deltas + half-open date windows.
6	Sliding window patterns in Python	Watch-time completion streaks (Hard) and sliding-window distinct counts.
7	Streaming, deques, fixed-memory buffers	Fixed-memory stream buffer with `collections.deque(maxlen=N)`.
8	Resumable ETL with checkpoint files	Fault-tolerant pipelines that resume from the last good offset.

1. Hash Maps and Data Analysis Patterns in Python

Hash-map fluency for data analysis in Python for data engineering

Hash maps are the foundational primitive of data engineering in Python. Almost every Netflix-style data-analysis problem—two-title sequence counts, popular next-watch, per-user engagement—reduces to "build a dict in one pass, query it in O(1)." The hard part is shape selection: a plain dict, a defaultdict, or a Counter each fit different access patterns.

Pro tip: Reach for Counter when the prompt says "count" or "frequency"; reach for defaultdict(lambda: defaultdict(int)) for 2-D maps like co-occurrence; use a plain dict for heterogeneous values.

`dict` vs `defaultdict` vs `Counter`

Three hash-map flavors, each optimized for a different access pattern. The shape choice changes how you write the body of the loop—not correctness, but 3–5 lines per call site.

dict — Most general; missing keys raise KeyError. Wrap reads in .get(k, default) to avoid the exception.
defaultdict(factory) — Auto-initializes missing keys via factory(). defaultdict(int) for counts, defaultdict(list) for grouping, defaultdict(lambda: defaultdict(int)) for 2-D tables.
Counter(iterable) — dict subclass for integer frequencies. Gives you .most_common(k) and arithmetic (c1 + c2, c1 - c2, c1 & c2).
Footgun: reading a missing key on a defaultdict inserts it. if k not in d after a read is always False.
Tie-break footgun: Counter.most_common preserves insertion order on ties—not alphabetical. Sort manually with (-count, key) if the prompt requires deterministic tie-breaks.

Worked example: Counting titles watched per user across two days.

user	titles_day_1	titles_day_2
A	`["Stranger", "Crown"]`	`["Crown"]`
B	`["Stranger"]`	`["Stranger", "Witcher"]`

Goal: per-user title counter → {'A': {'Stranger': 1, 'Crown': 2}, 'B': {'Stranger': 2, 'Witcher': 1}}.

from collections import defaultdict, Counter

def per_user_counts(rows: list[tuple[str, list[str]]]) -> dict[str, Counter]:
    out: dict[str, Counter] = defaultdict(Counter)
    for user, titles in rows:
        out[user].update(titles)
    return {u: dict(c) for u, c in out.items()}

Co-occurrence counting with nested `defaultdict`

Co-occurrence answers "given a sequence of events, how often does A appear immediately before B?" The right shape is defaultdict(lambda: defaultdict(int))—nested keys make "all successors of X" an O(1) jump instead of an O(N) scan over flat tuple keys.

Idiom: zip(history, history[1:]) yields consecutive pairs. A history of length N produces N − 1 pairs.
Edge cases: length < 2 yields zero pairs (no defensive guard needed); repeats like ["A","A","A"] produce (A, A): 2 correctly.
Worked example: History ["A", "B", "C", "B", "A"] → pairs (A,B), (B,C), (C,B), (B,A) → {A: {B: 1}, B: {C: 1, A: 1}, C: {B: 1}}.

from collections import defaultdict

def co_occurrence(history: list[str]) -> dict[str, dict[str, int]]:
    pairs: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
    for a, b in zip(history, history[1:]):
        pairs[a][b] += 1
    return {a: dict(b) for a, b in pairs.items()}

Most-frequent next-watch with tie-breaks

Once you have a 2-D co-occurrence table, "most frequent successor of title X" is a sort by composite key. The naive max(pairs[X], key=pairs[X].get) returns one of the tied keys nondeterministically—real prompts always specify tie-break behavior.

(-count, name) — The canonical Python tie-break key: primary descending (negate to flip), secondary ascending (lexicographic). Generalizes to any number of tie-break rules.
Stable sort — Python's sort is stable, so equal composite keys preserve insertion order. Sorting explicitly is safer than relying on it.
Worked example: Co-occurrence above + query "next watch for B?" → {C: 1, A: 1} → tied at 1 → alphabetical A wins.

def most_frequent_next(pairs: dict[str, dict[str, int]], title: str) -> str | None:
    if title not in pairs or not pairs[title]:
        return None
    return sorted(pairs[title].items(), key=lambda x: (-x[1], x[0]))[0][0]

Common beginner mistakes

Using a plain dict and forgetting to handle missing keys (raises KeyError).
Trusting Counter.most_common to break ties alphabetically (it does not—uses insertion order).
Building flat counters when nested counters are needed (and vice versa)—mismatched shape fails downstream.
Forgetting that zip(history, history[1:]) yields zero pairs when history has fewer than 2 items.
Calling if k not in d after a read on a defaultdict—the read inserted k, so the check is always False.

Python interview question on hash maps and data analysis

Given a list of user watch sessions where each session is a list of titles in viewing order, build a structure that, for any title, can return the most common next title with deterministic tie-breaks (alphabetical).

Solution using nested `defaultdict` and a sort key

from collections import defaultdict

def build_next_watch(sessions: list[list[str]]) -> dict[str, str]:
    pairs: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
    for history in sessions:
        for a, b in zip(history, history[1:]):
            pairs[a][b] += 1
    out: dict[str, str] = {}
    for title, counter in pairs.items():
        best = sorted(counter.items(), key=lambda x: (-x[1], x[0]))[0][0]
        out[title] = best
    return out

Why this works: A single pass per session counts every consecutive pair in O(L) where L is the session length—total O(total_titles). The nested defaultdict builds the 2-D counter in one line, no KeyError risk. Sorting (-count, name) resolves ties alphabetically. The output is a flat title → next_title map, so downstream lookups are O(1)—you've shifted cost from query time to index time, which is the right trade-off when you expect many queries against one fixed dataset (the typical Netflix recommender setup).

PYTHON
Topic — hash table
Hash table problems

Practice →

COMPANY
Netflix — hash table
Netflix-tagged hash table

Practice →

2. Monotonic Stacks for Span and Range Queries in Python

Monotonic stacks in Python for data engineering

Monotonic stacks are the canonical optimization for problems shaped like "for each element, find the nearest left/right neighbor satisfying a comparison." The brute-force is O(n²); the monotonic-stack rewrite is amortized O(n) because every element is pushed and popped at most once. Netflix's consecutive-left-span problem (#128) is the textbook stock-span variant.

Pro tip: The mental model: the stack is a queue of "open questions"—each element is asking "what is the next element greater than me?" When that answer arrives, the element is popped and the running aggregate updates. Anything passed over by a larger element will never be answered later, so it gets popped and the stack stays bounded.

Stack basics: LIFO and amortized O(n)

A stack is a LIFO container with push and pop. In Python, a plain list works as a stack: list.append and list.pop are both O(1) amortized. Avoid collections.deque for stack-only use—it has slightly higher constant factors.

Amortized argument — Total work is n + total_pops; since every element is pushed once and popped at most once, total_pops ≤ n, so total work is O(2n) = O(n). Memorize this sentence; interviewers occasionally ask you to defend it.
Worked example: Push 3, 1, 4 then pop twice → 4, 1 come off, stack becomes [3].

def stack_demo() -> list[int]:
    stack: list[int] = []
    for x in [3, 1, 4]:
        stack.append(x)
    stack.pop()  # 4
    stack.pop()  # 1
    return stack  # [3]

Monotonic decreasing stack for "next greater" / "previous greater"

A monotonic decreasing stack keeps elements in non-increasing order from bottom to top. As you scan left-to-right, when the new element is larger than the stack top, pop until the top is larger or the stack is empty. Popped elements just had their next-greater answered; whatever's left is the previous-greater of the new push.

Decreasing vs increasing — Decreasing stack answers "next/previous greater"; increasing stack answers "next/previous smaller." Pick by the predicate.
Indices vs values — Store indices if you need distances (i - stack[-1]); store values if you only need values.
Strict vs non-strict — "Strictly greater" pops while top ≤ x; "greater or equal" pops while top < x. Off-by-one mistakes here are silent bugs.
Worked example: Previous-greater for [6, 3, 4, 2, 5]:

i	x	stack before	pop while top ≤ x	prev-greater	stack after
0	6	`[]`	—	none	`[6]`
1	3	`[6]`	—	6	`[6, 3]`
2	4	`[6, 3]`	pop 3	6	`[6, 4]`
3	2	`[6, 4]`	—	4	`[6, 4, 2]`
4	5	`[6, 4, 2]`	pop 2, pop 4	6	`[6, 5]`

Result: [None, 6, 6, 4, 6].

def previous_greater(arr: list[int]) -> list[int | None]:
    stack: list[int] = []
    out: list[int | None] = []
    for x in arr:
        while stack and stack[-1] <= x:
            stack.pop()
        out.append(stack[-1] if stack else None)
        stack.append(x)
    return out

Span counting (consecutive left span)

The span of element i is the count of consecutive prior elements (including i) that are ≤ arr[i]. The naive walk-leftward solution is O(n²); the monotonic-stack rewrite stores (value, span) tuples and absorbs popped spans into the new element's span when it dominates them.

Why tuples? When popping (v, s) because arr[i] ≥ v, the new element's span absorbs s—every element subsumed by the popped one is also subsumed by arr[i] (transitivity). No rescan needed.
Worked example: arr = [100, 80, 60, 70, 60, 75, 85]:

i	x	running span logic	output
0	100	start, span = 1	1
1	80	80 < 100, span = 1	1
2	60	60 < 80, span = 1	1
3	70	70 ≥ 60 (pop, +1), 70 < 80, span = 2	2
4	60	60 < 70, span = 1	1
5	75	75 ≥ 60 (+1), 75 ≥ 70 (+2), 75 < 80, span = 4	4
6	85	85 ≥ 75 (+4), 85 ≥ 80 (+1), stack empty, span = 6	6

Result: [1, 1, 1, 2, 1, 4, 6].

def left_spans(arr: list[int]) -> list[int]:
    stack: list[tuple[int, int]] = []  # (value, span)
    out: list[int] = []
    for x in arr:
        span = 1
        while stack and stack[-1][0] <= x:
            span += stack.pop()[1]
        out.append(span)
        stack.append((x, span))
    return out

Common beginner mistakes

Reaching for an O(n²) nested loop and missing the linear-time monotonic-stack rewrite.
Storing only values instead of (value, span)—you cannot recover the contribution of popped elements without rescanning.
Comparing strictly < when the prompt requires ≤ (or vice versa)—off-by-one on the boundary.
Forgetting that the bottom of the stack may be empty after popping—handle the "no boundary" case.
Using collections.deque when a list is sufficient (deque has constant-factor overhead; list is faster for stack-only use).

Python interview question on monotonic stacks

Given an array of daily prices, return for each day the count of consecutive prior days (including today) where the price was less than or equal to today's price. Use O(n) time.

Solution using a monotonic decreasing stack of `(price, span)`

def consecutive_left_span(prices: list[int]) -> list[int]:
    stack: list[tuple[int, int]] = []
    out: list[int] = []
    for p in prices:
        span = 1
        while stack and stack[-1][0] <= p:
            span += stack.pop()[1]
        out.append(span)
        stack.append((p, span))
    return out

Why this works: Every price is pushed once and popped at most once, so total work is O(n) amortized—much faster than the O(n²) brute force on long inputs. The stack holds only "still-relevant" boundaries: values smaller than the current price contributed to the current span and are no longer needed; the boundary that stops the span stays on the stack for later queries. The (value, span) tuple lets us recover the contribution of popped elements in O(1) without rescanning—the absorption step is the whole point of the rewrite.

PYTHON
Topic — stack
Stack problems

Practice →

COMPANY
Netflix — array
Netflix-tagged array

Practice →

3. SQL Window Functions, Aggregation, and Joins

Window functions and aggregation in SQL for data engineering

Window functions are SQL's answer to "per-row analytics that depend on other rows." Where pre-window SQL needed self-joins or correlated subqueries for "running total alongside each order," windows do it as one expression: SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date). The simplified evaluation order: FROM → WHERE → GROUP BY → aggregates → HAVING → windows → SELECT → ORDER BY. Windows run after WHERE, so you cannot reference a window column in WHERE—wrap in a CTE.

Pro tip: Netflix problems test the layering: pre-filter, group, window, filter window output. Walk through it verbally before typing—interviewers grade explanation as much as code.

Window vs aggregate: `SUM() OVER` vs `SUM() GROUP BY`

The single most common window bug is confusing these two constructs. Aggregates collapse grain; windows preserve it.

SUM(col) GROUP BY key — Input N rows → output K rows (one per group). Grain changes.
SUM(col) OVER (PARTITION BY key) — Input N rows → output N rows. Each row gets the partition total alongside its own data.
Empty OVER () — No partition, no order: aggregate over the entire result set, broadcast to every row. Useful for "percentage of grand total"; bug if accidental.
Rule of thumb: "for each row, also show…" → window. "for each X, return…" → aggregate.
Worked example: Three orders, two customers [(1, 10), (1, 20), (2, 5)] → GROUP BY returns 2 rows (1, 30), (2, 5); OVER (PARTITION BY customer_id) returns 3 rows with customer_total 30, 30, 5.

SELECT
  customer_id,
  amount,
  SUM(amount) OVER (PARTITION BY customer_id) AS customer_total
FROM orders;

`PARTITION BY` and `ORDER BY` inside `OVER()`

OVER() accepts three nested clauses: PARTITION BY, ORDER BY, and the frame. Together they define which rows form this row's window and in what order.

PARTITION BY — Buckets rows into independent groups; window resets at each boundary. Without it, the window spans the entire result set.
ORDER BY — Required for LAG, LEAD, ROW_NUMBER, RANK, running totals. Without it, "the prior row" is undefined.
Frame — Default with ORDER BY is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. RANGE compares values (duplicates included together); ROWS compares physical positions. Specify explicitly when correctness matters.
Worked example: Running total per customer, ordered by date.

SELECT
  customer_id, order_date, amount,
  SUM(amount) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM orders;

`RANK` / `DENSE_RANK` / `ROW_NUMBER`

Three ranking functions, three different tie semantics. Picking the wrong one is the second-most-common window bug after the WHERE-on-window confusion.

ROW_NUMBER() — Unique sequential integers; ties broken arbitrarily by storage order. Add a tie-break column to ORDER BY for determinism.
RANK() — Olympic-medal model: tied rows share a rank; next rank skips the tied positions. [95, 90, 90, 85] → 1, 2, 2, 4.
DENSE_RANK() — Tied rows share a rank; next rank does not skip. [95, 90, 90, 85] → 1, 2, 2, 3.
Worked example: Scores [95, 90, 90, 85]:

score	row_number	rank	dense_rank
95	1	1	1
90	2	2	2
90	3	2	2
85	4	4	3

SELECT
  score,
  ROW_NUMBER() OVER (ORDER BY score DESC) AS rn,
  RANK()       OVER (ORDER BY score DESC) AS rnk,
  DENSE_RANK() OVER (ORDER BY score DESC) AS drk
FROM scores;

Filtering on a window column via CTE

WHERE runs before windows, so WHERE rnk = 1 directly on a window column is invalid. The fix is structural: wrap the window in a CTE, then filter the CTE in an outer SELECT.

Alternatives — QUALIFY (Snowflake / BigQuery / Teradata) is a dedicated clause for window filters but isn't portable to PostgreSQL. Default to the CTE wrapper in Postgres-style interviews.
Muscle memory: WITH ranked AS (SELECT …, RANK() OVER … AS rnk FROM …) SELECT … FROM ranked WHERE rnk = ….
Worked example: Rank promotions by use-count and keep the top.

WITH promo_counts AS (
  SELECT promo_id, COUNT(*) AS uses
  FROM orders
  GROUP BY promo_id
),
ranked AS (
  SELECT promo_id, uses, RANK() OVER (ORDER BY uses DESC) AS rnk
  FROM promo_counts
)
SELECT promo_id, uses FROM ranked WHERE rnk = 1;

Common beginner mistakes

Writing WHERE rnk = 1 directly on a window column—evaluation order makes it invalid.
Using RANK when the prompt requires no gaps (DENSE_RANK) or arbitrary tie-break (ROW_NUMBER).
Forgetting PARTITION BY and computing a global running total instead of a per-customer one.
Writing SUM(amount) OVER () (empty OVER) and getting a single value broadcast to every row—occasionally the goal, often a bug.
Defaulting to RANGE when ROWS is needed (or vice versa)—silent wrong answer when the ORDER BY column has duplicates.

SQL interview question on window functions

Table orders(order_id, customer_id, promo_id, amount, region). For region 'CA', return the promo_id with the most uses (ties on use-count broken alphabetically by promo_id).

Solution using `RANK` inside a CTE

WITH ca_counts AS (
  SELECT promo_id, COUNT(*) AS uses
  FROM orders
  WHERE region = 'CA'
  GROUP BY promo_id
),
ranked AS (
  SELECT promo_id, uses,
         RANK() OVER (ORDER BY uses DESC, promo_id ASC) AS rnk
  FROM ca_counts
)
SELECT promo_id, uses FROM ranked WHERE rnk = 1;

Why this works: The WHERE region = 'CA' filters rows before grouping (correct evaluation stage), so the count is per-region. The first CTE collapses rows to one per promo_id. The second CTE adds the rank with a tie-break on promo_id ASC—the same (-count, name) pattern in SQL form, where uses DESC flips ascending into descending and promo_id ASC resolves ties alphabetically. The outer WHERE rnk = 1 is legal because rnk is a CTE column, not a live window. The query plan: scan-with-filter on orders → hash aggregate by promo_id → window sort → filter—four logical stages, one query.

Drill SQL window function patterns →

SQL
Topic — window functions
Window function problems

Practice →

COMPANY
Netflix — window functions
Netflix-tagged window functions

Practice →

4. SQL Anti-Joins, Set Operations, and Relational Division

Anti-joins and set algebra in SQL for data engineering

SQL has two ways to express negation ("rows in A with no match in B") and universal quantification ("rows that match in all of B"). Netflix tests this twice: products-never-purchased (#192, anti-join) and products-sold-in-all-stores (#193, relational division). Each problem has multiple equivalent formulations; the interview signal is whether you know which one a reviewer can read in five seconds.

Pro tip: For relational division, lead with the count-match HAVING form (single pass, most readable) and have double-NOT EXISTS ready as the relational-algebra-formal follow-up.

Anti-join: `LEFT JOIN … WHERE rhs IS NULL` vs `NOT EXISTS`

Three SQL shapes for "rows in A with no matching row in B," each with a different mechanical and readability profile.

LEFT JOIN … WHERE rhs.id IS NULL — Outer join, then filter to unmatched rows. Planners usually rewrite to a hash-anti-join. Use the right table's non-nullable PK in the IS NULL check.
NOT EXISTS (SELECT 1 FROM B WHERE B.x = A.x) — More declarative; planner short-circuits on first match. Handles multi-column predicates cleanly. Preferred for readability and correctness.
NOT IN (subquery) — The NULL trap: a single NULL row in the subquery makes x NOT IN (NULL, ...) evaluate to UNKNOWN for every x, filtering out everything. Avoid; use NOT EXISTS.
Worked example: products = [p1, p2, p3], purchases(p_id, region) = [(p1, WA), (p2, CA)]. "Products never purchased in WA" → p2, p3.

-- LEFT JOIN form
SELECT p.p_id
FROM products p
LEFT JOIN purchases pu
  ON pu.p_id = p.p_id AND pu.region = 'WA'
WHERE pu.p_id IS NULL;

-- NOT EXISTS form (equivalent)
SELECT p.p_id
FROM products p
WHERE NOT EXISTS (
  SELECT 1 FROM purchases pu
  WHERE pu.p_id = p.p_id AND pu.region = 'WA'
);

Set operations: `INTERSECT` / `EXCEPT` / `UNION`

Three relational-algebra primitives. They take two SELECTs with compatible column lists and combine the result sets row-by-row.

UNION — Set union with deduplication (sort/hash overhead). UNION ALL is the bag variant—much faster, use whenever dedup isn't needed.
INTERSECT — Rows in both sides. INTERSECT ALL is the multiset variant.
EXCEPT (MINUS in Oracle) — Rows in left but not right. EXCEPT ALL is count-aware.
NULL semantics — Set ops treat NULLs as equal (different from = comparison where NULL = NULL is UNKNOWN).
Worked example: s1 carries [p1, p2], s2 carries [p2, p3]. "In both" → INTERSECT → p2. "In s1 not s2" → EXCEPT → p1.

SELECT product_id FROM store_products WHERE store_id = 's1'
INTERSECT
SELECT product_id FROM store_products WHERE store_id = 's2';

Relational division: "products in all stores"

Relational division answers "for all" questions: "products that appear in every store." No single keyword—you build it from primitives. Three idioms:

Double NOT EXISTS — Textbook formal-algebra shape: "no store exists where this product does not exist." Correct, terse, slow on large tables.
Count-match HAVING — COUNT(DISTINCT store_id) = (SELECT COUNT(*) FROM stores). Single pass, most readable; usually fastest.
EXCEPT-based — Cross-product products × stores minus actual pairs, then exclude products with leftovers. Verbose; rarely best.
Correctness: use COUNT(DISTINCT store_id) not COUNT(*)—duplicate inventory rows would otherwise inflate the count.
Worked example: Stores [s1, s2, s3]. p2 in all 3 → passes; p1 in 2 of 3 → fails.

SELECT product_id
FROM store_products
GROUP BY product_id
HAVING COUNT(DISTINCT store_id) = (SELECT COUNT(*) FROM stores);

Common beginner mistakes

Using INNER JOIN when the prompt asks for "rows missing from B"—you need an anti-join, not an inner.
Confusing EXCEPT with NOT IN on a column that contains NULL—NOT IN (NULL) is UNKNOWN, drops everything.
Writing COUNT(*) instead of COUNT(DISTINCT store_id) for relational division—duplicates inflate the count.
Forgetting UNION ALL vs UNION—UNION deduplicates and is slower.
Putting unrelated columns in the EXCEPT projection—set operators compare whole rows, so extra columns change the semantics.

SQL interview question on anti-join and relational division

Tables products(product_id), stores(store_id), inventory(product_id, store_id). Return all product_ids carried in every store.

Solution using `GROUP BY` and a count-match `HAVING`

SELECT product_id
FROM inventory
GROUP BY product_id
HAVING COUNT(DISTINCT store_id) = (SELECT COUNT(*) FROM stores);

Why this works: The GROUP BY collapses inventory to one row per product with the count of distinct stores carrying it. The scalar subquery (SELECT COUNT(*) FROM stores) is the universal-quantifier denominator—it's evaluated once and treated as a constant by the planner. Equality means the product appears in every store. COUNT(DISTINCT store_id) defends against duplicate inventory rows (same product appears twice in the same store)—a subtle but common bug. Total cost: one scan of inventory plus one scan of stores, both linear in their respective sizes.

SQL
Topic — anti-join
Anti-join problems

Practice →

SQL
Topic — set operations
Set operation problems

Practice →

5. SQL Window Functions for Time-Series and Date Logic

Time-series patterns in SQL for data engineering

Time-series problems combine LAG/LEAD for prior-row deltas, half-open date filters to avoid boundary bugs, and gaps-and-islands patterns to label or count consecutive runs. Netflix's product-price-increases problem (#194) tests all three at once: compute the prior-month price, count strict month-over-month increases, bound the analysis to a date range.

Pro tip: The gaps-and-islands shape ("longest streak," "label every island") combines LAG to detect transitions with running sums or ROW_NUMBER arithmetic to assign island ids. Memorize the shape; it scales without changing structure.

`LAG` and `LEAD` for prior-row deltas

LAG(col, n, default) returns col n rows before the current row in window order; LEAD(col, n, default) looks ahead. Both are computed inside OVER (PARTITION BY … ORDER BY …). Boundary rows return NULL unless a default is supplied.

First/last row — LAG(col, 1) is NULL on the first row per partition; LEAD is NULL on the last. Provide an explicit default if NULL would propagate badly through downstream arithmetic.
Deterministic ORDER BY — Most common bug: ties on the order key make "the prior row" non-deterministic. Add a tie-break column.
Frame — LAG and LEAD ignore the frame; it matters for FIRST_VALUE, LAST_VALUE, NTH_VALUE.
Worked example:

product_id	month	price	prev_price (`LAG`)
1	2026-01	10	NULL
1	2026-02	12	10
1	2026-03	12	12
1	2026-04	15	12

SELECT
  product_id, month, price,
  LAG(price) OVER (PARTITION BY product_id ORDER BY month) AS prev_price
FROM product_prices;

Half-open date windows (`>= start AND < end`)

"Find rows in March 2026" sounds trivial but is the most common SQL date bug. BETWEEN '2026-03-01' AND '2026-03-31' is inclusive on both sides and either over- or under-includes timestamp boundaries depending on dialect rounding.

Half-open convention — >= start AND < end is the only timezone- and precision-safe shape: lower inclusive, upper exclusive. Works for dates, timestamps, and timestamptz.
Chains cleanly — Q1 2026: >= '2026-01-01' AND < '2026-04-01'. No off-by-one math.
Timezones — timestamptz literals coerce to session timezone; be explicit (AT TIME ZONE 'UTC') when production matters.
Worked example: Prices recorded at any timestamp in March 2026.

SELECT *
FROM product_prices
WHERE recorded_at >= '2026-03-01'
  AND recorded_at <  '2026-04-01';

Counting consecutive increases via window arithmetic

"How many strict month-over-month price increases per product?" reduces to counting rows where price > LAG(price). The clean shape combines counting and filtering with SUM(CASE WHEN … THEN 1 ELSE 0 END)—no nested WHERE.

Why it works — SQL aggregates skip NULL. The first row's LAG is NULL, so price > NULL is NULL, CASE WHEN NULL THEN ... END is NULL, and SUM excludes it—exactly correct, since there's no prior row to compare to.
ELSE 0 vs no ELSE — SUM(CASE WHEN p > prev THEN 1 END) (no else) also works—non-matches return NULL and are skipped. ELSE 0 is more explicit.
Worked example: [10, 12, 12, 15] → strict increases at month 2 (10→12) and month 4 (12→15) = 2.

WITH with_lag AS (
  SELECT
    product_id, month, price,
    LAG(price) OVER (PARTITION BY product_id ORDER BY month) AS prev_price
  FROM product_prices
)
SELECT
  product_id,
  SUM(CASE WHEN price > prev_price THEN 1 ELSE 0 END) AS increases
FROM with_lag
GROUP BY product_id;

Common beginner mistakes

Using BETWEEN for date ranges and over-including the upper boundary on timestamp columns.
Forgetting PARTITION BY product_id—the LAG then crosses product boundaries and produces nonsense.
Counting "increases or equals" (>=) when the prompt says strict increase (>).
Sorting by an ambiguous column (e.g. created_at rounded to the day with multiple rows per day)—non-deterministic order.
Forgetting that LAG returns NULL for the first row and that subsequent NULL-aware logic must handle this case.

SQL interview question on time-series windows

Table product_prices(product_id, month, price). Return per-product the count of strict month-over-month price increases. Use LAG.

Solution using `LAG` inside a CTE and a `SUM(CASE)` aggregate

WITH with_lag AS (
  SELECT
    product_id, month, price,
    LAG(price) OVER (PARTITION BY product_id ORDER BY month) AS prev_price
  FROM product_prices
)
SELECT
  product_id,
  COALESCE(SUM(CASE WHEN price > prev_price THEN 1 ELSE 0 END), 0) AS increases
FROM with_lag
GROUP BY product_id
ORDER BY product_id;

Why this works: LAG produces the prior-month price within each product partition. The first row's LAG is NULL, so price > NULL is NULL—correctly excluded from the count. SUM(CASE WHEN …) is the canonical SQL replacement for counting boolean predicates. COALESCE(... , 0) defends against products with only one row (whose SUM would be NULL because every CASE returned NULL). The query plan: scan product_prices → window sort by (product_id, month) → window function adds prev_price column → hash group-by on product_id → output. Single-pass on the input table after the sort.

SQL
Topic — time series
Time-series problems

Practice →

COMPANY
Netflix — window functions
Netflix-tagged window functions

Practice →

6. Sliding Window Patterns in Python

Sliding windows in Python for data engineering

Sliding window is a family of patterns sharing one mechanic: maintain a window over a sequence, advance it one step at a time, update state in O(1) per step instead of recomputing from scratch. Netflix's only Hard problem (#300, watch-time completion streaks) and #551 (sliding-window distinct counts) both reduce to this template.

Pro tip: Every sliding-window problem has two pointers and a window state. Right always moves forward; left moves forward when the window violates an invariant. State the template before writing code—it scaffolds the solution and prevents whiteboard-flailing.

Fixed-size vs variable-size sliding window

Two flavors, one decision rule based on the prompt's verb.

Fixed-size — Window has exactly k elements. Slide adds the new right element and removes the old left one. Use cases: rolling average, rolling max, count-distinct-in-last-k. Total time O(n).
Variable-size — Window expands and shrinks based on a problem-specific invariant. right advances to expand; left advances to shrink until invariant holds. Each pointer moves at most n times → O(n) amortized.
Decision rule — "of size k" / "every k consecutive" → fixed. "longest" / "smallest" / "at most" / "at least" → variable.
Worked example: Fixed-size 3 over [1, 2, 3, 4, 5] → windows [1,2,3], [2,3,4], [3,4,5].

def fixed_window_sums(arr: list[int], k: int) -> list[int]:
    out: list[int] = []
    s = sum(arr[:k])
    out.append(s)
    for i in range(k, len(arr)):
        s += arr[i] - arr[i - k]
        out.append(s)
    return out

Sliding-window distinct count (`Counter` + window step)

"How many distinct values in the current window of size k?" Naive rebuild-set-per-window is O(n·k); the incremental Counter is O(n).

Mechanic — On each slide: increment the new right element's count; decrement the leaving left element's. Read len(counter) for distinct count.
Critical detail: when a count drops to zero, delete the key. Without the del, zero-count keys accumulate and len(counter) drifts upward.
Off-by-one — First emission at i >= k - 1 (the first complete window emits when you've seen k elements, at index k - 1).
Worked example: events = ['a','b','a','c','b'], k = 3:

step	window	counter	distinct
1	`[a,b,a]`	`{a:2, b:1}`	2
2	`[b,a,c]`	`{b:1, a:1, c:1}`	3
3	`[a,c,b]`	`{a:1, c:1, b:1}`	3

from collections import Counter

def sliding_distinct(events: list[str], k: int) -> list[int]:
    counter: Counter = Counter()
    out: list[int] = []
    for i, ev in enumerate(events):
        counter[ev] += 1
        if i >= k:
            old = events[i - k]
            counter[old] -= 1
            if counter[old] == 0:
                del counter[old]
        if i >= k - 1:
            out.append(len(counter))
    return out

Streak detection (consecutive completions)

"Longest run of True values" or "longest stretch of completed days" is a degenerate variable-window: the window grows on a match and resets to zero on a non-match.

State — Two integers: current (length of run ending here) and best (max seen). On match, current += 1; on non-match, current = 0. Always update best = max(best, current).
Why update best every step? If you only check at end-of-run, you miss the streak that extends to the very last element.
Generalization — Replace the boolean check with any predicate; the template is unchanged. Combine with id += 1 on reset for gaps-and-islands labeling.
Worked example: [True, True, False, True, True, True, False, True] → runs 2, 3, 1 → max 3.

def longest_streak(flags: list[bool]) -> int:
    best = current = 0
    for f in flags:
        current = current + 1 if f else 0
        if current > best:
            best = current
    return best

Common beginner mistakes

Rebuilding a set from scratch per window (O(n·k) instead of O(n)).
Forgetting to del counter[old] when the count hits zero—len(counter) then over-counts.
Using i >= k instead of i >= k - 1 for the first emission—off-by-one (the first complete window emits when you've seen k elements, which is at index k - 1).
Not resetting the streak counter on a non-match (treating False as a no-op).
Updating best only at the end of a run—misses the streak that extends to the last element.

Python interview question on sliding windows

Given a stream of events and a window size k, return the count of distinct events in every window of size k. Use O(n) total time.

Solution using a `Counter` with explicit zero-delete

from collections import Counter

def sliding_distinct_counts(events: list[str], k: int) -> list[int]:
    counter: Counter = Counter()
    out: list[int] = []
    for i, ev in enumerate(events):
        counter[ev] += 1
        if i >= k:
            old = events[i - k]
            counter[old] -= 1
            if counter[old] == 0:
                del counter[old]
        if i >= k - 1:
            out.append(len(counter))
    return out

Why this works: The window slides by one event per iteration—incrementing the new element and decrementing the leaving one—so each step is O(1). Deleting the key when its count hits zero keeps len(counter) equal to the number of currently-present distinct values; without the delete, removed events would linger as zero-count keys and inflate the answer. Total time is O(n), memory is O(min(n, alphabet_size)). The i >= k - 1 emission guard ensures the first output is for the first complete window (not for partial windows during fill-up).

Practice Netflix-style sliding-window problems →

PYTHON
Topic — sliding window
Sliding window problems

Practice →

COMPANY
Netflix — sliding window
Netflix-tagged sliding window

Practice →

7. Streaming, Deques, and Fixed-Memory Buffers in Python

Streams and deques in Python for data engineering

Streaming is where memory becomes a hard constraint—stream length is unknown, so you can't load it into a Pandas dataframe. The right shape is collections.deque(maxlen=N): a doubly-ended queue with fixed capacity that auto-evicts the opposite end on overflow. O(1) per push, O(N) memory regardless of stream length. Netflix's #603 (fixed-memory stream buffer) is the canonical formulation.

Pro tip: Any time you see "process the stream," ask three things out loud—how big is the stream, do you need the last N items or a summary, what happens on crash. That changes the round from "wrote a function" to "thinks like a pipeline engineer."

`deque` vs `list`: O(1) appends and pops from both ends

Python's list is a dynamic array—appends and right-pops are O(1) amortized, but list.insert(0, x) and list.pop(0) are O(n) because every element shifts. collections.deque is a doubly-linked list of memory blocks with O(1) at both ends.

Decision rule — Either-end access → deque. Random access by index → list (deque indexing is O(n)). Stack-only or queue-only at one end → list (lower constant factors).
Practical breakpoint — For very small N, both perform similarly; deque's advantage shows above N ≈ 100.
Worked example:

op	`list`	`deque`
append right	O(1)	O(1)
pop right	O(1)	O(1)
append left (insert at 0)	O(n)	O(1)
pop left (pop(0))	O(n)	O(1)

from collections import deque

dq = deque([1, 2, 3])
dq.appendleft(0)   # deque([0, 1, 2, 3]) — O(1)
dq.popleft()       # returns 0  — O(1)

`deque(maxlen=N)` for fixed-memory rolling buffer

Pass maxlen=N and the deque becomes a bounded ring buffer: pushes that exceed N automatically evict the opposite end in O(1).

The deque enforces the invariant — No if len(dq) > N: dq.popleft() guard needed. The constructor handles it silently.
Capacity is fixed — Once set, maxlen cannot change. Use a regular deque + manual guard if dynamic capacity is required.
Infinite streams work identically — Memory bounded by N, not stream length. Say this explicitly when the prompt mentions "unbounded stream."
Worked example: dq = deque(maxlen=3) push 1, 2, 3, 4:

step	push	deque after
1	1	`[1]`
2	2	`[1, 2]`
3	3	`[1, 2, 3]`
4	4	`[2, 3, 4]`

(The 1 was evicted automatically.)

from collections import deque

def rolling_buffer(stream, capacity: int):
    buf = deque(maxlen=capacity)
    for item in stream:
        buf.append(item)
    return list(buf)

Stream consumption: generator + deque

Real streams are generators—they produce values lazily, one at a time, without holding the full sequence. Combining deque(maxlen=N) with for item in stream gives constant-memory rolling windows over arbitrarily long input.

Production sources — Kafka, Kinesis, Pulsar, file iterators (for line in open(huge_file)), socket reads. The for item in stream idiom makes laziness automatic.
Lazy emission — yield list(buf) after each push lets callers iterate as far as they want without materializing the full snapshot sequence (O(emissions) memory instead of O(M·N)).
Worked example: Consume an event generator, keep last n events.

from collections import deque
from typing import Iterator

def last_n(stream: Iterator[str], n: int) -> list[str]:
    buf = deque(maxlen=n)
    for ev in stream:
        buf.append(ev)
    return list(buf)

Common beginner mistakes

Using a list and calling list.pop(0) in a hot loop—silently O(n) per step.
Manually managing maxlen with if len(dq) > N: dq.popleft()—correct but redundant when you can just pass maxlen=N.
Storing the entire stream just to keep the last N items—wastes memory.
Forgetting deque is iterable but not random-access in O(1)—dq[i] is O(i).
Trying to change maxlen after construction—not supported; build a new deque.

Python interview question on streaming and deques

You receive an event stream of unknown length. Return the last k events at any moment, with constant memory regardless of how long the stream is.

Solution using `deque(maxlen=k)`

from collections import deque
from typing import Iterator, Iterable

def stream_last_k(stream: Iterable[str], k: int) -> Iterator[list[str]]:
    buf = deque(maxlen=k)
    for ev in stream:
        buf.append(ev)
        yield list(buf)

Why this works: deque(maxlen=k) enforces the invariant "at most k items" automatically; pushing the (k+1)-th event evicts the oldest in O(1). Memory is bounded by k regardless of how many events the stream produces. The generator yield makes the function lazy—callers can iterate as far as they want without materializing the whole sequence of snapshots, which is critical for production streams that may run for days. Total time is O(L) where L is the number of events the caller actually consumes.

PYTHON
Topic — streaming
Streaming problems

Practice →

COMPANY
Netflix — streaming
Netflix-tagged streaming

Practice →

8. Resumable ETL with Checkpoint Files and Fault Tolerance in Python

Resumable ETL in Python for data engineering

Production pipelines crash. The question that separates a junior DE from a senior one is what happens after the crash. A pipeline that resumes from the last good offset—via a checkpoint file—loses minutes at most and double-processes nothing if its consumer is idempotent. Netflix's #607 is this exact scenario.

Pro tip: Three failure modes to defend against: (a) corrupt checkpoint from a half-written file → os.replace for atomic swap; (b) non-idempotent consumer → upsert by stable PK; (c) checkpoint written before work is durable → always process, confirm durable, then checkpoint.

See streaming and ETL problems →

Checkpoint file: serialize last-good offset

A checkpoint is a tiny durable file (JSON beats pickle for ops debugging) containing just enough state to resume—usually {"offset": <last_processed_index>}. Keep it small; a 100-field state blob is a debugging nightmare.

Atomic on update — Write to path + ".tmp", then os.replace(tmp, path). On modern filesystems (ext4, XFS, APFS, NTFS), os.replace is atomic—either the old or new checkpoint is intact, never half-written.
Order matters — Process record N → confirm durable (DB commit returned, fsync succeeded) → write checkpoint. Reversing this loses records on crash.
Worked example: Pipeline reads 0–99, writes {"offset": 100}. Crash at line 150 → next run skips to 100, replays 100–150. Idempotent consumer makes the replay safe.

import json
import os

def write_checkpoint(path: str, offset: int) -> None:
    tmp = path + ".tmp"
    with open(tmp, "w") as f:
        json.dump({"offset": offset}, f)
    os.replace(tmp, path)  # atomic on most filesystems

Resumable read: skip to last checkpoint, replay from there

On startup, read the checkpoint (default to 0 if missing—first run). Skip the source to that offset. Process records, batching checkpoint writes (every N records or T seconds) so disk I/O doesn't dominate.

Skip mechanic varies by source — Files: for i, line: if i < start: continue is O(start); f.seek(byte_offset) is O(1) if you store byte offsets. Kafka: seek(offset) directly. Databases: WHERE id > last_id ORDER BY id.
Batch trade-off — Larger batches → less I/O but more replay on crash. ~100–1000 records is typical; reason about it explicitly in the interview.
Edge cases — Empty input → skip the final write if last == start. Source shrunk → pipeline does nothing (correctly).
Worked example: Resume a CSV pipeline.

import json
import os

def read_checkpoint(path: str) -> int:
    if not os.path.exists(path):
        return 0
    with open(path) as f:
        return json.load(f).get("offset", 0)

def run_pipeline(input_path: str, ckpt_path: str, batch: int = 100) -> None:
    start = read_checkpoint(ckpt_path)
    last = start
    with open(input_path) as f:
        for i, line in enumerate(f):
            if i < start:
                continue
            process(line)  # idempotent
            last = i + 1
            if last % batch == 0:
                write_checkpoint(ckpt_path, last)
    if last > start:
        write_checkpoint(ckpt_path, last)

Idempotency and fault-tolerant writes

Resumability is only safe if process(record) is idempotent: replaying a record doesn't change system state more than once. Without it, every replay duplicates work.

Upsert by stable PK — INSERT … ON CONFLICT (pk) DO UPDATE (Postgres) / ON DUPLICATE KEY UPDATE (MySQL). The PK must be deterministic from the source (event id, row hash, (user_id, ts))—auto-increment surrogates break idempotency.
Dedupe at the sink — Maintain a "seen events" table; check membership before processing. Works when the source can't produce a stable PK.
External calls — Payment gateways, email sends: use the idempotency-key pattern (every external call carries a unique key derived from the record). Stripe and most modern APIs support this.
Worked example: Inserting events into Postgres with idempotent upsert.

def process(line: str) -> None:
    event = parse(line)
    db.execute(
        "INSERT INTO events (id, payload) VALUES (%s, %s) "
        "ON CONFLICT (id) DO NOTHING",
        (event.id, event.payload),
    )

Common beginner mistakes

Writing the checkpoint before processing—lose the just-skipped records on crash.
Non-atomic checkpoint writes (write directly to the target file)—a crash mid-write leaves a corrupt JSON; next startup raises.
Per-record checkpoint writes—correct but slow; batch instead.
Non-idempotent process() that duplicates inserts on replay (no upsert, no dedup table).
Auto-incrementing surrogate keys as the idempotency key—they change on replay, breaking the upsert guarantee.

Python interview question on resumable ETL

Build a function run_pipeline(input_path, ckpt_path) that processes lines from a large CSV. On startup it reads ckpt_path (a JSON file with {"offset": int}); after every 100 records it writes the current offset back. The pipeline must be safely killable at any moment without losing or duplicating records. process(line) is provided and is idempotent.

Solution using atomic checkpoint writes and batched commits

import json
import os

def read_checkpoint(path: str) -> int:
    if not os.path.exists(path):
        return 0
    with open(path) as f:
        return json.load(f).get("offset", 0)

def write_checkpoint(path: str, offset: int) -> None:
    tmp = path + ".tmp"
    with open(tmp, "w") as f:
        json.dump({"offset": offset}, f)
    os.replace(tmp, path)

def run_pipeline(input_path: str, ckpt_path: str, batch: int = 100) -> None:
    start = read_checkpoint(ckpt_path)
    last = start
    with open(input_path) as f:
        for i, line in enumerate(f):
            if i < start:
                continue
            process(line)  # idempotent
            last = i + 1
            if last % batch == 0:
                write_checkpoint(ckpt_path, last)
    if last > start:
        write_checkpoint(ckpt_path, last)

Why this works: The atomic os.replace swap means a crash during write_checkpoint either leaves the old good checkpoint intact or atomically swaps in the new good one—never a half-written file. Batching the writes (every 100 records) keeps disk overhead negligible without sacrificing the safety budget: on crash, replay is at most 99 records, all idempotent. The final write outside the loop catches the partial-batch case (e.g. 250 records → checkpoint at 100, 200, then 250 at the end). The whole pipeline is safe under arbitrary crashes because (a) the checkpoint is durable after os.replace, (b) process is idempotent so replay is safe, and (c) process runs before the checkpoint write so we never claim done on un-done work.

PYTHON
Topic — ETL
ETL problems

Practice →

COMPANY
Netflix — streaming
Netflix-tagged streaming

Practice →

Tips to crack Netflix data engineering interviews

These are habits that move the needle in real Netflix loops—not a re-statement of the topics above.

SQL preparation: drill window + anti-join + set ops

Spend most of your SQL prep on window functions inside CTEs (SUM / RANK / LAG), anti-joins (LEFT JOIN ... IS NULL and NOT EXISTS), set operations (INTERSECT, EXCEPT), and relational division (count-match HAVING). Type the queries; do not just read them. The window functions, anti-join, and set operations topic pages cover these directly.

Python preparation: stack, sliding window, deque

Netflix's Python rounds reward stdlib fluency over framework knowledge. Memorize: monotonic stacks for span queries, sliding-window distinct counts with Counter, collections.deque(maxlen=N) for fixed-memory streams, and the resumable-ETL pattern. Drill problems from the sliding window and streaming topic pages.

Pipeline thinking: streaming, checkpointing, fault tolerance

Netflix interviewers care about production realism. When the prompt says "process a stream," ask: how big is the stream? What happens on crash? Is process idempotent? Where does the checkpoint live? Saying these four sentences out loud changes how the round is scored. The ETL topic page has problems that match the resumable-pipeline pattern.

Where to practice on PipeCode

Skill lane	Practice path
Curated Netflix practice set	/explore/practice/company/netflix
SQL window functions	/explore/practice/topic/window-functions
SQL anti-join	/explore/practice/topic/anti-join
SQL set operations	/explore/practice/topic/set-operations
Sliding window	/explore/practice/topic/sliding-window
Streaming + deque	/explore/practice/topic/streaming
ETL pipelines	/explore/practice/topic/etl
Hash table	/explore/practice/topic/hash-table
All practice topics	/explore/practice/topics
Interview courses	/explore/courses

Communication under time pressure

State assumptions before typing: "I'll assume the stream is too large to fit in memory and that process is idempotent." State grain: "One row per product per month." State edge cases: "If the recursion has a cycle, my depth guard caps at 50." Interviewers grade clear reasoning above silent-and-perfect.

Frequently Asked Questions

What is the Netflix data engineering interview process like?

The Netflix data engineering interview typically includes a phone screen (Python warm-up around hash maps or arrays), one or two coding rounds focused on SQL window functions and Python streaming patterns, a system-design conversation around pipelines and warehousing, and behavioral interviews. The curated 13-problem Netflix practice set on PipeCode mirrors what you will see on the technical rounds.

What SQL topics does Netflix test for data engineers?

Netflix emphasizes window functions (SUM/RANK/LAG/LEAD with PARTITION BY), anti-joins, set operations (INTERSECT, EXCEPT), relational division ("for all" patterns), and time-series date logic. Drill these on the window functions, anti-join, and set operations topic pages before your screen.

How important is Python for a Netflix data engineering interview?

Python is roughly half the interview at Netflix—and the patterns are pipeline-realistic, not algorithm puzzles. Expect monotonic stacks, sliding-window distinct counts with Counter, collections.deque(maxlen=N) for fixed-memory streams, and resumable-ETL with checkpoint files. Stdlib fluency (collections, functools, itertools) matters more than knowing the latest framework.

Are sliding window questions common in the Netflix data engineering interview?

Yes—Netflix's only Hard problem in the curated set is sliding window (watch-time completion streaks), and a separate Medium covers sliding-window distinct counts in event streams. Memorize the Counter-based "distinct count in window of size k" pattern and the variable-window left/right pointer pattern. The sliding window topic page has problems matching this shape.

How should I prepare for Netflix's streaming and ETL questions?

Build the mental model first: a stream is unknown-length, so memory must be bounded; a pipeline crashes, so it must resume from a checkpoint; replays happen, so process must be idempotent. Practice writing deque(maxlen=N) rolling buffers and os.replace-based atomic checkpoint files from memory. The streaming and ETL topic pages cover both directly.

How many Netflix practice problems should I solve before the interview?

Aim for 30–50 problems spanning all eight topic clusters above—not 100 of the same kind. Solve every problem in the Netflix-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.

Start practicing Netflix data engineering problems

Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Netflix problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Netflix asks—without the noise of generic algorithm prep.

Pipecode.ai is Leetcode for Data Engineering.

Browse Netflix practice →
View all practice →

Top Netflix data engineering interview topics

1. Hash Maps and Data Analysis Patterns in Python

Hash-map fluency for data analysis in Python for data engineering

dict vs defaultdict vs Counter

Co-occurrence counting with nested defaultdict

Most-frequent next-watch with tie-breaks

Python interview question on hash maps and data analysis

Solution using nested defaultdict and a sort key

2. Monotonic Stacks for Span and Range Queries in Python

Monotonic stacks in Python for data engineering

Stack basics: LIFO and amortized O(n)

Monotonic decreasing stack for "next greater" / "previous greater"

Span counting (consecutive left span)

Python interview question on monotonic stacks

Solution using a monotonic decreasing stack of (price, span)

3. SQL Window Functions, Aggregation, and Joins

Window functions and aggregation in SQL for data engineering

Window vs aggregate: SUM() OVER vs SUM() GROUP BY

PARTITION BY and ORDER BY inside OVER()

RANK / DENSE_RANK / ROW_NUMBER

Filtering on a window column via CTE

SQL interview question on window functions

Solution using RANK inside a CTE

4. SQL Anti-Joins, Set Operations, and Relational Division

Anti-joins and set algebra in SQL for data engineering

Anti-join: LEFT JOIN … WHERE rhs IS NULL vs NOT EXISTS

Set operations: INTERSECT / EXCEPT / UNION

Relational division: "products in all stores"

SQL interview question on anti-join and relational division

Solution using GROUP BY and a count-match HAVING

5. SQL Window Functions for Time-Series and Date Logic

Time-series patterns in SQL for data engineering

LAG and LEAD for prior-row deltas

Half-open date windows (>= start AND < end)

Counting consecutive increases via window arithmetic

SQL interview question on time-series windows

Solution using LAG inside a CTE and a SUM(CASE) aggregate

6. Sliding Window Patterns in Python

Sliding windows in Python for data engineering

Fixed-size vs variable-size sliding window

Sliding-window distinct count (Counter + window step)

Streak detection (consecutive completions)

Python interview question on sliding windows

Solution using a Counter with explicit zero-delete

7. Streaming, Deques, and Fixed-Memory Buffers in Python

Streams and deques in Python for data engineering

deque vs list: O(1) appends and pops from both ends

deque(maxlen=N) for fixed-memory rolling buffer

Stream consumption: generator + deque

Python interview question on streaming and deques

Solution using deque(maxlen=k)

8. Resumable ETL with Checkpoint Files and Fault Tolerance in Python

Resumable ETL in Python for data engineering

Checkpoint file: serialize last-good offset

Resumable read: skip to last checkpoint, replay from there

Idempotency and fault-tolerant writes

Python interview question on resumable ETL

Solution using atomic checkpoint writes and batched commits

Tips to crack Netflix data engineering interviews

SQL preparation: drill window + anti-join + set ops

Python preparation: stack, sliding window, deque

Pipeline thinking: streaming, checkpointing, fault tolerance

Where to practice on PipeCode

Communication under time pressure

Frequently Asked Questions

What is the Netflix data engineering interview process like?

What SQL topics does Netflix test for data engineers?

How important is Python for a Netflix data engineering interview?

Are sliding window questions common in the Netflix data engineering interview?

How should I prepare for Netflix's streaming and ETL questions?

How many Netflix practice problems should I solve before the interview?

Start practicing Netflix data engineering problems

`dict` vs `defaultdict` vs `Counter`

Co-occurrence counting with nested `defaultdict`

Solution using nested `defaultdict` and a sort key

Solution using a monotonic decreasing stack of `(price, span)`

Window vs aggregate: `SUM() OVER` vs `SUM() GROUP BY`

`PARTITION BY` and `ORDER BY` inside `OVER()`

`RANK` / `DENSE_RANK` / `ROW_NUMBER`

Solution using `RANK` inside a CTE

Anti-join: `LEFT JOIN … WHERE rhs IS NULL` vs `NOT EXISTS`

Set operations: `INTERSECT` / `EXCEPT` / `UNION`

Solution using `GROUP BY` and a count-match `HAVING`

`LAG` and `LEAD` for prior-row deltas

Half-open date windows (`>= start AND < end`)

Solution using `LAG` inside a CTE and a `SUM(CASE)` aggregate

Sliding-window distinct count (`Counter` + window step)

Solution using a `Counter` with explicit zero-delete

`deque` vs `list`: O(1) appends and pops from both ends

`deque(maxlen=N)` for fixed-memory rolling buffer

Solution using `deque(maxlen=k)`