Netflix data engineering interview questions lean harder on production-DE patterns than the typical FAANG SQL/algorithm loop. Expect SQL problems built around window functions (LAG, LEAD, PARTITION BY), anti-joins, set operations, relational division, and time-series date logic—and Python problems that read like real pipeline code: monotonic stacks for span queries, sliding windows for streak detection, deque rolling buffers for fixed-memory streams, and resumable ETL with checkpoint files.
This guide walks through the eight topic clusters Netflix actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 13-problem Netflix interview set (1 easy, 11 medium, 1 hard) so you spend prep time on what shows up, not what doesn't.
Top Netflix data engineering interview topics
From the Netflix data engineering practice set, the eight numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–8) | Why it shows up at Netflix |
|---|---|---|
| 1 | Hash maps and data analysis in Python | Two-title sequence analysis, popular-movie next-watch, user-engagement patterns. |
| 2 | Monotonic stacks for span queries in Python | Consecutive left-span counts—classic stack-pattern interview question. |
| 3 | SQL window functions, aggregation and joins | "Most-used promotion in California"—window inside a CTE, joined to a filter. |
| 4 | SQL anti-join, set operations, relational division | "Products never purchased" (anti-join) and "products sold in all stores" (relational division). |
| 5 | SQL window functions for time-series and date logic | Product price increases over time—LAG deltas + half-open date windows. |
| 6 | Sliding window patterns in Python | Watch-time completion streaks (Hard) and sliding-window distinct counts. |
| 7 | Streaming, deques, fixed-memory buffers | Fixed-memory stream buffer with collections.deque(maxlen=N). |
| 8 | Resumable ETL with checkpoint files | Fault-tolerant pipelines that resume from the last good offset. |
SQL evaluation order (mental model):
FROM/ joins →WHERE(row filter) →GROUP BY→ aggregates →HAVING(group filter) → windows →SELECT/ORDER BY. Window functions run afterWHEREbut beforeSELECT—so you cannot reference a window column inWHERE. Wrap it in a CTE and filter outside.
1. Hash Maps and Data Analysis Patterns in Python
Hash-map fluency for data analysis in Python for data engineering
Hash maps are the foundational primitive of data engineering in Python. Almost every Netflix-style data-analysis problem—two-title sequence counts, popular next-watch, per-user engagement—reduces to "build a dict in one pass, query it in O(1)." The hard part is shape selection: a plain dict, a defaultdict, or a Counter each fit different access patterns.
Pro tip: Reach for
Counterwhen the prompt says "count" or "frequency"; reach fordefaultdict(lambda: defaultdict(int))for 2-D maps like co-occurrence; use a plaindictfor heterogeneous values.
dict vs defaultdict vs Counter
Three hash-map flavors, each optimized for a different access pattern. The shape choice changes how you write the body of the loop—not correctness, but 3–5 lines per call site.
-
dict— Most general; missing keys raiseKeyError. Wrap reads in.get(k, default)to avoid the exception. -
defaultdict(factory)— Auto-initializes missing keys viafactory().defaultdict(int)for counts,defaultdict(list)for grouping,defaultdict(lambda: defaultdict(int))for 2-D tables. -
Counter(iterable)—dictsubclass for integer frequencies. Gives you.most_common(k)and arithmetic (c1 + c2,c1 - c2,c1 & c2). -
Footgun: reading a missing key on a
defaultdictinserts it.if k not in dafter a read is alwaysFalse. -
Tie-break footgun:
Counter.most_commonpreserves insertion order on ties—not alphabetical. Sort manually with(-count, key)if the prompt requires deterministic tie-breaks.
Worked example: Counting titles watched per user across two days.
| user | titles_day_1 | titles_day_2 |
|---|---|---|
| A | ["Stranger", "Crown"] |
["Crown"] |
| B | ["Stranger"] |
["Stranger", "Witcher"] |
Goal: per-user title counter → {'A': {'Stranger': 1, 'Crown': 2}, 'B': {'Stranger': 2, 'Witcher': 1}}.
from collections import defaultdict, Counter
def per_user_counts(rows: list[tuple[str, list[str]]]) -> dict[str, Counter]:
out: dict[str, Counter] = defaultdict(Counter)
for user, titles in rows:
out[user].update(titles)
return {u: dict(c) for u, c in out.items()}
Co-occurrence counting with nested defaultdict
Co-occurrence answers "given a sequence of events, how often does A appear immediately before B?" The right shape is defaultdict(lambda: defaultdict(int))—nested keys make "all successors of X" an O(1) jump instead of an O(N) scan over flat tuple keys.
-
Idiom:
zip(history, history[1:])yields consecutive pairs. A history of length N produces N − 1 pairs. -
Edge cases: length < 2 yields zero pairs (no defensive guard needed); repeats like
["A","A","A"]produce(A, A): 2correctly. -
Worked example: History
["A", "B", "C", "B", "A"]→ pairs(A,B), (B,C), (C,B), (B,A)→{A: {B: 1}, B: {C: 1, A: 1}, C: {B: 1}}.
from collections import defaultdict
def co_occurrence(history: list[str]) -> dict[str, dict[str, int]]:
pairs: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for a, b in zip(history, history[1:]):
pairs[a][b] += 1
return {a: dict(b) for a, b in pairs.items()}
Most-frequent next-watch with tie-breaks
Once you have a 2-D co-occurrence table, "most frequent successor of title X" is a sort by composite key. The naive max(pairs[X], key=pairs[X].get) returns one of the tied keys nondeterministically—real prompts always specify tie-break behavior.
-
(-count, name)— The canonical Python tie-break key: primary descending (negate to flip), secondary ascending (lexicographic). Generalizes to any number of tie-break rules. - Stable sort — Python's sort is stable, so equal composite keys preserve insertion order. Sorting explicitly is safer than relying on it.
-
Worked example: Co-occurrence above + query "next watch for
B?" →{C: 1, A: 1}→ tied at 1 → alphabeticalAwins.
def most_frequent_next(pairs: dict[str, dict[str, int]], title: str) -> str | None:
if title not in pairs or not pairs[title]:
return None
return sorted(pairs[title].items(), key=lambda x: (-x[1], x[0]))[0][0]
Common beginner mistakes
- Using a plain
dictand forgetting to handle missing keys (raisesKeyError). - Trusting
Counter.most_commonto break ties alphabetically (it does not—uses insertion order). - Building flat counters when nested counters are needed (and vice versa)—mismatched shape fails downstream.
- Forgetting that
zip(history, history[1:])yields zero pairs when history has fewer than 2 items. - Calling
if k not in dafter a read on adefaultdict—the read insertedk, so the check is alwaysFalse.
Python interview question on hash maps and data analysis
Given a list of user watch sessions where each session is a list of titles in viewing order, build a structure that, for any title, can return the most common next title with deterministic tie-breaks (alphabetical).
Solution using nested defaultdict and a sort key
from collections import defaultdict
def build_next_watch(sessions: list[list[str]]) -> dict[str, str]:
pairs: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for history in sessions:
for a, b in zip(history, history[1:]):
pairs[a][b] += 1
out: dict[str, str] = {}
for title, counter in pairs.items():
best = sorted(counter.items(), key=lambda x: (-x[1], x[0]))[0][0]
out[title] = best
return out
Why this works: A single pass per session counts every consecutive pair in O(L) where L is the session length—total O(total_titles). The nested defaultdict builds the 2-D counter in one line, no KeyError risk. Sorting (-count, name) resolves ties alphabetically. The output is a flat title → next_title map, so downstream lookups are O(1)—you've shifted cost from query time to index time, which is the right trade-off when you expect many queries against one fixed dataset (the typical Netflix recommender setup).
PYTHON
Topic — hash table
Hash table problems
COMPANY
Netflix — hash table
Netflix-tagged hash table
2. Monotonic Stacks for Span and Range Queries in Python
Monotonic stacks in Python for data engineering
Monotonic stacks are the canonical optimization for problems shaped like "for each element, find the nearest left/right neighbor satisfying a comparison." The brute-force is O(n²); the monotonic-stack rewrite is amortized O(n) because every element is pushed and popped at most once. Netflix's consecutive-left-span problem (#128) is the textbook stock-span variant.
Pro tip: The mental model: the stack is a queue of "open questions"—each element is asking "what is the next element greater than me?" When that answer arrives, the element is popped and the running aggregate updates. Anything passed over by a larger element will never be answered later, so it gets popped and the stack stays bounded.
Stack basics: LIFO and amortized O(n)
A stack is a LIFO container with push and pop. In Python, a plain list works as a stack: list.append and list.pop are both O(1) amortized. Avoid collections.deque for stack-only use—it has slightly higher constant factors.
-
Amortized argument — Total work is
n + total_pops; since every element is pushed once and popped at most once,total_pops ≤ n, so total work isO(2n) = O(n). Memorize this sentence; interviewers occasionally ask you to defend it. -
Worked example: Push
3, 1, 4then pop twice →4, 1come off, stack becomes[3].
def stack_demo() -> list[int]:
stack: list[int] = []
for x in [3, 1, 4]:
stack.append(x)
stack.pop() # 4
stack.pop() # 1
return stack # [3]
Monotonic decreasing stack for "next greater" / "previous greater"
A monotonic decreasing stack keeps elements in non-increasing order from bottom to top. As you scan left-to-right, when the new element is larger than the stack top, pop until the top is larger or the stack is empty. Popped elements just had their next-greater answered; whatever's left is the previous-greater of the new push.
- Decreasing vs increasing — Decreasing stack answers "next/previous greater"; increasing stack answers "next/previous smaller." Pick by the predicate.
-
Indices vs values — Store indices if you need distances (
i - stack[-1]); store values if you only need values. -
Strict vs non-strict — "Strictly greater" pops while
top ≤ x; "greater or equal" pops whiletop < x. Off-by-one mistakes here are silent bugs. -
Worked example: Previous-greater for
[6, 3, 4, 2, 5]:
| i | x | stack before | pop while top ≤ x | prev-greater | stack after |
|---|---|---|---|---|---|
| 0 | 6 | [] |
— | none | [6] |
| 1 | 3 | [6] |
— | 6 | [6, 3] |
| 2 | 4 | [6, 3] |
pop 3 | 6 | [6, 4] |
| 3 | 2 | [6, 4] |
— | 4 | [6, 4, 2] |
| 4 | 5 | [6, 4, 2] |
pop 2, pop 4 | 6 | [6, 5] |
Result: [None, 6, 6, 4, 6].
def previous_greater(arr: list[int]) -> list[int | None]:
stack: list[int] = []
out: list[int | None] = []
for x in arr:
while stack and stack[-1] <= x:
stack.pop()
out.append(stack[-1] if stack else None)
stack.append(x)
return out
Span counting (consecutive left span)
The span of element i is the count of consecutive prior elements (including i) that are ≤ arr[i]. The naive walk-leftward solution is O(n²); the monotonic-stack rewrite stores (value, span) tuples and absorbs popped spans into the new element's span when it dominates them.
-
Why tuples? When popping
(v, s)becausearr[i] ≥ v, the new element's span absorbss—every element subsumed by the popped one is also subsumed byarr[i](transitivity). No rescan needed. -
Worked example:
arr = [100, 80, 60, 70, 60, 75, 85]:
| i | x | running span logic | output |
|---|---|---|---|
| 0 | 100 | start, span = 1 | 1 |
| 1 | 80 | 80 < 100, span = 1 | 1 |
| 2 | 60 | 60 < 80, span = 1 | 1 |
| 3 | 70 | 70 ≥ 60 (pop, +1), 70 < 80, span = 2 | 2 |
| 4 | 60 | 60 < 70, span = 1 | 1 |
| 5 | 75 | 75 ≥ 60 (+1), 75 ≥ 70 (+2), 75 < 80, span = 4 | 4 |
| 6 | 85 | 85 ≥ 75 (+4), 85 ≥ 80 (+1), stack empty, span = 6 | 6 |
Result: [1, 1, 1, 2, 1, 4, 6].
def left_spans(arr: list[int]) -> list[int]:
stack: list[tuple[int, int]] = [] # (value, span)
out: list[int] = []
for x in arr:
span = 1
while stack and stack[-1][0] <= x:
span += stack.pop()[1]
out.append(span)
stack.append((x, span))
return out
Common beginner mistakes
- Reaching for an O(n²) nested loop and missing the linear-time monotonic-stack rewrite.
- Storing only values instead of
(value, span)—you cannot recover the contribution of popped elements without rescanning. - Comparing strictly
<when the prompt requires≤(or vice versa)—off-by-one on the boundary. - Forgetting that the bottom of the stack may be empty after popping—handle the "no boundary" case.
- Using
collections.dequewhen alistis sufficient (deque has constant-factor overhead; list is faster for stack-only use).
Python interview question on monotonic stacks
Given an array of daily prices, return for each day the count of consecutive prior days (including today) where the price was less than or equal to today's price. Use O(n) time.
Solution using a monotonic decreasing stack of (price, span)
def consecutive_left_span(prices: list[int]) -> list[int]:
stack: list[tuple[int, int]] = []
out: list[int] = []
for p in prices:
span = 1
while stack and stack[-1][0] <= p:
span += stack.pop()[1]
out.append(span)
stack.append((p, span))
return out
Why this works: Every price is pushed once and popped at most once, so total work is O(n) amortized—much faster than the O(n²) brute force on long inputs. The stack holds only "still-relevant" boundaries: values smaller than the current price contributed to the current span and are no longer needed; the boundary that stops the span stays on the stack for later queries. The (value, span) tuple lets us recover the contribution of popped elements in O(1) without rescanning—the absorption step is the whole point of the rewrite.
PYTHON
Topic — stack
Stack problems
COMPANY
Netflix — array
Netflix-tagged array
3. SQL Window Functions, Aggregation, and Joins
Window functions and aggregation in SQL for data engineering
Window functions are SQL's answer to "per-row analytics that depend on other rows." Where pre-window SQL needed self-joins or correlated subqueries for "running total alongside each order," windows do it as one expression: SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date). The simplified evaluation order: FROM → WHERE → GROUP BY → aggregates → HAVING → windows → SELECT → ORDER BY. Windows run after WHERE, so you cannot reference a window column in WHERE—wrap in a CTE.
Pro tip: Netflix problems test the layering: pre-filter, group, window, filter window output. Walk through it verbally before typing—interviewers grade explanation as much as code.
Window vs aggregate: SUM() OVER vs SUM() GROUP BY
The single most common window bug is confusing these two constructs. Aggregates collapse grain; windows preserve it.
-
SUM(col) GROUP BY key— Input N rows → output K rows (one per group). Grain changes. -
SUM(col) OVER (PARTITION BY key)— Input N rows → output N rows. Each row gets the partition total alongside its own data. -
Empty
OVER ()— No partition, no order: aggregate over the entire result set, broadcast to every row. Useful for "percentage of grand total"; bug if accidental. - Rule of thumb: "for each row, also show…" → window. "for each X, return…" → aggregate.
-
Worked example: Three orders, two customers
[(1, 10), (1, 20), (2, 5)]→GROUP BYreturns 2 rows(1, 30), (2, 5);OVER (PARTITION BY customer_id)returns 3 rows withcustomer_total30, 30, 5.
SELECT
customer_id,
amount,
SUM(amount) OVER (PARTITION BY customer_id) AS customer_total
FROM orders;
PARTITION BY and ORDER BY inside OVER()
OVER() accepts three nested clauses: PARTITION BY, ORDER BY, and the frame. Together they define which rows form this row's window and in what order.
-
PARTITION BY— Buckets rows into independent groups; window resets at each boundary. Without it, the window spans the entire result set. -
ORDER BY— Required forLAG,LEAD,ROW_NUMBER,RANK, running totals. Without it, "the prior row" is undefined. -
Frame — Default with
ORDER BYisRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.RANGEcompares values (duplicates included together);ROWScompares physical positions. Specify explicitly when correctness matters. - Worked example: Running total per customer, ordered by date.
SELECT
customer_id, order_date, amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM orders;
RANK / DENSE_RANK / ROW_NUMBER
Three ranking functions, three different tie semantics. Picking the wrong one is the second-most-common window bug after the WHERE-on-window confusion.
-
ROW_NUMBER()— Unique sequential integers; ties broken arbitrarily by storage order. Add a tie-break column toORDER BYfor determinism. -
RANK()— Olympic-medal model: tied rows share a rank; next rank skips the tied positions.[95, 90, 90, 85]→1, 2, 2, 4. -
DENSE_RANK()— Tied rows share a rank; next rank does not skip.[95, 90, 90, 85]→1, 2, 2, 3. -
Worked example: Scores
[95, 90, 90, 85]:
| score | row_number | rank | dense_rank |
|---|---|---|---|
| 95 | 1 | 1 | 1 |
| 90 | 2 | 2 | 2 |
| 90 | 3 | 2 | 2 |
| 85 | 4 | 4 | 3 |
SELECT
score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS rn,
RANK() OVER (ORDER BY score DESC) AS rnk,
DENSE_RANK() OVER (ORDER BY score DESC) AS drk
FROM scores;
Filtering on a window column via CTE
WHERE runs before windows, so WHERE rnk = 1 directly on a window column is invalid. The fix is structural: wrap the window in a CTE, then filter the CTE in an outer SELECT.
-
Alternatives —
QUALIFY(Snowflake / BigQuery / Teradata) is a dedicated clause for window filters but isn't portable to PostgreSQL. Default to the CTE wrapper in Postgres-style interviews. -
Muscle memory:
WITH ranked AS (SELECT …, RANK() OVER … AS rnk FROM …) SELECT … FROM ranked WHERE rnk = …. - Worked example: Rank promotions by use-count and keep the top.
WITH promo_counts AS (
SELECT promo_id, COUNT(*) AS uses
FROM orders
GROUP BY promo_id
),
ranked AS (
SELECT promo_id, uses, RANK() OVER (ORDER BY uses DESC) AS rnk
FROM promo_counts
)
SELECT promo_id, uses FROM ranked WHERE rnk = 1;
Common beginner mistakes
- Writing
WHERE rnk = 1directly on a window column—evaluation order makes it invalid. - Using
RANKwhen the prompt requires no gaps (DENSE_RANK) or arbitrary tie-break (ROW_NUMBER). - Forgetting
PARTITION BYand computing a global running total instead of a per-customer one. - Writing
SUM(amount) OVER ()(emptyOVER) and getting a single value broadcast to every row—occasionally the goal, often a bug. - Defaulting to
RANGEwhenROWSis needed (or vice versa)—silent wrong answer when theORDER BYcolumn has duplicates.
SQL interview question on window functions
Table orders(order_id, customer_id, promo_id, amount, region). For region 'CA', return the promo_id with the most uses (ties on use-count broken alphabetically by promo_id).
Solution using RANK inside a CTE
WITH ca_counts AS (
SELECT promo_id, COUNT(*) AS uses
FROM orders
WHERE region = 'CA'
GROUP BY promo_id
),
ranked AS (
SELECT promo_id, uses,
RANK() OVER (ORDER BY uses DESC, promo_id ASC) AS rnk
FROM ca_counts
)
SELECT promo_id, uses FROM ranked WHERE rnk = 1;
Why this works: The WHERE region = 'CA' filters rows before grouping (correct evaluation stage), so the count is per-region. The first CTE collapses rows to one per promo_id. The second CTE adds the rank with a tie-break on promo_id ASC—the same (-count, name) pattern in SQL form, where uses DESC flips ascending into descending and promo_id ASC resolves ties alphabetically. The outer WHERE rnk = 1 is legal because rnk is a CTE column, not a live window. The query plan: scan-with-filter on orders → hash aggregate by promo_id → window sort → filter—four logical stages, one query.
Drill SQL window function patterns →
SQL
Topic — window functions
Window function problems
COMPANY
Netflix — window functions
Netflix-tagged window functions
4. SQL Anti-Joins, Set Operations, and Relational Division
Anti-joins and set algebra in SQL for data engineering
SQL has two ways to express negation ("rows in A with no match in B") and universal quantification ("rows that match in all of B"). Netflix tests this twice: products-never-purchased (#192, anti-join) and products-sold-in-all-stores (#193, relational division). Each problem has multiple equivalent formulations; the interview signal is whether you know which one a reviewer can read in five seconds.
Pro tip: For relational division, lead with the count-match
HAVINGform (single pass, most readable) and have double-NOT EXISTSready as the relational-algebra-formal follow-up.
Anti-join: LEFT JOIN … WHERE rhs IS NULL vs NOT EXISTS
Three SQL shapes for "rows in A with no matching row in B," each with a different mechanical and readability profile.
-
LEFT JOIN … WHERE rhs.id IS NULL— Outer join, then filter to unmatched rows. Planners usually rewrite to a hash-anti-join. Use the right table's non-nullable PK in theIS NULLcheck. -
NOT EXISTS (SELECT 1 FROM B WHERE B.x = A.x)— More declarative; planner short-circuits on first match. Handles multi-column predicates cleanly. Preferred for readability and correctness. -
NOT IN (subquery)— The NULL trap: a single NULL row in the subquery makesx NOT IN (NULL, ...)evaluate toUNKNOWNfor everyx, filtering out everything. Avoid; useNOT EXISTS. -
Worked example:
products = [p1, p2, p3],purchases(p_id, region) = [(p1, WA), (p2, CA)]. "Products never purchased in WA" →p2, p3.
-- LEFT JOIN form
SELECT p.p_id
FROM products p
LEFT JOIN purchases pu
ON pu.p_id = p.p_id AND pu.region = 'WA'
WHERE pu.p_id IS NULL;
-- NOT EXISTS form (equivalent)
SELECT p.p_id
FROM products p
WHERE NOT EXISTS (
SELECT 1 FROM purchases pu
WHERE pu.p_id = p.p_id AND pu.region = 'WA'
);
Set operations: INTERSECT / EXCEPT / UNION
Three relational-algebra primitives. They take two SELECTs with compatible column lists and combine the result sets row-by-row.
-
UNION— Set union with deduplication (sort/hash overhead).UNION ALLis the bag variant—much faster, use whenever dedup isn't needed. -
INTERSECT— Rows in both sides.INTERSECT ALLis the multiset variant. -
EXCEPT(MINUSin Oracle) — Rows in left but not right.EXCEPT ALLis count-aware. -
NULL semantics — Set ops treat NULLs as equal (different from
=comparison whereNULL = NULLisUNKNOWN). -
Worked example:
s1carries[p1, p2],s2carries[p2, p3]. "In both" →INTERSECT→p2. "In s1 not s2" →EXCEPT→p1.
SELECT product_id FROM store_products WHERE store_id = 's1'
INTERSECT
SELECT product_id FROM store_products WHERE store_id = 's2';
Relational division: "products in all stores"
Relational division answers "for all" questions: "products that appear in every store." No single keyword—you build it from primitives. Three idioms:
-
Double
NOT EXISTS— Textbook formal-algebra shape: "no store exists where this product does not exist." Correct, terse, slow on large tables. -
Count-match
HAVING—COUNT(DISTINCT store_id) = (SELECT COUNT(*) FROM stores). Single pass, most readable; usually fastest. -
EXCEPT-based — Cross-productproducts × storesminus actual pairs, then exclude products with leftovers. Verbose; rarely best. -
Correctness: use
COUNT(DISTINCT store_id)notCOUNT(*)—duplicate inventory rows would otherwise inflate the count. -
Worked example: Stores
[s1, s2, s3].p2in all 3 → passes;p1in 2 of 3 → fails.
SELECT product_id
FROM store_products
GROUP BY product_id
HAVING COUNT(DISTINCT store_id) = (SELECT COUNT(*) FROM stores);
Common beginner mistakes
- Using
INNER JOINwhen the prompt asks for "rows missing from B"—you need an anti-join, not an inner. - Confusing
EXCEPTwithNOT INon a column that containsNULL—NOT IN (NULL)isUNKNOWN, drops everything. - Writing
COUNT(*)instead ofCOUNT(DISTINCT store_id)for relational division—duplicates inflate the count. - Forgetting
UNION ALLvsUNION—UNIONdeduplicates and is slower. - Putting unrelated columns in the
EXCEPTprojection—set operators compare whole rows, so extra columns change the semantics.
SQL interview question on anti-join and relational division
Tables products(product_id), stores(store_id), inventory(product_id, store_id). Return all product_ids carried in every store.
Solution using GROUP BY and a count-match HAVING
SELECT product_id
FROM inventory
GROUP BY product_id
HAVING COUNT(DISTINCT store_id) = (SELECT COUNT(*) FROM stores);
Why this works: The GROUP BY collapses inventory to one row per product with the count of distinct stores carrying it. The scalar subquery (SELECT COUNT(*) FROM stores) is the universal-quantifier denominator—it's evaluated once and treated as a constant by the planner. Equality means the product appears in every store. COUNT(DISTINCT store_id) defends against duplicate inventory rows (same product appears twice in the same store)—a subtle but common bug. Total cost: one scan of inventory plus one scan of stores, both linear in their respective sizes.
SQL
Topic — anti-join
Anti-join problems
SQL
Topic — set operations
Set operation problems
5. SQL Window Functions for Time-Series and Date Logic
Time-series patterns in SQL for data engineering
Time-series problems combine LAG/LEAD for prior-row deltas, half-open date filters to avoid boundary bugs, and gaps-and-islands patterns to label or count consecutive runs. Netflix's product-price-increases problem (#194) tests all three at once: compute the prior-month price, count strict month-over-month increases, bound the analysis to a date range.
Pro tip: The gaps-and-islands shape ("longest streak," "label every island") combines
LAGto detect transitions with running sums orROW_NUMBERarithmetic to assign island ids. Memorize the shape; it scales without changing structure.
LAG and LEAD for prior-row deltas
LAG(col, n, default) returns col n rows before the current row in window order; LEAD(col, n, default) looks ahead. Both are computed inside OVER (PARTITION BY … ORDER BY …). Boundary rows return NULL unless a default is supplied.
-
First/last row —
LAG(col, 1)is NULL on the first row per partition;LEADis NULL on the last. Provide an explicit default if NULL would propagate badly through downstream arithmetic. -
Deterministic
ORDER BY— Most common bug: ties on the order key make "the prior row" non-deterministic. Add a tie-break column. -
Frame —
LAGandLEADignore the frame; it matters forFIRST_VALUE,LAST_VALUE,NTH_VALUE. - Worked example:
| product_id | month | price | prev_price (LAG) |
|---|---|---|---|
| 1 | 2026-01 | 10 | NULL |
| 1 | 2026-02 | 12 | 10 |
| 1 | 2026-03 | 12 | 12 |
| 1 | 2026-04 | 15 | 12 |
SELECT
product_id, month, price,
LAG(price) OVER (PARTITION BY product_id ORDER BY month) AS prev_price
FROM product_prices;
Half-open date windows (>= start AND < end)
"Find rows in March 2026" sounds trivial but is the most common SQL date bug. BETWEEN '2026-03-01' AND '2026-03-31' is inclusive on both sides and either over- or under-includes timestamp boundaries depending on dialect rounding.
-
Half-open convention —
>= start AND < endis the only timezone- and precision-safe shape: lower inclusive, upper exclusive. Works for dates, timestamps, and timestamptz. -
Chains cleanly — Q1 2026:
>= '2026-01-01' AND < '2026-04-01'. No off-by-one math. -
Timezones —
timestamptzliterals coerce to session timezone; be explicit (AT TIME ZONE 'UTC') when production matters. - Worked example: Prices recorded at any timestamp in March 2026.
SELECT *
FROM product_prices
WHERE recorded_at >= '2026-03-01'
AND recorded_at < '2026-04-01';
Counting consecutive increases via window arithmetic
"How many strict month-over-month price increases per product?" reduces to counting rows where price > LAG(price). The clean shape combines counting and filtering with SUM(CASE WHEN … THEN 1 ELSE 0 END)—no nested WHERE.
-
Why it works — SQL aggregates skip NULL. The first row's
LAGis NULL, soprice > NULLis NULL,CASE WHEN NULL THEN ... ENDis NULL, andSUMexcludes it—exactly correct, since there's no prior row to compare to. -
ELSE 0vs noELSE—SUM(CASE WHEN p > prev THEN 1 END)(no else) also works—non-matches return NULL and are skipped.ELSE 0is more explicit. -
Worked example:
[10, 12, 12, 15]→ strict increases at month 2 (10→12) and month 4 (12→15) = 2.
WITH with_lag AS (
SELECT
product_id, month, price,
LAG(price) OVER (PARTITION BY product_id ORDER BY month) AS prev_price
FROM product_prices
)
SELECT
product_id,
SUM(CASE WHEN price > prev_price THEN 1 ELSE 0 END) AS increases
FROM with_lag
GROUP BY product_id;
Common beginner mistakes
- Using
BETWEENfor date ranges and over-including the upper boundary on timestamp columns. - Forgetting
PARTITION BY product_id—theLAGthen crosses product boundaries and produces nonsense. - Counting "increases or equals" (
>=) when the prompt says strict increase (>). - Sorting by an ambiguous column (e.g.
created_atrounded to the day with multiple rows per day)—non-deterministic order. - Forgetting that
LAGreturns NULL for the first row and that subsequent NULL-aware logic must handle this case.
SQL interview question on time-series windows
Table product_prices(product_id, month, price). Return per-product the count of strict month-over-month price increases. Use LAG.
Solution using LAG inside a CTE and a SUM(CASE) aggregate
WITH with_lag AS (
SELECT
product_id, month, price,
LAG(price) OVER (PARTITION BY product_id ORDER BY month) AS prev_price
FROM product_prices
)
SELECT
product_id,
COALESCE(SUM(CASE WHEN price > prev_price THEN 1 ELSE 0 END), 0) AS increases
FROM with_lag
GROUP BY product_id
ORDER BY product_id;
Why this works: LAG produces the prior-month price within each product partition. The first row's LAG is NULL, so price > NULL is NULL—correctly excluded from the count. SUM(CASE WHEN …) is the canonical SQL replacement for counting boolean predicates. COALESCE(... , 0) defends against products with only one row (whose SUM would be NULL because every CASE returned NULL). The query plan: scan product_prices → window sort by (product_id, month) → window function adds prev_price column → hash group-by on product_id → output. Single-pass on the input table after the sort.
SQL
Topic — time series
Time-series problems
COMPANY
Netflix — window functions
Netflix-tagged window functions
6. Sliding Window Patterns in Python
Sliding windows in Python for data engineering
Sliding window is a family of patterns sharing one mechanic: maintain a window over a sequence, advance it one step at a time, update state in O(1) per step instead of recomputing from scratch. Netflix's only Hard problem (#300, watch-time completion streaks) and #551 (sliding-window distinct counts) both reduce to this template.
Pro tip: Every sliding-window problem has two pointers and a window state. Right always moves forward; left moves forward when the window violates an invariant. State the template before writing code—it scaffolds the solution and prevents whiteboard-flailing.
Fixed-size vs variable-size sliding window
Two flavors, one decision rule based on the prompt's verb.
-
Fixed-size — Window has exactly
kelements. Slide adds the new right element and removes the old left one. Use cases: rolling average, rolling max, count-distinct-in-last-k. Total timeO(n). -
Variable-size — Window expands and shrinks based on a problem-specific invariant.
rightadvances to expand;leftadvances to shrink until invariant holds. Each pointer moves at mostntimes →O(n)amortized. - Decision rule — "of size k" / "every k consecutive" → fixed. "longest" / "smallest" / "at most" / "at least" → variable.
-
Worked example: Fixed-size 3 over
[1, 2, 3, 4, 5]→ windows[1,2,3],[2,3,4],[3,4,5].
def fixed_window_sums(arr: list[int], k: int) -> list[int]:
out: list[int] = []
s = sum(arr[:k])
out.append(s)
for i in range(k, len(arr)):
s += arr[i] - arr[i - k]
out.append(s)
return out
Sliding-window distinct count (Counter + window step)
"How many distinct values in the current window of size k?" Naive rebuild-set-per-window is O(n·k); the incremental Counter is O(n).
-
Mechanic — On each slide: increment the new right element's count; decrement the leaving left element's. Read
len(counter)for distinct count. -
Critical detail: when a count drops to zero, delete the key. Without the
del, zero-count keys accumulate andlen(counter)drifts upward. -
Off-by-one — First emission at
i >= k - 1(the first complete window emits when you've seenkelements, at indexk - 1). -
Worked example:
events = ['a','b','a','c','b'],k = 3:
| step | window | counter | distinct |
|---|---|---|---|
| 1 | [a,b,a] |
{a:2, b:1} |
2 |
| 2 | [b,a,c] |
{b:1, a:1, c:1} |
3 |
| 3 | [a,c,b] |
{a:1, c:1, b:1} |
3 |
from collections import Counter
def sliding_distinct(events: list[str], k: int) -> list[int]:
counter: Counter = Counter()
out: list[int] = []
for i, ev in enumerate(events):
counter[ev] += 1
if i >= k:
old = events[i - k]
counter[old] -= 1
if counter[old] == 0:
del counter[old]
if i >= k - 1:
out.append(len(counter))
return out
Streak detection (consecutive completions)
"Longest run of True values" or "longest stretch of completed days" is a degenerate variable-window: the window grows on a match and resets to zero on a non-match.
-
State — Two integers:
current(length of run ending here) andbest(max seen). On match,current += 1; on non-match,current = 0. Always updatebest = max(best, current). -
Why update
bestevery step? If you only check at end-of-run, you miss the streak that extends to the very last element. -
Generalization — Replace the boolean check with any predicate; the template is unchanged. Combine with
id += 1 on resetfor gaps-and-islands labeling. -
Worked example:
[True, True, False, True, True, True, False, True]→ runs2, 3, 1→ max3.
def longest_streak(flags: list[bool]) -> int:
best = current = 0
for f in flags:
current = current + 1 if f else 0
if current > best:
best = current
return best
Common beginner mistakes
- Rebuilding a
setfrom scratch per window (O(n·k) instead of O(n)). - Forgetting to
del counter[old]when the count hits zero—len(counter)then over-counts. - Using
i >= kinstead ofi >= k - 1for the first emission—off-by-one (the first complete window emits when you've seenkelements, which is at indexk - 1). - Not resetting the streak counter on a non-match (treating
Falseas a no-op). - Updating
bestonly at the end of a run—misses the streak that extends to the last element.
Python interview question on sliding windows
Given a stream of events and a window size k, return the count of distinct events in every window of size k. Use O(n) total time.
Solution using a Counter with explicit zero-delete
from collections import Counter
def sliding_distinct_counts(events: list[str], k: int) -> list[int]:
counter: Counter = Counter()
out: list[int] = []
for i, ev in enumerate(events):
counter[ev] += 1
if i >= k:
old = events[i - k]
counter[old] -= 1
if counter[old] == 0:
del counter[old]
if i >= k - 1:
out.append(len(counter))
return out
Why this works: The window slides by one event per iteration—incrementing the new element and decrementing the leaving one—so each step is O(1). Deleting the key when its count hits zero keeps len(counter) equal to the number of currently-present distinct values; without the delete, removed events would linger as zero-count keys and inflate the answer. Total time is O(n), memory is O(min(n, alphabet_size)). The i >= k - 1 emission guard ensures the first output is for the first complete window (not for partial windows during fill-up).
Practice Netflix-style sliding-window problems →
PYTHON
Topic — sliding window
Sliding window problems
COMPANY
Netflix — sliding window
Netflix-tagged sliding window
7. Streaming, Deques, and Fixed-Memory Buffers in Python
Streams and deques in Python for data engineering
Streaming is where memory becomes a hard constraint—stream length is unknown, so you can't load it into a Pandas dataframe. The right shape is collections.deque(maxlen=N): a doubly-ended queue with fixed capacity that auto-evicts the opposite end on overflow. O(1) per push, O(N) memory regardless of stream length. Netflix's #603 (fixed-memory stream buffer) is the canonical formulation.
Pro tip: Any time you see "process the stream," ask three things out loud—how big is the stream, do you need the last N items or a summary, what happens on crash. That changes the round from "wrote a function" to "thinks like a pipeline engineer."
deque vs list: O(1) appends and pops from both ends
Python's list is a dynamic array—appends and right-pops are O(1) amortized, but list.insert(0, x) and list.pop(0) are O(n) because every element shifts. collections.deque is a doubly-linked list of memory blocks with O(1) at both ends.
-
Decision rule — Either-end access →
deque. Random access by index →list(deque indexing is O(n)). Stack-only or queue-only at one end →list(lower constant factors). - Practical breakpoint — For very small N, both perform similarly; deque's advantage shows above N ≈ 100.
- Worked example:
| op | list |
deque |
|---|---|---|
| append right | O(1) | O(1) |
| pop right | O(1) | O(1) |
| append left (insert at 0) | O(n) | O(1) |
| pop left (pop(0)) | O(n) | O(1) |
from collections import deque
dq = deque([1, 2, 3])
dq.appendleft(0) # deque([0, 1, 2, 3]) — O(1)
dq.popleft() # returns 0 — O(1)
deque(maxlen=N) for fixed-memory rolling buffer
Pass maxlen=N and the deque becomes a bounded ring buffer: pushes that exceed N automatically evict the opposite end in O(1).
-
The deque enforces the invariant — No
if len(dq) > N: dq.popleft()guard needed. The constructor handles it silently. -
Capacity is fixed — Once set,
maxlencannot change. Use a regular deque + manual guard if dynamic capacity is required. -
Infinite streams work identically — Memory bounded by
N, not stream length. Say this explicitly when the prompt mentions "unbounded stream." -
Worked example:
dq = deque(maxlen=3)push1, 2, 3, 4:
| step | push | deque after |
|---|---|---|
| 1 | 1 | [1] |
| 2 | 2 | [1, 2] |
| 3 | 3 | [1, 2, 3] |
| 4 | 4 | [2, 3, 4] |
(The 1 was evicted automatically.)
from collections import deque
def rolling_buffer(stream, capacity: int):
buf = deque(maxlen=capacity)
for item in stream:
buf.append(item)
return list(buf)
Stream consumption: generator + deque
Real streams are generators—they produce values lazily, one at a time, without holding the full sequence. Combining deque(maxlen=N) with for item in stream gives constant-memory rolling windows over arbitrarily long input.
-
Production sources — Kafka, Kinesis, Pulsar, file iterators (
for line in open(huge_file)), socket reads. Thefor item in streamidiom makes laziness automatic. -
Lazy emission —
yield list(buf)after each push lets callers iterate as far as they want without materializing the full snapshot sequence (O(emissions)memory instead ofO(M·N)). -
Worked example: Consume an event generator, keep last
nevents.
from collections import deque
from typing import Iterator
def last_n(stream: Iterator[str], n: int) -> list[str]:
buf = deque(maxlen=n)
for ev in stream:
buf.append(ev)
return list(buf)
Common beginner mistakes
- Using a
listand callinglist.pop(0)in a hot loop—silentlyO(n)per step. - Manually managing
maxlenwithif len(dq) > N: dq.popleft()—correct but redundant when you can just passmaxlen=N. - Storing the entire stream just to keep the last
Nitems—wastes memory. - Forgetting
dequeis iterable but not random-access inO(1)—dq[i]isO(i). - Trying to change
maxlenafter construction—not supported; build a new deque.
Python interview question on streaming and deques
You receive an event stream of unknown length. Return the last k events at any moment, with constant memory regardless of how long the stream is.
Solution using deque(maxlen=k)
from collections import deque
from typing import Iterator, Iterable
def stream_last_k(stream: Iterable[str], k: int) -> Iterator[list[str]]:
buf = deque(maxlen=k)
for ev in stream:
buf.append(ev)
yield list(buf)
Why this works: deque(maxlen=k) enforces the invariant "at most k items" automatically; pushing the (k+1)-th event evicts the oldest in O(1). Memory is bounded by k regardless of how many events the stream produces. The generator yield makes the function lazy—callers can iterate as far as they want without materializing the whole sequence of snapshots, which is critical for production streams that may run for days. Total time is O(L) where L is the number of events the caller actually consumes.
PYTHON
Topic — streaming
Streaming problems
COMPANY
Netflix — streaming
Netflix-tagged streaming
8. Resumable ETL with Checkpoint Files and Fault Tolerance in Python
Resumable ETL in Python for data engineering
Production pipelines crash. The question that separates a junior DE from a senior one is what happens after the crash. A pipeline that resumes from the last good offset—via a checkpoint file—loses minutes at most and double-processes nothing if its consumer is idempotent. Netflix's #607 is this exact scenario.
Pro tip: Three failure modes to defend against: (a) corrupt checkpoint from a half-written file →
os.replacefor atomic swap; (b) non-idempotent consumer → upsert by stable PK; (c) checkpoint written before work is durable → always process, confirm durable, then checkpoint.
See streaming and ETL problems →
Checkpoint file: serialize last-good offset
A checkpoint is a tiny durable file (JSON beats pickle for ops debugging) containing just enough state to resume—usually {"offset": <last_processed_index>}. Keep it small; a 100-field state blob is a debugging nightmare.
-
Atomic on update — Write to
path + ".tmp", thenos.replace(tmp, path). On modern filesystems (ext4, XFS, APFS, NTFS),os.replaceis atomic—either the old or new checkpoint is intact, never half-written. - Order matters — Process record N → confirm durable (DB commit returned, fsync succeeded) → write checkpoint. Reversing this loses records on crash.
-
Worked example: Pipeline reads 0–99, writes
{"offset": 100}. Crash at line 150 → next run skips to 100, replays 100–150. Idempotent consumer makes the replay safe.
import json
import os
def write_checkpoint(path: str, offset: int) -> None:
tmp = path + ".tmp"
with open(tmp, "w") as f:
json.dump({"offset": offset}, f)
os.replace(tmp, path) # atomic on most filesystems
Resumable read: skip to last checkpoint, replay from there
On startup, read the checkpoint (default to 0 if missing—first run). Skip the source to that offset. Process records, batching checkpoint writes (every N records or T seconds) so disk I/O doesn't dominate.
-
Skip mechanic varies by source — Files:
for i, line: if i < start: continueis O(start);f.seek(byte_offset)is O(1) if you store byte offsets. Kafka:seek(offset)directly. Databases:WHERE id > last_id ORDER BY id. - Batch trade-off — Larger batches → less I/O but more replay on crash. ~100–1000 records is typical; reason about it explicitly in the interview.
-
Edge cases — Empty input → skip the final write if
last == start. Source shrunk → pipeline does nothing (correctly). - Worked example: Resume a CSV pipeline.
import json
import os
def read_checkpoint(path: str) -> int:
if not os.path.exists(path):
return 0
with open(path) as f:
return json.load(f).get("offset", 0)
def run_pipeline(input_path: str, ckpt_path: str, batch: int = 100) -> None:
start = read_checkpoint(ckpt_path)
last = start
with open(input_path) as f:
for i, line in enumerate(f):
if i < start:
continue
process(line) # idempotent
last = i + 1
if last % batch == 0:
write_checkpoint(ckpt_path, last)
if last > start:
write_checkpoint(ckpt_path, last)
Idempotency and fault-tolerant writes
Resumability is only safe if process(record) is idempotent: replaying a record doesn't change system state more than once. Without it, every replay duplicates work.
-
Upsert by stable PK —
INSERT … ON CONFLICT (pk) DO UPDATE(Postgres) /ON DUPLICATE KEY UPDATE(MySQL). The PK must be deterministic from the source (event id, row hash,(user_id, ts))—auto-increment surrogates break idempotency. - Dedupe at the sink — Maintain a "seen events" table; check membership before processing. Works when the source can't produce a stable PK.
- External calls — Payment gateways, email sends: use the idempotency-key pattern (every external call carries a unique key derived from the record). Stripe and most modern APIs support this.
- Worked example: Inserting events into Postgres with idempotent upsert.
def process(line: str) -> None:
event = parse(line)
db.execute(
"INSERT INTO events (id, payload) VALUES (%s, %s) "
"ON CONFLICT (id) DO NOTHING",
(event.id, event.payload),
)
Common beginner mistakes
- Writing the checkpoint before processing—lose the just-skipped records on crash.
- Non-atomic checkpoint writes (write directly to the target file)—a crash mid-write leaves a corrupt JSON; next startup raises.
- Per-record checkpoint writes—correct but slow; batch instead.
- Non-idempotent
process()that duplicates inserts on replay (no upsert, no dedup table). - Auto-incrementing surrogate keys as the idempotency key—they change on replay, breaking the upsert guarantee.
Python interview question on resumable ETL
Build a function run_pipeline(input_path, ckpt_path) that processes lines from a large CSV. On startup it reads ckpt_path (a JSON file with {"offset": int}); after every 100 records it writes the current offset back. The pipeline must be safely killable at any moment without losing or duplicating records. process(line) is provided and is idempotent.
Solution using atomic checkpoint writes and batched commits
import json
import os
def read_checkpoint(path: str) -> int:
if not os.path.exists(path):
return 0
with open(path) as f:
return json.load(f).get("offset", 0)
def write_checkpoint(path: str, offset: int) -> None:
tmp = path + ".tmp"
with open(tmp, "w") as f:
json.dump({"offset": offset}, f)
os.replace(tmp, path)
def run_pipeline(input_path: str, ckpt_path: str, batch: int = 100) -> None:
start = read_checkpoint(ckpt_path)
last = start
with open(input_path) as f:
for i, line in enumerate(f):
if i < start:
continue
process(line) # idempotent
last = i + 1
if last % batch == 0:
write_checkpoint(ckpt_path, last)
if last > start:
write_checkpoint(ckpt_path, last)
Why this works: The atomic os.replace swap means a crash during write_checkpoint either leaves the old good checkpoint intact or atomically swaps in the new good one—never a half-written file. Batching the writes (every 100 records) keeps disk overhead negligible without sacrificing the safety budget: on crash, replay is at most 99 records, all idempotent. The final write outside the loop catches the partial-batch case (e.g. 250 records → checkpoint at 100, 200, then 250 at the end). The whole pipeline is safe under arbitrary crashes because (a) the checkpoint is durable after os.replace, (b) process is idempotent so replay is safe, and (c) process runs before the checkpoint write so we never claim done on un-done work.
PYTHON
Topic — ETL
ETL problems
COMPANY
Netflix — streaming
Netflix-tagged streaming
Tips to crack Netflix data engineering interviews
These are habits that move the needle in real Netflix loops—not a re-statement of the topics above.
SQL preparation: drill window + anti-join + set ops
Spend most of your SQL prep on window functions inside CTEs (SUM / RANK / LAG), anti-joins (LEFT JOIN ... IS NULL and NOT EXISTS), set operations (INTERSECT, EXCEPT), and relational division (count-match HAVING). Type the queries; do not just read them. The window functions, anti-join, and set operations topic pages cover these directly.
Python preparation: stack, sliding window, deque
Netflix's Python rounds reward stdlib fluency over framework knowledge. Memorize: monotonic stacks for span queries, sliding-window distinct counts with Counter, collections.deque(maxlen=N) for fixed-memory streams, and the resumable-ETL pattern. Drill problems from the sliding window and streaming topic pages.
Pipeline thinking: streaming, checkpointing, fault tolerance
Netflix interviewers care about production realism. When the prompt says "process a stream," ask: how big is the stream? What happens on crash? Is process idempotent? Where does the checkpoint live? Saying these four sentences out loud changes how the round is scored. The ETL topic page has problems that match the resumable-pipeline pattern.
Where to practice on PipeCode
| Skill lane | Practice path |
|---|---|
| Curated Netflix practice set | /explore/practice/company/netflix |
| SQL window functions | /explore/practice/topic/window-functions |
| SQL anti-join | /explore/practice/topic/anti-join |
| SQL set operations | /explore/practice/topic/set-operations |
| Sliding window | /explore/practice/topic/sliding-window |
| Streaming + deque | /explore/practice/topic/streaming |
| ETL pipelines | /explore/practice/topic/etl |
| Hash table | /explore/practice/topic/hash-table |
| All practice topics | /explore/practice/topics |
| Interview courses | /explore/courses |
Communication under time pressure
State assumptions before typing: "I'll assume the stream is too large to fit in memory and that process is idempotent." State grain: "One row per product per month." State edge cases: "If the recursion has a cycle, my depth guard caps at 50." Interviewers grade clear reasoning above silent-and-perfect.
Frequently Asked Questions
What is the Netflix data engineering interview process like?
The Netflix data engineering interview typically includes a phone screen (Python warm-up around hash maps or arrays), one or two coding rounds focused on SQL window functions and Python streaming patterns, a system-design conversation around pipelines and warehousing, and behavioral interviews. The curated 13-problem Netflix practice set on PipeCode mirrors what you will see on the technical rounds.
What SQL topics does Netflix test for data engineers?
Netflix emphasizes window functions (SUM/RANK/LAG/LEAD with PARTITION BY), anti-joins, set operations (INTERSECT, EXCEPT), relational division ("for all" patterns), and time-series date logic. Drill these on the window functions, anti-join, and set operations topic pages before your screen.
How important is Python for a Netflix data engineering interview?
Python is roughly half the interview at Netflix—and the patterns are pipeline-realistic, not algorithm puzzles. Expect monotonic stacks, sliding-window distinct counts with Counter, collections.deque(maxlen=N) for fixed-memory streams, and resumable-ETL with checkpoint files. Stdlib fluency (collections, functools, itertools) matters more than knowing the latest framework.
Are sliding window questions common in the Netflix data engineering interview?
Yes—Netflix's only Hard problem in the curated set is sliding window (watch-time completion streaks), and a separate Medium covers sliding-window distinct counts in event streams. Memorize the Counter-based "distinct count in window of size k" pattern and the variable-window left/right pointer pattern. The sliding window topic page has problems matching this shape.
How should I prepare for Netflix's streaming and ETL questions?
Build the mental model first: a stream is unknown-length, so memory must be bounded; a pipeline crashes, so it must resume from a checkpoint; replays happen, so process must be idempotent. Practice writing deque(maxlen=N) rolling buffers and os.replace-based atomic checkpoint files from memory. The streaming and ETL topic pages cover both directly.
How many Netflix practice problems should I solve before the interview?
Aim for 30–50 problems spanning all eight topic clusters above—not 100 of the same kind. Solve every problem in the Netflix-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.
Start practicing Netflix data engineering problems
Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Netflix problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Netflix asks—without the noise of generic algorithm prep.
Pipecode.ai is Leetcode for Data Engineering.



Top comments (0)