Facebook data engineering interview questions are bilingual, product-analytics-flavored, and PostgreSQL-grounded on the SQL side. Facebook Inc. rebranded to Meta Platforms Inc. in October 2021, but the data-engineering interview shape — and the question patterns that show up in the live phone screen and onsite — has not moved. The standard technical phone screen is 5 minutes intro + 30 minutes SQL + 30 minutes Python + 5 minutes Q&A, with the candidate choosing whether to start with SQL or Python. Four primitives carry the loop: n*(n+1)/2 - sum(arr) arithmetic-series sum-formula (and its XOR self-cancellation alternative) for the missing-number array problem, character-by-character tokenization plus two-pass evaluation for arithmetic formula parsing, correlated EXISTS subqueries with EXTRACT(MONTH FROM ts - INTERVAL '1 month') for month-over-month MAU retention, and CTE composition plus self-joins for post-hiatus aggregation and friend-recommendation queries.
This guide walks four topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches a curated 2-problem Facebook set (1 EASY Python array + 1 MEDIUM Python array+math+bit+string parser) plus two adjacent SQL primitives — EXISTS month-over-month retention and CTE + self-join aggregation — that show up on every Meta SQL question list and at every product-analytics onsite. The interview is bilingual SQL + Python; candidates who prep only one language stutter on the half they avoided.
Top Facebook data engineering interview topics
From the Facebook data engineering practice set, the four numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–4) | Why it shows up at Facebook |
|---|---|---|
| 1 | Python array missing number — sum formula and XOR | Missing Number (EASY) — n*(n+1)/2 - sum(arr) arithmetic-series identity and its XOR self-cancellation alternative, the classic array primitive Meta phone-screens with. |
| 2 | Python arithmetic formula evaluator — array, math, bit, string parsing | Arithmetic Formula Evaluator (MEDIUM) — character-by-character tokenization plus two-pass evaluation (handle * and / first, then + and -), the parser primitive that powers any "given a string formula, return the value" question. |
| 3 | SQL window functions and EXISTS for monthly active user retention |
Active User Retention — correlated EXISTS subquery with EXTRACT(MONTH FROM ts - INTERVAL '1 month') for month-over-month MAU, the SQL primitive that drives Meta product-analytics retention dashboards. |
| 4 | SQL CTE and self-join for post hiatus and friend recommendations | Average Post Hiatus + Friend Recommendations — MIN/MAX(post_date) per user with HAVING COUNT > 1 and CTE-driven self-joins for pair-wise friend-rec queries, the SQL primitive that drives Meta social-graph analytics. |
Bilingual phone-screen framing rule: Facebook / Meta data engineering phone screens run a strict 5-30-30-5 format: 5 minutes intro, 30 minutes SQL, 30 minutes Python, 5 minutes Q&A — and the candidate chooses whether to open with SQL or Python. Drill both halves equally; over-indexing on one side sets up a stuttering second half. State your preferred opener at the start so the interviewer can plan accordingly.
1. Python Array Missing Number — Sum Formula and XOR
Sum-formula and XOR self-cancellation in Python for data engineering
"Given an array of n distinct integers in the range [0, n] with exactly one missing, return the missing number" is Facebook's signature EASY Python prompt (Missing Number). The mental model: the arithmetic-series identity 0 + 1 + … + n = n*(n+1)/2 gives the expected sum; subtracting the actual sum reveals the missing number in O(n) time and O(1) space. The mirror primitive is XOR self-cancellation — a ^ a == 0 and a ^ 0 == a — which gives the same answer using bit operations and never overflows.
Pro tip: State both approaches out loud before writing code. Interviewers grade the candidate's awareness that the sum-formula can overflow on huge
n(Python ints are arbitrary precision so it's fine in practice, but the answer in Java or C++ requires the XOR variant). Naming both demonstrates breadth.
Arithmetic-series sum formula: n*(n+1)/2
The sum-formula invariant: the sum of integers from 0 to n inclusive is n*(n+1)/2. Subtracting sum(arr) from this expected total yields the missing element when exactly one integer is absent from the contiguous range.
-
n*(n+1)//2— integer division in Python; produces the expected total. -
sum(nums)—O(n)linear scan; produces the actual total. -
expected - actual— the difference is the missing value. - Constant space — no auxiliary structure required.
Worked example. nums = [0, 1, 3, 4], n = 4.
| step | value |
|---|---|
n |
4 |
n*(n+1)//2 |
10 |
sum(nums) |
8 |
missing |
2 |
Worked-example solution.
def missing_number(nums: list[int]) -> int:
n = len(nums)
return n * (n + 1) // 2 - sum(nums)
Rule of thumb: the sum-formula is the cleanest one-liner in Python; reach for it whenever the prompt guarantees a contiguous integer range.
XOR self-cancellation: a ^ a == 0
The XOR invariant: a ^ a == 0 and a ^ 0 == a; XOR is commutative and associative so the order does not matter. XORing every element of nums with every element of range(n + 1) cancels every value that appears in both, leaving only the one missing from nums.
-
a ^ a == 0— self-cancellation property. -
a ^ 0 == a— identity property. - Commutative + associative — order-independent.
- No overflow — bit operations never overflow in any language.
Worked example. Same nums = [0, 1, 3, 4], n = 4.
| step | XOR chain |
|---|---|
| nums |
0 ^ 1 ^ 3 ^ 4 = 6 |
| range |
0 ^ 1 ^ 2 ^ 3 ^ 4 = 4 |
| combined |
6 ^ 4 = 2 |
Worked-example solution.
from functools import reduce
from operator import xor
def missing_number_xor(nums: list[int]) -> int:
n = len(nums)
return reduce(xor, nums + list(range(n + 1)))
Rule of thumb: the XOR variant is the right answer when the prompt asks about overflow safety or when you want to demonstrate bit-manipulation fluency.
Set-difference fallback for non-contiguous ranges
The set-difference invariant: when the input range is not contiguous (e.g., "find the missing element from [10, 20, 30, 40, 60] knowing the full set should be [10, 20, 30, 40, 50, 60]"), the sum and XOR shortcuts no longer apply; reach for set(expected) - set(actual).
-
set(expected) - set(actual)—O(n)time,O(n)space. -
(set(expected) - set(actual)).pop()— return the single missing element. - Multiple missing elements — the same set-difference returns all of them.
-
Slower than sum / XOR — uses
O(n)extra space; pick this only when the range is non-contiguous.
Worked example. actual = [10, 20, 30, 40, 60], expected = [10, 20, 30, 40, 50, 60].
| step | output |
|---|---|
set(expected) - set(actual) |
{50} |
Worked-example solution.
def missing_from_set(actual: list[int], expected: list[int]) -> int:
return (set(expected) - set(actual)).pop()
Rule of thumb: contiguous range → sum-formula or XOR (O(1) space); non-contiguous → set-difference (O(n) space).
Common beginner mistakes
- Using
n * (n + 1) / 2(float division) instead ofn * (n + 1) // 2(integer division) — produces a float result that fails int-typed tests. - Computing
nasmax(nums)instead oflen(nums)— wrong by one when the missing element is the max itself. - Forgetting that
range(n)excludesn— userange(n + 1)for inclusive[0, n]. - Sorting
numsfirst —O(n log n)instead ofO(n); signals algorithmic weakness. - Returning the difference set instead of a single value — read the contract; the missing-number problem returns one int.
Python Interview Question on Missing Number
Given an array nums containing n distinct integers in the range [0, n], return the single missing number in O(n) time and O(1) space.
def missing_number(nums: list[int]) -> int:
pass # your code here
Solution Using n*(n+1)//2 - sum(nums)
def missing_number(nums: list[int]) -> int:
n = len(nums)
return n * (n + 1) // 2 - sum(nums)
Why this works: len(nums) gives n because the array contains exactly one fewer element than the full range [0, n]; the arithmetic-series identity 0 + 1 + … + n = n*(n+1)/2 produces the expected total; subtracting sum(nums) (the actual total) yields the missing value; the integer-division // keeps the result an int. O(n) time for the single linear sum, O(1) extra space — no auxiliary structure.
Step-by-step trace for nums = [3, 0, 1]:
| step | value |
|---|---|
n = len(nums) |
3 |
expected = 3 * 4 // 2
|
6 |
actual = sum([3, 0, 1])
|
4 |
missing = 6 - 4
|
2 |
Output:
| input | expected | actual | missing |
|---|---|---|---|
[3, 0, 1] |
6 | 4 | 2 |
Why this works — concept by concept:
-
Arithmetic-series identity —
0 + 1 + ... + n = n*(n+1)/2is a closed-form expression; computing it isO(1)regardless ofn. -
Linear
sum(nums)—O(n)scan of the array; the only data-touching cost in the algorithm. -
Integer division
//— keeps the expected total as a Pythonint;n*(n+1)is always even so the division has no remainder. -
Subtraction reveals the gap —
expected - actualcancels every shared value and leaves the missing one. -
O(n)time /O(1)space — single pass overnums, constant extra storage, no recursion or auxiliary collection.
Inline CTA: Drill the Facebook Python practice page for the curated EASY array problem and the Facebook array practice page for the company_topic surface.
PYTHON
Company — Facebook (Python)
Facebook Python practice problems
PYTHON
Company — Facebook / array
Facebook array problems
PYTHON
Topic — array
Python array problems
2. Python Arithmetic Formula Evaluator — Array, Math, Bit, String Parsing
Tokenization and two-pass evaluation in Python for data engineering
"Given a string like '3 + 5 * 2', parse and evaluate it with standard precedence" is Facebook's signature MEDIUM Python prompt (Arithmetic Formula Evaluator). The mental model: a length-1 character-by-character scan tokenizes the string into a flat list of integers and operator characters; a first pass collapses every * and / left-to-right; a second pass collapses the remaining + and - left-to-right; the final list has one element — the result. Same primitive powers any "expression evaluator" pipeline — calculator apps, formula columns in spreadsheets, simple DSL interpreters.
Pro tip: Avoid
eval(s)even when the test prompt allows it. Production code never trusts arbitrary strings, and interviewers grade the candidate who writes a real parser. State the no-evalrule out loud before coding; the senior signal is unmistakable.
Tokenization: split a string into numbers and operators
The tokenization invariant: scan the string character-by-character; accumulate digits into an integer buffer, flush the buffer when an operator is hit, append the operator as its own token; whitespace is skipped. The output is a flat list alternating between integers and operator chars.
-
Digit accumulation —
if c.isdigit(): num = num * 10 + int(c). -
Operator flush —
elif c in '+-*/': tokens.append(num); tokens.append(c); num = 0. -
Final flush — append the trailing
numafter the loop. -
Whitespace — skip with
if c.isspace(): continue.
Worked example. s = '3 + 5 * 2'.
| step | tokens |
|---|---|
| start | [] |
3 |
[3] |
+ |
[3, '+'] |
5 |
[3, '+', 5] |
* |
[3, '+', 5, '*'] |
2 |
[3, '+', 5, '*', 2] |
Worked-example solution.
def tokenize(s: str) -> list:
tokens, num, in_num = [], 0, False
for c in s:
if c.isspace():
continue
if c.isdigit():
num = num * 10 + int(c)
in_num = True
else:
if in_num:
tokens.append(num)
num, in_num = 0, False
tokens.append(c)
if in_num:
tokens.append(num)
return tokens
Rule of thumb: always handle multi-digit numbers explicitly — '12 + 3' is two operands, not three; the digit accumulator is what makes that work.
Two-pass evaluation: handle * and / first, then + and -
The precedence invariant: the first pass walks the token list, finds every * and /, and replaces the triple (left, op, right) with (left op right); the second pass does the same for + and -. After the first pass, the token list contains only + and - operators between integers — left-to-right evaluation produces the result.
-
First pass — handle
*and/(higher precedence). -
Second pass — handle
+and-(lower precedence). - Left-to-right within precedence — preserves the standard expression semantics.
-
Integer division —
//for/if the contract is integer arithmetic;int(a / b)for truncate-toward-zero.
Worked example. Continuing tokens = [3, '+', 5, '*', 2].
| pass | tokens |
|---|---|
| input | [3, '+', 5, '*', 2] |
pass 1 (*) |
[3, '+', 10] |
pass 2 (+) |
[13] |
| result | 13 |
Worked-example solution.
def collapse(tokens: list, ops: set) -> list:
result = [tokens[0]]
i = 1
while i < len(tokens):
op, right = tokens[i], tokens[i + 1]
if op in ops:
left = result.pop()
if op == '*':
result.append(left * right)
elif op == '/':
result.append(left // right)
elif op == '+':
result.append(left + right)
else: # '-'
result.append(left - right)
else:
result.append(op)
result.append(right)
i += 2
return result
Rule of thumb: two passes is the simplest correct approach; reach for the stack-based shunting-yard variant only when parens or unary minus enter the spec.
Stack-based evaluator for general expressions
The stack invariant: maintain a stack of partial results; when scanning a token, the operator decides whether to update the top of the stack (* or /) or to push a new term (+ or -). The final answer is sum(stack).
-
+— push the next operand onto the stack. -
-— push the negated next operand. -
*— multiply the top of stack by the next operand in place. -
/— integer-divide the top of stack by the next operand in place.
Worked example. s = '3+5*2', single-pass with stack.
| token | stack |
|---|---|
3 |
[3] |
+ 5 |
[3, 5] |
* 2 |
[3, 10] |
| end | sum = 13 |
Worked-example solution.
def calculate(s: str) -> int:
s = s.replace(' ', '')
stack, num, op = [], 0, '+'
for i, c in enumerate(s):
if c.isdigit():
num = num * 10 + int(c)
if c in '+-*/' or i == len(s) - 1:
if op == '+': stack.append(num)
elif op == '-': stack.append(-num)
elif op == '*': stack.append(stack.pop() * num)
else: stack.append(int(stack.pop() / num))
op, num = c, 0
return sum(stack)
Rule of thumb: the stack-based single-pass evaluator is the production-grade answer; the two-pass variant is the easier explanation but allocates more.
Common beginner mistakes
- Using
eval(s)— works on the test cases but is graded as a fail; production code neverevals untrusted input. - Forgetting multi-digit numbers —
'12 + 3'becomes[1, 2, '+', 3]instead of[12, '+', 3]. - Doing left-to-right without precedence —
3 + 5 * 2becomes(3 + 5) * 2 = 16instead of3 + (5 * 2) = 13. - Using
int(a / b)when the contract says//— produces wrong rounding for negative integers. - Skipping the trailing-number flush — the last operand never enters the token list.
Python Interview Question on Arithmetic Formula Evaluator
Given a string s containing integer operands and the operators + - * / (no parentheses, integer arithmetic, standard precedence), return the integer result.
def calculate(s: str) -> int:
pass # your code here
Solution Using single-pass stack-based evaluator
def calculate(s: str) -> int:
s = s.replace(' ', '')
stack, num, op = [], 0, '+'
for i, c in enumerate(s):
if c.isdigit():
num = num * 10 + int(c)
if c in '+-*/' or i == len(s) - 1:
if op == '+':
stack.append(num)
elif op == '-':
stack.append(-num)
elif op == '*':
stack.append(stack.pop() * num)
else: # '/'
stack.append(int(stack.pop() / num))
op, num = c, 0
return sum(stack)
Why this works: the single pass tokenizes and evaluates simultaneously; op tracks the operator that applies to the just-finished number; on +/- we push a new term onto the stack; on */// we update the top of stack in place — which is exactly the precedence-respecting behavior we need; the trailing flush (i == len(s) - 1) handles the last number; sum(stack) collapses the additive sub-results.
Step-by-step trace for s = '3+5*2':
| i | c | num | op_pending | stack |
|---|---|---|---|---|
| 0 | 3 |
3 | + |
[] |
| 1 | + |
0 | + |
[3] |
| 2 | 5 |
5 | + |
[3] |
| 3 | * |
0 | * |
[3, 5] |
| 4 | 2 |
2 | * |
[3, 5] |
| end | — | 0 | — | [3, 10] |
sum(stack) = 3 + 10 = 13.
Output:
| input | result |
|---|---|
'3+5*2' |
13 |
'12-3*2' |
6 |
'10/3' |
3 |
Why this works — concept by concept:
- Single-pass tokenize-and-evaluate — one scan of the string drives both tokenization and stack updates; no separate token list materialized.
-
optracks the previous operator — applies to the number that just finished accumulating, which is the key trick for the stack-update timing. -
Stack push for
+and-— additive operations push new terms; the finalsumcollapses them. -
In-place stack update for
*and/— multiplicative operations modify the top of stack so they bind tighter than additive ones; this is what gives precedence without a second pass. -
Trailing flush via
i == len(s) - 1— the last operand has no following operator to trigger a flush; the index check forces the final stack update. -
O(n)time /O(n)space — one pass over the string; the stack holds at most one element per+or-token, which isO(n)worst case.
Inline CTA: More SQL string problems and array problems for breadth.
PYTHON
Company — Facebook (Python)
Facebook Python practice problems
PYTHON
Topic — string
Python string problems
PYTHON
Topic — array
Python array problems
3. SQL Window Functions and EXISTS for Monthly Active User Retention
Correlated EXISTS subquery for month-over-month MAU retention in SQL for data engineering
"Return the number of monthly active users in July 2022 — users who were active in both July AND June" is Meta's signature SQL retention prompt (DataLemur Q4). The mental model: a user is a July-MAU iff they have at least one event in July AND at least one event in June; a correlated EXISTS subquery checks the second condition row-by-row against the same user_actions table; EXTRACT(MONTH FROM curr_month.event_date - INTERVAL '1 month') shifts the comparison window. Same primitive powers any "active in current period AND in previous period" retention metric — week-over-week active users, day-over-day session retention, cohort-N-day return.
Pro tip: Correlated
EXISTSsubqueries are graded as the right answer over self-joins for retention queries. Self-joins explode the row count when a user has many events;EXISTSshort-circuits on the first match per outer row. State this performance distinction to the interviewer.
EXTRACT(MONTH FROM ts) and INTERVAL '1 month' arithmetic
The date-arithmetic invariant: EXTRACT(MONTH FROM ts) returns the month component of ts as an integer 1-12; ts - INTERVAL '1 month' shifts ts back exactly one calendar month respecting end-of-month edges. Combining both produces "month of one month ago" — the comparison key for retention.
-
EXTRACT(MONTH FROM ts)— month number 1-12. -
EXTRACT(YEAR FROM ts)— year number; combine with month for unique periods. -
ts - INTERVAL '1 month'— calendar shift, handles 31-day → 30-day automatically. -
DATE_TRUNC('month', ts)— alternative; returns first day of the month as aDATE.
Worked example. event_date = '2022-07-15'.
| expression | output |
|---|---|
EXTRACT(MONTH FROM event_date) |
7 |
event_date - INTERVAL '1 month' |
2022-06-15 |
EXTRACT(MONTH FROM event_date - INTERVAL '1 month') |
6 |
Worked-example solution.
SELECT EXTRACT(MONTH FROM event_date) AS curr_month,
EXTRACT(MONTH FROM event_date - INTERVAL '1 month') AS prev_month
FROM user_actions
WHERE event_date = '2022-07-15';
Rule of thumb: combine EXTRACT(MONTH FROM ...) and EXTRACT(YEAR FROM ...) when retention spans calendar years; month-only is wrong for Dec→Jan transitions.
Correlated EXISTS subquery for previous-month presence
The correlated-subquery invariant: WHERE EXISTS (SELECT 1 FROM ... WHERE inner.col = outer.col) returns TRUE for outer rows whose col value has at least one matching row in the inner query; the inner query references the outer alias and re-evaluates per outer row. Short-circuits on the first match.
-
EXISTS (SELECT 1 FROM ... WHERE ...)— the canonical pattern. - Correlation — inner WHERE clause references outer alias.
- Short-circuit — stops scanning inner rows on first match.
-
NOT EXISTS— mirror image; "no match found in inner."
Worked example. Two users; user 445 has June + July rows, user 742 has only July.
| user_id | event_date | EXISTS June row? |
|---|---|---|
| 445 | 2022-06-30 | (no — this IS June) |
| 445 | 2022-07-05 | ✓ (matches user 445's June row) |
| 742 | 2022-07-03 | ✗ |
Worked-example solution.
SELECT user_id
FROM user_actions AS curr_month
WHERE EXTRACT(MONTH FROM event_date) = 7
AND EXTRACT(YEAR FROM event_date) = 2022
AND EXISTS (
SELECT 1
FROM user_actions AS last_month
WHERE last_month.user_id = curr_month.user_id
AND EXTRACT(MONTH FROM last_month.event_date) = 6
AND EXTRACT(YEAR FROM last_month.event_date) = 2022
);
Rule of thumb: EXISTS beats self-join for retention because it short-circuits per outer row and never explodes the cardinality.
Alternative: window function with self-join over month buckets
The window-alternative invariant: a self-join on user_id between two pre-bucketed tables (or LAG/LEAD over month-truncated dates inside a partition) achieves the same retention answer with different performance trade-offs. Useful when the prompt asks for a continuous N-month-streak metric.
-
DATE_TRUNC('month', event_date)— bucket per month. -
LAG(month_bucket) OVER (PARTITION BY user_id ORDER BY month_bucket)— previous month bucket per user. -
Self-join —
JOIN user_actions u2 ON u2.user_id = u1.user_id AND u2.month_bucket = u1.month_bucket - INTERVAL '1 month'. - Window approach scales better — single sort, no nested scan.
Worked example. Same data; window approach.
| user_id | curr_month | prev_month_bucket | retained? |
|---|---|---|---|
| 445 | 2022-07 | 2022-06 | ✓ |
| 742 | 2022-07 | NULL | ✗ |
Worked-example solution.
WITH per_user_months AS (
SELECT DISTINCT user_id, DATE_TRUNC('month', event_date) AS month_bucket
FROM user_actions
)
SELECT user_id
FROM per_user_months curr
JOIN per_user_months prev
ON prev.user_id = curr.user_id
AND prev.month_bucket = curr.month_bucket - INTERVAL '1 month'
WHERE curr.month_bucket = '2022-07-01';
Rule of thumb: one-month retention → EXISTS; multi-month streak → window functions over DISTINCT user_id, month_bucket rows.
Common beginner mistakes
- Comparing
EXTRACT(MONTH FROM ts) = 6without checking the year — June 2021 silently counts as June 2022. - Using a self-join when
EXISTSis cleaner — explodes the cardinality. - Forgetting
WHERE EXTRACT(MONTH FROM event_date) = 7on the outer query — counts all-time MAU, not July-specific. - Comparing
event_date::datedirectly withoutINTERVAL '1 month'— fails on calendar boundaries (30 vs 31 day months, leap years). - Returning all rows instead of distinct user counts —
COUNT(DISTINCT user_id)is the metric.
SQL Interview Question on Active User Retention
Given user_actions(user_id, event_id, event_type, event_date), return the count of monthly active users for July 2022 — users who had at least one event in July AND at least one event in June. Output month (numeric) and monthly_active_users.
Solution Using correlated EXISTS subquery
SELECT
EXTRACT(MONTH FROM curr_month.event_date) AS mth,
COUNT(DISTINCT curr_month.user_id) AS monthly_active_users
FROM user_actions AS curr_month
WHERE EXTRACT(MONTH FROM curr_month.event_date) = 7
AND EXTRACT(YEAR FROM curr_month.event_date) = 2022
AND EXISTS (
SELECT 1
FROM user_actions AS last_month
WHERE last_month.user_id = curr_month.user_id
AND EXTRACT(MONTH FROM last_month.event_date)
= EXTRACT(MONTH FROM curr_month.event_date - INTERVAL '1 month')
AND EXTRACT(YEAR FROM last_month.event_date)
= EXTRACT(YEAR FROM curr_month.event_date - INTERVAL '1 month')
)
GROUP BY EXTRACT(MONTH FROM curr_month.event_date);
Why this works: the outer query restricts to July 2022 events; the correlated EXISTS short-circuits on the first June-2022 event for the same user_id; INTERVAL '1 month' makes the comparison work across year boundaries (Jan retention against December of prior year); COUNT(DISTINCT user_id) collapses multiple July events per user to a single MAU contribution; the EXTRACT(MONTH ...) GROUP BY is mechanically required by the SELECT alias.
Step-by-step trace for the DataLemur sample:
| user_id | event_date | event_type | curr_month? | EXISTS prev? | counted? |
|---|---|---|---|---|---|
| 445 | 2022-06-30 | sign-in | ✗ (June) | — | — |
| 742 | 2022-07-03 | sign-in | ✓ | ✗ | — |
| 445 | 2022-07-05 | like | ✓ | ✓ | ✓ |
| 742 | 2022-07-05 | comment | ✓ | ✗ | — |
| 648 | 2022-07-18 | like | ✓ | ✗ | — |
Only user 445 satisfies both conditions → MAU = 1.
Output:
| mth | monthly_active_users |
|---|---|
| 7 | 1 |
Why this works — concept by concept:
-
EXTRACT(MONTH FROM ts)+EXTRACT(YEAR FROM ts)— joint period-key avoids cross-year false positives. -
Correlated
EXISTS— short-circuits on first match per outer row;O(N · M)worst case but typically far less in practice. -
INTERVAL '1 month'shift — calendar-aware month subtraction; handles end-of-month edges and year transitions automatically. -
COUNT(DISTINCT user_id)— collapses multiple July events per user; the contract demands "users", not "events." -
GROUP BY EXTRACT(MONTH FROM ...)— required because the SELECT references a non-aggregate; produces one output row per month. -
O(|user_actions| × log|user_actions|)time — index lookup per outer row inside the EXISTS; with a(user_id, event_date)index this is near-linear.
Inline CTA: More SQL window-function problems and aggregation problems on PipeCode.
SQL
Company — Facebook
Facebook SQL practice problems
SQL
Topic — window functions
SQL window-function problems
SQL
Topic — aggregation
SQL aggregation problems
4. SQL CTE and Self-Join for Post Hiatus and Friend Recommendations
CTE composition with MIN/MAX aggregates and self-joins in SQL for data engineering
"For each user who posted at least twice in 2024, return the days between their first and last post" is Meta's signature post-hiatus prompt (DataLemur Q1). The mental model: MAX(post_date) - MIN(post_date) per user gives the hiatus; WHERE EXTRACT(YEAR FROM post_date) = 2024 filters to the year; HAVING COUNT(post_id) > 1 ensures the user actually posted multiple times. The same CTE-and-self-join skeleton scales up to the friend-recommendation pattern (Q6) — a CTE captures private_events and a self-join produces non-friend pairs who attended the same events.
Pro tip: Always alias the result of date subtraction (
AS days_betweenfor post hiatus,AS user_pairfor friend-rec). PostgreSQL returns anINTERVALfrom date subtraction by default — explicit casting (::DATE - ::DATE) returns a plainint. State the cast in the SELECT.
MIN / MAX aggregates per user with date subtraction
The aggregate-date invariant: MAX(post_date) - MIN(post_date) over a GROUP BY user_id window returns the per-user span of activity; casting to ::DATE first ensures the subtraction returns a plain integer-day count, not an INTERVAL. Filter to the year first via WHERE, group second.
-
MIN(post_date::DATE)— earliest post date per group. -
MAX(post_date::DATE)— latest post date per group. -
MAX - MIN—INTERVALif both are timestamps,intif both areDATE. -
EXTRACT(DAY FROM ...)cast — alternative if working with timestamps.
Worked example. Two users, six posts in 2024.
| user_id | post_dates | min | max | days_between |
|---|---|---|---|---|
| 151652 | 07/10, 07/12 | 07/10 | 07/12 | 2 |
| 661093 | 07/08, 07/29 | 07/08 | 07/29 | 21 |
Worked-example solution.
SELECT user_id,
MAX(post_date::DATE) - MIN(post_date::DATE) AS days_between
FROM posts
WHERE EXTRACT(YEAR FROM post_date) = 2024
GROUP BY user_id;
Rule of thumb: when the prompt says "days between first and last", MAX(::DATE) - MIN(::DATE) is one line; never compute it via window functions when GROUP BY suffices.
HAVING COUNT() > 1 for multi-event users
The filter-on-aggregate invariant: HAVING COUNT(post_id) > 1 filters group rows after the GROUP BY to keep only users with at least 2 posts; WHERE cannot reference aggregates and would parse-error. The two clauses are not interchangeable.
-
WHERE— row-level filter (year predicate). -
HAVING— group-level filter (count predicate). -
HAVING COUNT(post_id) > 1— strictly more than 1. -
HAVING COUNT(*) >= 2— equivalent here sincepost_idis non-null.
Worked example. Three users; one posted only once in 2024.
| user_id | post_count_2024 | survives? |
|---|---|---|
| 151652 | 2 | ✓ |
| 661093 | 2 | ✓ |
| 004239 | 1 | ✗ (filtered) |
Worked-example solution.
SELECT user_id,
MAX(post_date::DATE) - MIN(post_date::DATE) AS days_between
FROM posts
WHERE EXTRACT(YEAR FROM post_date) = 2024
GROUP BY user_id
HAVING COUNT(post_id) > 1;
Rule of thumb: a single-row hiatus is 0 days, which is misleading; the HAVING COUNT > 1 filter is non-negotiable for the post-hiatus prompt.
CTE composition for multi-step logic + self-join for friend-rec / pair queries
The CTE-and-self-join invariant: WITH cte AS (SELECT ...) names an intermediate result; subsequent SELECTs reference it like a table; for pair queries (e.g., friend recommendations), self-join the same CTE on a not-equal-id condition to generate every ordered pair. Combine with friendship-status filtering to surface non-friend pairs.
-
CTE definition —
WITH private_events AS (SELECT user_id, event_id FROM event_rsvp WHERE attendance_status IN ('going', 'maybe') AND event_type = 'private'). -
Self-join —
JOIN private_events e2 ON e1.event_id = e2.event_id AND e1.user_id != e2.user_id. -
Friendship-status filter —
JOIN friendship_status fs ... WHERE fs.status = 'not_friends'. -
Pair count —
HAVING COUNT(*) >= 2for the "two-or-more shared events" requirement.
Worked example. Three users (111, 222, 333) attended event 234; only (222, 333) are not friends.
| pair | shared_events | friend? | recommend? |
|---|---|---|---|
| (111, 222) | 1 | friends | ✗ |
| (111, 333) | 1 | not_friends | ✗ (only 1 shared) |
| (222, 333) | 2 | not_friends | ✓ |
Worked-example solution.
WITH private_events AS (
SELECT user_id, event_id
FROM event_rsvp
WHERE attendance_status IN ('going', 'maybe')
AND event_type = 'private'
)
SELECT fs.user_a_id, fs.user_b_id
FROM private_events e1
JOIN private_events e2
ON e1.event_id = e2.event_id
AND e1.user_id != e2.user_id
JOIN friendship_status fs
ON fs.user_a_id = e1.user_id
AND fs.user_b_id = e2.user_id
WHERE fs.status = 'not_friends'
GROUP BY fs.user_a_id, fs.user_b_id
HAVING COUNT(*) >= 2;
Rule of thumb: CTEs make multi-step logic readable; self-joins on the same CTE are the canonical shape for pair / recommendation queries.
Common beginner mistakes
- Computing
MAX - MINwithout::DATEcast — returns anINTERVAL, not days. - Using
WHERE COUNT(post_id) > 1— parse error;WHEREcannot reference aggregates. - Forgetting the year filter — counts all-time hiatus instead of 2024-only.
- Self-joining without
e1.user_id != e2.user_id— generates(user, user)pairs that pollute the answer. - Returning a single direction
(user_a, user_b)when the prompt says recommendations are bidirectional — add the mirror byUNIONor by including both orderings in the GROUP BY.
SQL Interview Question on Average Post Hiatus
Given posts(user_id, post_id, post_date), for each user who posted at least twice in 2024, return the number of days between the user's first and last post in 2024. Output user_id and days_between.
Solution Using MAX - MIN per user with HAVING COUNT > 1
SELECT
user_id,
MAX(post_date::DATE) - MIN(post_date::DATE) AS days_between
FROM posts
WHERE EXTRACT(YEAR FROM post_date) = 2024
GROUP BY user_id
HAVING COUNT(post_id) > 1;
Why this works: the WHERE clause restricts the row stream to 2024 posts before grouping; GROUP BY user_id collapses to one row per user; MAX(post_date::DATE) - MIN(post_date::DATE) produces the hiatus as an integer day count thanks to the ::DATE cast; HAVING COUNT(post_id) > 1 strips out users who posted only once. Single-pass aggregation; no self-join needed.
Step-by-step trace for the DataLemur sample:
| user_id | post_id | post_date |
|---|---|---|
| 151652 | 599415 | 2024-07-10 |
| 661093 | 624356 | 2024-07-29 |
| 004239 | 784254 | 2024-07-04 |
| 661093 | 442560 | 2024-07-08 |
| 151652 | 111766 | 2024-07-12 |
- WHERE filter — all 5 rows are in 2024; nothing dropped.
- GROUP BY user_id — three groups: 151652 (2 posts), 661093 (2 posts), 004239 (1 post).
- MIN / MAX per group — 151652 → (07-10, 07-12); 661093 → (07-08, 07-29); 004239 → (07-04, 07-04).
- MAX - MIN — 2 days, 21 days, 0 days.
- HAVING COUNT > 1 — strips 004239 (only 1 post).
Output:
| user_id | days_between |
|---|---|
| 151652 | 2 |
| 661093 | 21 |
Why this works — concept by concept:
-
WHERE EXTRACT(YEAR FROM post_date) = 2024— row-level filter; runs before grouping for correctness and performance. -
GROUP BY user_id— collapses to one row per user. -
MAX(post_date::DATE) - MIN(post_date::DATE)— date subtraction returns an integer day count after the::DATEcast;INTERVALwould be the wrong shape. -
HAVING COUNT(post_id) > 1— group-level filter; aggregate predicate must live inHAVING, notWHERE. -
O(|posts| + G log G)time — single scan + sort ofGgroups; no self-join.
Inline CTA: More SQL CTE problems and SQL join problems on PipeCode.
SQL
Company — Facebook
Facebook SQL practice problems
SQL
Topic — CTE
SQL CTE problems
SQL
Topic — joins
SQL join problems
Tips to crack Facebook data engineering interviews
Facebook = Meta — the rebrand survived but the SQL bar didn't move
Facebook Inc. rebranded to Meta Platforms Inc. in October 2021. The legal entity is Meta; the consumer products (Facebook, Instagram, WhatsApp, Messenger, Threads) keep their original brands. The data-engineering interview loop, the SQL bar, and the question shapes did not change with the rebrand. Search for both "Facebook data engineer interview" and "Meta data engineer interview" — every external article you find under one name applies under the other.
30+30 phone-screen format — pick your opener
The standard Meta data-engineering technical phone screen is 5 minutes intro + 30 minutes SQL + 30 minutes Python + 5 minutes Q&A, with the candidate choosing whether to open with SQL or Python. Pick whichever you'd rather attack first when you're freshest; the second half always feels harder under fatigue, so save your stronger language for the closer if you can. State your preferred opener at the start so the interviewer can plan.
Two role variants — pick the one that matches your loop
Meta has two distinct DE loops: the standard "Data Engineer" role (algo-heavy Python plus SQL) and the "Data Engineer — Product Analytics" role (5 SQL questions plus 5 algo coding questions, more product-sense flavor). Confirm which loop you're in during the recruiter call; the prep mix shifts — Product Analytics candidates drill more SQL retention / cohort patterns; standard DE candidates drill more Python algorithms.
Drill the four primitives
The four primitives in this guide map directly to the curated 2 PipeCode Python problems plus the two adjacent SQL primitives every Meta SQL list rotates through: n*(n+1)/2 - sum(arr) and XOR self-cancellation for the missing-number array problem (Python EASY, #83), two-pass tokenize-and-evaluate or single-pass stack-based evaluator for arithmetic formula parsing (Python MEDIUM, #273), correlated EXISTS with INTERVAL '1 month' for month-over-month MAU retention (DataLemur Q4), and CTE composition + self-joins with MIN/MAX aggregates and HAVING COUNT > 1 for post hiatus and friend recommendations (DataLemur Q1 / Q6).
Product-analytics SQL emphasis
Meta's data-engineering SQL questions are heavily product-analytics-flavored — MAU retention, post hiatus, power users, click-through rates, friend recommendations. Drill the Meta-style SQL practice surface with a focus on EXISTS subqueries, CTE composition, MIN/MAX per-user aggregates, and CASE WHEN + ROUND for percentage metrics. Avoid generic "joins and group-by" prep; Meta's bar is higher.
IC4-IC6 leveling and behavioral expectations
Meta levels DE roles from IC4 (Senior, ~5+ YoE) through IC6 (Staff, ~10+ YoE) and IC7 (Senior Staff). Behavioral rounds at IC5+ probe ownership ("tell me about a time you owned a critical pipeline migration"), navigating ambiguity, and cross-functional collaboration with product / DS / ML teams. Have STAR-format stories ready for each. Compensation per Levels.fyi: IC4 ~$340K-$430K total, IC5 ~$430K-$580K, IC6 $600K+.
Where to practice on PipeCode
Start with the Facebook practice page and the language-scoped Facebook Python practice page for the curated 2-problem set. Hit the company_topic Facebook — array page for the only Facebook-tagged topic surface available. After that, drill the matching topic pages: array, string, window functions, aggregation, CTE, joins, date functions, filtering. The interview courses page bundles structured curricula. For broader coverage, browse by topic, or pivot to peer guides — the Airbnb DE interview guide, the top DE interview questions 2026 blog, and the SQL data types Postgres guide.
Communication and approach under time pressure
Talk through the invariant first ("this is a missing-number problem with sum-formula and XOR alternatives"), the brute force second ("a sort-then-scan would also work but is O(n log n)"), and the optimal third ("but the sum-formula gives O(n) time and O(1) space"). Interviewers grade process as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, single-element array, zero-only array, year-boundary date arithmetic, NULL in a user_id partition. The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.
Frequently Asked Questions
What is the Facebook (Meta) data engineering interview process?
The Meta data engineering interview opens with a recruiter screen (15-30 min), then a technical phone screen with the 5 minute intro + 30 minute SQL + 30 minute Python + 5 minute Q&A format (candidate-choice on opener), then a 4-5 round virtual onsite covering 2 SQL rounds, 1 Python algorithm round, 1 system design round, and 1 behavioral round. Senior roles (IC5+) add a data-architecture round. End-to-end the loop runs three to four weeks.
Is Facebook the same as Meta? Which name should I search for?
Yes — Facebook Inc. rebranded to Meta Platforms Inc. in October 2021. The legal entity is Meta; the consumer brands (Facebook, Instagram, WhatsApp, Messenger, Threads) keep their original names. Search for both "Facebook data engineer interview" and "Meta data engineer interview" — every external article you find under one name applies under the other. The DataLemur SQL guide is titled "Facebook/Meta SQL Interview Questions" because both names index the same content.
What programming languages does Facebook test in data engineering interviews?
Meta tests Python and SQL in the technical phone screen and onsite — bilingual by design. The phone screen is exactly 30 minutes of each, with the candidate choosing the opener. Python emphasizes algorithms (array missing-number, string parsing, hash-table aggregation, sliding-window patterns). SQL emphasizes product-analytics queries (MAU retention, post hiatus, power users, CTR, friend recommendations) on PostgreSQL syntax. Python questions for the Meta data engineer interview lean medium-difficulty algorithm style, not data-pipeline scripting.
What is the difference between the standard DE role and the Data Engineer — Product Analytics variant?
Standard Data Engineer at Meta = algo-heavier Python loops with bilingual SQL coverage (the curated PipeCode set's algorithm focus). Data Engineer — Product Analytics = a product-sense-flavored variant where the technical phone screen is 5 SQL questions + 5 algo coding questions in one session, more weighted toward retention / cohort / funnel analytics SQL than pure algorithm fluency. The recruiter call confirms which loop you're in; ask explicitly if it isn't stated.
What SQL questions does Meta ask data engineers?
Meta SQL interview questions concentrate on six product-analytics shapes: (1) post hiatus / first-and-last-event aggregations via MIN/MAX(::DATE) with HAVING COUNT > 1; (2) power-user identification via JOIN + 2-condition HAVING; (3) MAU retention via correlated EXISTS subquery with INTERVAL '1 month' shifts; (4) friend recommendations via CTE + self-join over event_rsvp with friendship_status filtering; (5) average-shares-per-post via LEFT JOIN + COALESCE; (6) ad click-through rate via CASE WHEN + ROUND and 2022-year filters. Drill all six against the PostgreSQL dialect; CoderPad is the live coding environment.
What is the Meta data engineer salary range?
Meta data engineer total compensation per Levels.fyi: IC4 (Senior, ~5+ YoE) $340K-$430K total comp ($180K-$220K base + RSUs + bonus); IC5 (Staff, ~10+ YoE) $430K-$580K total ($210K-$260K base); IC6 (Senior Staff, ~12+ YoE) $600K+ total. RSU refreshers are annual; equity vests on a 4-year schedule with a typical 25% cliff. Negotiation success rates run 10-20% with competing offers. The "meta data engineer salary" search keyword (vol 260) reflects strong candidate interest in this number.




Top comments (0)