Gowtham Potureddi

Posted on May 3

Facebook Data Engineering Interview Questions & Prep Guide

#sql #python #interview #dataengineering

Facebook data engineering interview questions are bilingual, product-analytics-flavored, and PostgreSQL-grounded on the SQL side. Facebook Inc. rebranded to Meta Platforms Inc. in October 2021, but the data-engineering interview shape — and the question patterns that show up in the live phone screen and onsite — has not moved. The standard technical phone screen is 5 minutes intro + 30 minutes SQL + 30 minutes Python + 5 minutes Q&A, with the candidate choosing whether to start with SQL or Python. Four primitives carry the loop: n*(n+1)/2 - sum(arr) arithmetic-series sum-formula (and its XOR self-cancellation alternative) for the missing-number array problem, character-by-character tokenization plus two-pass evaluation for arithmetic formula parsing, correlated EXISTS subqueries with EXTRACT(MONTH FROM ts - INTERVAL '1 month') for month-over-month MAU retention, and CTE composition plus self-joins for post-hiatus aggregation and friend-recommendation queries.

This guide walks four topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches a curated 2-problem Facebook set (1 EASY Python array + 1 MEDIUM Python array+math+bit+string parser) plus two adjacent SQL primitives — EXISTS month-over-month retention and CTE + self-join aggregation — that show up on every Meta SQL question list and at every product-analytics onsite. The interview is bilingual SQL + Python; candidates who prep only one language stutter on the half they avoided.

#	Topic (sections 1–4)	Why it shows up at Facebook
1	Python array missing number — sum formula and XOR	Missing Number (EASY) — `n*(n+1)/2 - sum(arr)` arithmetic-series identity and its XOR self-cancellation alternative, the classic array primitive Meta phone-screens with.
2	Python arithmetic formula evaluator — array, math, bit, string parsing	Arithmetic Formula Evaluator (MEDIUM) — character-by-character tokenization plus two-pass evaluation (handle `*` and `/` first, then `+` and `-`), the parser primitive that powers any "given a string formula, return the value" question.
3	SQL window functions and `EXISTS` for monthly active user retention	Active User Retention — correlated `EXISTS` subquery with `EXTRACT(MONTH FROM ts - INTERVAL '1 month')` for month-over-month MAU, the SQL primitive that drives Meta product-analytics retention dashboards.
4	SQL CTE and self-join for post hiatus and friend recommendations	Average Post Hiatus + Friend Recommendations — `MIN/MAX(post_date)` per user with `HAVING COUNT > 1` and CTE-driven self-joins for pair-wise friend-rec queries, the SQL primitive that drives Meta social-graph analytics.

1. Python Array Missing Number — Sum Formula and XOR

Sum-formula and XOR self-cancellation in Python for data engineering

"Given an array of n distinct integers in the range [0, n] with exactly one missing, return the missing number" is Facebook's signature EASY Python prompt (Missing Number). The mental model: the arithmetic-series identity 0 + 1 + … + n = n*(n+1)/2 gives the expected sum; subtracting the actual sum reveals the missing number in O(n) time and O(1) space. The mirror primitive is XOR self-cancellation — a ^ a == 0 and a ^ 0 == a — which gives the same answer using bit operations and never overflows.

Pro tip: State both approaches out loud before writing code. Interviewers grade the candidate's awareness that the sum-formula can overflow on huge n (Python ints are arbitrary precision so it's fine in practice, but the answer in Java or C++ requires the XOR variant). Naming both demonstrates breadth.

Arithmetic-series sum formula: `n*(n+1)/2`

The sum-formula invariant: the sum of integers from 0 to n inclusive is n*(n+1)/2. Subtracting sum(arr) from this expected total yields the missing element when exactly one integer is absent from the contiguous range.

n*(n+1)//2 — integer division in Python; produces the expected total.
sum(nums) — O(n) linear scan; produces the actual total.
expected - actual — the difference is the missing value.
Constant space — no auxiliary structure required.

Worked example. nums = [0, 1, 3, 4], n = 4.

step	value
`n`	4
`n*(n+1)//2`	10
`sum(nums)`	8
`missing`	2

Worked-example solution.

def missing_number(nums: list[int]) -> int:
    n = len(nums)
    return n * (n + 1) // 2 - sum(nums)

Rule of thumb: the sum-formula is the cleanest one-liner in Python; reach for it whenever the prompt guarantees a contiguous integer range.

XOR self-cancellation: `a ^ a == 0`

The XOR invariant: a ^ a == 0 and a ^ 0 == a; XOR is commutative and associative so the order does not matter. XORing every element of nums with every element of range(n + 1) cancels every value that appears in both, leaving only the one missing from nums.

a ^ a == 0 — self-cancellation property.
a ^ 0 == a — identity property.
Commutative + associative — order-independent.
No overflow — bit operations never overflow in any language.

Worked example. Same nums = [0, 1, 3, 4], n = 4.

step	XOR chain
nums	`0 ^ 1 ^ 3 ^ 4` = 6
range	`0 ^ 1 ^ 2 ^ 3 ^ 4` = 4
combined	`6 ^ 4` = 2

Worked-example solution.

from functools import reduce
from operator import xor

def missing_number_xor(nums: list[int]) -> int:
    n = len(nums)
    return reduce(xor, nums + list(range(n + 1)))

Rule of thumb: the XOR variant is the right answer when the prompt asks about overflow safety or when you want to demonstrate bit-manipulation fluency.

Set-difference fallback for non-contiguous ranges

The set-difference invariant: when the input range is not contiguous (e.g., "find the missing element from [10, 20, 30, 40, 60] knowing the full set should be [10, 20, 30, 40, 50, 60]"), the sum and XOR shortcuts no longer apply; reach for set(expected) - set(actual).

set(expected) - set(actual) — O(n) time, O(n) space.
(set(expected) - set(actual)).pop() — return the single missing element.
Multiple missing elements — the same set-difference returns all of them.
Slower than sum / XOR — uses O(n) extra space; pick this only when the range is non-contiguous.

Worked example. actual = [10, 20, 30, 40, 60], expected = [10, 20, 30, 40, 50, 60].

step	output
`set(expected) - set(actual)`	`{50}`

Worked-example solution.

def missing_from_set(actual: list[int], expected: list[int]) -> int:
    return (set(expected) - set(actual)).pop()

Rule of thumb: contiguous range → sum-formula or XOR (O(1) space); non-contiguous → set-difference (O(n) space).

Common beginner mistakes

Using n * (n + 1) / 2 (float division) instead of n * (n + 1) // 2 (integer division) — produces a float result that fails int-typed tests.
Computing n as max(nums) instead of len(nums) — wrong by one when the missing element is the max itself.
Forgetting that range(n) excludes n — use range(n + 1) for inclusive [0, n].
Sorting nums first — O(n log n) instead of O(n); signals algorithmic weakness.
Returning the difference set instead of a single value — read the contract; the missing-number problem returns one int.

Python Interview Question on Missing Number

Given an array nums containing n distinct integers in the range [0, n], return the single missing number in O(n) time and O(1) space.

def missing_number(nums: list[int]) -> int:
    pass  # your code here

Solution Using `n*(n+1)//2 - sum(nums)`

def missing_number(nums: list[int]) -> int:
    n = len(nums)
    return n * (n + 1) // 2 - sum(nums)

Why this works: len(nums) gives n because the array contains exactly one fewer element than the full range [0, n]; the arithmetic-series identity 0 + 1 + … + n = n*(n+1)/2 produces the expected total; subtracting sum(nums) (the actual total) yields the missing value; the integer-division // keeps the result an int. O(n) time for the single linear sum, O(1) extra space — no auxiliary structure.

Step-by-step trace for nums = [3, 0, 1]:

step	value
`n = len(nums)`	3
expected = `3 * 4 // 2`	6
actual = `sum([3, 0, 1])`	4
missing = `6 - 4`	2

Output:

input	expected	actual	missing
`[3, 0, 1]`	6	4	2

Why this works — concept by concept:

Arithmetic-series identity — 0 + 1 + ... + n = n*(n+1)/2 is a closed-form expression; computing it is O(1) regardless of n.
Linear sum(nums) — O(n) scan of the array; the only data-touching cost in the algorithm.
Integer division // — keeps the expected total as a Python int; n*(n+1) is always even so the division has no remainder.
Subtraction reveals the gap — expected - actual cancels every shared value and leaves the missing one.
O(n) time / O(1) space — single pass over nums, constant extra storage, no recursion or auxiliary collection.

Inline CTA: Drill the Facebook Python practice page for the curated EASY array problem and the Facebook array practice page for the company_topic surface.

PYTHON
Company — Facebook (Python)
Facebook Python practice problems

Practice →

PYTHON
Company — Facebook / array
Facebook array problems

Practice →

PYTHON
Topic — array
Python array problems

Practice →

2. Python Arithmetic Formula Evaluator — Array, Math, Bit, String Parsing

Tokenization and two-pass evaluation in Python for data engineering

"Given a string like '3 + 5 * 2', parse and evaluate it with standard precedence" is Facebook's signature MEDIUM Python prompt (Arithmetic Formula Evaluator). The mental model: a length-1 character-by-character scan tokenizes the string into a flat list of integers and operator characters; a first pass collapses every * and / left-to-right; a second pass collapses the remaining + and - left-to-right; the final list has one element — the result. Same primitive powers any "expression evaluator" pipeline — calculator apps, formula columns in spreadsheets, simple DSL interpreters.

Pro tip: Avoid eval(s) even when the test prompt allows it. Production code never trusts arbitrary strings, and interviewers grade the candidate who writes a real parser. State the no-eval rule out loud before coding; the senior signal is unmistakable.

Tokenization: split a string into numbers and operators

The tokenization invariant: scan the string character-by-character; accumulate digits into an integer buffer, flush the buffer when an operator is hit, append the operator as its own token; whitespace is skipped. The output is a flat list alternating between integers and operator chars.

Digit accumulation — if c.isdigit(): num = num * 10 + int(c).
Operator flush — elif c in '+-*/': tokens.append(num); tokens.append(c); num = 0.
Final flush — append the trailing num after the loop.
Whitespace — skip with if c.isspace(): continue.

Worked example. s = '3 + 5 * 2'.

step	tokens
start	`[]`
`3`	`[3]`
`+`	`[3, '+']`
`5`	`[3, '+', 5]`
`*`	`[3, '+', 5, '*']`
`2`	`[3, '+', 5, '*', 2]`

Worked-example solution.

def tokenize(s: str) -> list:
    tokens, num, in_num = [], 0, False
    for c in s:
        if c.isspace():
            continue
        if c.isdigit():
            num = num * 10 + int(c)
            in_num = True
        else:
            if in_num:
                tokens.append(num)
                num, in_num = 0, False
            tokens.append(c)
    if in_num:
        tokens.append(num)
    return tokens

Rule of thumb: always handle multi-digit numbers explicitly — '12 + 3' is two operands, not three; the digit accumulator is what makes that work.

Two-pass evaluation: handle `*` and `/` first, then `+` and `-`

The precedence invariant: the first pass walks the token list, finds every * and /, and replaces the triple (left, op, right) with (left op right); the second pass does the same for + and -. After the first pass, the token list contains only + and - operators between integers — left-to-right evaluation produces the result.

First pass — handle * and / (higher precedence).
Second pass — handle + and - (lower precedence).
Left-to-right within precedence — preserves the standard expression semantics.
Integer division — // for / if the contract is integer arithmetic; int(a / b) for truncate-toward-zero.

Worked example. Continuing tokens = [3, '+', 5, '*', 2].

pass	tokens
input	`[3, '+', 5, '*', 2]`
pass 1 (`*`)	`[3, '+', 10]`
pass 2 (`+`)	`[13]`
result	`13`

Worked-example solution.

def collapse(tokens: list, ops: set) -> list:
    result = [tokens[0]]
    i = 1
    while i < len(tokens):
        op, right = tokens[i], tokens[i + 1]
        if op in ops:
            left = result.pop()
            if op == '*':
                result.append(left * right)
            elif op == '/':
                result.append(left // right)
            elif op == '+':
                result.append(left + right)
            else:  # '-'
                result.append(left - right)
        else:
            result.append(op)
            result.append(right)
        i += 2
    return result

Rule of thumb: two passes is the simplest correct approach; reach for the stack-based shunting-yard variant only when parens or unary minus enter the spec.

Stack-based evaluator for general expressions

The stack invariant: maintain a stack of partial results; when scanning a token, the operator decides whether to update the top of the stack (* or /) or to push a new term (+ or -). The final answer is sum(stack).

+ — push the next operand onto the stack.
- — push the negated next operand.
* — multiply the top of stack by the next operand in place.
/ — integer-divide the top of stack by the next operand in place.

Worked example. s = '3+5*2', single-pass with stack.

token	stack
`3`	`[3]`
`+ 5`	`[3, 5]`
`* 2`	`[3, 10]`
end	sum = 13

Worked-example solution.

def calculate(s: str) -> int:
    s = s.replace(' ', '')
    stack, num, op = [], 0, '+'
    for i, c in enumerate(s):
        if c.isdigit():
            num = num * 10 + int(c)
        if c in '+-*/' or i == len(s) - 1:
            if op == '+': stack.append(num)
            elif op == '-': stack.append(-num)
            elif op == '*': stack.append(stack.pop() * num)
            else: stack.append(int(stack.pop() / num))
            op, num = c, 0
    return sum(stack)

Rule of thumb: the stack-based single-pass evaluator is the production-grade answer; the two-pass variant is the easier explanation but allocates more.

Common beginner mistakes

Using eval(s) — works on the test cases but is graded as a fail; production code never evals untrusted input.
Forgetting multi-digit numbers — '12 + 3' becomes [1, 2, '+', 3] instead of [12, '+', 3].
Doing left-to-right without precedence — 3 + 5 * 2 becomes (3 + 5) * 2 = 16 instead of 3 + (5 * 2) = 13.
Using int(a / b) when the contract says // — produces wrong rounding for negative integers.
Skipping the trailing-number flush — the last operand never enters the token list.

Python Interview Question on Arithmetic Formula Evaluator

Given a string s containing integer operands and the operators + - * / (no parentheses, integer arithmetic, standard precedence), return the integer result.

def calculate(s: str) -> int:
    pass  # your code here

Solution Using single-pass stack-based evaluator

def calculate(s: str) -> int:
    s = s.replace(' ', '')
    stack, num, op = [], 0, '+'
    for i, c in enumerate(s):
        if c.isdigit():
            num = num * 10 + int(c)
        if c in '+-*/' or i == len(s) - 1:
            if op == '+':
                stack.append(num)
            elif op == '-':
                stack.append(-num)
            elif op == '*':
                stack.append(stack.pop() * num)
            else:  # '/'
                stack.append(int(stack.pop() / num))
            op, num = c, 0
    return sum(stack)

Why this works: the single pass tokenizes and evaluates simultaneously; op tracks the operator that applies to the just-finished number; on +/- we push a new term onto the stack; on */// we update the top of stack in place — which is exactly the precedence-respecting behavior we need; the trailing flush (i == len(s) - 1) handles the last number; sum(stack) collapses the additive sub-results.

Step-by-step trace for s = '3+5*2':

i	c	num	op_pending	stack
0	`3`	3	`+`	`[]`
1	`+`	0	`+`	`[3]`
2	`5`	5	`+`	`[3]`
3	`*`	0	`*`	`[3, 5]`
4	`2`	2	`*`	`[3, 5]`
end	—	0	—	`[3, 10]`

sum(stack) = 3 + 10 = 13.

Output:

input	result
`'3+5*2'`	13
`'12-3*2'`	6
`'10/3'`	3

Why this works — concept by concept:

Single-pass tokenize-and-evaluate — one scan of the string drives both tokenization and stack updates; no separate token list materialized.
op tracks the previous operator — applies to the number that just finished accumulating, which is the key trick for the stack-update timing.
Stack push for + and - — additive operations push new terms; the final sum collapses them.
In-place stack update for * and / — multiplicative operations modify the top of stack so they bind tighter than additive ones; this is what gives precedence without a second pass.
Trailing flush via i == len(s) - 1 — the last operand has no following operator to trigger a flush; the index check forces the final stack update.
O(n) time / O(n) space — one pass over the string; the stack holds at most one element per + or - token, which is O(n) worst case.

Inline CTA: More SQL string problems and array problems for breadth.

PYTHON
Company — Facebook (Python)
Facebook Python practice problems

Practice →

PYTHON
Topic — string
Python string problems

Practice →

PYTHON
Topic — array
Python array problems

Practice →

3. SQL Window Functions and `EXISTS` for Monthly Active User Retention

Correlated EXISTS subquery for month-over-month MAU retention in SQL for data engineering

"Return the number of monthly active users in July 2022 — users who were active in both July AND June" is Meta's signature SQL retention prompt (DataLemur Q4). The mental model: a user is a July-MAU iff they have at least one event in July AND at least one event in June; a correlated EXISTS subquery checks the second condition row-by-row against the same user_actions table; EXTRACT(MONTH FROM curr_month.event_date - INTERVAL '1 month') shifts the comparison window. Same primitive powers any "active in current period AND in previous period" retention metric — week-over-week active users, day-over-day session retention, cohort-N-day return.

Pro tip: Correlated EXISTS subqueries are graded as the right answer over self-joins for retention queries. Self-joins explode the row count when a user has many events; EXISTS short-circuits on the first match per outer row. State this performance distinction to the interviewer.

`EXTRACT(MONTH FROM ts)` and `INTERVAL '1 month'` arithmetic

The date-arithmetic invariant: EXTRACT(MONTH FROM ts) returns the month component of ts as an integer 1-12; ts - INTERVAL '1 month' shifts ts back exactly one calendar month respecting end-of-month edges. Combining both produces "month of one month ago" — the comparison key for retention.

EXTRACT(MONTH FROM ts) — month number 1-12.
EXTRACT(YEAR FROM ts) — year number; combine with month for unique periods.
ts - INTERVAL '1 month' — calendar shift, handles 31-day → 30-day automatically.
DATE_TRUNC('month', ts) — alternative; returns first day of the month as a DATE.

Worked example. event_date = '2022-07-15'.

expression	output
`EXTRACT(MONTH FROM event_date)`	7
`event_date - INTERVAL '1 month'`	2022-06-15
`EXTRACT(MONTH FROM event_date - INTERVAL '1 month')`	6

Worked-example solution.

SELECT EXTRACT(MONTH FROM event_date)                            AS curr_month,
       EXTRACT(MONTH FROM event_date - INTERVAL '1 month')       AS prev_month
FROM user_actions
WHERE event_date = '2022-07-15';

Rule of thumb: combine EXTRACT(MONTH FROM ...) and EXTRACT(YEAR FROM ...) when retention spans calendar years; month-only is wrong for Dec→Jan transitions.

Correlated `EXISTS` subquery for previous-month presence

The correlated-subquery invariant: WHERE EXISTS (SELECT 1 FROM ... WHERE inner.col = outer.col) returns TRUE for outer rows whose col value has at least one matching row in the inner query; the inner query references the outer alias and re-evaluates per outer row. Short-circuits on the first match.

EXISTS (SELECT 1 FROM ... WHERE ...) — the canonical pattern.
Correlation — inner WHERE clause references outer alias.
Short-circuit — stops scanning inner rows on first match.
NOT EXISTS — mirror image; "no match found in inner."

Worked example. Two users; user 445 has June + July rows, user 742 has only July.

user_id	event_date	EXISTS June row?
445	2022-06-30	(no — this IS June)
445	2022-07-05	✓ (matches user 445's June row)
742	2022-07-03	✗

Worked-example solution.

SELECT user_id
FROM user_actions AS curr_month
WHERE EXTRACT(MONTH FROM event_date) = 7
  AND EXTRACT(YEAR  FROM event_date) = 2022
  AND EXISTS (
      SELECT 1
      FROM user_actions AS last_month
      WHERE last_month.user_id = curr_month.user_id
        AND EXTRACT(MONTH FROM last_month.event_date) = 6
        AND EXTRACT(YEAR  FROM last_month.event_date) = 2022
  );

Rule of thumb: EXISTS beats self-join for retention because it short-circuits per outer row and never explodes the cardinality.

Alternative: window function with self-join over month buckets

The window-alternative invariant: a self-join on user_id between two pre-bucketed tables (or LAG/LEAD over month-truncated dates inside a partition) achieves the same retention answer with different performance trade-offs. Useful when the prompt asks for a continuous N-month-streak metric.

DATE_TRUNC('month', event_date) — bucket per month.
LAG(month_bucket) OVER (PARTITION BY user_id ORDER BY month_bucket) — previous month bucket per user.
Self-join — JOIN user_actions u2 ON u2.user_id = u1.user_id AND u2.month_bucket = u1.month_bucket - INTERVAL '1 month'.
Window approach scales better — single sort, no nested scan.

Worked example. Same data; window approach.

user_id	curr_month	prev_month_bucket	retained?
445	2022-07	2022-06	✓
742	2022-07	NULL	✗

Worked-example solution.

WITH per_user_months AS (
  SELECT DISTINCT user_id, DATE_TRUNC('month', event_date) AS month_bucket
  FROM user_actions
)
SELECT user_id
FROM per_user_months curr
JOIN per_user_months prev
  ON prev.user_id    = curr.user_id
 AND prev.month_bucket = curr.month_bucket - INTERVAL '1 month'
WHERE curr.month_bucket = '2022-07-01';

Rule of thumb: one-month retention → EXISTS; multi-month streak → window functions over DISTINCT user_id, month_bucket rows.

Common beginner mistakes

Comparing EXTRACT(MONTH FROM ts) = 6 without checking the year — June 2021 silently counts as June 2022.
Using a self-join when EXISTS is cleaner — explodes the cardinality.
Forgetting WHERE EXTRACT(MONTH FROM event_date) = 7 on the outer query — counts all-time MAU, not July-specific.
Comparing event_date::date directly without INTERVAL '1 month' — fails on calendar boundaries (30 vs 31 day months, leap years).
Returning all rows instead of distinct user counts — COUNT(DISTINCT user_id) is the metric.

SQL Interview Question on Active User Retention

Given user_actions(user_id, event_id, event_type, event_date), return the count of monthly active users for July 2022 — users who had at least one event in July AND at least one event in June. Output month (numeric) and monthly_active_users.

Solution Using correlated `EXISTS` subquery

SELECT
    EXTRACT(MONTH FROM curr_month.event_date) AS mth,
    COUNT(DISTINCT curr_month.user_id)        AS monthly_active_users
FROM user_actions AS curr_month
WHERE EXTRACT(MONTH FROM curr_month.event_date) = 7
  AND EXTRACT(YEAR  FROM curr_month.event_date) = 2022
  AND EXISTS (
      SELECT 1
      FROM user_actions AS last_month
      WHERE last_month.user_id = curr_month.user_id
        AND EXTRACT(MONTH FROM last_month.event_date)
            = EXTRACT(MONTH FROM curr_month.event_date - INTERVAL '1 month')
        AND EXTRACT(YEAR FROM last_month.event_date)
            = EXTRACT(YEAR FROM curr_month.event_date - INTERVAL '1 month')
  )
GROUP BY EXTRACT(MONTH FROM curr_month.event_date);

Why this works: the outer query restricts to July 2022 events; the correlated EXISTS short-circuits on the first June-2022 event for the same user_id; INTERVAL '1 month' makes the comparison work across year boundaries (Jan retention against December of prior year); COUNT(DISTINCT user_id) collapses multiple July events per user to a single MAU contribution; the EXTRACT(MONTH ...) GROUP BY is mechanically required by the SELECT alias.

Step-by-step trace for the DataLemur sample:

user_id	event_date	event_type	curr_month?	EXISTS prev?	counted?
445	2022-06-30	sign-in	✗ (June)	—	—
742	2022-07-03	sign-in	✓	✗	—
445	2022-07-05	like	✓	✓	✓
742	2022-07-05	comment	✓	✗	—
648	2022-07-18	like	✓	✗	—

Only user 445 satisfies both conditions → MAU = 1.

Output:

mth	monthly_active_users
7	1

Why this works — concept by concept:

EXTRACT(MONTH FROM ts) + EXTRACT(YEAR FROM ts) — joint period-key avoids cross-year false positives.
Correlated EXISTS — short-circuits on first match per outer row; O(N · M) worst case but typically far less in practice.
INTERVAL '1 month' shift — calendar-aware month subtraction; handles end-of-month edges and year transitions automatically.
COUNT(DISTINCT user_id) — collapses multiple July events per user; the contract demands "users", not "events."
GROUP BY EXTRACT(MONTH FROM ...) — required because the SELECT references a non-aggregate; produces one output row per month.
O(|user_actions| × log|user_actions|) time — index lookup per outer row inside the EXISTS; with a (user_id, event_date) index this is near-linear.

Inline CTA: More SQL window-function problems and aggregation problems on PipeCode.

SQL
Company — Facebook
Facebook SQL practice problems

Practice →

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

4. SQL CTE and Self-Join for Post Hiatus and Friend Recommendations

CTE composition with MIN/MAX aggregates and self-joins in SQL for data engineering

"For each user who posted at least twice in 2024, return the days between their first and last post" is Meta's signature post-hiatus prompt (DataLemur Q1). The mental model: MAX(post_date) - MIN(post_date) per user gives the hiatus; WHERE EXTRACT(YEAR FROM post_date) = 2024 filters to the year; HAVING COUNT(post_id) > 1 ensures the user actually posted multiple times. The same CTE-and-self-join skeleton scales up to the friend-recommendation pattern (Q6) — a CTE captures private_events and a self-join produces non-friend pairs who attended the same events.

Pro tip: Always alias the result of date subtraction (AS days_between for post hiatus, AS user_pair for friend-rec). PostgreSQL returns an INTERVAL from date subtraction by default — explicit casting (::DATE - ::DATE) returns a plain int. State the cast in the SELECT.

`MIN` / `MAX` aggregates per user with date subtraction

The aggregate-date invariant: MAX(post_date) - MIN(post_date) over a GROUP BY user_id window returns the per-user span of activity; casting to ::DATE first ensures the subtraction returns a plain integer-day count, not an INTERVAL. Filter to the year first via WHERE, group second.

MIN(post_date::DATE) — earliest post date per group.
MAX(post_date::DATE) — latest post date per group.
MAX - MIN — INTERVAL if both are timestamps, int if both are DATE.
EXTRACT(DAY FROM ...) cast — alternative if working with timestamps.

Worked example. Two users, six posts in 2024.

user_id	post_dates	min	max	days_between
151652	07/10, 07/12	07/10	07/12	2
661093	07/08, 07/29	07/08	07/29	21

Worked-example solution.

SELECT user_id,
       MAX(post_date::DATE) - MIN(post_date::DATE) AS days_between
FROM posts
WHERE EXTRACT(YEAR FROM post_date) = 2024
GROUP BY user_id;

Rule of thumb: when the prompt says "days between first and last", MAX(::DATE) - MIN(::DATE) is one line; never compute it via window functions when GROUP BY suffices.

`HAVING COUNT() > 1` for multi-event users

The filter-on-aggregate invariant: HAVING COUNT(post_id) > 1 filters group rows after the GROUP BY to keep only users with at least 2 posts; WHERE cannot reference aggregates and would parse-error. The two clauses are not interchangeable.

WHERE — row-level filter (year predicate).
HAVING — group-level filter (count predicate).
HAVING COUNT(post_id) > 1 — strictly more than 1.
HAVING COUNT(*) >= 2 — equivalent here since post_id is non-null.

Worked example. Three users; one posted only once in 2024.

user_id	post_count_2024	survives?
151652	2	✓
661093	2	✓
004239	1	✗ (filtered)

Worked-example solution.

SELECT user_id,
       MAX(post_date::DATE) - MIN(post_date::DATE) AS days_between
FROM posts
WHERE EXTRACT(YEAR FROM post_date) = 2024
GROUP BY user_id
HAVING COUNT(post_id) > 1;

Rule of thumb: a single-row hiatus is 0 days, which is misleading; the HAVING COUNT > 1 filter is non-negotiable for the post-hiatus prompt.

CTE composition for multi-step logic + self-join for friend-rec / pair queries

The CTE-and-self-join invariant: WITH cte AS (SELECT ...) names an intermediate result; subsequent SELECTs reference it like a table; for pair queries (e.g., friend recommendations), self-join the same CTE on a not-equal-id condition to generate every ordered pair. Combine with friendship-status filtering to surface non-friend pairs.

CTE definition — WITH private_events AS (SELECT user_id, event_id FROM event_rsvp WHERE attendance_status IN ('going', 'maybe') AND event_type = 'private').
Self-join — JOIN private_events e2 ON e1.event_id = e2.event_id AND e1.user_id != e2.user_id.
Friendship-status filter — JOIN friendship_status fs ... WHERE fs.status = 'not_friends'.
Pair count — HAVING COUNT(*) >= 2 for the "two-or-more shared events" requirement.

Worked example. Three users (111, 222, 333) attended event 234; only (222, 333) are not friends.

pair	shared_events	friend?	recommend?
(111, 222)	1	friends	✗
(111, 333)	1	not_friends	✗ (only 1 shared)
(222, 333)	2	not_friends	✓

Worked-example solution.

WITH private_events AS (
  SELECT user_id, event_id
  FROM event_rsvp
  WHERE attendance_status IN ('going', 'maybe')
    AND event_type = 'private'
)
SELECT fs.user_a_id, fs.user_b_id
FROM private_events e1
JOIN private_events e2
  ON e1.event_id = e2.event_id
 AND e1.user_id != e2.user_id
JOIN friendship_status fs
  ON fs.user_a_id = e1.user_id
 AND fs.user_b_id = e2.user_id
WHERE fs.status = 'not_friends'
GROUP BY fs.user_a_id, fs.user_b_id
HAVING COUNT(*) >= 2;

Rule of thumb: CTEs make multi-step logic readable; self-joins on the same CTE are the canonical shape for pair / recommendation queries.

Common beginner mistakes

Computing MAX - MIN without ::DATE cast — returns an INTERVAL, not days.
Using WHERE COUNT(post_id) > 1 — parse error; WHERE cannot reference aggregates.
Forgetting the year filter — counts all-time hiatus instead of 2024-only.
Self-joining without e1.user_id != e2.user_id — generates (user, user) pairs that pollute the answer.
Returning a single direction (user_a, user_b) when the prompt says recommendations are bidirectional — add the mirror by UNION or by including both orderings in the GROUP BY.

SQL Interview Question on Average Post Hiatus

Given posts(user_id, post_id, post_date), for each user who posted at least twice in 2024, return the number of days between the user's first and last post in 2024. Output user_id and days_between.

Solution Using `MAX - MIN` per user with `HAVING COUNT > 1`

SELECT
    user_id,
    MAX(post_date::DATE) - MIN(post_date::DATE) AS days_between
FROM posts
WHERE EXTRACT(YEAR FROM post_date) = 2024
GROUP BY user_id
HAVING COUNT(post_id) > 1;

Why this works: the WHERE clause restricts the row stream to 2024 posts before grouping; GROUP BY user_id collapses to one row per user; MAX(post_date::DATE) - MIN(post_date::DATE) produces the hiatus as an integer day count thanks to the ::DATE cast; HAVING COUNT(post_id) > 1 strips out users who posted only once. Single-pass aggregation; no self-join needed.

Step-by-step trace for the DataLemur sample:

user_id	post_id	post_date
151652	599415	2024-07-10
661093	624356	2024-07-29
004239	784254	2024-07-04
661093	442560	2024-07-08
151652	111766	2024-07-12

WHERE filter — all 5 rows are in 2024; nothing dropped.
GROUP BY user_id — three groups: 151652 (2 posts), 661093 (2 posts), 004239 (1 post).
MIN / MAX per group — 151652 → (07-10, 07-12); 661093 → (07-08, 07-29); 004239 → (07-04, 07-04).
MAX - MIN — 2 days, 21 days, 0 days.
HAVING COUNT > 1 — strips 004239 (only 1 post).

Output:

user_id	days_between
151652	2
661093	21

Why this works — concept by concept:

WHERE EXTRACT(YEAR FROM post_date) = 2024 — row-level filter; runs before grouping for correctness and performance.
GROUP BY user_id — collapses to one row per user.
MAX(post_date::DATE) - MIN(post_date::DATE) — date subtraction returns an integer day count after the ::DATE cast; INTERVAL would be the wrong shape.
HAVING COUNT(post_id) > 1 — group-level filter; aggregate predicate must live in HAVING, not WHERE.
O(|posts| + G log G) time — single scan + sort of G groups; no self-join.

Inline CTA: More SQL CTE problems and SQL join problems on PipeCode.

SQL
Company — Facebook
Facebook SQL practice problems

Practice →

SQL
Topic — CTE
SQL CTE problems

Practice →

SQL
Topic — joins
SQL join problems

Practice →

Tips to crack Facebook data engineering interviews

Facebook = Meta — the rebrand survived but the SQL bar didn't move

Facebook Inc. rebranded to Meta Platforms Inc. in October 2021. The legal entity is Meta; the consumer products (Facebook, Instagram, WhatsApp, Messenger, Threads) keep their original brands. The data-engineering interview loop, the SQL bar, and the question shapes did not change with the rebrand. Search for both "Facebook data engineer interview" and "Meta data engineer interview" — every external article you find under one name applies under the other.

30+30 phone-screen format — pick your opener

The standard Meta data-engineering technical phone screen is 5 minutes intro + 30 minutes SQL + 30 minutes Python + 5 minutes Q&A, with the candidate choosing whether to open with SQL or Python. Pick whichever you'd rather attack first when you're freshest; the second half always feels harder under fatigue, so save your stronger language for the closer if you can. State your preferred opener at the start so the interviewer can plan.

Two role variants — pick the one that matches your loop

Meta has two distinct DE loops: the standard "Data Engineer" role (algo-heavy Python plus SQL) and the "Data Engineer — Product Analytics" role (5 SQL questions plus 5 algo coding questions, more product-sense flavor). Confirm which loop you're in during the recruiter call; the prep mix shifts — Product Analytics candidates drill more SQL retention / cohort patterns; standard DE candidates drill more Python algorithms.

Drill the four primitives

The four primitives in this guide map directly to the curated 2 PipeCode Python problems plus the two adjacent SQL primitives every Meta SQL list rotates through: n*(n+1)/2 - sum(arr) and XOR self-cancellation for the missing-number array problem (Python EASY, #83), two-pass tokenize-and-evaluate or single-pass stack-based evaluator for arithmetic formula parsing (Python MEDIUM, #273), correlated EXISTS with INTERVAL '1 month' for month-over-month MAU retention (DataLemur Q4), and CTE composition + self-joins with MIN/MAX aggregates and HAVING COUNT > 1 for post hiatus and friend recommendations (DataLemur Q1 / Q6).

Product-analytics SQL emphasis

Meta's data-engineering SQL questions are heavily product-analytics-flavored — MAU retention, post hiatus, power users, click-through rates, friend recommendations. Drill the Meta-style SQL practice surface with a focus on EXISTS subqueries, CTE composition, MIN/MAX per-user aggregates, and CASE WHEN + ROUND for percentage metrics. Avoid generic "joins and group-by" prep; Meta's bar is higher.

IC4-IC6 leveling and behavioral expectations

Meta levels DE roles from IC4 (Senior, ~5+ YoE) through IC6 (Staff, ~10+ YoE) and IC7 (Senior Staff). Behavioral rounds at IC5+ probe ownership ("tell me about a time you owned a critical pipeline migration"), navigating ambiguity, and cross-functional collaboration with product / DS / ML teams. Have STAR-format stories ready for each. Compensation per Levels.fyi: IC4 ~$340K-$430K total, IC5 ~$430K-$580K, IC6 $600K+.

Where to practice on PipeCode

Start with the Facebook practice page and the language-scoped Facebook Python practice page for the curated 2-problem set. Hit the company_topic Facebook — array page for the only Facebook-tagged topic surface available. After that, drill the matching topic pages: array, string, window functions, aggregation, CTE, joins, date functions, filtering. The interview courses page bundles structured curricula. For broader coverage, browse by topic, or pivot to peer guides — the Airbnb DE interview guide, the top DE interview questions 2026 blog, and the SQL data types Postgres guide.

Communication and approach under time pressure

Talk through the invariant first ("this is a missing-number problem with sum-formula and XOR alternatives"), the brute force second ("a sort-then-scan would also work but is O(n log n)"), and the optimal third ("but the sum-formula gives O(n) time and O(1) space"). Interviewers grade process as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, single-element array, zero-only array, year-boundary date arithmetic, NULL in a user_id partition. The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.

Frequently Asked Questions

What is the Facebook (Meta) data engineering interview process?

The Meta data engineering interview opens with a recruiter screen (15-30 min), then a technical phone screen with the 5 minute intro + 30 minute SQL + 30 minute Python + 5 minute Q&A format (candidate-choice on opener), then a 4-5 round virtual onsite covering 2 SQL rounds, 1 Python algorithm round, 1 system design round, and 1 behavioral round. Senior roles (IC5+) add a data-architecture round. End-to-end the loop runs three to four weeks.

Is Facebook the same as Meta? Which name should I search for?

Yes — Facebook Inc. rebranded to Meta Platforms Inc. in October 2021. The legal entity is Meta; the consumer brands (Facebook, Instagram, WhatsApp, Messenger, Threads) keep their original names. Search for both "Facebook data engineer interview" and "Meta data engineer interview" — every external article you find under one name applies under the other. The DataLemur SQL guide is titled "Facebook/Meta SQL Interview Questions" because both names index the same content.

What programming languages does Facebook test in data engineering interviews?

Meta tests Python and SQL in the technical phone screen and onsite — bilingual by design. The phone screen is exactly 30 minutes of each, with the candidate choosing the opener. Python emphasizes algorithms (array missing-number, string parsing, hash-table aggregation, sliding-window patterns). SQL emphasizes product-analytics queries (MAU retention, post hiatus, power users, CTR, friend recommendations) on PostgreSQL syntax. Python questions for the Meta data engineer interview lean medium-difficulty algorithm style, not data-pipeline scripting.

What is the difference between the standard DE role and the Data Engineer — Product Analytics variant?

Standard Data Engineer at Meta = algo-heavier Python loops with bilingual SQL coverage (the curated PipeCode set's algorithm focus). Data Engineer — Product Analytics = a product-sense-flavored variant where the technical phone screen is 5 SQL questions + 5 algo coding questions in one session, more weighted toward retention / cohort / funnel analytics SQL than pure algorithm fluency. The recruiter call confirms which loop you're in; ask explicitly if it isn't stated.

What SQL questions does Meta ask data engineers?

Meta SQL interview questions concentrate on six product-analytics shapes: (1) post hiatus / first-and-last-event aggregations via MIN/MAX(::DATE) with HAVING COUNT > 1; (2) power-user identification via JOIN + 2-condition HAVING; (3) MAU retention via correlated EXISTS subquery with INTERVAL '1 month' shifts; (4) friend recommendations via CTE + self-join over event_rsvp with friendship_status filtering; (5) average-shares-per-post via LEFT JOIN + COALESCE; (6) ad click-through rate via CASE WHEN + ROUND and 2022-year filters. Drill all six against the PostgreSQL dialect; CoderPad is the live coding environment.

What is the Meta data engineer salary range?

Meta data engineer total compensation per Levels.fyi: IC4 (Senior, ~5+ YoE) $340K-$430K total comp ($180K-$220K base + RSUs + bonus); IC5 (Staff, ~10+ YoE) $430K-$580K total ($210K-$260K base); IC6 (Senior Staff, ~12+ YoE) $600K+ total. RSU refreshers are annual; equity vests on a 4-year schedule with a typical 25% cliff. Negotiation success rates run 10-20% with competing offers. The "meta data engineer salary" search keyword (vol 260) reflects strong candidate interest in this number.

Top Facebook data engineering interview topics

1. Python Array Missing Number — Sum Formula and XOR

Sum-formula and XOR self-cancellation in Python for data engineering

Arithmetic-series sum formula: n*(n+1)/2

XOR self-cancellation: a ^ a == 0

Set-difference fallback for non-contiguous ranges

Common beginner mistakes

Python Interview Question on Missing Number

Solution Using n*(n+1)//2 - sum(nums)

2. Python Arithmetic Formula Evaluator — Array, Math, Bit, String Parsing

Tokenization and two-pass evaluation in Python for data engineering

Tokenization: split a string into numbers and operators

Two-pass evaluation: handle * and / first, then + and -

Stack-based evaluator for general expressions

Common beginner mistakes

Python Interview Question on Arithmetic Formula Evaluator

Solution Using single-pass stack-based evaluator

3. SQL Window Functions and EXISTS for Monthly Active User Retention

Correlated EXISTS subquery for month-over-month MAU retention in SQL for data engineering

EXTRACT(MONTH FROM ts) and INTERVAL '1 month' arithmetic

Correlated EXISTS subquery for previous-month presence

Alternative: window function with self-join over month buckets

Common beginner mistakes

SQL Interview Question on Active User Retention

Solution Using correlated EXISTS subquery

4. SQL CTE and Self-Join for Post Hiatus and Friend Recommendations

CTE composition with MIN/MAX aggregates and self-joins in SQL for data engineering

MIN / MAX aggregates per user with date subtraction

HAVING COUNT() > 1 for multi-event users

CTE composition for multi-step logic + self-join for friend-rec / pair queries

Common beginner mistakes

SQL Interview Question on Average Post Hiatus

Solution Using MAX - MIN per user with HAVING COUNT > 1

Tips to crack Facebook data engineering interviews

Facebook = Meta — the rebrand survived but the SQL bar didn't move

30+30 phone-screen format — pick your opener

Two role variants — pick the one that matches your loop

Drill the four primitives

Product-analytics SQL emphasis

IC4-IC6 leveling and behavioral expectations

Where to practice on PipeCode

Communication and approach under time pressure

Frequently Asked Questions

What is the Facebook (Meta) data engineering interview process?

Is Facebook the same as Meta? Which name should I search for?

What programming languages does Facebook test in data engineering interviews?

What is the difference between the standard DE role and the Data Engineer — Product Analytics variant?

What SQL questions does Meta ask data engineers?

What is the Meta data engineer salary range?

Start practicing Facebook data engineering problems

Arithmetic-series sum formula: `n*(n+1)/2`

XOR self-cancellation: `a ^ a == 0`

Solution Using `n*(n+1)//2 - sum(nums)`

Two-pass evaluation: handle `*` and `/` first, then `+` and `-`

3. SQL Window Functions and `EXISTS` for Monthly Active User Retention

`EXTRACT(MONTH FROM ts)` and `INTERVAL '1 month'` arithmetic

Correlated `EXISTS` subquery for previous-month presence

Solution Using correlated `EXISTS` subquery

`MIN` / `MAX` aggregates per user with date subtraction

`HAVING COUNT() > 1` for multi-event users

Solution Using `MAX - MIN` per user with `HAVING COUNT > 1`