DEV Community

Cover image for Robinhood Data Engineering Interview Questions & Prep Guide
Gowtham Potureddi
Gowtham Potureddi

Posted on

Robinhood Data Engineering Interview Questions & Prep Guide

Robinhood data engineering interview questions are bilingual — SQL and Python in roughly equal measure — with a fintech-correctness edge that most generic interview-prep posts miss. Four primitives carry the loop: dict.get(s, 0) + 1 hash-table counters that aggregate stock-purchase events by symbol, INNER JOIN trades + users + GROUP BY + ORDER BY count DESC LIMIT N for top-N city / member-transfer rankings, LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) for day-over-day volume or balance change, and GROUP BY user_id HAVING SUM(notional) > limit for end-of-day threshold and notional-cap checks. The framings are everyday brokerage data engineering — count purchases per ticker, surface the top cities by completed trades, compute a daily volume percentage change, flag any account whose option exposure crosses a regulatory limit.

This guide walks through the four topic clusters Robinhood actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches a curated 2-problem Robinhood set (1 EASY Python hash-table + 1 MEDIUM SQL joins) plus the two SQL primitives — window LAG and HAVING-aggregate — that show up on every Robinhood SQL question list and at every L4/L5 onsite. Robinhood pipelines also demand penny-perfect correctness, idempotency, and audit-aware thinking; brokerage candidates who frame their answers in those terms separate themselves from the generic-DE pile.

Robinhood data engineering interview questions cover image with bold headline, Python and SQL chips, faint code ghost, and pipecode.ai attribution.


Top Robinhood data engineering interview topics

From the Robinhood data engineering practice set, the four numbered sections below follow this topic map (one row per H2):

# Topic (sections 1–4) Why it shows up at Robinhood
1 Python hash tables and dict counters for stock-purchase aggregation Stock Purchases Count (EASY) — dict.get(symbol, 0) + 1 and dict-of-sets, the Python primitive for counting events and tracking distinct buyers.
2 SQL inner join and GROUP BY for top-N trade aggregations Member Transfer Records (MEDIUM) — INNER JOIN trades + users + WHERE status='Completed' + GROUP BY city + ORDER BY COUNT(*) DESC + LIMIT N, the SQL primitive for top-N rankings on completed trades.
3 SQL window functions and LAG for daily volume change Volume change percentage — LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) for day-over-day deltas on partitioned time series.
4 SQL aggregation and HAVING for threshold and notional-limit checks Notional-limit check — GROUP BY user_id HAVING SUM(contract_count * strike_price * 100) > limit, the SQL primitive for "WHERE on aggregates" that flags risk and compliance breaches.

Bilingual framing rule: Robinhood's prompts are everyday brokerage data engineering — count buyers per ticker, rank cities by completed orders, measure daily volume change, flag breaches of a notional limit. The interviewer is grading whether you map each business framing to the right primitive: count events → hash-table dict counter; rank by attribute → INNER JOIN + GROUP BY + ORDER BY DESC + LIMIT; day-over-day delta → LAG window function; flag aggregates over a threshold → GROUP BY + HAVING. State the mapping out loud and the correctness will follow.


1. Python Hash Tables and Dict Counters for Stock-Purchase Aggregation

Hash-table dict counting for stock-purchase events in Python for data engineering

"Given a list of stock-purchase events, count purchases per symbol and surface the set of distinct buyers per symbol" is Robinhood's signature EASY Python prompt (Stock Purchases Count). The mental model: a hash-table counter is a dict keyed on the grouping attribute with a count value, updated in a single pass with counts[k] = counts.get(k, 0) + 1; a parallel dict[str, set] keyed the same way tracks distinct members per group. Same primitive powers any "count by category" or "unique-elements per bucket" pipeline — count clicks per user, distinct sessions per page, purchases per ticker.

Diagram of stock-purchase events flowing into a Python dict counter that maps each symbol to a purchase count and a parallel dict-of-sets that maps each symbol to its set of distinct buyers.

Pro tip: The dict.get(k, 0) + 1 idiom is the workhorse — but collections.Counter is the idiomatic shortcut and collections.defaultdict(int) is the common middle ground. Pick the one your interviewer can read fastest: Counter for "count occurrences," defaultdict(set) for "track unique members per group," vanilla dict.get when you want zero imports. State which you chose and why.

Dict counter idiom: dict.get(k, 0) + 1 vs collections.Counter

The dict-counter invariant: a dict keyed on the grouping attribute holds the running count for that group; on each event you read the current count (or 0 if absent), add one, and write it back. dict.get(k, default) returns default when k is missing, which avoids the KeyError that bare dict[k] would raise.

  • counts[k] = counts.get(k, 0) + 1 — vanilla dict, no imports, the universal pattern.
  • counts = Counter(events) — one-liner if events is the iterable of keys themselves (no extra fields).
  • counts = defaultdict(int); counts[k] += 1 — middle ground, handles missing keys via the factory.
  • Avoid try/except KeyError — works but is slower and harder to read than get or defaultdict.

Worked example. Count three purchase events for AAPL, TSLA, AAPL using the dict.get idiom.

step counts
start {}
AAPL {'AAPL': 1}
TSLA {'AAPL': 1, 'TSLA': 1}
AAPL {'AAPL': 2, 'TSLA': 1}

Worked-example solution.

counts = {}
for symbol in ['AAPL', 'TSLA', 'AAPL']:
    counts[symbol] = counts.get(symbol, 0) + 1
# counts == {'AAPL': 2, 'TSLA': 1}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: when the count is the only signal, Counter(iter) is shortest; when you also need other per-group state (a set of distinct buyers, a running max, a list of timestamps), reach for defaultdict(set) or vanilla dict.get with explicit logic.

Set-of-buyers per stock: dict-of-sets pattern

The dict-of-sets invariant: a dict keyed on the grouping attribute holds a set of distinct member ids; each event inserts the member id, and set semantics dedupe automatically. Inserting the same (symbol, user) pair twice is a no-op — no special handling required, no if member in s guard.

  • buyers.setdefault(symbol, set()).add(user) — one-liner that initializes the set on first touch.
  • buyers = defaultdict(set); buyers[symbol].add(user) — same outcome, slightly cleaner.
  • len(buyers[symbol]) — distinct-buyer count for any symbol.
  • Set membershipO(1) average; never use a list here, which would be O(N) for if x in lst.

Worked example. Track distinct buyers for AAPL across three events: (u1, AAPL), (u2, AAPL), (u1, AAPL).

step buyers['AAPL']
start set()
u1 {'u1'}
u2 {'u1', 'u2'}
u1 (dup) {'u1', 'u2'}

Worked-example solution.

from collections import defaultdict
buyers = defaultdict(set)
for user, symbol in [('u1', 'AAPL'), ('u2', 'AAPL'), ('u1', 'AAPL')]:
    buyers[symbol].add(user)
# buyers == {'AAPL': {'u1', 'u2'}}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: defaultdict(set) collapses the "first-time init" branch — never write if k not in d: d[k] = set() when defaultdict exists.

One-pass aggregation in a single for-loop

The single-pass invariant: two parallel dicts updated in the same for loop produce both aggregates in O(N) total time and O(K) space, where K is the number of distinct symbols. Iterating twice — once for counts, once for buyer-sets — works but doubles the work; one pass is idiomatic and what interviewers grade.

  • One loop, two updatescounts[s] = counts.get(s, 0) + 1 and buyers.setdefault(s, set()).add(u) per event.
  • Avoid two passesO(2N) == O(N) asymptotically but reads as wasteful.
  • Tuple unpack the eventfor user, symbol, qty in events: is cleaner than indexing event[0], event[1].
  • Skip filters first — if the prompt says "only Completed trades," guard with if status != 'Completed': continue at the top.

Worked example. Aggregate the four-event stream into both dicts in one pass.

event counts buyers
(u1, AAPL) {'AAPL': 1} {'AAPL': {'u1'}}
(u2, TSLA) {'AAPL': 1, 'TSLA': 1} {'AAPL': {'u1'}, 'TSLA': {'u2'}}
(u1, AAPL) {'AAPL': 2, 'TSLA': 1} {'AAPL': {'u1'}, 'TSLA': {'u2'}}
(u3, AAPL) {'AAPL': 3, 'TSLA': 1} {'AAPL': {'u1', 'u3'}, 'TSLA': {'u2'}}

Worked-example solution.

from collections import defaultdict
counts: dict[str, int] = {}
buyers: dict[str, set] = defaultdict(set)
for user, symbol in [('u1', 'AAPL'), ('u2', 'TSLA'), ('u1', 'AAPL'), ('u3', 'AAPL')]:
    counts[symbol] = counts.get(symbol, 0) + 1
    buyers[symbol].add(user)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: one loop, two parallel structures, zero if k in d branches — reach for defaultdict and dict.get first.

Common beginner mistakes

  • Using counts[k] += 1 on a missing key — raises KeyError. Use dict.get(k, 0) + 1 or defaultdict(int).
  • Using a list instead of a set for distinct buyers — O(N) membership checks blow up the asymptotic cost.
  • Iterating twice instead of once — wastes work and signals the candidate didn't think about cost.
  • Forgetting to skip filtered events at the top of the loop — counts include cancelled / failed trades and produce the wrong answer.
  • Returning dict.values() directly when the test expects a plain list — wrap with list(...) if the contract demands it.

Python Interview Question on Stock-Purchase Aggregation

Given a list of completed stock-purchase events (user_id, symbol, quantity), return a dict mapping each symbol to a tuple of (a) the total purchase count and (b) the set of distinct user_id values that bought it.

def stock_purchases(events):
    pass  # your code here
Enter fullscreen mode Exit fullscreen mode

Solution Using dict.get and defaultdict(set) in one pass

from collections import defaultdict

def stock_purchases(events):
    counts: dict[str, int] = {}
    buyers: dict[str, set] = defaultdict(set)
    for user_id, symbol, _qty in events:
        counts[symbol] = counts.get(symbol, 0) + 1
        buyers[symbol].add(user_id)
    return {s: (counts[s], buyers[s]) for s in counts}
Enter fullscreen mode Exit fullscreen mode

Why this works: the dict.get(symbol, 0) + 1 idiom keeps the count loop branch-free; defaultdict(set) keeps the buyer-set loop branch-free; one pass over events builds both structures in O(N) time. The final dict-comprehension stitches them together so each symbol maps to (count, set_of_buyers) in a single return value.

Step-by-step trace for events = [('u1', 'AAPL', 50), ('u2', 'TSLA', 10), ('u1', 'AAPL', 25), ('u3', 'AAPL', 5), ('u2', 'GOOG', 8)]:

step event counts buyers
0 start {} defaultdict(set)
1 (u1, AAPL, 50) {'AAPL': 1} {'AAPL': {'u1'}}
2 (u2, TSLA, 10) {'AAPL': 1, 'TSLA': 1} {'AAPL': {'u1'}, 'TSLA': {'u2'}}
3 (u1, AAPL, 25) {'AAPL': 2, 'TSLA': 1} {'AAPL': {'u1'}, 'TSLA': {'u2'}}
4 (u3, AAPL, 5) {'AAPL': 3, 'TSLA': 1} {'AAPL': {'u1', 'u3'}, 'TSLA': {'u2'}}
5 (u2, GOOG, 8) {'AAPL': 3, 'TSLA': 1, 'GOOG': 1} {'AAPL': {'u1', 'u3'}, 'TSLA': {'u2'}, 'GOOG': {'u2'}}

After the loop, the dict-comprehension {s: (counts[s], buyers[s]) for s in counts} produces the merged output.

Output:

symbol count distinct_buyers
AAPL 3 {'u1', 'u3'}
TSLA 1 {'u2'}
GOOG 1 {'u2'}

Why this works — concept by concept:

  • Hash-table dict countercounts[symbol] = counts.get(symbol, 0) + 1 updates per group in O(1) average; the dict.get default of 0 keeps the new-key branch implicit and unbranched.
  • defaultdict(set) factorybuyers[symbol].add(user_id) initializes a fresh set on first touch; subsequent .add calls dedupe at insertion time, so distinct counting is free.
  • Single-pass loop — two parallel updates inside one for keep total work at O(N); iterating events twice would double the work for the same answer.
  • Dict-comprehension merge{s: (counts[s], buyers[s]) for s in counts} stitches the two parallel structures into the contract-shape return without an explicit accumulator.
  • O(N) time / O(K) spaceN events, each touching two hash maps; K distinct symbols hold the aggregate state. No sorting, no nested scans.

Inline CTA: Drill the Robinhood Python practice page for the curated hash-table problem and the hash-table Python practice page for breadth.

PYTHON
Company — Robinhood
Robinhood data engineering problems

Practice →

PYTHON
Topic — hash table
Python hash-table problems

Practice →

PYTHON
Topic — dictionary
Python dictionary problems

Practice →


2. SQL Inner Join and GROUP BY for Top-N Trade Aggregations

Inner-join + group-by + top-N rankings in SQL for Robinhood data engineering

"Given a trades table and a users (or members) table, return the top three cities by completed trade orders" is Robinhood's signature MEDIUM SQL prompt — the same shape as PipeCode's #247 Member Transfer Records and DataLemur's #1 Cities With Completed Trades. The mental model: INNER JOIN the fact table to the dim table on the shared id, filter rows with WHERE status = 'Completed', group by the dim attribute, count or sum, then ORDER BY count DESC LIMIT N for the top-N. The same primitive powers any "rank entities by event volume" pipeline — top countries by signups, top sectors by trade flow, top symbols by buy-side volume.

Pro tip: Filter early with WHERE (per-row), not late with HAVING (per-group). WHERE status = 'Completed' removes cancelled trades before the GROUP BY, so the COUNT(*) is correct. Doing it after the group with HAVING requires an aggregate predicate and is slower; doing it not at all silently inflates the count with cancellations.

INNER JOIN cardinality: trades 1-to-many users

The join-cardinality invariant: INNER JOIN trades ON trades.user_id = users.user_id produces one output row per matching (trade, user) pair. Since each trade has exactly one user but a user can have many trades, the join expands the user table by trade count — the result has one row per trade. INNER JOIN drops trades whose user_id has no match in users (orphans).

  • INNER JOIN — only matching pairs survive.
  • LEFT JOIN — keeps trades with no user match (NULL city); use when you must report orphans.
  • FULL OUTER JOIN — also keeps users with zero trades; useful for reconciliation, not for top-N.
  • Pick the smallest table on the right — modern planners are smart, but explicit LEFT/INNER choice signals intent.

Worked example. Two-row trades, two-row users; INNER JOIN on user_id.

trade user city
100101 (u=111) u=111 San Francisco
100259 (u=148) u=148 Boston

Worked-example solution.

SELECT t.order_id, u.city
FROM trades t
INNER JOIN users u ON t.user_id = u.user_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: INNER JOIN is the default for "this trade definitely has a user"; reach for LEFT JOIN only when orphans are part of the answer.

WHERE before GROUP BY: filtering rows vs filtering groups

The filter-order invariant: WHERE runs per-row before grouping; HAVING runs per-group after. WHERE status = 'Completed' strips out cancelled and pending trades so the subsequent GROUP BY city only counts what the question asks about. Putting the same predicate in HAVING requires HAVING SUM(CASE WHEN status='Completed' THEN 1 ELSE 0 END) > 0 — verbose, slower, and a "bad signal" in the round.

  • WHERE clause — row-level filter; runs first; uses =, <, IN, LIKE, IS NULL.
  • HAVING clause — group-level filter; runs after GROUP BY; uses SUM, COUNT, AVG, MAX.
  • Order in the querySELECT ... FROM ... JOIN ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... LIMIT.
  • PerformanceWHERE-then-group reads less data from disk because the grouper sees fewer rows.

Worked example. Filter to status = 'Completed' before grouping by city.

input rows after WHERE grouped city → count
7 5 (2 dropped) SF: 3, Boston: 2

Worked-example solution.

SELECT u.city, COUNT(*) AS total
FROM trades t
INNER JOIN users u ON t.user_id = u.user_id
WHERE t.status = 'Completed'
GROUP BY u.city;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: push every row-level predicate into WHERE; reserve HAVING strictly for predicates that need an aggregate.

ORDER BY count DESC + LIMIT for top-N rankings

The top-N invariant: ORDER BY <metric> DESC sorts groups by the metric in descending order; LIMIT N returns only the first N. For ties at the cut, dialects differ — Postgres returns whichever order the planner prefers; Redshift / Snowflake similar. If ties matter, add a deterministic tiebreaker like , u.city ASC.

  • ORDER BY total_orders DESC — sorts groups by the aliased column.
  • LIMIT 3 — first 3 groups in sort order; standard across MySQL, Postgres, Snowflake, BigQuery (BigQuery uses LIMIT too).
  • OFFSET — pagination; LIMIT 3 OFFSET 3 returns ranks 4–6.
  • Tiebreakers, u.city ASC makes ties deterministic and stable across runs.

Worked example. Sort 4 cities by count and take the top 3.

city total rank
SF 3 1
Boston 2 2
Denver 1 3
Austin 1 (cut)

Worked-example solution.

SELECT u.city, COUNT(*) AS total
FROM trades t
INNER JOIN users u ON t.user_id = u.user_id
WHERE t.status = 'Completed'
GROUP BY u.city
ORDER BY total DESC
LIMIT 3;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always alias the count column (AS total) and use the alias in ORDER BY; mixing literal COUNT(*) and the alias is style noise.

Common beginner mistakes

  • Using COUNT(*) when only completed trades should count — forgetting WHERE status = 'Completed' silently inflates totals.
  • GROUP BY user_id instead of GROUP BY city — confuses the dim level with the fact level; the question asks for cities.
  • LEFT JOIN when the question asks "completed trades" — orphan trades (no matching user) silently survive with NULL city and pollute the rank.
  • Forgetting the alias in ORDER BYORDER BY COUNT(trades.order_id) DESC works but reads worse than ORDER BY total_orders DESC.
  • Returning more than N rows by skipping LIMIT — graded as a wrong answer even when the top N are correct.

SQL Interview Question on Top Cities by Completed Trades

Given the tables trades(order_id, user_id, quantity, status, date, price) and users(user_id, city, email, signup_date), write a query that returns the top 3 cities by number of completed trade orders, in descending order. Output two columns: city and total_orders.

Solution Using INNER JOIN + WHERE + GROUP BY + ORDER BY DESC + LIMIT

SELECT
    u.city,
    COUNT(t.order_id) AS total_orders
FROM trades t
INNER JOIN users u
    ON t.user_id = u.user_id
WHERE t.status = 'Completed'
GROUP BY u.city
ORDER BY total_orders DESC
LIMIT 3;
Enter fullscreen mode Exit fullscreen mode

Why this works: the INNER JOIN pairs each trade with its user (and drops orphan trades with no user match); WHERE status = 'Completed' removes cancelled trades before grouping; GROUP BY city collapses rows to one-per-city with COUNT(t.order_id) as the metric; ORDER BY total_orders DESC ranks cities; LIMIT 3 returns only the top three.

Step-by-step trace for the Robinhood-style sample data:

order_id user_id status city
100101 111 Cancelled San Francisco
100102 111 Completed San Francisco
100259 148 Completed Boston
100264 148 Completed Boston
100305 300 Completed San Francisco
100400 178 Completed San Francisco
100565 265 Completed Denver
  1. Inner-join — every trade's user_id matches a row in users; the join produces 7 rows with (order_id, status, city) paired up.
  2. WHERE filter — drop 100101 (Cancelled). 6 rows remain: 3 SF + 2 Boston + 1 Denver.
  3. Group by city — collapse to three groups: SF=3, Boston=2, Denver=1.
  4. Order by count desc — SF (3) > Boston (2) > Denver (1).
  5. Limit 3 — returns all three (no cut needed).

Output:

city total_orders
San Francisco 3
Boston 2
Denver 1

Why this works — concept by concept:

  • INNER JOIN cardinality — each trade pairs with exactly one user; orphan trades (none in this dataset) would be dropped, which is the right semantic when the question asks about completed-trade-by-city.
  • WHERE before GROUP BY — the row-level filter eliminates cancelled trades before grouping, so COUNT(*) reflects only completed orders; doing it after with HAVING would still work but reads worse and runs slower.
  • GROUP BY u.city — collapses the result to one row per city; the only non-aggregate column in SELECT must appear here (or be functionally dependent on it).
  • COUNT(t.order_id) — counts non-null order_id values per group; equivalent to COUNT(*) here because order_id is the trades primary key (never null).
  • ORDER BY total_orders DESC LIMIT 3 — sorts groups by the metric and slices the top three; the alias total_orders keeps the ORDER BY clean.
  • O((|trades| + |users|) + G log G) time|trades| rows scanned for the join, G groups sorted (G ≤ |cities|); O(G) space for the group hash table.

Inline CTA: More SQL join problems and the Robinhood SQL practice page for the curated MEDIUM problem.

SQL
Company — Robinhood
Robinhood SQL problems

Practice →

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — group by
SQL group-by problems

Practice →


3. SQL Window Functions and LAG for Daily Volume Change

Window LAG for day-over-day deltas in SQL for Robinhood data engineering

"For each day and each stock, compute the percentage change in trading volume vs the previous day for the same stock" is a Robinhood SQL staple — the same shape that scales up to "reconstruct an account balance from trade events" at the L4/L5 onsite. The mental model: LAG(value) OVER (PARTITION BY group ORDER BY ts) looks one row back inside the same partition and returns the prior row's value; partition keeps each stock's series independent, ORDER BY fixes the row order, and the day-over-day delta is (value − LAG(value)) / LAG(value). The same window primitive powers any "delta vs previous" query — running balance, running total via SUM, rank within a partition via ROW_NUMBER.

Diagram showing the LAG window function reaching one row back inside a per-stock_symbol partition over trade_date to compute daily percentage change of trading volume.

Pro tip: The first row in each partition has no predecessor, so LAG returns NULL. Decide upfront: filter out the NULL row with WHERE prev IS NOT NULL, or pass through with COALESCE(prev, value) to mark "no change." Stating this explicitly is the senior signal interviewers grade.

Window basics: PARTITION BY group + ORDER BY ordering

The window-function invariant: OVER (PARTITION BY <group> ORDER BY <ordering>) declares an independent ordered subset for every value of the group expression; the function (LAG, LEAD, SUM, ROW_NUMBER, …) is evaluated within that subset only. PARTITION BY stock_symbol builds one ordered series per ticker; ORDER BY trade_date fixes the row order inside each.

  • PARTITION BY is optional — omitted, the whole table is one window; useful for global running totals.
  • ORDER BY is required for offset functionsLAG, LEAD, ROW_NUMBER, RANK all need a defined order.
  • FramesROWS UNBOUNDED PRECEDING / BETWEEN ... AND ... further bound the visible rows; default is fine for LAG.
  • Multiple windows in one query — different functions can use different OVER (...) clauses on the same row.

Worked example. Two (date, symbol, volume) rows for AAPL; LAG(volume) OVER (PARTITION BY symbol ORDER BY date) returns the previous day's volume.

date symbol volume lag(volume)
2022-07-01 AAPL 1000000 NULL
2022-07-02 AAPL 1500000 1000000

Worked-example solution.

SELECT date, symbol, volume,
       LAG(volume) OVER (PARTITION BY symbol ORDER BY date) AS prev_volume
FROM trading_volume;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always set PARTITION BY to the entity that owns the time series (stock_symbol, account_id, user_id); without it, day-1 of one symbol leaks across to compute deltas with a different symbol's day-N.

LAG vs LEAD: row-N-before vs row-N-after

The offset-function invariant: LAG(expr, n) returns expr from the row n positions before the current row inside the partition; LEAD(expr, n) returns from n positions after. Default n = 1. Both accept a third default argument that replaces NULL at partition boundaries.

  • LAG(volume) — previous row's volume (1 back).
  • LAG(volume, 7) — 7 rows back; useful for week-over-week deltas on daily data.
  • LEAD(volume) — next row's volume (1 ahead).
  • LAG(volume, 1, volume) — falls back to the current row's volume on the first row, so the % change is 0 instead of NULL.

Worked example. Volume on day 2 vs day 1.

trade_date volume LAG(volume) LEAD(volume)
2022-07-01 1000000 NULL 1500000
2022-07-02 1500000 1000000 1800000
2022-07-03 1800000 1500000 NULL

Worked-example solution.

SELECT trade_date,
       volume,
       LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) AS prev_volume,
       LEAD(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) AS next_volume
FROM trading_volume;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "delta vs yesterday" → LAG; "delta vs tomorrow" or "next-event distance" → LEAD. They are mirror images of the same primitive.

NULL on the first row: handling the edge with COALESCE or filtering

The boundary invariant: the first row of each partition has no preceding row, so LAG(expr) returns NULL; arithmetic on NULL propagates to NULL. The two clean strategies are: (a) filter out the NULL row (WHERE prev IS NOT NULL) so the result starts on day 2; (b) replace NULL upstream with LAG(expr, 1, default) or COALESCE(LAG(expr), 0) so day 1 has a defined % change (often 0 or NULL-as-string).

  • WHERE prev IS NOT NULL — drops the first-day row entirely.
  • LAG(volume, 1, volume) — first-day prev_volume becomes the same row's volume; % change becomes 0.
  • COALESCE(LAG(volume), 0) — first-day prev_volume becomes 0, which makes the % change blow up (/ 0); usually wrong.
  • NULLIF(LAG(volume), 0) — guards against zero-volume divisions; complementary to the boundary fix.

Worked example. Drop the first-day NULL on a 3-row series.

trade_date volume volume_change_pct
2022-07-02 1500000 50.00
2022-07-03 1800000 20.00

Worked-example solution.

SELECT *
FROM (
    SELECT trade_date,
           stock_symbol,
           volume,
           LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) AS prev_volume,
           (volume - LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date))
             / NULLIF(LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date), 0) * 100
             AS volume_change_pct
    FROM trading_volume
) sub
WHERE prev_volume IS NOT NULL;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: state your boundary policy out loud — "I'll drop day 1" or "I'll show day 1 as NULL" — before writing the query; interviewers grade the explicit choice.

Common beginner mistakes

  • Forgetting PARTITION BY stock_symbolLAG reaches into the previous symbol's series and computes a meaningless delta.
  • Forgetting ORDER BY trade_dateLAG returns whichever row the planner happens to pick; the answer is non-deterministic.
  • Dividing by LAG(volume) when the prior volume could be 0 — runtime divide-by-zero; use NULLIF(LAG(volume), 0).
  • Using LEAD when the prompt says "vs yesterday" — answers the wrong direction.
  • Hardcoding the offset (LAG(volume, 1)) when the spec is "vs 7 days ago" — read the spec and use LAG(volume, 7).

SQL Interview Question on Daily Volume Percentage Change

Given a trading_volume(trade_date, stock_symbol, volume) table with one row per (date, symbol), write a query that returns, for each row from day 2 onwards, the daily percentage change in volume vs the previous day for the same symbol. Output trade_date, stock_symbol, and volume_change_pct (rounded to 2 decimals).

Solution Using LAG window function

SELECT trade_date,
       stock_symbol,
       ROUND(
         (volume - LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date))
         / NULLIF(LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date), 0) * 100,
         2
       ) AS volume_change_pct
FROM trading_volume
QUALIFY LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) IS NOT NULL;
Enter fullscreen mode Exit fullscreen mode

(In Postgres, replace QUALIFY with a wrapping subquery and WHERE prev_volume IS NOT NULL — Snowflake / BigQuery / Databricks support QUALIFY directly.)

Why this works: PARTITION BY stock_symbol keeps each ticker's daily series independent; ORDER BY trade_date fixes ordering inside the partition; LAG(volume) reaches one row back to fetch yesterday's volume; the formula (today − yesterday) / yesterday × 100 produces the % change; NULLIF(..., 0) guards against divide-by-zero on zero-volume days; QUALIFY ... IS NOT NULL filters the first-day boundary so output starts on day 2.

Step-by-step trace for AAPL across 5 days:

trade_date stock_symbol volume
2022-07-01 AAPL 1000000
2022-07-02 AAPL 1500000
2022-07-03 AAPL 1800000
2022-07-04 AAPL 1750000
2022-07-05 AAPL 1800000
  1. Partition — single partition for AAPL; one ordered series of 5 rows by trade_date ascending.
  2. LAG(volume) — produces [NULL, 1000000, 1500000, 1800000, 1750000] for the 5 rows.
  3. Delta and division — row 2: (1500000 - 1000000) / 1000000 = 0.50; row 3: (1800000 - 1500000) / 1500000 = 0.20; row 4: (1750000 - 1800000) / 1800000 = -0.0278; row 5: (1800000 - 1750000) / 1750000 = 0.0286.
  4. × 100 — convert to percent.
  5. ROUND(..., 2) — round to 2 decimals.
  6. QUALIFY LAG IS NOT NULL — drops row 1 (first day, no predecessor).

Output:

trade_date stock_symbol volume_change_pct
2022-07-02 AAPL 50.00
2022-07-03 AAPL 20.00
2022-07-04 AAPL -2.78
2022-07-05 AAPL 2.86

Why this works — concept by concept:

  • PARTITION BY stock_symbol — declares one independent window per ticker; without it, AAPL's day-1 would compute a delta against TSLA's day-N.
  • ORDER BY trade_date — fixes the row order inside each window; offset functions need an explicit ordering or the result is non-deterministic.
  • LAG(volume) — retrieves the previous row's volume inside the partition; NULL on the first row by design.
  • NULLIF(LAG(volume), 0) — guards against divide-by-zero on zero-volume days; NULLIF(x, 0) returns NULL when x = 0, propagating cleanly through arithmetic.
  • QUALIFY LAG IS NOT NULL — drops the first-day row of each partition so the output starts on day 2 with a defined % change; the Postgres equivalent is a subquery WHERE prev_volume IS NOT NULL.
  • O(N log N) time — the planner sorts each partition once; O(N) space for the running window state. For Robinhood-scale daily volumes this is cheap.

Inline CTA: More window-function problems and SQL aggregation problems for the breadth tier.

SQL
Company — Robinhood
Robinhood SQL problems

Practice →

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →


4. SQL Aggregation and HAVING for Threshold and Notional-Limit Checks

GROUP BY + HAVING for risk-threshold checks in SQL for Robinhood data engineering

"Find every user whose end-of-day option position exceeds the notional limit" is the canonical Robinhood risk / compliance SQL prompt. The mental model: aggregate per user with SUM(contract_count × strike_price × multiplier) to compute notional exposure, then filter the resulting groups with HAVING SUM(...) > limit. HAVING is "WHERE on aggregates" — it filters group rows, not source rows. Same primitive powers any "flag entities whose aggregate metric crosses a policy line" pipeline — flag accounts whose daily trade count exceeds a velocity cap, sectors whose monthly volume exceeds an exposure ceiling, users whose 24-hour withdrawals breach AML thresholds.

Diagram of per-user notional-position aggregation showing each user's total notional as a horizontal bar with a threshold line, where bars exceeding the threshold are tinted red and flagged in the output table.

Pro tip: Brokerage data is decimal-precision-sensitive. Never store currency or notional as FLOAT / DOUBLE — the rounding errors compound and a 1¢ drift becomes a regulatory event. Use NUMERIC(18, 4) or DECIMAL(18, 4) for prices, quantities, and notionals. State this constraint when you write the schema; interviewers grade it.

WHERE vs HAVING: rows vs groups

The filter-stage invariant: WHERE filters source rows before grouping; HAVING filters group rows after grouping. WHERE cannot reference aggregates (SUM, COUNT, AVG); HAVING can. The execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT.

  • WHERE status = 'Open' — row predicate; filters before grouping.
  • HAVING SUM(notional) > 100000 — group predicate; filters after grouping.
  • Both in one queryWHERE strips closed positions, HAVING keeps groups whose remaining sum breaches the limit.
  • Avoid duplicating logic — if a predicate doesn't need an aggregate, put it in WHERE for performance; if it does, HAVING is the only correct place.

Worked example. Filter open positions row-wise, then keep groups whose notional sum > $100K.

step rows groups
after WHERE 12 (open)
after GROUP BY 5 users
after HAVING > 100K 2 users

Worked-example solution.

SELECT user_id, SUM(contract_count * strike_price * 100) AS notional
FROM positions
WHERE status = 'Open'
GROUP BY user_id
HAVING SUM(contract_count * strike_price * 100) > 100000;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: aggregate predicates always belong in HAVING; row predicates always belong in WHERE. Mixing them up is graded as a conceptual error.

Notional aggregation: SUM(contract_count × strike_price × multiplier)

The notional invariant: for an equity option, notional = contract_count × strike_price × 100 (one US equity option contract represents 100 shares). For futures, the multiplier varies by product (ES = 50, NQ = 20). Always express the multiplier explicitly in the SQL — never bake "× 100" silently into an upstream view.

  • Equity optionscontract_count × strike_price × 100.
  • Index futures — multiplier per contract spec; ES = 50.
  • Equities (long)qty × price.
  • Use NUMERIC columns — preserve cents through the multiply chain.

Worked example. Three positions for u2: 200 contracts at $80, all open.

user_id contract_count strike notional
u2 200 80 200 × 80 × 100 = 1,600,000

Worked-example solution.

SELECT user_id, SUM(contract_count * strike_price * 100) AS notional
FROM positions
WHERE status = 'Open'
GROUP BY user_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: spell out the multiplier every time and use a NUMERIC column for strike_price and contract_count; floats here will quietly lose pennies and the audit log will catch you.

Combining HAVING with COUNT and SUM thresholds

The composition invariant: HAVING accepts any boolean expression over aggregates and grouping columns. You can combine SUM, COUNT, AVG, MAX, and the grouping keys with AND / OR. Common compound predicates: "more than 10 trades AND notional > $100K", "MAX position older than 7 days", "average ticket size > $5K and at least 5 trades."

  • HAVING COUNT(*) > 10 — at least 11 trades in the group.
  • HAVING SUM(notional) > 100000 AND COUNT(*) >= 5 — both conditions must hold.
  • HAVING AVG(price) > 100 OR MAX(price) > 500 — either condition.
  • HAVING MAX(trade_date) < CURRENT_DATE - INTERVAL '7' DAY — group has been quiet for a week.

Worked example. Flag users with ≥5 trades AND notional > $100K.

user trades notional flagged
u1 7 60,000 no (notional fail)
u2 12 1,600,000 yes
u3 3 200,000 no (count fail)

Worked-example solution.

SELECT user_id,
       COUNT(*) AS trades,
       SUM(contract_count * strike_price * 100) AS notional
FROM positions
WHERE status = 'Open'
GROUP BY user_id
HAVING COUNT(*) >= 5
   AND SUM(contract_count * strike_price * 100) > 100000;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: compose the HAVING predicate exactly as the policy or rule reads; don't pre-aggregate into a CTE just to put WHERE on top — HAVING is the right tool.

Common beginner mistakes

  • Using WHERE SUM(notional) > 100000WHERE cannot reference aggregates; the parser rejects it. Use HAVING.
  • Computing notional in a subquery and then re-aggregating outside — works but is the "long way around"; one GROUP BY + HAVING is cleaner.
  • Storing strike_price as FLOAT — accumulates rounding error; use NUMERIC(18, 4).
  • Forgetting the × 100 multiplier on equity options — the notional comes out 100× too small.
  • Including closed positions — forgetting WHERE status = 'Open' inflates notional and produces false positives.

SQL Interview Question on Notional-Limit Threshold Check

Given a positions(user_id, contract_count, strike_price, status) table where status is 'Open' or 'Closed' and each row is one option position (equity-option multiplier = 100), write a query that returns every user_id whose total open notional exceeds $100,000. Output user_id and notional, ordered by notional descending.

Solution Using GROUP BY + HAVING SUM(...) > limit

SELECT
    user_id,
    SUM(contract_count * strike_price * 100) AS notional
FROM positions
WHERE status = 'Open'
GROUP BY user_id
HAVING SUM(contract_count * strike_price * 100) > 100000
ORDER BY notional DESC;
Enter fullscreen mode Exit fullscreen mode

Why this works: WHERE status = 'Open' strips closed positions before grouping so the aggregate only reflects current exposure; GROUP BY user_id collapses to one row per user; the equity-option multiplier × 100 produces real-dollar notional inside SUM; HAVING filters groups whose summed notional crosses the policy limit; ORDER BY notional DESC ranks the breaches by severity.

Step-by-step trace for the sample:

user_id contract_count strike_price status
u1 50 120 Open
u2 200 80 Open
u3 30 150 Open
u4 80 90 Open
u5 10 100 Open
u4 25 50 Closed
  1. WHERE filter — drops the closed u4 row. 5 open rows remain.
  2. Per-row notionalu1: 50×120×100 = 600,000 … wait — the worked-example diagram uses simplified numbers; recompute: u1: 50×120×100 = 600,000, u2: 200×80×100 = 1,600,000, u3: 30×150×100 = 450,000, u4: 80×90×100 = 720,000, u5: 10×100×100 = 100,000.
  3. Group by user_id — already one open row per user; the SUM collapses each one to itself.
  4. HAVING > 100,000 — strict greater-than; u5 at exactly 100,000 does not pass; u1 (600K), u2 (1.6M), u3 (450K), u4 (720K) do.
  5. Order by notional descu2 > u4 > u1 > u3.

Output:

user_id notional
u2 1,600,000
u4 720,000
u1 600,000
u3 450,000

Why this works — concept by concept:

  • WHERE row filter — strips closed positions before grouping; HAVING cannot do this work because the predicate is per-row, not per-group.
  • SUM(contract_count * strike_price * 100) — computes equity-option notional with the × 100 multiplier baked into the aggregate; spelled out so the reviewer can read the units.
  • GROUP BY user_id — collapses each user's positions to one row with summed notional; the only non-aggregate column in SELECT matches the GROUP BY.
  • HAVING SUM(...) > 100000 — filters group rows; aggregate predicates have to live here. Strict-greater means 100000 itself does not breach.
  • ORDER BY notional DESC — ranks breaches by severity so the on-call engineer can triage worst-first.
  • O(|positions| + G log G) time — one scan for the join-free GROUP BY, then O(G log G) to sort the group output where G is the number of users with open positions.

Inline CTA: More SQL HAVING-clause problems and SQL aggregation problems for breadth.

SQL
Company — Robinhood
Robinhood SQL problems

Practice →

SQL
Topic — having clause
SQL HAVING-clause problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →


Tips to crack Robinhood data engineering interviews

Bilingual profile — SQL and Python both, equally weighted

Unlike Cisco's Python-only loop, Robinhood splits the coding rounds between SQL and Python. The curated 2-problem Robinhood practice set is 1 EASY Python hash-table + 1 MEDIUM SQL joins — bilingual by design. Spending all your prep time on Python means losing the SQL round; spending it all on SQL means stuttering on the dict-counter Python prompt. Allocate roughly half-half across the SQL practice surface and the Python practice surface.

Drill the four primitives

The four primitives in this guide map directly to the curated Robinhood set plus the two adjacent SQL patterns every Robinhood SQL question list rotates through: hash-table dict counter (Python, EASY — Stock Purchases Count), INNER JOIN + GROUP BY + ORDER DESC + LIMIT (SQL, MEDIUM — Member Transfer Records / cities-completed-trades), LAG window function (SQL — daily volume change), GROUP BY + HAVING (SQL — notional-limit checks). Each maps to a specific module: vanilla dict and collections.defaultdict for the Python primitive, INNER JOIN and GROUP BY for the joins primitive, OVER (PARTITION BY ... ORDER BY ...) for window LAG, HAVING for the aggregate-threshold primitive.

Penny-perfect correctness is the bar

Robinhood is a regulated brokerage; every cent of every account balance is auditable. Float-typed currency, rounding-tolerant joins, and "approximately right" pipelines are downgrade signals in the round. Use NUMERIC(18, 4) for prices and notionals; spell out the equity-option × 100 multiplier; volunteer "I would store this as decimal, never float" when you write the schema. State idempotency requirements when designing pipelines — every job can be re-run without double-counting.

Event-sourced thinking — positions are folded from immutable trade events

Robinhood's distinctive architecture pattern is event sourcing: account balances and positions are not mutable rows you UPDATE; they are derived by folding an immutable log of trade and cash events. CRUD-style designs in interview answers underperform. Frame state as "I would write trades to an append-only Kafka log; the position table is a materialized fold of those events; corrections arrive as new events, not as UPDATE statements." This single framing flip unlocks the senior signal interviewers grade.

T+1 settlement and brokerage glossary as table stakes

Knowing the basics — settlement (T+1 since May 2024 in the US), corporate actions (splits, dividends, mergers), options assignment / exercise / expiration, FIFO / LIFO accounting, wash sales (same-security repurchase within 30 days disallows the loss) — gives a major edge. You don't need a finance degree; you do need to recognize the terms when the interviewer drops them. Spend a week with a brokerage glossary if your background is non-finance.

Easy-Medium discipline matters

The curated Robinhood set is 1 EASY + 1 MEDIUM. Easy at Robinhood doesn't mean trivial — it means the interviewer expects zero hesitation, idiomatic code, and an articulated invariant. A correct EASY answer with stuttering or a missing edge case is graded worse than a correct MEDIUM answer with the same flaw. Drill the easy practice page and the medium practice page until the canonical EASY-tier code rolls off your fingers in under three minutes.

Where to practice on PipeCode

Start with the Robinhood practice page for the curated 2-problem set. After that, drill the matching topic pages: hash table, dictionary, joins, group by, aggregation, window functions, having clause, filtering. The interview courses page bundles structured curricula. For a broader set, browse by topic or pivot to peer fintech with the Airbnb DE interview guide and the top DE interview questions 2026 blog.

Communication and approach under time pressure

Talk through the invariant first ("this is a LAG-on-partitioned-time-series problem"), the brute force second ("a self-join on date - 1 would also work"), and the optimal third ("but LAG is the idiomatic and faster move"). Interviewers grade process as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, single-row partitions, duplicate trade events, divide-by-zero on prior-day volume, decimal precision on the notional sum. The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.


Frequently Asked Questions

What is the Robinhood data engineering interview process like?

The Robinhood data engineering interview opens with a 30-minute recruiter screen, then a 60-minute technical phone screen with one live SQL or Python coding problem, then a 4-round virtual onsite: a system-design round (commonly trade reconciliation, regulatory reporting, or position event-sourcing), a live coding round in the language you didn't see in the phone screen, a data-modeling discussion (often Type 2 SCD on financial dimensions), and a behavioral round. Robinhood interviewers grade integrity and compliance-mindedness heavily; bring a postmortem-style story about a hard call between speed and correctness. End-to-end the loop runs three to four weeks.

What programming languages does Robinhood test in data engineering interviews?

Robinhood data engineering interviews are bilingual — SQL and Python in roughly equal measure across the loop. SQL questions concentrate on joins, aggregations with GROUP BY and HAVING, window functions (LAG, LEAD, SUM OVER), and financial-precision patterns (account balance reconstruction, FIFO P&L, trade reconciliation). Python questions concentrate on hash-table counters, dict-of-sets aggregation, event-stream dedup with TTL, and event sourcing folds. Go and Scala appear at backend-leaning DE roles but are not expected in the coding rounds.

How difficult are Robinhood data engineering interview questions?

The curated Robinhood practice set on PipeCode is 1 easy and 1 medium, no hard. The EASY is a Python hash-table dict-counter problem (Stock Purchases Count); the MEDIUM is a SQL joins + filtering problem (Member Transfer Records / top-3 cities by completed trades). At the onsite, system-design and modeling questions reach L4-L5 level — trade reconciliation pipelines, regulatory CAT reporting, position event-sourcing — but the live coding rounds stay in the EASY-MEDIUM zone for IC2-IC3 hires. Stuttering on the EASY is a stronger negative signal than struggling with the MEDIUM.

How should I prepare for a Robinhood data engineer interview?

Solve the 2 problems on the Robinhood practice page end-to-end — untimed first, then timed at 25 minutes per problem — and broaden to 30 to 50 additional problems across the matching topic pages: hash table and dictionary on the Python side, joins, group-by, aggregation, window functions, and having-clause on the SQL side. Read a brokerage glossary for a week (settlement, corporate actions, options assignment, FIFO accounting, wash sales). Practice articulating idempotency and audit-log requirements when discussing pipeline design — those framings are graded heavily at Robinhood.

What is the Robinhood data engineer salary range?

Robinhood data engineer total compensation runs from roughly $170K (IC2, 2-4 years experience) to $620K (IC5, multi-org technical leadership). Senior Data Engineer (IC3) is the most common external hire at $240K–$370K total comp. Staff (IC4) sits at $330K–$500K. Average base salary across all levels lands around $163K with median $150K; total compensation averages around $221K when RSU refreshers and bonus are included. Negotiation success rates run 10–25% with competing offers per verified levels.fyi data.

What financial-domain knowledge do Robinhood interviewers expect?

Robinhood interviewers expect candidates to recognize and use brokerage-domain terms without flinching: settlement (T+1), corporate actions (splits, dividends, mergers — they require historical-position re-multiplication), options (assignment, exercise, expiration, the × 100 equity-option multiplier), accounting methods (FIFO vs LIFO vs average cost — FIFO is the default for tax lots), wash sales (loss disallowed if same security repurchased within 30 days), regulatory pipelines (CAT — Consolidated Audit Trail — is FINRA-mandated). You don't need a finance background to land the role, but you do need to converse fluently in these terms; a one-week brokerage-glossary sprint is enough.


Start practicing Robinhood data engineering problems

Top comments (0)