Robinhood data engineering interview questions are bilingual — SQL and Python in roughly equal measure — with a fintech-correctness edge that most generic interview-prep posts miss. Four primitives carry the loop: dict.get(s, 0) + 1 hash-table counters that aggregate stock-purchase events by symbol, INNER JOIN trades + users + GROUP BY + ORDER BY count DESC LIMIT N for top-N city / member-transfer rankings, LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) for day-over-day volume or balance change, and GROUP BY user_id HAVING SUM(notional) > limit for end-of-day threshold and notional-cap checks. The framings are everyday brokerage data engineering — count purchases per ticker, surface the top cities by completed trades, compute a daily volume percentage change, flag any account whose option exposure crosses a regulatory limit.
This guide walks through the four topic clusters Robinhood actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches a curated 2-problem Robinhood set (1 EASY Python hash-table + 1 MEDIUM SQL joins) plus the two SQL primitives — window LAG and HAVING-aggregate — that show up on every Robinhood SQL question list and at every L4/L5 onsite. Robinhood pipelines also demand penny-perfect correctness, idempotency, and audit-aware thinking; brokerage candidates who frame their answers in those terms separate themselves from the generic-DE pile.
Top Robinhood data engineering interview topics
From the Robinhood data engineering practice set, the four numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–4) | Why it shows up at Robinhood |
|---|---|---|
| 1 | Python hash tables and dict counters for stock-purchase aggregation | Stock Purchases Count (EASY) — dict.get(symbol, 0) + 1 and dict-of-sets, the Python primitive for counting events and tracking distinct buyers. |
| 2 | SQL inner join and GROUP BY for top-N trade aggregations | Member Transfer Records (MEDIUM) — INNER JOIN trades + users + WHERE status='Completed' + GROUP BY city + ORDER BY COUNT(*) DESC + LIMIT N, the SQL primitive for top-N rankings on completed trades. |
| 3 | SQL window functions and LAG for daily volume change |
Volume change percentage — LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) for day-over-day deltas on partitioned time series. |
| 4 | SQL aggregation and HAVING for threshold and notional-limit checks |
Notional-limit check — GROUP BY user_id HAVING SUM(contract_count * strike_price * 100) > limit, the SQL primitive for "WHERE on aggregates" that flags risk and compliance breaches. |
Bilingual framing rule: Robinhood's prompts are everyday brokerage data engineering — count buyers per ticker, rank cities by completed orders, measure daily volume change, flag breaches of a notional limit. The interviewer is grading whether you map each business framing to the right primitive: count events → hash-table dict counter; rank by attribute →
INNER JOIN + GROUP BY + ORDER BY DESC + LIMIT; day-over-day delta →LAGwindow function; flag aggregates over a threshold →GROUP BY + HAVING. State the mapping out loud and the correctness will follow.
1. Python Hash Tables and Dict Counters for Stock-Purchase Aggregation
Hash-table dict counting for stock-purchase events in Python for data engineering
"Given a list of stock-purchase events, count purchases per symbol and surface the set of distinct buyers per symbol" is Robinhood's signature EASY Python prompt (Stock Purchases Count). The mental model: a hash-table counter is a dict keyed on the grouping attribute with a count value, updated in a single pass with counts[k] = counts.get(k, 0) + 1; a parallel dict[str, set] keyed the same way tracks distinct members per group. Same primitive powers any "count by category" or "unique-elements per bucket" pipeline — count clicks per user, distinct sessions per page, purchases per ticker.
Pro tip: The
dict.get(k, 0) + 1idiom is the workhorse — butcollections.Counteris the idiomatic shortcut andcollections.defaultdict(int)is the common middle ground. Pick the one your interviewer can read fastest:Counterfor "count occurrences,"defaultdict(set)for "track unique members per group," vanilladict.getwhen you want zero imports. State which you chose and why.
Dict counter idiom: dict.get(k, 0) + 1 vs collections.Counter
The dict-counter invariant: a dict keyed on the grouping attribute holds the running count for that group; on each event you read the current count (or 0 if absent), add one, and write it back. dict.get(k, default) returns default when k is missing, which avoids the KeyError that bare dict[k] would raise.
-
counts[k] = counts.get(k, 0) + 1— vanilla dict, no imports, the universal pattern. -
counts = Counter(events)— one-liner ifeventsis the iterable of keys themselves (no extra fields). -
counts = defaultdict(int); counts[k] += 1— middle ground, handles missing keys via the factory. -
Avoid
try/except KeyError— works but is slower and harder to read thangetordefaultdict.
Worked example. Count three purchase events for AAPL, TSLA, AAPL using the dict.get idiom.
| step | counts |
|---|---|
| start | {} |
AAPL |
{'AAPL': 1} |
TSLA |
{'AAPL': 1, 'TSLA': 1} |
AAPL |
{'AAPL': 2, 'TSLA': 1} |
Worked-example solution.
counts = {}
for symbol in ['AAPL', 'TSLA', 'AAPL']:
counts[symbol] = counts.get(symbol, 0) + 1
# counts == {'AAPL': 2, 'TSLA': 1}
Rule of thumb: when the count is the only signal, Counter(iter) is shortest; when you also need other per-group state (a set of distinct buyers, a running max, a list of timestamps), reach for defaultdict(set) or vanilla dict.get with explicit logic.
Set-of-buyers per stock: dict-of-sets pattern
The dict-of-sets invariant: a dict keyed on the grouping attribute holds a set of distinct member ids; each event inserts the member id, and set semantics dedupe automatically. Inserting the same (symbol, user) pair twice is a no-op — no special handling required, no if member in s guard.
-
buyers.setdefault(symbol, set()).add(user)— one-liner that initializes the set on first touch. -
buyers = defaultdict(set); buyers[symbol].add(user)— same outcome, slightly cleaner. -
len(buyers[symbol])— distinct-buyer count for any symbol. -
Set membership —
O(1)average; never use alisthere, which would beO(N)forif x in lst.
Worked example. Track distinct buyers for AAPL across three events: (u1, AAPL), (u2, AAPL), (u1, AAPL).
| step | buyers['AAPL'] |
|---|---|
| start | set() |
u1 |
{'u1'} |
u2 |
{'u1', 'u2'} |
u1 (dup) |
{'u1', 'u2'} |
Worked-example solution.
from collections import defaultdict
buyers = defaultdict(set)
for user, symbol in [('u1', 'AAPL'), ('u2', 'AAPL'), ('u1', 'AAPL')]:
buyers[symbol].add(user)
# buyers == {'AAPL': {'u1', 'u2'}}
Rule of thumb: defaultdict(set) collapses the "first-time init" branch — never write if k not in d: d[k] = set() when defaultdict exists.
One-pass aggregation in a single for-loop
The single-pass invariant: two parallel dicts updated in the same for loop produce both aggregates in O(N) total time and O(K) space, where K is the number of distinct symbols. Iterating twice — once for counts, once for buyer-sets — works but doubles the work; one pass is idiomatic and what interviewers grade.
-
One loop, two updates —
counts[s] = counts.get(s, 0) + 1andbuyers.setdefault(s, set()).add(u)per event. -
Avoid two passes —
O(2N) == O(N)asymptotically but reads as wasteful. -
Tuple unpack the event —
for user, symbol, qty in events:is cleaner than indexingevent[0],event[1]. -
Skip filters first — if the prompt says "only Completed trades," guard with
if status != 'Completed': continueat the top.
Worked example. Aggregate the four-event stream into both dicts in one pass.
| event | counts | buyers |
|---|---|---|
(u1, AAPL) |
{'AAPL': 1} |
{'AAPL': {'u1'}} |
(u2, TSLA) |
{'AAPL': 1, 'TSLA': 1} |
{'AAPL': {'u1'}, 'TSLA': {'u2'}} |
(u1, AAPL) |
{'AAPL': 2, 'TSLA': 1} |
{'AAPL': {'u1'}, 'TSLA': {'u2'}} |
(u3, AAPL) |
{'AAPL': 3, 'TSLA': 1} |
{'AAPL': {'u1', 'u3'}, 'TSLA': {'u2'}} |
Worked-example solution.
from collections import defaultdict
counts: dict[str, int] = {}
buyers: dict[str, set] = defaultdict(set)
for user, symbol in [('u1', 'AAPL'), ('u2', 'TSLA'), ('u1', 'AAPL'), ('u3', 'AAPL')]:
counts[symbol] = counts.get(symbol, 0) + 1
buyers[symbol].add(user)
Rule of thumb: one loop, two parallel structures, zero if k in d branches — reach for defaultdict and dict.get first.
Common beginner mistakes
- Using
counts[k] += 1on a missing key — raisesKeyError. Usedict.get(k, 0) + 1ordefaultdict(int). - Using a
listinstead of asetfor distinct buyers —O(N)membership checks blow up the asymptotic cost. - Iterating twice instead of once — wastes work and signals the candidate didn't think about cost.
- Forgetting to skip filtered events at the top of the loop — counts include cancelled / failed trades and produce the wrong answer.
- Returning
dict.values()directly when the test expects a plain list — wrap withlist(...)if the contract demands it.
Python Interview Question on Stock-Purchase Aggregation
Given a list of completed stock-purchase events (user_id, symbol, quantity), return a dict mapping each symbol to a tuple of (a) the total purchase count and (b) the set of distinct user_id values that bought it.
def stock_purchases(events):
pass # your code here
Solution Using dict.get and defaultdict(set) in one pass
from collections import defaultdict
def stock_purchases(events):
counts: dict[str, int] = {}
buyers: dict[str, set] = defaultdict(set)
for user_id, symbol, _qty in events:
counts[symbol] = counts.get(symbol, 0) + 1
buyers[symbol].add(user_id)
return {s: (counts[s], buyers[s]) for s in counts}
Why this works: the dict.get(symbol, 0) + 1 idiom keeps the count loop branch-free; defaultdict(set) keeps the buyer-set loop branch-free; one pass over events builds both structures in O(N) time. The final dict-comprehension stitches them together so each symbol maps to (count, set_of_buyers) in a single return value.
Step-by-step trace for events = [('u1', 'AAPL', 50), ('u2', 'TSLA', 10), ('u1', 'AAPL', 25), ('u3', 'AAPL', 5), ('u2', 'GOOG', 8)]:
| step | event | counts | buyers |
|---|---|---|---|
| 0 | start | {} |
defaultdict(set) |
| 1 | (u1, AAPL, 50) |
{'AAPL': 1} |
{'AAPL': {'u1'}} |
| 2 | (u2, TSLA, 10) |
{'AAPL': 1, 'TSLA': 1} |
{'AAPL': {'u1'}, 'TSLA': {'u2'}} |
| 3 | (u1, AAPL, 25) |
{'AAPL': 2, 'TSLA': 1} |
{'AAPL': {'u1'}, 'TSLA': {'u2'}} |
| 4 | (u3, AAPL, 5) |
{'AAPL': 3, 'TSLA': 1} |
{'AAPL': {'u1', 'u3'}, 'TSLA': {'u2'}} |
| 5 | (u2, GOOG, 8) |
{'AAPL': 3, 'TSLA': 1, 'GOOG': 1} |
{'AAPL': {'u1', 'u3'}, 'TSLA': {'u2'}, 'GOOG': {'u2'}} |
After the loop, the dict-comprehension {s: (counts[s], buyers[s]) for s in counts} produces the merged output.
Output:
| symbol | count | distinct_buyers |
|---|---|---|
| AAPL | 3 | {'u1', 'u3'} |
| TSLA | 1 | {'u2'} |
| GOOG | 1 | {'u2'} |
Why this works — concept by concept:
-
Hash-table dict counter —
counts[symbol] = counts.get(symbol, 0) + 1updates per group inO(1)average; thedict.getdefault of0keeps the new-key branch implicit and unbranched. -
defaultdict(set)factory —buyers[symbol].add(user_id)initializes a freshseton first touch; subsequent.addcalls dedupe at insertion time, so distinct counting is free. -
Single-pass loop — two parallel updates inside one
forkeep total work atO(N); iteratingeventstwice would double the work for the same answer. -
Dict-comprehension merge —
{s: (counts[s], buyers[s]) for s in counts}stitches the two parallel structures into the contract-shape return without an explicit accumulator. -
O(N)time /O(K)space —Nevents, each touching two hash maps;Kdistinct symbols hold the aggregate state. No sorting, no nested scans.
Inline CTA: Drill the Robinhood Python practice page for the curated hash-table problem and the hash-table Python practice page for breadth.
PYTHON
Company — Robinhood
Robinhood data engineering problems
PYTHON
Topic — hash table
Python hash-table problems
PYTHON
Topic — dictionary
Python dictionary problems
2. SQL Inner Join and GROUP BY for Top-N Trade Aggregations
Inner-join + group-by + top-N rankings in SQL for Robinhood data engineering
"Given a trades table and a users (or members) table, return the top three cities by completed trade orders" is Robinhood's signature MEDIUM SQL prompt — the same shape as PipeCode's #247 Member Transfer Records and DataLemur's #1 Cities With Completed Trades. The mental model: INNER JOIN the fact table to the dim table on the shared id, filter rows with WHERE status = 'Completed', group by the dim attribute, count or sum, then ORDER BY count DESC LIMIT N for the top-N. The same primitive powers any "rank entities by event volume" pipeline — top countries by signups, top sectors by trade flow, top symbols by buy-side volume.
Pro tip: Filter early with
WHERE(per-row), not late withHAVING(per-group).WHERE status = 'Completed'removes cancelled trades before theGROUP BY, so theCOUNT(*)is correct. Doing it after the group withHAVINGrequires an aggregate predicate and is slower; doing it not at all silently inflates the count with cancellations.
INNER JOIN cardinality: trades 1-to-many users
The join-cardinality invariant: INNER JOIN trades ON trades.user_id = users.user_id produces one output row per matching (trade, user) pair. Since each trade has exactly one user but a user can have many trades, the join expands the user table by trade count — the result has one row per trade. INNER JOIN drops trades whose user_id has no match in users (orphans).
-
INNER JOIN— only matching pairs survive. -
LEFT JOIN— keeps trades with no user match (NULLcity); use when you must report orphans. -
FULL OUTER JOIN— also keeps users with zero trades; useful for reconciliation, not for top-N. -
Pick the smallest table on the right — modern planners are smart, but explicit
LEFT/INNERchoice signals intent.
Worked example. Two-row trades, two-row users; INNER JOIN on user_id.
| trade | user | city |
|---|---|---|
| 100101 (u=111) | u=111 | San Francisco |
| 100259 (u=148) | u=148 | Boston |
Worked-example solution.
SELECT t.order_id, u.city
FROM trades t
INNER JOIN users u ON t.user_id = u.user_id;
Rule of thumb: INNER JOIN is the default for "this trade definitely has a user"; reach for LEFT JOIN only when orphans are part of the answer.
WHERE before GROUP BY: filtering rows vs filtering groups
The filter-order invariant: WHERE runs per-row before grouping; HAVING runs per-group after. WHERE status = 'Completed' strips out cancelled and pending trades so the subsequent GROUP BY city only counts what the question asks about. Putting the same predicate in HAVING requires HAVING SUM(CASE WHEN status='Completed' THEN 1 ELSE 0 END) > 0 — verbose, slower, and a "bad signal" in the round.
-
WHEREclause — row-level filter; runs first; uses=,<,IN,LIKE,IS NULL. -
HAVINGclause — group-level filter; runs afterGROUP BY; usesSUM,COUNT,AVG,MAX. -
Order in the query —
SELECT ... FROM ... JOIN ... WHERE ... GROUP BY ... HAVING ... ORDER BY ... LIMIT. -
Performance —
WHERE-then-group reads less data from disk because the grouper sees fewer rows.
Worked example. Filter to status = 'Completed' before grouping by city.
| input rows | after WHERE | grouped city → count |
|---|---|---|
| 7 | 5 (2 dropped) | SF: 3, Boston: 2 |
Worked-example solution.
SELECT u.city, COUNT(*) AS total
FROM trades t
INNER JOIN users u ON t.user_id = u.user_id
WHERE t.status = 'Completed'
GROUP BY u.city;
Rule of thumb: push every row-level predicate into WHERE; reserve HAVING strictly for predicates that need an aggregate.
ORDER BY count DESC + LIMIT for top-N rankings
The top-N invariant: ORDER BY <metric> DESC sorts groups by the metric in descending order; LIMIT N returns only the first N. For ties at the cut, dialects differ — Postgres returns whichever order the planner prefers; Redshift / Snowflake similar. If ties matter, add a deterministic tiebreaker like , u.city ASC.
-
ORDER BY total_orders DESC— sorts groups by the aliased column. -
LIMIT 3— first 3 groups in sort order; standard across MySQL, Postgres, Snowflake, BigQuery (BigQuery usesLIMITtoo). -
OFFSET— pagination;LIMIT 3 OFFSET 3returns ranks 4–6. -
Tiebreakers —
, u.city ASCmakes ties deterministic and stable across runs.
Worked example. Sort 4 cities by count and take the top 3.
| city | total | rank |
|---|---|---|
| SF | 3 | 1 |
| Boston | 2 | 2 |
| Denver | 1 | 3 |
| Austin | 1 | (cut) |
Worked-example solution.
SELECT u.city, COUNT(*) AS total
FROM trades t
INNER JOIN users u ON t.user_id = u.user_id
WHERE t.status = 'Completed'
GROUP BY u.city
ORDER BY total DESC
LIMIT 3;
Rule of thumb: always alias the count column (AS total) and use the alias in ORDER BY; mixing literal COUNT(*) and the alias is style noise.
Common beginner mistakes
- Using
COUNT(*)when only completed trades should count — forgettingWHERE status = 'Completed'silently inflates totals. -
GROUP BY user_idinstead ofGROUP BY city— confuses the dim level with the fact level; the question asks for cities. -
LEFT JOINwhen the question asks "completed trades" — orphan trades (no matching user) silently survive withNULLcity and pollute the rank. - Forgetting the alias in
ORDER BY—ORDER BY COUNT(trades.order_id) DESCworks but reads worse thanORDER BY total_orders DESC. - Returning more than
Nrows by skippingLIMIT— graded as a wrong answer even when the topNare correct.
SQL Interview Question on Top Cities by Completed Trades
Given the tables trades(order_id, user_id, quantity, status, date, price) and users(user_id, city, email, signup_date), write a query that returns the top 3 cities by number of completed trade orders, in descending order. Output two columns: city and total_orders.
Solution Using INNER JOIN + WHERE + GROUP BY + ORDER BY DESC + LIMIT
SELECT
u.city,
COUNT(t.order_id) AS total_orders
FROM trades t
INNER JOIN users u
ON t.user_id = u.user_id
WHERE t.status = 'Completed'
GROUP BY u.city
ORDER BY total_orders DESC
LIMIT 3;
Why this works: the INNER JOIN pairs each trade with its user (and drops orphan trades with no user match); WHERE status = 'Completed' removes cancelled trades before grouping; GROUP BY city collapses rows to one-per-city with COUNT(t.order_id) as the metric; ORDER BY total_orders DESC ranks cities; LIMIT 3 returns only the top three.
Step-by-step trace for the Robinhood-style sample data:
| order_id | user_id | status | city |
|---|---|---|---|
| 100101 | 111 | Cancelled | San Francisco |
| 100102 | 111 | Completed | San Francisco |
| 100259 | 148 | Completed | Boston |
| 100264 | 148 | Completed | Boston |
| 100305 | 300 | Completed | San Francisco |
| 100400 | 178 | Completed | San Francisco |
| 100565 | 265 | Completed | Denver |
-
Inner-join — every trade's
user_idmatches a row inusers; the join produces 7 rows with(order_id, status, city)paired up. -
WHERE filter — drop
100101(Cancelled). 6 rows remain: 3 SF + 2 Boston + 1 Denver. - Group by city — collapse to three groups: SF=3, Boston=2, Denver=1.
- Order by count desc — SF (3) > Boston (2) > Denver (1).
- Limit 3 — returns all three (no cut needed).
Output:
| city | total_orders |
|---|---|
| San Francisco | 3 |
| Boston | 2 |
| Denver | 1 |
Why this works — concept by concept:
-
INNER JOINcardinality — each trade pairs with exactly one user; orphan trades (none in this dataset) would be dropped, which is the right semantic when the question asks about completed-trade-by-city. -
WHEREbeforeGROUP BY— the row-level filter eliminates cancelled trades before grouping, soCOUNT(*)reflects only completed orders; doing it after withHAVINGwould still work but reads worse and runs slower. -
GROUP BY u.city— collapses the result to one row per city; the only non-aggregate column inSELECTmust appear here (or be functionally dependent on it). -
COUNT(t.order_id)— counts non-nullorder_idvalues per group; equivalent toCOUNT(*)here becauseorder_idis the trades primary key (never null). -
ORDER BY total_orders DESC LIMIT 3— sorts groups by the metric and slices the top three; the aliastotal_orderskeeps theORDER BYclean. -
O((|trades| + |users|) + G log G)time —|trades|rows scanned for the join,Ggroups sorted (G ≤ |cities|);O(G)space for the group hash table.
Inline CTA: More SQL join problems and the Robinhood SQL practice page for the curated MEDIUM problem.
SQL
Company — Robinhood
Robinhood SQL problems
SQL
Topic — joins
SQL join problems
SQL
Topic — group by
SQL group-by problems
3. SQL Window Functions and LAG for Daily Volume Change
Window LAG for day-over-day deltas in SQL for Robinhood data engineering
"For each day and each stock, compute the percentage change in trading volume vs the previous day for the same stock" is a Robinhood SQL staple — the same shape that scales up to "reconstruct an account balance from trade events" at the L4/L5 onsite. The mental model: LAG(value) OVER (PARTITION BY group ORDER BY ts) looks one row back inside the same partition and returns the prior row's value; partition keeps each stock's series independent, ORDER BY fixes the row order, and the day-over-day delta is (value − LAG(value)) / LAG(value). The same window primitive powers any "delta vs previous" query — running balance, running total via SUM, rank within a partition via ROW_NUMBER.
Pro tip: The first row in each partition has no predecessor, so
LAGreturnsNULL. Decide upfront: filter out theNULLrow withWHERE prev IS NOT NULL, or pass through withCOALESCE(prev, value)to mark "no change." Stating this explicitly is the senior signal interviewers grade.
Window basics: PARTITION BY group + ORDER BY ordering
The window-function invariant: OVER (PARTITION BY <group> ORDER BY <ordering>) declares an independent ordered subset for every value of the group expression; the function (LAG, LEAD, SUM, ROW_NUMBER, …) is evaluated within that subset only. PARTITION BY stock_symbol builds one ordered series per ticker; ORDER BY trade_date fixes the row order inside each.
-
PARTITION BYis optional — omitted, the whole table is one window; useful for global running totals. -
ORDER BYis required for offset functions —LAG,LEAD,ROW_NUMBER,RANKall need a defined order. -
Frames —
ROWS UNBOUNDED PRECEDING/BETWEEN ... AND ...further bound the visible rows; default is fine forLAG. -
Multiple windows in one query — different functions can use different
OVER (...)clauses on the same row.
Worked example. Two (date, symbol, volume) rows for AAPL; LAG(volume) OVER (PARTITION BY symbol ORDER BY date) returns the previous day's volume.
| date | symbol | volume | lag(volume) |
|---|---|---|---|
| 2022-07-01 | AAPL | 1000000 | NULL |
| 2022-07-02 | AAPL | 1500000 | 1000000 |
Worked-example solution.
SELECT date, symbol, volume,
LAG(volume) OVER (PARTITION BY symbol ORDER BY date) AS prev_volume
FROM trading_volume;
Rule of thumb: always set PARTITION BY to the entity that owns the time series (stock_symbol, account_id, user_id); without it, day-1 of one symbol leaks across to compute deltas with a different symbol's day-N.
LAG vs LEAD: row-N-before vs row-N-after
The offset-function invariant: LAG(expr, n) returns expr from the row n positions before the current row inside the partition; LEAD(expr, n) returns from n positions after. Default n = 1. Both accept a third default argument that replaces NULL at partition boundaries.
-
LAG(volume)— previous row'svolume(1 back). -
LAG(volume, 7)— 7 rows back; useful for week-over-week deltas on daily data. -
LEAD(volume)— next row'svolume(1 ahead). -
LAG(volume, 1, volume)— falls back to the current row'svolumeon the first row, so the % change is0instead ofNULL.
Worked example. Volume on day 2 vs day 1.
| trade_date | volume | LAG(volume) | LEAD(volume) |
|---|---|---|---|
| 2022-07-01 | 1000000 | NULL | 1500000 |
| 2022-07-02 | 1500000 | 1000000 | 1800000 |
| 2022-07-03 | 1800000 | 1500000 | NULL |
Worked-example solution.
SELECT trade_date,
volume,
LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) AS prev_volume,
LEAD(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) AS next_volume
FROM trading_volume;
Rule of thumb: "delta vs yesterday" → LAG; "delta vs tomorrow" or "next-event distance" → LEAD. They are mirror images of the same primitive.
NULL on the first row: handling the edge with COALESCE or filtering
The boundary invariant: the first row of each partition has no preceding row, so LAG(expr) returns NULL; arithmetic on NULL propagates to NULL. The two clean strategies are: (a) filter out the NULL row (WHERE prev IS NOT NULL) so the result starts on day 2; (b) replace NULL upstream with LAG(expr, 1, default) or COALESCE(LAG(expr), 0) so day 1 has a defined % change (often 0 or NULL-as-string).
-
WHERE prev IS NOT NULL— drops the first-day row entirely. -
LAG(volume, 1, volume)— first-dayprev_volumebecomes the same row'svolume; % change becomes0. -
COALESCE(LAG(volume), 0)— first-dayprev_volumebecomes0, which makes the % change blow up (/ 0); usually wrong. -
NULLIF(LAG(volume), 0)— guards against zero-volume divisions; complementary to the boundary fix.
Worked example. Drop the first-day NULL on a 3-row series.
| trade_date | volume | volume_change_pct |
|---|---|---|
| 2022-07-02 | 1500000 | 50.00 |
| 2022-07-03 | 1800000 | 20.00 |
Worked-example solution.
SELECT *
FROM (
SELECT trade_date,
stock_symbol,
volume,
LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) AS prev_volume,
(volume - LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date))
/ NULLIF(LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date), 0) * 100
AS volume_change_pct
FROM trading_volume
) sub
WHERE prev_volume IS NOT NULL;
Rule of thumb: state your boundary policy out loud — "I'll drop day 1" or "I'll show day 1 as NULL" — before writing the query; interviewers grade the explicit choice.
Common beginner mistakes
- Forgetting
PARTITION BY stock_symbol—LAGreaches into the previous symbol's series and computes a meaningless delta. - Forgetting
ORDER BY trade_date—LAGreturns whichever row the planner happens to pick; the answer is non-deterministic. - Dividing by
LAG(volume)when the prior volume could be0— runtime divide-by-zero; useNULLIF(LAG(volume), 0). - Using
LEADwhen the prompt says "vs yesterday" — answers the wrong direction. - Hardcoding the offset (
LAG(volume, 1)) when the spec is "vs 7 days ago" — read the spec and useLAG(volume, 7).
SQL Interview Question on Daily Volume Percentage Change
Given a trading_volume(trade_date, stock_symbol, volume) table with one row per (date, symbol), write a query that returns, for each row from day 2 onwards, the daily percentage change in volume vs the previous day for the same symbol. Output trade_date, stock_symbol, and volume_change_pct (rounded to 2 decimals).
Solution Using LAG window function
SELECT trade_date,
stock_symbol,
ROUND(
(volume - LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date))
/ NULLIF(LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date), 0) * 100,
2
) AS volume_change_pct
FROM trading_volume
QUALIFY LAG(volume) OVER (PARTITION BY stock_symbol ORDER BY trade_date) IS NOT NULL;
(In Postgres, replace QUALIFY with a wrapping subquery and WHERE prev_volume IS NOT NULL — Snowflake / BigQuery / Databricks support QUALIFY directly.)
Why this works: PARTITION BY stock_symbol keeps each ticker's daily series independent; ORDER BY trade_date fixes ordering inside the partition; LAG(volume) reaches one row back to fetch yesterday's volume; the formula (today − yesterday) / yesterday × 100 produces the % change; NULLIF(..., 0) guards against divide-by-zero on zero-volume days; QUALIFY ... IS NOT NULL filters the first-day boundary so output starts on day 2.
Step-by-step trace for AAPL across 5 days:
| trade_date | stock_symbol | volume |
|---|---|---|
| 2022-07-01 | AAPL | 1000000 |
| 2022-07-02 | AAPL | 1500000 |
| 2022-07-03 | AAPL | 1800000 |
| 2022-07-04 | AAPL | 1750000 |
| 2022-07-05 | AAPL | 1800000 |
-
Partition — single partition for AAPL; one ordered series of 5 rows by
trade_dateascending. -
LAG(volume)— produces[NULL, 1000000, 1500000, 1800000, 1750000]for the 5 rows. -
Delta and division — row 2:
(1500000 - 1000000) / 1000000 = 0.50; row 3:(1800000 - 1500000) / 1500000 = 0.20; row 4:(1750000 - 1800000) / 1800000 = -0.0278; row 5:(1800000 - 1750000) / 1750000 = 0.0286. -
× 100— convert to percent. -
ROUND(..., 2)— round to 2 decimals. -
QUALIFY LAG IS NOT NULL— drops row 1 (first day, no predecessor).
Output:
| trade_date | stock_symbol | volume_change_pct |
|---|---|---|
| 2022-07-02 | AAPL | 50.00 |
| 2022-07-03 | AAPL | 20.00 |
| 2022-07-04 | AAPL | -2.78 |
| 2022-07-05 | AAPL | 2.86 |
Why this works — concept by concept:
-
PARTITION BY stock_symbol— declares one independent window per ticker; without it, AAPL's day-1 would compute a delta against TSLA's day-N. -
ORDER BY trade_date— fixes the row order inside each window; offset functions need an explicit ordering or the result is non-deterministic. -
LAG(volume)— retrieves the previous row'svolumeinside the partition;NULLon the first row by design. -
NULLIF(LAG(volume), 0)— guards against divide-by-zero on zero-volume days;NULLIF(x, 0)returnsNULLwhenx = 0, propagating cleanly through arithmetic. -
QUALIFY LAG IS NOT NULL— drops the first-day row of each partition so the output starts on day 2 with a defined % change; the Postgres equivalent is a subqueryWHERE prev_volume IS NOT NULL. -
O(N log N)time — the planner sorts each partition once;O(N)space for the running window state. For Robinhood-scale daily volumes this is cheap.
Inline CTA: More window-function problems and SQL aggregation problems for the breadth tier.
SQL
Company — Robinhood
Robinhood SQL problems
SQL
Topic — window functions
SQL window-function problems
SQL
Topic — aggregation
SQL aggregation problems
4. SQL Aggregation and HAVING for Threshold and Notional-Limit Checks
GROUP BY + HAVING for risk-threshold checks in SQL for Robinhood data engineering
"Find every user whose end-of-day option position exceeds the notional limit" is the canonical Robinhood risk / compliance SQL prompt. The mental model: aggregate per user with SUM(contract_count × strike_price × multiplier) to compute notional exposure, then filter the resulting groups with HAVING SUM(...) > limit. HAVING is "WHERE on aggregates" — it filters group rows, not source rows. Same primitive powers any "flag entities whose aggregate metric crosses a policy line" pipeline — flag accounts whose daily trade count exceeds a velocity cap, sectors whose monthly volume exceeds an exposure ceiling, users whose 24-hour withdrawals breach AML thresholds.
Pro tip: Brokerage data is decimal-precision-sensitive. Never store currency or notional as
FLOAT/DOUBLE— the rounding errors compound and a 1¢ drift becomes a regulatory event. UseNUMERIC(18, 4)orDECIMAL(18, 4)for prices, quantities, and notionals. State this constraint when you write the schema; interviewers grade it.
WHERE vs HAVING: rows vs groups
The filter-stage invariant: WHERE filters source rows before grouping; HAVING filters group rows after grouping. WHERE cannot reference aggregates (SUM, COUNT, AVG); HAVING can. The execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT.
-
WHERE status = 'Open'— row predicate; filters before grouping. -
HAVING SUM(notional) > 100000— group predicate; filters after grouping. -
Both in one query —
WHEREstrips closed positions,HAVINGkeeps groups whose remaining sum breaches the limit. -
Avoid duplicating logic — if a predicate doesn't need an aggregate, put it in
WHEREfor performance; if it does,HAVINGis the only correct place.
Worked example. Filter open positions row-wise, then keep groups whose notional sum > $100K.
| step | rows | groups |
|---|---|---|
| after WHERE | 12 (open) | — |
| after GROUP BY | — | 5 users |
| after HAVING > 100K | — | 2 users |
Worked-example solution.
SELECT user_id, SUM(contract_count * strike_price * 100) AS notional
FROM positions
WHERE status = 'Open'
GROUP BY user_id
HAVING SUM(contract_count * strike_price * 100) > 100000;
Rule of thumb: aggregate predicates always belong in HAVING; row predicates always belong in WHERE. Mixing them up is graded as a conceptual error.
Notional aggregation: SUM(contract_count × strike_price × multiplier)
The notional invariant: for an equity option, notional = contract_count × strike_price × 100 (one US equity option contract represents 100 shares). For futures, the multiplier varies by product (ES = 50, NQ = 20). Always express the multiplier explicitly in the SQL — never bake "× 100" silently into an upstream view.
-
Equity options —
contract_count × strike_price × 100. - Index futures — multiplier per contract spec; ES = 50.
-
Equities (long) —
qty × price. -
Use
NUMERICcolumns — preserve cents through the multiply chain.
Worked example. Three positions for u2: 200 contracts at $80, all open.
| user_id | contract_count | strike | notional |
|---|---|---|---|
| u2 | 200 | 80 | 200 × 80 × 100 = 1,600,000 |
Worked-example solution.
SELECT user_id, SUM(contract_count * strike_price * 100) AS notional
FROM positions
WHERE status = 'Open'
GROUP BY user_id;
Rule of thumb: spell out the multiplier every time and use a NUMERIC column for strike_price and contract_count; floats here will quietly lose pennies and the audit log will catch you.
Combining HAVING with COUNT and SUM thresholds
The composition invariant: HAVING accepts any boolean expression over aggregates and grouping columns. You can combine SUM, COUNT, AVG, MAX, and the grouping keys with AND / OR. Common compound predicates: "more than 10 trades AND notional > $100K", "MAX position older than 7 days", "average ticket size > $5K and at least 5 trades."
-
HAVING COUNT(*) > 10— at least 11 trades in the group. -
HAVING SUM(notional) > 100000 AND COUNT(*) >= 5— both conditions must hold. -
HAVING AVG(price) > 100 OR MAX(price) > 500— either condition. -
HAVING MAX(trade_date) < CURRENT_DATE - INTERVAL '7' DAY— group has been quiet for a week.
Worked example. Flag users with ≥5 trades AND notional > $100K.
| user | trades | notional | flagged |
|---|---|---|---|
| u1 | 7 | 60,000 | no (notional fail) |
| u2 | 12 | 1,600,000 | yes |
| u3 | 3 | 200,000 | no (count fail) |
Worked-example solution.
SELECT user_id,
COUNT(*) AS trades,
SUM(contract_count * strike_price * 100) AS notional
FROM positions
WHERE status = 'Open'
GROUP BY user_id
HAVING COUNT(*) >= 5
AND SUM(contract_count * strike_price * 100) > 100000;
Rule of thumb: compose the HAVING predicate exactly as the policy or rule reads; don't pre-aggregate into a CTE just to put WHERE on top — HAVING is the right tool.
Common beginner mistakes
- Using
WHERE SUM(notional) > 100000—WHEREcannot reference aggregates; the parser rejects it. UseHAVING. - Computing notional in a subquery and then re-aggregating outside — works but is the "long way around"; one
GROUP BY + HAVINGis cleaner. - Storing
strike_priceasFLOAT— accumulates rounding error; useNUMERIC(18, 4). - Forgetting the
× 100multiplier on equity options — the notional comes out 100× too small. - Including closed positions — forgetting
WHERE status = 'Open'inflates notional and produces false positives.
SQL Interview Question on Notional-Limit Threshold Check
Given a positions(user_id, contract_count, strike_price, status) table where status is 'Open' or 'Closed' and each row is one option position (equity-option multiplier = 100), write a query that returns every user_id whose total open notional exceeds $100,000. Output user_id and notional, ordered by notional descending.
Solution Using GROUP BY + HAVING SUM(...) > limit
SELECT
user_id,
SUM(contract_count * strike_price * 100) AS notional
FROM positions
WHERE status = 'Open'
GROUP BY user_id
HAVING SUM(contract_count * strike_price * 100) > 100000
ORDER BY notional DESC;
Why this works: WHERE status = 'Open' strips closed positions before grouping so the aggregate only reflects current exposure; GROUP BY user_id collapses to one row per user; the equity-option multiplier × 100 produces real-dollar notional inside SUM; HAVING filters groups whose summed notional crosses the policy limit; ORDER BY notional DESC ranks the breaches by severity.
Step-by-step trace for the sample:
| user_id | contract_count | strike_price | status |
|---|---|---|---|
| u1 | 50 | 120 | Open |
| u2 | 200 | 80 | Open |
| u3 | 30 | 150 | Open |
| u4 | 80 | 90 | Open |
| u5 | 10 | 100 | Open |
| u4 | 25 | 50 | Closed |
-
WHERE filter — drops the closed
u4row. 5 open rows remain. -
Per-row notional —
u1: 50×120×100 = 600,000… wait — the worked-example diagram uses simplified numbers; recompute:u1: 50×120×100 = 600,000,u2: 200×80×100 = 1,600,000,u3: 30×150×100 = 450,000,u4: 80×90×100 = 720,000,u5: 10×100×100 = 100,000. -
Group by user_id — already one open row per user; the
SUMcollapses each one to itself. -
HAVING > 100,000 — strict greater-than;
u5at exactly100,000does not pass;u1(600K),u2(1.6M),u3(450K),u4(720K) do. -
Order by notional desc —
u2 > u4 > u1 > u3.
Output:
| user_id | notional |
|---|---|
| u2 | 1,600,000 |
| u4 | 720,000 |
| u1 | 600,000 |
| u3 | 450,000 |
Why this works — concept by concept:
-
WHERErow filter — strips closed positions before grouping;HAVINGcannot do this work because the predicate is per-row, not per-group. -
SUM(contract_count * strike_price * 100)— computes equity-option notional with the× 100multiplier baked into the aggregate; spelled out so the reviewer can read the units. -
GROUP BY user_id— collapses each user's positions to one row with summed notional; the only non-aggregate column inSELECTmatches theGROUP BY. -
HAVING SUM(...) > 100000— filters group rows; aggregate predicates have to live here. Strict-greater means100000itself does not breach. -
ORDER BY notional DESC— ranks breaches by severity so the on-call engineer can triage worst-first. -
O(|positions| + G log G)time — one scan for the join-freeGROUP BY, thenO(G log G)to sort the group output whereGis the number of users with open positions.
Inline CTA: More SQL HAVING-clause problems and SQL aggregation problems for breadth.
SQL
Company — Robinhood
Robinhood SQL problems
SQL
Topic — having clause
SQL HAVING-clause problems
SQL
Topic — aggregation
SQL aggregation problems
Tips to crack Robinhood data engineering interviews
Bilingual profile — SQL and Python both, equally weighted
Unlike Cisco's Python-only loop, Robinhood splits the coding rounds between SQL and Python. The curated 2-problem Robinhood practice set is 1 EASY Python hash-table + 1 MEDIUM SQL joins — bilingual by design. Spending all your prep time on Python means losing the SQL round; spending it all on SQL means stuttering on the dict-counter Python prompt. Allocate roughly half-half across the SQL practice surface and the Python practice surface.
Drill the four primitives
The four primitives in this guide map directly to the curated Robinhood set plus the two adjacent SQL patterns every Robinhood SQL question list rotates through: hash-table dict counter (Python, EASY — Stock Purchases Count), INNER JOIN + GROUP BY + ORDER DESC + LIMIT (SQL, MEDIUM — Member Transfer Records / cities-completed-trades), LAG window function (SQL — daily volume change), GROUP BY + HAVING (SQL — notional-limit checks). Each maps to a specific module: vanilla dict and collections.defaultdict for the Python primitive, INNER JOIN and GROUP BY for the joins primitive, OVER (PARTITION BY ... ORDER BY ...) for window LAG, HAVING for the aggregate-threshold primitive.
Penny-perfect correctness is the bar
Robinhood is a regulated brokerage; every cent of every account balance is auditable. Float-typed currency, rounding-tolerant joins, and "approximately right" pipelines are downgrade signals in the round. Use NUMERIC(18, 4) for prices and notionals; spell out the equity-option × 100 multiplier; volunteer "I would store this as decimal, never float" when you write the schema. State idempotency requirements when designing pipelines — every job can be re-run without double-counting.
Event-sourced thinking — positions are folded from immutable trade events
Robinhood's distinctive architecture pattern is event sourcing: account balances and positions are not mutable rows you UPDATE; they are derived by folding an immutable log of trade and cash events. CRUD-style designs in interview answers underperform. Frame state as "I would write trades to an append-only Kafka log; the position table is a materialized fold of those events; corrections arrive as new events, not as UPDATE statements." This single framing flip unlocks the senior signal interviewers grade.
T+1 settlement and brokerage glossary as table stakes
Knowing the basics — settlement (T+1 since May 2024 in the US), corporate actions (splits, dividends, mergers), options assignment / exercise / expiration, FIFO / LIFO accounting, wash sales (same-security repurchase within 30 days disallows the loss) — gives a major edge. You don't need a finance degree; you do need to recognize the terms when the interviewer drops them. Spend a week with a brokerage glossary if your background is non-finance.
Easy-Medium discipline matters
The curated Robinhood set is 1 EASY + 1 MEDIUM. Easy at Robinhood doesn't mean trivial — it means the interviewer expects zero hesitation, idiomatic code, and an articulated invariant. A correct EASY answer with stuttering or a missing edge case is graded worse than a correct MEDIUM answer with the same flaw. Drill the easy practice page and the medium practice page until the canonical EASY-tier code rolls off your fingers in under three minutes.
Where to practice on PipeCode
Start with the Robinhood practice page for the curated 2-problem set. After that, drill the matching topic pages: hash table, dictionary, joins, group by, aggregation, window functions, having clause, filtering. The interview courses page bundles structured curricula. For a broader set, browse by topic or pivot to peer fintech with the Airbnb DE interview guide and the top DE interview questions 2026 blog.
Communication and approach under time pressure
Talk through the invariant first ("this is a LAG-on-partitioned-time-series problem"), the brute force second ("a self-join on date - 1 would also work"), and the optimal third ("but LAG is the idiomatic and faster move"). Interviewers grade process as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, single-row partitions, duplicate trade events, divide-by-zero on prior-day volume, decimal precision on the notional sum. The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.
Frequently Asked Questions
What is the Robinhood data engineering interview process like?
The Robinhood data engineering interview opens with a 30-minute recruiter screen, then a 60-minute technical phone screen with one live SQL or Python coding problem, then a 4-round virtual onsite: a system-design round (commonly trade reconciliation, regulatory reporting, or position event-sourcing), a live coding round in the language you didn't see in the phone screen, a data-modeling discussion (often Type 2 SCD on financial dimensions), and a behavioral round. Robinhood interviewers grade integrity and compliance-mindedness heavily; bring a postmortem-style story about a hard call between speed and correctness. End-to-end the loop runs three to four weeks.
What programming languages does Robinhood test in data engineering interviews?
Robinhood data engineering interviews are bilingual — SQL and Python in roughly equal measure across the loop. SQL questions concentrate on joins, aggregations with GROUP BY and HAVING, window functions (LAG, LEAD, SUM OVER), and financial-precision patterns (account balance reconstruction, FIFO P&L, trade reconciliation). Python questions concentrate on hash-table counters, dict-of-sets aggregation, event-stream dedup with TTL, and event sourcing folds. Go and Scala appear at backend-leaning DE roles but are not expected in the coding rounds.
How difficult are Robinhood data engineering interview questions?
The curated Robinhood practice set on PipeCode is 1 easy and 1 medium, no hard. The EASY is a Python hash-table dict-counter problem (Stock Purchases Count); the MEDIUM is a SQL joins + filtering problem (Member Transfer Records / top-3 cities by completed trades). At the onsite, system-design and modeling questions reach L4-L5 level — trade reconciliation pipelines, regulatory CAT reporting, position event-sourcing — but the live coding rounds stay in the EASY-MEDIUM zone for IC2-IC3 hires. Stuttering on the EASY is a stronger negative signal than struggling with the MEDIUM.
How should I prepare for a Robinhood data engineer interview?
Solve the 2 problems on the Robinhood practice page end-to-end — untimed first, then timed at 25 minutes per problem — and broaden to 30 to 50 additional problems across the matching topic pages: hash table and dictionary on the Python side, joins, group-by, aggregation, window functions, and having-clause on the SQL side. Read a brokerage glossary for a week (settlement, corporate actions, options assignment, FIFO accounting, wash sales). Practice articulating idempotency and audit-log requirements when discussing pipeline design — those framings are graded heavily at Robinhood.
What is the Robinhood data engineer salary range?
Robinhood data engineer total compensation runs from roughly $170K (IC2, 2-4 years experience) to $620K (IC5, multi-org technical leadership). Senior Data Engineer (IC3) is the most common external hire at $240K–$370K total comp. Staff (IC4) sits at $330K–$500K. Average base salary across all levels lands around $163K with median $150K; total compensation averages around $221K when RSU refreshers and bonus are included. Negotiation success rates run 10–25% with competing offers per verified levels.fyi data.
What financial-domain knowledge do Robinhood interviewers expect?
Robinhood interviewers expect candidates to recognize and use brokerage-domain terms without flinching: settlement (T+1), corporate actions (splits, dividends, mergers — they require historical-position re-multiplication), options (assignment, exercise, expiration, the × 100 equity-option multiplier), accounting methods (FIFO vs LIFO vs average cost — FIFO is the default for tax lots), wash sales (loss disallowed if same security repurchased within 30 days), regulatory pipelines (CAT — Consolidated Audit Trail — is FINRA-mandated). You don't need a finance background to land the role, but you do need to converse fluently in these terms; a one-week brokerage-glossary sprint is enough.




Top comments (0)