HackerRank data engineering interview questions screen on a tight, predictable skill matrix: SQL JOINs with foreign-key filters, multi-table aggregation without fan-out, GROUP BY / HAVING on related-row predicates, window functions for streaks and per-day winners, CASE on hierarchies for tree-style classification, scalar subqueries with strict output formatting, and PySpark dataframe transforms the platform now grades alongside SQL. The grader is exact-match on stdout — clever logic with the wrong trailing zero fails the test case.
This guide walks through the seven canonical clusters HackerRank actually tests for data engineers, drawing on its own 7 Advanced SQL Interview Questions set plus first-hand candidate reports for the company's online assessment. Each cluster ships a teaching paragraph, three sub-topic primitives with worked examples, an interview-style problem, a runnable solution, and a step-by-step trace so you can rehearse the four-step "join → filter → group → order" instinct that separates a passing submission from a near-miss.
Top HackerRank data engineering interview topics
From the HackerRank data engineering practice set, the seven numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–7) | HackerRank question | Why it shows up |
|---|---|---|---|
| 1 | SQL JOINs and foreign-key filtering | Tenured Employees | INNER JOIN on ID, WHERE on TIME < 3, ORDER BY a column not in SELECT. |
| 2 | Multi-table aggregation | New Companies | Five-level hierarchy → one per-company row using COUNT(DISTINCT) against duplicates. |
| 3 | GROUP BY + HAVING on related-row predicates | Top Competitors |
WHERE submission.score = difficulty.score → GROUP BY hacker → HAVING COUNT(*) > 1. |
| 4 | Window functions for streaks and winners | 15 Days of Learning SQL |
ROW_NUMBER per day plus a "submitted every day so far" cumulative-presence test. |
| 5 | CASE expressions on hierarchies | Binary Tree Nodes | Three-branch CASE WHEN P IS NULL ... EXISTS(children) ... in strict order. |
| 6 | Scalar subqueries + strict output formatting | Weather Observation Station 18 |
WHERE x = (SELECT MAX(x) ...) plus FORMAT(ROUND(x, 4), '0.0000') for the grader. |
| 7 | PySpark dataframe transforms | Real-IDE coding (Glassdoor-confirmed) | SQL → PySpark idioms — anti-join, schema, Window.partitionBy. |
Output-format rule: HackerRank grades on stdout exact match. A correct query that prints
0instead of0.00, or.00instead of0.00, fails the test case. Coerce trailing zeros withFORMAT, NULLs withCOALESCE, and date strings with explicitDATE_FORMAT/to_char— every time, not just when it feels needed.
1. SQL JOINs and the Tenured Employees Pattern
Foreign-key joins with multi-key sorts in SQL for data engineering
The Tenured Employees question is the cleanest "junior data engineer" filter on the HackerRank platform: two tables, one foreign key, one filter predicate, a two-key sort. The mental model is join → filter → order in that order. The invariant: the join brings columns together, WHERE narrows rows, and ORDER BY sequences output — the sort key does not need to live in the SELECT list.
Pro tip: When the prompt says "sort by NAME, then ID, but output only UIN and NAME," resist the urge to drop ID from the query. SQL's
ORDER BYaccepts any column visible after the FROM clause, regardless of whether it appears in the projection.
INNER JOIN ... ON fk = pk
INNER JOIN keeps only rows that have a match on both sides. For two tables with a true 1:1 foreign key, INNER and LEFT JOIN produce identical row counts; for guaranteed-match foreign keys, INNER is the default because it documents the intent and is faster to plan.
- Join key: the column that identifies the same entity on both sides.
- Cardinality check: if either side has duplicates on the key, the joined row count multiplies — fan-out.
-
Worked example:
EMPLOYEE.ID = EMPLOYEE_UIN.IDis 1:1, so the joined row count equals the smaller side.
SELECT u.UIN, e.NAME
FROM EMPLOYEE e
INNER JOIN EMPLOYEE_UIN u ON u.ID = e.ID;
WHERE filter on the joined columns
The filter TIME < 3 narrows the joined set. It can sit either in the join's ON clause or in the outer WHERE — for INNER JOIN they are equivalent, for LEFT JOIN they are not. Default to WHERE for filters that apply after the join.
-
ONpredicate: part of the join algebra; for LEFT JOIN it preserves unmatched left rows. -
WHEREpredicate: post-join filter; rejects rows regardless of join type. -
Worked example:
TIME < 3keeps Sherrie (1), Mary (2), and Dave (.33). Paul (7) and Sam (3) are excluded.
SELECT u.UIN, e.NAME
FROM EMPLOYEE e
INNER JOIN EMPLOYEE_UIN u ON u.ID = e.ID
WHERE e.TIME < 3;
ORDER BY name ASC, id ASC (sort key not in SELECT)
ORDER BY accepts any FROM-clause column, not just the projected ones. The two-key sort name ASC, id ASC resolves alphabetic ties deterministically — without the secondary key, two employees with the same name produce non-deterministic row order, which the HackerRank grader treats as failing.
- Primary key: the column that drives most of the sort.
- Secondary key: breaks ties between equal primary values.
-
Worked example: Dave, Mary, Sherrie are already alphabetically distinct, so the secondary
idonly matters when a name repeats.
SELECT u.UIN, e.NAME
FROM EMPLOYEE e
INNER JOIN EMPLOYEE_UIN u ON u.ID = e.ID
WHERE e.TIME < 3
ORDER BY e.NAME ASC, e.ID ASC;
Common beginner mistakes
- Putting
ORDER BYcolumns inSELECT"to be safe" — pollutes the output. - Using
LEFT JOINwhen the foreign key is guaranteed — clutters the plan, hides intent. - Forgetting the secondary sort key — non-deterministic output fails exact-match grading.
- Sorting by
idnumerically when the schema makes it a string —C_1, C_10, C_2versusC_1, C_2, C_10.
SQL interview question on tenured employees
Two tables: EMPLOYEE(ID INT, NAME VARCHAR, TIME INT, ADDRESS VARCHAR, SALARY INT) and EMPLOYEE_UIN(ID INT, UIN VARCHAR). Output UIN and NAME for employees with TIME < 3 years, ordered by NAME ASC then ID ASC. ID is part of the sort but not the output.
Solution using INNER JOIN with a multi-key ORDER BY
SELECT u.UIN, e.NAME
FROM EMPLOYEE e
INNER JOIN EMPLOYEE_UIN u ON u.ID = e.ID
WHERE e.TIME < 3
ORDER BY e.NAME ASC, e.ID ASC;
Step-by-step trace (input: 5-row EMPLOYEE table):
| ID | NAME | TIME |
|---|---|---|
| 1 | Sherrie | 1 |
| 2 | Paul | 7 |
| 3 | Mary | 2 |
| 4 | Sam | 3 |
| 5 | Dave | 0 |
-
Join
EMPLOYEEtoEMPLOYEE_UINonID. Five matched rows because the foreign key is 1:1. -
Apply
WHERE e.TIME < 3. Paul (7) and Sam (3) drop out; Sherrie, Mary, Dave survive (Dave's0.33is < 3 once the schema'sINTtruncation is honored — treat it as 0 here for clarity). -
Apply
ORDER BY e.NAME ASC, e.ID ASC. Alphabetic order: Dave, Mary, Sherrie. -
Project
UIN, NAMEfor each surviving row.
Output:
| UIN | NAME |
|---|---|
| 63868-453 | Dave |
| 63550-194 | Mary |
| 57520-0440 | Sherrie |
Why this works — concept by concept:
-
INNER JOIN on
ID— guaranteed-match foreign key; documents the intent that every employee has a UIN. -
WHEREfilter post-join — narrows by tenure without affecting the join algebra. -
Multi-key
ORDER BY— primary sort onNAMEfor human readability, secondary onIDfor deterministic tie-breaking. -
Sort key not in
SELECT— SQL allows it; the grader expects exactly the two output columns. -
Cost — one hash-join scan plus one sort:
O((N + M) + N log N).
SQL
HackerRank — joins
HackerRank JOIN problems
SQL
Topic — joins
SQL join problems
2. Multi-Table Aggregation with the New Companies Hierarchy
Multi-level hierarchies and COUNT(DISTINCT) for data engineering
The New Companies question forces the question every analytics SQL writer eventually has to answer: how do I roll a five-level hierarchy into one row per company without double-counting? The mental model: aggregate each level independently, then join the per-level counts on the parent key. The invariant: a single JOIN chain across all five tables fans out the row count and breaks every aggregate downstream.
Pro tip: When the prompt says "the tables may contain duplicate records,"
COUNT(*)is wrong by default — switch toCOUNT(DISTINCT child_id). Read this line in every HackerRank schema before you write the first SELECT.
Pre-aggregate per level, then join
When five tables share a company_code key, aggregate each table independently and join the scalar counts on the parent. This avoids the cross-product of joining all five at once.
-
Per-level aggregate:
SELECT company_code, COUNT(DISTINCT id) FROM <level> GROUP BY company_code. -
Final JOIN: glue the per-level counts on
company_code. -
Worked example: company
C1with 2 lead managers, 4 senior managers, 8 managers, 16 employees → one final row.
SELECT
c.company_code, c.founder,
(SELECT COUNT(DISTINCT lm.lead_manager_code) FROM Lead_Manager lm WHERE lm.company_code = c.company_code) AS n_lead,
(SELECT COUNT(DISTINCT sm.senior_manager_code) FROM Senior_Manager sm WHERE sm.company_code = c.company_code) AS n_senior,
(SELECT COUNT(DISTINCT m.manager_code) FROM Manager m WHERE m.company_code = c.company_code) AS n_manager,
(SELECT COUNT(DISTINCT e.employee_code) FROM Employee e WHERE e.company_code = c.company_code) AS n_employee
FROM Company c
ORDER BY c.company_code;
COUNT(DISTINCT child_id) GROUP BY parent_id
When a single table has duplicate rows on the child id (e.g. an employee appears twice because of a join in the source), COUNT(*) returns the row count, not the entity count. COUNT(DISTINCT child_id) returns the entity count — the right answer when the prompt says "how many".
-
COUNT(*): counts rows. -
COUNT(DISTINCT id): counts unique entities. -
Worked example: three rows for the same
employee_code→COUNT(*) = 3,COUNT(DISTINCT employee_code) = 1.
String-sort vs numeric-sort on company_code
ORDER BY company_code ASC on a VARCHAR column sorts lexicographically: C_1, C_10, C_2. To sort numerically you would need CAST(SUBSTRING(company_code, 3) AS UNSIGNED). HackerRank's grader for this problem expects the string sort, so the natural ASC sort is correct as long as the column is VARCHAR.
- Lex sort: alphabetic, character by character.
- Numeric sort: parse the suffix and cast.
-
Worked example:
C_1, C_10, C_2is the expected order;C_1, C_2, C_10is wrong here.
Common beginner mistakes
- Joining all five tables at once and using
COUNT(DISTINCT id)per level — works but plans badly and is hard to read. - Using
COUNT(*)after a JOIN that fans out — every level gets multiplied by its descendants. - Casting
company_codeto a number before sorting — defeats the prompt.
SQL interview question on the New Companies hierarchy
Tables Company(company_code, founder), Lead_Manager(lead_manager_code, company_code), Senior_Manager, Manager, Employee — same column shape down the chain. Output company_code, founder, n_lead, n_senior, n_manager, n_employee per company, ordered by company_code ascending (string sort).
Solution using scalar subqueries per level
SELECT
c.company_code, c.founder,
(SELECT COUNT(DISTINCT lm.lead_manager_code) FROM Lead_Manager lm WHERE lm.company_code = c.company_code) AS n_lead,
(SELECT COUNT(DISTINCT sm.senior_manager_code) FROM Senior_Manager sm WHERE sm.company_code = c.company_code) AS n_senior,
(SELECT COUNT(DISTINCT m.manager_code) FROM Manager m WHERE m.company_code = c.company_code) AS n_manager,
(SELECT COUNT(DISTINCT e.employee_code) FROM Employee e WHERE e.company_code = c.company_code) AS n_employee
FROM Company c
ORDER BY c.company_code;
Step-by-step trace (input: 1 company C1 with founder Monika, 1 lead, 2 seniors, 4 managers, 8 employees):
| company_code | founder | n_lead | n_senior | n_manager | n_employee |
|---|---|---|---|---|---|
| C1 | Monika | 1 | 2 | 4 | 8 |
-
Outer scan of
Company— yields one row per company. -
Per-row, run each scalar subquery —
Lead_Managerfiltered bycompany_code = c.company_codeandCOUNT(DISTINCT)-aggregated. Same forSenior_Manager,Manager,Employee. - Project the results — six columns per output row.
-
ORDER BY c.company_code— string sort:C_1,C_10,C_2(when there are multiple companies).
Output:
| company_code | founder | n_lead | n_senior | n_manager | n_employee |
|---|---|---|---|---|---|
| C1 | Monika | 1 | 2 | 4 | 8 |
Why this works — concept by concept:
- Scalar subquery per level — each subquery returns a single integer; no fan-out across levels.
-
COUNT(DISTINCT)— defends against duplicates the prompt warns about. -
Outer-row anchor — the
Companyrow drives the subqueries, so per-company filtering is automatic. -
String
ORDER BY—VARCHARcodes sort lexicographically; matches the prompt. -
Cost — one outer scan plus four indexed lookups per company:
O(C × (1 + index_seek_per_level)).
SQL
Topic — multi-join
Multi-table join problems
SQL
Topic — aggregation
Aggregation problems
3. GROUP BY, HAVING, and the Top Competitors Pattern
Related-row predicates with GROUP BY for SQL data engineering
Top Competitors is the canonical "who scored full marks more than once" query, and it teaches two compound skills: filtering on a related row's column (a submission's score equals its difficulty's max score) and filtering on an aggregate (more than one such submission). The mental model: JOIN → WHERE related-equality → GROUP BY entity → HAVING aggregate-predicate → ORDER BY.
Pro tip:
HAVINGis the only place an aggregate can appear in a predicate. If you find yourself writingWHERE COUNT(*) > 1, the grader rejects the query —WHEREruns before aggregation,HAVINGruns after.
WHERE submission.score = difficulty.score
The "full score" predicate isn't a constant comparison — it's row-level equality between a submission's score and the maximum score declared in the matching difficulty row. Use the JOIN to bring the related row alongside, then compare both columns.
-
Joined predicate:
s.score = d.scoreonly makes sense afterJOIN Difficulty d ON c.difficulty_level = d.difficulty_level. -
Worked example: difficulty 6 → max score 100. A submission with
score = 100qualifies;99doesn't.
WHERE s.score = d.score
GROUP BY hacker, COUNT(*), HAVING COUNT(*) > 1
Once you've filtered to full-score rows, group by the entity that owns the achievement (the hacker) and count. HAVING COUNT(*) > 1 keeps only hackers with more than one full-score row.
-
GROUP BY h.hacker_id, h.name: name appears in SELECT, so it must appear in GROUP BY. -
HAVING COUNT(*) > 1: strict greater than — "more than one" excludes hackers with exactly one. -
Worked example: Joe scored full marks twice →
COUNT(*) = 2, qualifies.
Two-key sort: ORDER BY COUNT(*) DESC, hacker_id ASC
ORDER BY accepts aggregates directly — no need to wrap the count in a subquery alias. Primary sort on COUNT(*) descending, secondary on hacker_id ascending so ties resolve to the lower id.
-
COUNT(*) DESC: most-prolific hackers first. -
hacker_id ASC: lowest id wins on ties. -
Worked example: two hackers tied at 3 full-score submissions → the lower
hacker_idappears first.
Common beginner mistakes
- Putting the aggregate predicate in
WHERE— runtime error in every engine. - Forgetting
nameinGROUP BYafter selecting it — strict-mode engines reject. - Sorting by an aliased count column with the alias — works in MySQL, fails in stricter engines.
SQL interview question on Top Competitors
Tables Hackers(hacker_id, name), Difficulty(difficulty_level, score), Challenges(challenge_id, hacker_id, difficulty_level), Submissions(submission_id, hacker_id, challenge_id, score). Print hacker_id, name for hackers who achieved full marks on more than one challenge. Order by full-score count descending, then hacker_id ascending.
Solution using INNER JOIN with HAVING
SELECT h.hacker_id, h.name
FROM Submissions s
INNER JOIN Hackers h ON s.hacker_id = h.hacker_id
INNER JOIN Challenges c ON s.challenge_id = c.challenge_id
INNER JOIN Difficulty d ON c.difficulty_level = d.difficulty_level
WHERE s.score = d.score
GROUP BY h.hacker_id, h.name
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC, h.hacker_id ASC;
Step-by-step trace (input: 4 submissions, 2 challenges, 2 hackers):
| hacker | challenge | difficulty | sub_score | max_score |
|---|---|---|---|---|
| 86870 | 71055 | 2 | 30 | 30 |
| 90411 | 71055 | 2 | 30 | 30 |
| 90411 | 66730 | 6 | 100 | 100 |
| 86870 | 66730 | 6 | 50 | 100 |
-
Four-way JOIN — bring
Hackers,Challenges,Difficultyalongside eachSubmissionsrow. -
WHERE s.score = d.score— keeps rows 1, 2, 3. Row 4 (Hacker 86870 scored 50 on max-100 challenge) drops. -
GROUP BY h.hacker_id, h.name— collapses the 3 surviving rows into per-hacker groups:86870 → 1 row,90411 → 2 rows. -
HAVING COUNT(*) > 1— only90411survives (2 > 1). -
ORDER BY COUNT(*) DESC, h.hacker_id ASC— single row, no tie to break.
Output:
| hacker_id | name |
|---|---|
| 90411 | Joe |
Why this works — concept by concept:
- Four-way JOIN — chains submission → hacker, submission → challenge, challenge → difficulty so the related-row equality has both sides on the same row.
-
Related-row equality —
s.score = d.scorecompares each submission to the max possible score for its challenge's difficulty, not a constant. -
GROUP BY + HAVING —
HAVING COUNT(*) > 1is the "more than one" qualifier; can only sit in HAVING because it references an aggregate. -
Aggregate in
ORDER BY—COUNT(*) DESCranks by prolific hackers;hacker_id ASCresolves ties deterministically. -
Cost — three hash-join passes plus one aggregate plus one sort: roughly
O(N log N)over the submission row count.
SQL
Topic — having clause
HAVING clause problems
SQL
Topic — group by
GROUP BY problems
4. SQL Window Functions and the 15 Days of Learning Pattern
Window functions for streaks and per-day winners in SQL
15 Days of Learning SQL is a stamina test for window functions: per day, return the count of hackers whose streak survives to that day plus the per-day submission winner with a deterministic tie-break. The mental model: window functions partition by day to rank per-day; a self-join with a streak predicate counts surviving hackers. The invariant: a single window expression cannot capture the "every day so far" condition — that needs a join or correlated subquery.
Pro tip:
ROW_NUMBER()andRANK()look interchangeable on small inputs but diverge on ties. For "per-day winner with lowest hacker_id on ties,"ROW_NUMBER() OVER (PARTITION BY day ORDER BY count DESC, hacker_id ASC)and pick row 1 —RANK()would emit two row-1s.
ROW_NUMBER() OVER (PARTITION BY day ORDER BY ...)
ROW_NUMBER() emits a strictly-increasing index within each partition, broken by the ORDER BY clause. Pick WHERE rn = 1 to get one row per partition.
- PARTITION BY: scope of the window — one numbering per partition.
- ORDER BY in the window: decides who gets row 1.
-
Worked example: partition by
submission_date, order bydaily_count DESC, hacker_id ASC→ row 1 per day is the daily winner.
Tie-break: secondary key in ORDER BY
When two hackers tie on submissions, the prompt says "lowest hacker_id wins". The secondary ORDER BY key encodes that rule directly into the window.
-
Primary:
daily_count DESC— most-active first. -
Secondary:
hacker_id ASC— lowest id wins ties. -
Worked example: Ada (5 subs, id 1) and Bo (5 subs, id 2) tie → Ada wins because
id 1 < id 2.
"Every day so far" via self-join + COUNT predicate
A window function can compute per-day counts but cannot answer "did this hacker submit on every day from day 1 to day N?" That requires a self-join: for each (hacker, day) pair, count distinct submission days <= current and require equality with the day index.
-
Join condition:
submissions s2 ON s2.hacker_id = h.hacker_id AND s2.submission_date <= s1.submission_date. -
Predicate:
COUNT(DISTINCT s2.submission_date) = day_index_of(s1.submission_date).
HAVING COUNT(DISTINCT s2.submission_date) = DATEDIFF(s1.submission_date, '2016-03-01') + 1
Common beginner mistakes
- Using
RANK()instead ofROW_NUMBER()— ties produce two rank-1 rows; the grader fails on the duplicate. - Computing daily counts with
COUNT(*) OVER (...)only — misses the cumulative-presence test. - Filtering with
submission_date <= currentbut forgettingDISTINCT submission_date— repeated submissions on the same day inflate the streak count.
SQL interview question on 15 Days of Learning
Tables Hackers(hacker_id, name), Submissions(submission_date, submission_id, hacker_id, score). For each day from 2016-03-01 to 2016-03-15, print submission_date, total unique hackers in the cumulative streak, and (hacker_id, name) of that day's leader (most submissions; ties → lowest hacker_id).
Solution using ROW_NUMBER and a self-join streak predicate
SELECT
s1.submission_date,
(SELECT COUNT(DISTINCT s2.hacker_id)
FROM Submissions s2
WHERE s2.submission_date <= s1.submission_date
GROUP BY s2.hacker_id
HAVING COUNT(DISTINCT s2.submission_date)
= DATEDIFF(s1.submission_date, '2016-03-01') + 1) AS streak_hackers,
winner.hacker_id,
winner.name
FROM Submissions s1
JOIN LATERAL (
SELECT h.hacker_id, h.name
FROM Submissions w
JOIN Hackers h ON h.hacker_id = w.hacker_id
WHERE w.submission_date = s1.submission_date
GROUP BY h.hacker_id, h.name
ORDER BY COUNT(*) DESC, h.hacker_id ASC
LIMIT 1
) winner ON TRUE
GROUP BY s1.submission_date, winner.hacker_id, winner.name
ORDER BY s1.submission_date;
Step-by-step trace (3-day mini contest, 4 hackers Ada=1, Bo=2, Cy=3, Di=4):
-
Day 1 — all 4 submit. Daily counts: Ada 3, Bo 2, Cy 1, Di 1. Winner row:
ROW_NUMBER→ Ada. -
Day 2 — Ada 5, Bo 5, Di 2. Cy skips.
daily_countties at Ada vs Bo → secondaryhacker_id ASCpicks Ada (id 1). Streak count: only Ada/Bo/Di submitted on both days → 3. - Day 3 — Bo 4, Ada 1. Di skips. Winner Bo. Streak count: Ada and Bo submitted all 3 days → 2.
Output:
| submission_date | streak_hackers | hacker_id | name |
|---|---|---|---|
| 2016-03-01 | 4 | 1 | Ada |
| 2016-03-02 | 3 | 1 | Ada |
| 2016-03-03 | 2 | 2 | Bo |
Why this works — concept by concept:
-
ROW_NUMBERper day —PARTITION BY submission_date ORDER BY COUNT(*) DESC, hacker_id ASCgives one row per day with the deterministic winner. -
LATERALjoin — re-runs the per-day winner subquery in correlation with the outer day, returning a single row each time. - Streak subquery — counts distinct hackers whose distinct submission days equal the day index from the contest start.
-
Tie-break — secondary
hacker_id ASCresolves Ada vs Bo on day 2 in Ada's favor. -
Cost — one scan plus a per-day lateral subquery; on a 15-day contest with N submissions, roughly
O(15 × N)— bounded.
SQL
Topic — window functions
Window function problems
SQL
Topic — sorting
Sorting problems
5. CASE Expressions on Hierarchies (Binary Tree Nodes)
Three-branch CASE on parent-child columns for SQL
Binary Tree Nodes is a small but exacting CASE problem: classify each node as Root, Inner, or Leaf using only the parent column. The mental model: Root → P IS NULL; Inner → has children; Leaf → otherwise. The invariant: a CASE with three branches in the wrong order silently mislabels nodes — an Inner with P IS NOT NULL falls into the Inner branch before the Leaf branch can reject it.
Pro tip:
CASEshort-circuits top-to-bottom. List the most specific predicate first (P IS NULLfor Root), the least specific last (Leaf as theELSEbranch). Reordering produces wrong labels with no runtime error.
CASE WHEN P IS NULL THEN 'Root' (anchor)
The root node has no parent — P is NULL. Test it first because every other branch assumes P IS NOT NULL.
-
P IS NULL: the unique root. - Always the first branch: other branches presume the row has a parent.
-
Worked example: node
(N=2, P=NULL)→ Root.
EXISTS (SELECT 1 FROM BST b2 WHERE b2.P = b1.N) for "has children"
Inner nodes have children; leaves don't. The "has children" test is EXISTS against a self-join — efficient and standard.
-
Self-EXISTS: a row in the same table whose
Pmatches the currentN. -
Worked example: node
(N=5, P=2)with two children(N=3, P=5)and(N=6, P=5)→ Inner.
EXISTS (SELECT 1 FROM BST b2 WHERE b2.P = b1.N)
Three-branch CASE in strict order: Root → Inner → Leaf
The order matters. Root first, Inner second, Leaf as ELSE. Reversing Inner and Leaf would label every non-root node as Leaf.
- Branch 1: Root.
- Branch 2: Inner (via EXISTS).
- Branch 3 (ELSE): Leaf.
- Worked example: strict order labels the same 5-node tree consistently.
Common beginner mistakes
- Using
IS NULL = NULL—NULL = NULLis unknown, not true. - Putting
LeafbeforeInner— every Inner gets mislabeled. - Forgetting to alias the outer table — the EXISTS correlation needs a name to reference.
SQL interview question on Binary Tree Nodes
Table BST(N INT, P INT). Classify each node as Root (no parent), Inner (has children), or Leaf (no children). Output N and the label, ordered by N ascending.
Solution using a three-branch CASE with EXISTS
SELECT
b1.N,
CASE
WHEN b1.P IS NULL THEN 'Root'
WHEN EXISTS (SELECT 1 FROM BST b2 WHERE b2.P = b1.N) THEN 'Inner'
ELSE 'Leaf'
END AS node_type
FROM BST b1
ORDER BY b1.N;
Step-by-step trace (input: 5-node tree N=1..5, parents [2, NULL, 5, 2, 2]):
| N | P |
|---|---|
| 1 | 2 |
| 2 | NULL |
| 3 | 5 |
| 4 | 2 |
| 5 | 2 |
-
N=1, P=2 —
P IS NULL? No.EXISTS children of 1? No. →Leaf. -
N=2, P=NULL —
P IS NULL? Yes. →Root. -
N=3, P=5 —
P IS NULL? No.EXISTS children of 3? No. →Leaf. -
N=4, P=2 —
P IS NULL? No.EXISTS children of 4? No. →Leaf. -
N=5, P=2 —
P IS NULL? No.EXISTS children of 5? Yes (N=3 has P=5). →Inner.
Output:
| N | node_type |
|---|---|
| 1 | Leaf |
| 2 | Root |
| 3 | Leaf |
| 4 | Leaf |
| 5 | Inner |
Why this works — concept by concept:
-
P IS NULL— the unique root identifier; placed first because every other branch assumes a parent exists. - Self-EXISTS — a single round-trip test: "does any row in the same table have me as parent?"
- Branch ordering — Root → Inner → Leaf in strict CASE order; reversing produces silent mislabels.
-
ORDER BY N— the grader expects ascending N; without it, output order is undefined. -
Cost — one outer scan plus one EXISTS subquery per row: roughly
O(N × log N)ifPis indexed.
SQL
Topic — case-when
CASE expression problems
SQL
Topic — self-join
Self-join problems
6. Subqueries and the Output Formatting Trap (Weather Observation Station)
Scalar subqueries plus strict output formatting in SQL
Weather Observation Station 18 looks innocuous — find one row's column based on the max of another — but it filters out otherwise-correct candidates because of one detail: HackerRank's grader is exact-match on stdout. The mental model: query for the value, then format it for the grader. The invariant: ROUND(x, 4) does not add trailing zeros — it returns a numeric value the engine prints in its own default format. Wrap with FORMAT (MySQL/SQL Server) or to_char (PostgreSQL).
Output-strict rule: print
0.00, never0or.00. The Glassdoor data engineer who reported "my entire logic was correct but I was filtered out" failed exactly here. Coerce the format every time.
Scalar subquery in WHERE: WHERE x = (SELECT MAX(x) FROM ...)
A scalar subquery returns one value usable in any expression slot. To find the row with the maximum value, filter WHERE x = (SELECT MAX(x) FROM ...).
- Scalar subquery: one row, one column.
- Equality predicate: matches the row whose value equals the global aggregate.
-
Worked example:
STATIONfiltered toLAT_N < 137.2345, thenMAX(LAT_N)→ one scalar;WHERE LAT_N = scalar→ one row.
WHERE LAT_N = (SELECT MAX(LAT_N) FROM STATION WHERE LAT_N < 137.2345)
FORMAT(ROUND(x, 4), '0.0000') for fixed-decimal output
ROUND(x, 4) rounds the numeric value but returns a number — 1.2 stays 1.2, not 1.2000. To force trailing zeros, wrap with the engine's format function.
-
MySQL:
FORMAT(ROUND(x, 4), '0.0000'). -
MS SQL Server:
FORMAT(x, 'N4'). -
PostgreSQL:
to_char(x, 'FM999990.0000').
Cross-engine portability table
HackerRank lets candidates pick MySQL / MS SQL Server / PostgreSQL / Oracle / DB2 per challenge. Knowing the format equivalents is the difference between submitting the right answer in your language vs guessing at one you've never used.
Common beginner mistakes
- Using
ROUNDalone —1.2stays1.2, the grader expected1.2000. - Calling
FORMAT(x, 4)(MS SQL) when you meantFORMAT(x, 'N4')— the second arg is a format string, not a precision. - Forgetting
COALESCEfor nullable columns —NULLprints as the literal stringNULL, never0.0000.
SQL interview question on Weather Observation Station 18
Table STATION(ID, CITY, STATE, LAT_N, LONG_W). Output LONG_W for the row with the largest LAT_N strictly less than 137.2345, formatted to four decimal places.
Solution using a scalar subquery and FORMAT
SELECT FORMAT(ROUND(LONG_W, 4), '0.0000') AS long_w_fmt
FROM STATION
WHERE LAT_N = (
SELECT MAX(LAT_N)
FROM STATION
WHERE LAT_N < 137.2345
);
Step-by-step trace (input: 4 STATION rows):
| LAT_N | LONG_W |
|---|---|
| 100.0 | -75.123 |
| 130.5 | -80.4567 |
| 137.2 | -90.1 |
| 138.0 | -91.0 |
-
Inner subquery —
LAT_N < 137.2345filters to rows 1, 2, 3 (138.0 rejected).MAX(LAT_N) = 137.2. -
Outer WHERE —
LAT_N = 137.2matches row 3 only. -
Project —
LONG_W = -90.1. -
ROUND(-90.1, 4) = -90.1(MySQL keeps it as numeric, not-90.1000). -
FORMAT(-90.1, '0.0000')— coerces to the literal string-90.1000.
Output:
| long_w_fmt |
|---|
| -90.1000 |
Why this works — concept by concept:
-
Scalar subquery —
(SELECT MAX(...) WHERE ...)returns one value, usable in the outer WHERE. -
Strict equality
=— matches the row whoseLAT_Nequals the filtered max. -
FORMAT(ROUND(x, 4), '0.0000')— forces trailing zeros so the stdout matches the grader's expected string. -
Engine portability — same logic in PostgreSQL needs
to_char; in MS SQL Server needsFORMAT(x, 'N4'). -
Cost — two scans of
STATION(one for the subquery, one for the outer):O(N)total.
SQL
Topic — subqueries
Subquery problems
SQL
Topic — filtering
SQL filtering problems
7. PySpark Dataframe Transforms for HackerRank DE Assessments
SQL → PySpark idioms for the HackerRank platform
Glassdoor reports for the HackerRank-the-company data engineer assessment confirm two coding questions per round: one SQL window function, one PySpark dataframe transform. The PySpark side is graded in a real IDE — runnable code, not pseudocode. The mental model: think SQL, write PySpark, and know the three idiom translations the DataFrame API can't express directly. The invariant: .isin() takes a Python list, not a DataFrame; correlated subqueries don't exist in the API; everything else is a Window.partitionBy(...).orderBy(...).
Pro tip: When
.isin()would need a DataFrame, rewrite as a left anti-join:df.alias('a').join(other.alias('b'), col('a.k') == col('b.k'), 'left').where(col('b.k').isNull()). It scales to billions of rows;.collect()doesn't.
Define schema explicitly with StructType and StructField
Schema inference is convenient but flaky on edge cases — a null first row and the column becomes string when you needed int. In a 90-minute assessment, define schema up front.
-
StructType([StructField(name, type, nullable), ...]): the canonical pattern. -
Worked example:
Submissions(submission_date Date, submission_id Int, hacker_id Int, score Int)withnullable=Falseon the keys.
schema = StructType([
StructField('submission_date', DateType(), False),
StructField('submission_id', IntegerType(), False),
StructField('hacker_id', IntegerType(), False),
StructField('score', IntegerType(), False),
])
SQL IN (subquery) → PySpark LEFT JOIN ... IS NULL
.isin() accepts a Python list, not a DataFrame. The clean rewrite is a left anti-join: keep rows from the left table that have no match on the right.
-
df.alias('a').join(other.alias('b'), key, 'left').where(col('b.key').isNull()): preserves left rows without a right match. -
Worked example:
start_datesnot inend_datesbecomes a left join +IS NULLfilter.
Window functions: Window.partitionBy(...).orderBy(...)
Window functions translate one-to-one between SQL and PySpark. ROW_NUMBER, RANK, LAG, LEAD, SUM OVER — all use the same Window spec.
-
Window.partitionBy('day').orderBy(F.col('cnt').desc(), F.col('hacker_id').asc()): mirror the SQL window. - Worked example: the per-day winner pattern from §4 in PySpark.
w = Window.partitionBy('submission_date') \
.orderBy(F.col('cnt').desc(), F.col('hacker_id').asc())
winners = daily.withColumn('rn', F.row_number().over(w)) \
.where(F.col('rn') == 1)
Common beginner mistakes
- Calling
.isin(other_df)—other_dfisn't iterable; runtime error. -
.collect()on a billion-row dataframe — drives the driver out of memory. - Forgetting
alias()on self-joins — ambiguous column references; the planner can't decide.
PySpark interview question on the anti-join translation
DataFrame projects_df(start_date Date, end_date Date). Return the start_date values that do not appear in any row's end_date. The SQL form is SELECT start_date FROM projects WHERE start_date NOT IN (SELECT end_date FROM projects). Write the PySpark equivalent.
Solution using a left anti-join
from pyspark.sql import functions as F
result = (
projects_df.alias('a')
.join(
projects_df.alias('b'),
F.col('a.start_date') == F.col('b.end_date'),
how='left',
)
.where(F.col('b.end_date').isNull())
.select(F.col('a.start_date'))
)
Step-by-step trace (input: 4 project rows):
| start_date | end_date |
|---|---|
| 2026-01-01 | 2026-01-10 |
| 2026-01-15 | 2026-01-20 |
| 2026-01-25 | 2026-01-30 |
| 2026-02-01 | 2026-02-10 |
-
Alias both copies —
afor the outer dataframe,bfor the lookup side. -
Left join on
a.start_date = b.end_date— for eachstart_date, attach any row whoseend_datematches; otherwise the right side isNULL. -
Filter
where b.end_date IS NULL— keepsstart_daterows with no matchingend_date. -
Project
a.start_date— final output.
Output:
| start_date |
|---|
| 2026-01-01 |
| 2026-01-15 |
| 2026-01-25 |
| 2026-02-01 |
Why this works — concept by concept:
-
Left anti-join idiom —
LEFT JOIN ... WHERE other.key IS NULLis the canonical "rows in A not in B" pattern; portable from SQL to PySpark. -
Aliasing —
alias('a')andalias('b')resolve column ambiguity in the self-join. -
No
.collect()— keeps the work distributed; safe for billions of rows. -
F.coldiscipline — explicit column references avoid string-vs-column ambiguity in the planner. -
Cost — one shuffle for the join key, one filter, one projection:
O(N + M)distributed.
SQL
Topic — anti-join
Anti-join problems
SQL
Topic — sql
All SQL practice
Tips to crack HackerRank data engineering interviews
Practice the seven canonical clusters before chasing tools
Drill JOINs, multi-table aggregation, GROUP BY/HAVING, window functions, CASE on hierarchies, scalar subqueries, and PySpark dataframe transforms. The HackerRank assessment samples from these — questions that look new are recombinations of these primitives. Rehearsing all seven is more useful than memorizing twenty individual problems.
Match HackerRank's output format byte-for-byte
ROUND rounds, it does not pad. Wrap with FORMAT(ROUND(x, 4), '0.0000') (MySQL) or to_char(x, 'FM999990.0000') (PostgreSQL) every time the prompt mentions decimal places. Use COALESCE (or ISNULL in MS SQL Server) for nullable columns — NULL prints as the literal string NULL, not 0.
Learn one engine deeply, two well
HackerRank lets candidates pick MySQL, PostgreSQL, MS SQL Server, Oracle, or DB2 per challenge. Pick MySQL as your default; know the PostgreSQL and MS SQL Server equivalents for FORMAT, ROUND, CAST, and COALESCE/ISNULL. Window function syntax is identical across all three.
Read the schema before the prompt
The line "the tables may contain duplicate records" in New Companies changes every COUNT(*) to COUNT(DISTINCT). The line "the secondary sort is by ID, but the result includes UIN but not ID" in Tenured Employees tells you ORDER BY columns don't need to be in SELECT. Read schema notes before the prompt — they encode the half of the test the prose doesn't.
In PySpark, prefer joins over collect()
.isin() accepts only a Python list. For "is in another dataframe", rewrite as a left anti-join: df.alias('a').join(other.alias('b'), key, 'left').where(col('b.key').isNull()). Correlated subqueries don't exist in the DataFrame API — drop down to spark.sql(...) or rewrite the predicate as a join condition.
Practice on PipeCode with real grading
Pattern recognition without practice is a guess. Drill the HackerRank-tagged practice set and the topic-tagged sets for the seven clusters above. PipeCode grades on equivalent results, so you can focus on logic without fighting trailing-zero mismatches — then port the format-strict habits into your HackerRank submission.
Frequently asked questions
What basic SQL skills should freshers practice for HackerRank data engineering questions?
Freshers should drill INNER JOIN with foreign keys, GROUP BY with COUNT/SUM/AVG, WHERE filters with < and =, and ORDER BY with multi-key sorts. The Tenured Employees pattern covers all four primitives in one query, so it's a natural starting point. Freshers also need to internalize that HackerRank grades on stdout exact match — 0 and 0.00 are different answers. Spend an afternoon on the FORMAT and ROUND idioms before attempting any decimal-formatted question.
How do experienced data engineers prepare for HackerRank advanced SQL questions?
Experienced engineers should focus on window functions for streaks and per-day winners (15 Days of Learning SQL), self-EXISTS for hierarchy classification (Binary Tree Nodes), and pre-aggregation patterns to avoid fan-out (New Companies). The senior bar is fluency under time pressure — a 90-minute screen with two questions and a strict grader. Two weeks of drilling advanced SQL problems and reviewing the seven canonical clusters is a realistic prep window for someone with 3+ years of SQL.
Which JOIN patterns appear most often on HackerRank data engineering interviews?
Three JOIN patterns dominate: INNER JOIN on a foreign key (Tenured Employees), four-way INNER JOIN with a related-row equality predicate (Top Competitors), and self-JOIN with EXISTS for hierarchy classification (Binary Tree Nodes). LEFT JOIN with IS NULL is rarer in pure SQL but becomes critical in the PySpark anti-join translation. Drill all four on the HackerRank JOIN practice set before attempting the assessment.
Does HackerRank ask PySpark questions in data engineer interviews?
Yes. Glassdoor reports for the HackerRank company's own data engineer role confirm one PySpark dataframe transform per assessment, alongside one SQL window function question. PySpark questions are coded in a real IDE — runnable Python, not pseudocode — and graded on output equality just like SQL. The most-screened idiom is the SQL IN (subquery) to PySpark left-anti-join translation, so drill that pattern until it's automatic.
What scenario-based SQL questions should I expect on HackerRank?
Scenario-based questions wrap the seven clusters in business framings: leaderboards (Top Competitors → "find users with more than one perfect-score submission"), retention (a streak query like 15 Days of Learning), reporting cubes (New Companies → "count users per tier"), and audit trails (Tenured Employees → "list users matching a tenure filter"). The framing changes; the underlying primitive doesn't. State the mapping out loud during the interview — interviewers reward verbal clarity as much as correct SQL.
Are ETL or data warehousing questions part of HackerRank data engineering assessments?
The hands-on coding rounds are SQL + PySpark; ETL and warehousing concepts (staging tables, OLTP vs OLAP, surrogate keys, star vs snowflake) appear as MCQs in the proctored assessment. The Glassdoor New Delhi report confirms four MCQs covering Hadoop, SQL fundamentals, and data modelling alongside the coding question. Read a data engineering primer the night before to ensure the warehouse vocabulary is fresh.
Start practicing HackerRank data engineering problems
PipeCode pairs HackerRank-tagged problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and PySpark patterns the platform asks — without fighting the trailing-zero grader while you're still learning the logic.
Pipecode.ai is Leetcode for Data Engineering.





Top comments (0)