Gowtham Potureddi

Posted on May 2

HackerRank Data Engineering Interview Questions: 7 SQL & PySpark Patterns to Master

#python #sql #interview #dataengineering

HackerRank data engineering interview questions screen on a tight, predictable skill matrix: SQL JOINs with foreign-key filters, multi-table aggregation without fan-out, GROUP BY / HAVING on related-row predicates, window functions for streaks and per-day winners, CASE on hierarchies for tree-style classification, scalar subqueries with strict output formatting, and PySpark dataframe transforms the platform now grades alongside SQL. The grader is exact-match on stdout — clever logic with the wrong trailing zero fails the test case.

This guide walks through the seven canonical clusters HackerRank actually tests for data engineers, drawing on its own 7 Advanced SQL Interview Questions set plus first-hand candidate reports for the company's online assessment. Each cluster ships a teaching paragraph, three sub-topic primitives with worked examples, an interview-style problem, a runnable solution, and a step-by-step trace so you can rehearse the four-step "join → filter → group → order" instinct that separates a passing submission from a near-miss.

#	Topic (sections 1–7)	HackerRank question	Why it shows up
1	SQL JOINs and foreign-key filtering	Tenured Employees	INNER JOIN on `ID`, WHERE on `TIME < 3`, ORDER BY a column not in SELECT.
2	Multi-table aggregation	New Companies	Five-level hierarchy → one per-company row using `COUNT(DISTINCT)` against duplicates.
3	GROUP BY + HAVING on related-row predicates	Top Competitors	`WHERE submission.score = difficulty.score` → GROUP BY hacker → `HAVING COUNT(*) > 1`.
4	Window functions for streaks and winners	15 Days of Learning SQL	`ROW_NUMBER` per day plus a "submitted every day so far" cumulative-presence test.
5	CASE expressions on hierarchies	Binary Tree Nodes	Three-branch `CASE WHEN P IS NULL ... EXISTS(children) ...` in strict order.
6	Scalar subqueries + strict output formatting	Weather Observation Station 18	`WHERE x = (SELECT MAX(x) ...)` plus `FORMAT(ROUND(x, 4), '0.0000')` for the grader.
7	PySpark dataframe transforms	Real-IDE coding (Glassdoor-confirmed)	SQL → PySpark idioms — anti-join, schema, `Window.partitionBy`.

1. SQL JOINs and the Tenured Employees Pattern

Foreign-key joins with multi-key sorts in SQL for data engineering

The Tenured Employees question is the cleanest "junior data engineer" filter on the HackerRank platform: two tables, one foreign key, one filter predicate, a two-key sort. The mental model is join → filter → order in that order. The invariant: the join brings columns together, WHERE narrows rows, and ORDER BY sequences output — the sort key does not need to live in the SELECT list.

Pro tip: When the prompt says "sort by NAME, then ID, but output only UIN and NAME," resist the urge to drop ID from the query. SQL's ORDER BY accepts any column visible after the FROM clause, regardless of whether it appears in the projection.

`INNER JOIN ... ON fk = pk`

INNER JOIN keeps only rows that have a match on both sides. For two tables with a true 1:1 foreign key, INNER and LEFT JOIN produce identical row counts; for guaranteed-match foreign keys, INNER is the default because it documents the intent and is faster to plan.

Join key: the column that identifies the same entity on both sides.
Cardinality check: if either side has duplicates on the key, the joined row count multiplies — fan-out.
Worked example: EMPLOYEE.ID = EMPLOYEE_UIN.ID is 1:1, so the joined row count equals the smaller side.

SELECT u.UIN, e.NAME
FROM EMPLOYEE e
INNER JOIN EMPLOYEE_UIN u ON u.ID = e.ID;

`WHERE` filter on the joined columns

The filter TIME < 3 narrows the joined set. It can sit either in the join's ON clause or in the outer WHERE — for INNER JOIN they are equivalent, for LEFT JOIN they are not. Default to WHERE for filters that apply after the join.

ON predicate: part of the join algebra; for LEFT JOIN it preserves unmatched left rows.
WHERE predicate: post-join filter; rejects rows regardless of join type.
Worked example: TIME < 3 keeps Sherrie (1), Mary (2), and Dave (.33). Paul (7) and Sam (3) are excluded.

SELECT u.UIN, e.NAME
FROM EMPLOYEE e
INNER JOIN EMPLOYEE_UIN u ON u.ID = e.ID
WHERE e.TIME < 3;

`ORDER BY name ASC, id ASC` (sort key not in SELECT)

ORDER BY accepts any FROM-clause column, not just the projected ones. The two-key sort name ASC, id ASC resolves alphabetic ties deterministically — without the secondary key, two employees with the same name produce non-deterministic row order, which the HackerRank grader treats as failing.

Primary key: the column that drives most of the sort.
Secondary key: breaks ties between equal primary values.
Worked example: Dave, Mary, Sherrie are already alphabetically distinct, so the secondary id only matters when a name repeats.

SELECT u.UIN, e.NAME
FROM EMPLOYEE e
INNER JOIN EMPLOYEE_UIN u ON u.ID = e.ID
WHERE e.TIME < 3
ORDER BY e.NAME ASC, e.ID ASC;

Common beginner mistakes

Putting ORDER BY columns in SELECT "to be safe" — pollutes the output.
Using LEFT JOIN when the foreign key is guaranteed — clutters the plan, hides intent.
Forgetting the secondary sort key — non-deterministic output fails exact-match grading.
Sorting by id numerically when the schema makes it a string — C_1, C_10, C_2 versus C_1, C_2, C_10.

SQL interview question on tenured employees

Two tables: EMPLOYEE(ID INT, NAME VARCHAR, TIME INT, ADDRESS VARCHAR, SALARY INT) and EMPLOYEE_UIN(ID INT, UIN VARCHAR). Output UIN and NAME for employees with TIME < 3 years, ordered by NAME ASC then ID ASC. ID is part of the sort but not the output.

Solution using INNER JOIN with a multi-key ORDER BY

SELECT u.UIN, e.NAME
FROM EMPLOYEE e
INNER JOIN EMPLOYEE_UIN u ON u.ID = e.ID
WHERE e.TIME < 3
ORDER BY e.NAME ASC, e.ID ASC;

Step-by-step trace (input: 5-row EMPLOYEE table):

ID	NAME	TIME
1	Sherrie	1
2	Paul	7
3	Mary	2
4	Sam	3
5	Dave	0

Join EMPLOYEE to EMPLOYEE_UIN on ID. Five matched rows because the foreign key is 1:1.
Apply WHERE e.TIME < 3. Paul (7) and Sam (3) drop out; Sherrie, Mary, Dave survive (Dave's 0.33 is < 3 once the schema's INT truncation is honored — treat it as 0 here for clarity).
Apply ORDER BY e.NAME ASC, e.ID ASC. Alphabetic order: Dave, Mary, Sherrie.
Project UIN, NAME for each surviving row.

Output:

UIN	NAME
63868-453	Dave
63550-194	Mary
57520-0440	Sherrie

Why this works — concept by concept:

INNER JOIN on ID — guaranteed-match foreign key; documents the intent that every employee has a UIN.
WHERE filter post-join — narrows by tenure without affecting the join algebra.
Multi-key ORDER BY — primary sort on NAME for human readability, secondary on ID for deterministic tie-breaking.
Sort key not in SELECT — SQL allows it; the grader expects exactly the two output columns.
Cost — one hash-join scan plus one sort: O((N + M) + N log N).

SQL
HackerRank — joins
HackerRank JOIN problems

Practice →

SQL
Topic — joins
SQL join problems

Practice →

2. Multi-Table Aggregation with the New Companies Hierarchy

Multi-level hierarchies and `COUNT(DISTINCT)` for data engineering

The New Companies question forces the question every analytics SQL writer eventually has to answer: how do I roll a five-level hierarchy into one row per company without double-counting? The mental model: aggregate each level independently, then join the per-level counts on the parent key. The invariant: a single JOIN chain across all five tables fans out the row count and breaks every aggregate downstream.

Pro tip: When the prompt says "the tables may contain duplicate records," COUNT(*) is wrong by default — switch to COUNT(DISTINCT child_id). Read this line in every HackerRank schema before you write the first SELECT.

Pre-aggregate per level, then join

When five tables share a company_code key, aggregate each table independently and join the scalar counts on the parent. This avoids the cross-product of joining all five at once.

Per-level aggregate: SELECT company_code, COUNT(DISTINCT id) FROM <level> GROUP BY company_code.
Final JOIN: glue the per-level counts on company_code.
Worked example: company C1 with 2 lead managers, 4 senior managers, 8 managers, 16 employees → one final row.

SELECT
  c.company_code, c.founder,
  (SELECT COUNT(DISTINCT lm.lead_manager_code) FROM Lead_Manager lm WHERE lm.company_code = c.company_code) AS n_lead,
  (SELECT COUNT(DISTINCT sm.senior_manager_code) FROM Senior_Manager sm WHERE sm.company_code = c.company_code) AS n_senior,
  (SELECT COUNT(DISTINCT m.manager_code) FROM Manager m WHERE m.company_code = c.company_code) AS n_manager,
  (SELECT COUNT(DISTINCT e.employee_code) FROM Employee e WHERE e.company_code = c.company_code) AS n_employee
FROM Company c
ORDER BY c.company_code;

`COUNT(DISTINCT child_id) GROUP BY parent_id`

When a single table has duplicate rows on the child id (e.g. an employee appears twice because of a join in the source), COUNT(*) returns the row count, not the entity count. COUNT(DISTINCT child_id) returns the entity count — the right answer when the prompt says "how many".

COUNT(*): counts rows.
COUNT(DISTINCT id): counts unique entities.
Worked example: three rows for the same employee_code → COUNT(*) = 3, COUNT(DISTINCT employee_code) = 1.

String-sort vs numeric-sort on `company_code`

ORDER BY company_code ASC on a VARCHAR column sorts lexicographically: C_1, C_10, C_2. To sort numerically you would need CAST(SUBSTRING(company_code, 3) AS UNSIGNED). HackerRank's grader for this problem expects the string sort, so the natural ASC sort is correct as long as the column is VARCHAR.

Lex sort: alphabetic, character by character.
Numeric sort: parse the suffix and cast.
Worked example: C_1, C_10, C_2 is the expected order; C_1, C_2, C_10 is wrong here.

Common beginner mistakes

Joining all five tables at once and using COUNT(DISTINCT id) per level — works but plans badly and is hard to read.
Using COUNT(*) after a JOIN that fans out — every level gets multiplied by its descendants.
Casting company_code to a number before sorting — defeats the prompt.

SQL interview question on the New Companies hierarchy

Tables Company(company_code, founder), Lead_Manager(lead_manager_code, company_code), Senior_Manager, Manager, Employee — same column shape down the chain. Output company_code, founder, n_lead, n_senior, n_manager, n_employee per company, ordered by company_code ascending (string sort).

Solution using scalar subqueries per level

SELECT
  c.company_code, c.founder,
  (SELECT COUNT(DISTINCT lm.lead_manager_code) FROM Lead_Manager lm WHERE lm.company_code = c.company_code) AS n_lead,
  (SELECT COUNT(DISTINCT sm.senior_manager_code) FROM Senior_Manager sm WHERE sm.company_code = c.company_code) AS n_senior,
  (SELECT COUNT(DISTINCT m.manager_code) FROM Manager m WHERE m.company_code = c.company_code) AS n_manager,
  (SELECT COUNT(DISTINCT e.employee_code) FROM Employee e WHERE e.company_code = c.company_code) AS n_employee
FROM Company c
ORDER BY c.company_code;

Step-by-step trace (input: 1 company C1 with founder Monika, 1 lead, 2 seniors, 4 managers, 8 employees):

company_code	founder	n_lead	n_senior	n_manager	n_employee
C1	Monika	1	2	4	8

Outer scan of Company — yields one row per company.
Per-row, run each scalar subquery — Lead_Manager filtered by company_code = c.company_code and COUNT(DISTINCT)-aggregated. Same for Senior_Manager, Manager, Employee.
Project the results — six columns per output row.
ORDER BY c.company_code — string sort: C_1, C_10, C_2 (when there are multiple companies).

Output:

company_code	founder	n_lead	n_senior	n_manager	n_employee
C1	Monika	1	2	4	8

Why this works — concept by concept:

Scalar subquery per level — each subquery returns a single integer; no fan-out across levels.
COUNT(DISTINCT) — defends against duplicates the prompt warns about.
Outer-row anchor — the Company row drives the subqueries, so per-company filtering is automatic.
String ORDER BY — VARCHAR codes sort lexicographically; matches the prompt.
Cost — one outer scan plus four indexed lookups per company: O(C × (1 + index_seek_per_level)).

SQL
Topic — multi-join
Multi-table join problems

Practice →

SQL
Topic — aggregation
Aggregation problems

Practice →

3. GROUP BY, HAVING, and the Top Competitors Pattern

Related-row predicates with GROUP BY for SQL data engineering

Top Competitors is the canonical "who scored full marks more than once" query, and it teaches two compound skills: filtering on a related row's column (a submission's score equals its difficulty's max score) and filtering on an aggregate (more than one such submission). The mental model: JOIN → WHERE related-equality → GROUP BY entity → HAVING aggregate-predicate → ORDER BY.

Pro tip: HAVING is the only place an aggregate can appear in a predicate. If you find yourself writing WHERE COUNT(*) > 1, the grader rejects the query — WHERE runs before aggregation, HAVING runs after.

`WHERE submission.score = difficulty.score`

The "full score" predicate isn't a constant comparison — it's row-level equality between a submission's score and the maximum score declared in the matching difficulty row. Use the JOIN to bring the related row alongside, then compare both columns.

Joined predicate: s.score = d.score only makes sense after JOIN Difficulty d ON c.difficulty_level = d.difficulty_level.
Worked example: difficulty 6 → max score 100. A submission with score = 100 qualifies; 99 doesn't.

WHERE s.score = d.score

`GROUP BY hacker, COUNT(), HAVING COUNT() > 1`

Once you've filtered to full-score rows, group by the entity that owns the achievement (the hacker) and count. HAVING COUNT(*) > 1 keeps only hackers with more than one full-score row.

GROUP BY h.hacker_id, h.name: name appears in SELECT, so it must appear in GROUP BY.
HAVING COUNT(*) > 1: strict greater than — "more than one" excludes hackers with exactly one.
Worked example: Joe scored full marks twice → COUNT(*) = 2, qualifies.

Two-key sort: `ORDER BY COUNT(*) DESC, hacker_id ASC`

ORDER BY accepts aggregates directly — no need to wrap the count in a subquery alias. Primary sort on COUNT(*) descending, secondary on hacker_id ascending so ties resolve to the lower id.

COUNT(*) DESC: most-prolific hackers first.
hacker_id ASC: lowest id wins on ties.
Worked example: two hackers tied at 3 full-score submissions → the lower hacker_id appears first.

Common beginner mistakes

Putting the aggregate predicate in WHERE — runtime error in every engine.
Forgetting name in GROUP BY after selecting it — strict-mode engines reject.
Sorting by an aliased count column with the alias — works in MySQL, fails in stricter engines.

SQL interview question on Top Competitors

Tables Hackers(hacker_id, name), Difficulty(difficulty_level, score), Challenges(challenge_id, hacker_id, difficulty_level), Submissions(submission_id, hacker_id, challenge_id, score). Print hacker_id, name for hackers who achieved full marks on more than one challenge. Order by full-score count descending, then hacker_id ascending.

Solution using INNER JOIN with HAVING

SELECT h.hacker_id, h.name
FROM Submissions s
INNER JOIN Hackers h ON s.hacker_id = h.hacker_id
INNER JOIN Challenges c ON s.challenge_id = c.challenge_id
INNER JOIN Difficulty d ON c.difficulty_level = d.difficulty_level
WHERE s.score = d.score
GROUP BY h.hacker_id, h.name
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC, h.hacker_id ASC;

Step-by-step trace (input: 4 submissions, 2 challenges, 2 hackers):

hacker	challenge	difficulty	sub_score	max_score
86870	71055	2	30	30
90411	71055	2	30	30
90411	66730	6	100	100
86870	66730	6	50	100

Four-way JOIN — bring Hackers, Challenges, Difficulty alongside each Submissions row.
WHERE s.score = d.score — keeps rows 1, 2, 3. Row 4 (Hacker 86870 scored 50 on max-100 challenge) drops.
GROUP BY h.hacker_id, h.name — collapses the 3 surviving rows into per-hacker groups: 86870 → 1 row, 90411 → 2 rows.
HAVING COUNT(*) > 1 — only 90411 survives (2 > 1).
ORDER BY COUNT(*) DESC, h.hacker_id ASC — single row, no tie to break.

Output:

hacker_id	name
90411	Joe

Why this works — concept by concept:

Four-way JOIN — chains submission → hacker, submission → challenge, challenge → difficulty so the related-row equality has both sides on the same row.
Related-row equality — s.score = d.score compares each submission to the max possible score for its challenge's difficulty, not a constant.
GROUP BY + HAVING — HAVING COUNT(*) > 1 is the "more than one" qualifier; can only sit in HAVING because it references an aggregate.
Aggregate in ORDER BY — COUNT(*) DESC ranks by prolific hackers; hacker_id ASC resolves ties deterministically.
Cost — three hash-join passes plus one aggregate plus one sort: roughly O(N log N) over the submission row count.

SQL
Topic — having clause
HAVING clause problems

Practice →

SQL
Topic — group by
GROUP BY problems

Practice →

4. SQL Window Functions and the 15 Days of Learning Pattern

Window functions for streaks and per-day winners in SQL

15 Days of Learning SQL is a stamina test for window functions: per day, return the count of hackers whose streak survives to that day plus the per-day submission winner with a deterministic tie-break. The mental model: window functions partition by day to rank per-day; a self-join with a streak predicate counts surviving hackers. The invariant: a single window expression cannot capture the "every day so far" condition — that needs a join or correlated subquery.

Pro tip: ROW_NUMBER() and RANK() look interchangeable on small inputs but diverge on ties. For "per-day winner with lowest hacker_id on ties," ROW_NUMBER() OVER (PARTITION BY day ORDER BY count DESC, hacker_id ASC) and pick row 1 — RANK() would emit two row-1s.

`ROW_NUMBER() OVER (PARTITION BY day ORDER BY ...)`

ROW_NUMBER() emits a strictly-increasing index within each partition, broken by the ORDER BY clause. Pick WHERE rn = 1 to get one row per partition.

PARTITION BY: scope of the window — one numbering per partition.
ORDER BY in the window: decides who gets row 1.
Worked example: partition by submission_date, order by daily_count DESC, hacker_id ASC → row 1 per day is the daily winner.

Tie-break: secondary key in ORDER BY

When two hackers tie on submissions, the prompt says "lowest hacker_id wins". The secondary ORDER BY key encodes that rule directly into the window.

Primary: daily_count DESC — most-active first.
Secondary: hacker_id ASC — lowest id wins ties.
Worked example: Ada (5 subs, id 1) and Bo (5 subs, id 2) tie → Ada wins because id 1 < id 2.

"Every day so far" via self-join + COUNT predicate

A window function can compute per-day counts but cannot answer "did this hacker submit on every day from day 1 to day N?" That requires a self-join: for each (hacker, day) pair, count distinct submission days <= current and require equality with the day index.

Join condition: submissions s2 ON s2.hacker_id = h.hacker_id AND s2.submission_date <= s1.submission_date.
Predicate: COUNT(DISTINCT s2.submission_date) = day_index_of(s1.submission_date).

HAVING COUNT(DISTINCT s2.submission_date) = DATEDIFF(s1.submission_date, '2016-03-01') + 1

Common beginner mistakes

Using RANK() instead of ROW_NUMBER() — ties produce two rank-1 rows; the grader fails on the duplicate.
Computing daily counts with COUNT(*) OVER (...) only — misses the cumulative-presence test.
Filtering with submission_date <= current but forgetting DISTINCT submission_date — repeated submissions on the same day inflate the streak count.

SQL interview question on 15 Days of Learning

Tables Hackers(hacker_id, name), Submissions(submission_date, submission_id, hacker_id, score). For each day from 2016-03-01 to 2016-03-15, print submission_date, total unique hackers in the cumulative streak, and (hacker_id, name) of that day's leader (most submissions; ties → lowest hacker_id).

Solution using ROW_NUMBER and a self-join streak predicate

SELECT
  s1.submission_date,
  (SELECT COUNT(DISTINCT s2.hacker_id)
   FROM Submissions s2
   WHERE s2.submission_date <= s1.submission_date
   GROUP BY s2.hacker_id
   HAVING COUNT(DISTINCT s2.submission_date)
        = DATEDIFF(s1.submission_date, '2016-03-01') + 1) AS streak_hackers,
  winner.hacker_id,
  winner.name
FROM Submissions s1
JOIN LATERAL (
  SELECT h.hacker_id, h.name
  FROM Submissions w
  JOIN Hackers h ON h.hacker_id = w.hacker_id
  WHERE w.submission_date = s1.submission_date
  GROUP BY h.hacker_id, h.name
  ORDER BY COUNT(*) DESC, h.hacker_id ASC
  LIMIT 1
) winner ON TRUE
GROUP BY s1.submission_date, winner.hacker_id, winner.name
ORDER BY s1.submission_date;

Step-by-step trace (3-day mini contest, 4 hackers Ada=1, Bo=2, Cy=3, Di=4):

Day 1 — all 4 submit. Daily counts: Ada 3, Bo 2, Cy 1, Di 1. Winner row: ROW_NUMBER → Ada.
Day 2 — Ada 5, Bo 5, Di 2. Cy skips. daily_count ties at Ada vs Bo → secondary hacker_id ASC picks Ada (id 1). Streak count: only Ada/Bo/Di submitted on both days → 3.
Day 3 — Bo 4, Ada 1. Di skips. Winner Bo. Streak count: Ada and Bo submitted all 3 days → 2.

Output:

submission_date	streak_hackers	hacker_id	name
2016-03-01	4	1	Ada
2016-03-02	3	1	Ada
2016-03-03	2	2	Bo

Why this works — concept by concept:

ROW_NUMBER per day — PARTITION BY submission_date ORDER BY COUNT(*) DESC, hacker_id ASC gives one row per day with the deterministic winner.
LATERAL join — re-runs the per-day winner subquery in correlation with the outer day, returning a single row each time.
Streak subquery — counts distinct hackers whose distinct submission days equal the day index from the contest start.
Tie-break — secondary hacker_id ASC resolves Ada vs Bo on day 2 in Ada's favor.
Cost — one scan plus a per-day lateral subquery; on a 15-day contest with N submissions, roughly O(15 × N) — bounded.

SQL
Topic — window functions
Window function problems

Practice →

SQL
Topic — sorting
Sorting problems

Practice →

5. CASE Expressions on Hierarchies (Binary Tree Nodes)

Three-branch CASE on parent-child columns for SQL

Binary Tree Nodes is a small but exacting CASE problem: classify each node as Root, Inner, or Leaf using only the parent column. The mental model: Root → P IS NULL; Inner → has children; Leaf → otherwise. The invariant: a CASE with three branches in the wrong order silently mislabels nodes — an Inner with P IS NOT NULL falls into the Inner branch before the Leaf branch can reject it.

Pro tip: CASE short-circuits top-to-bottom. List the most specific predicate first (P IS NULL for Root), the least specific last (Leaf as the ELSE branch). Reordering produces wrong labels with no runtime error.

`CASE WHEN P IS NULL THEN 'Root'` (anchor)

The root node has no parent — P is NULL. Test it first because every other branch assumes P IS NOT NULL.

P IS NULL: the unique root.
Always the first branch: other branches presume the row has a parent.
Worked example: node (N=2, P=NULL) → Root.

`EXISTS (SELECT 1 FROM BST b2 WHERE b2.P = b1.N)` for "has children"

Inner nodes have children; leaves don't. The "has children" test is EXISTS against a self-join — efficient and standard.

Self-EXISTS: a row in the same table whose P matches the current N.
Worked example: node (N=5, P=2) with two children (N=3, P=5) and (N=6, P=5) → Inner.

EXISTS (SELECT 1 FROM BST b2 WHERE b2.P = b1.N)

Three-branch CASE in strict order: Root → Inner → Leaf

The order matters. Root first, Inner second, Leaf as ELSE. Reversing Inner and Leaf would label every non-root node as Leaf.

Branch 1: Root.
Branch 2: Inner (via EXISTS).
Branch 3 (ELSE): Leaf.
Worked example: strict order labels the same 5-node tree consistently.

Common beginner mistakes

Using IS NULL = NULL — NULL = NULL is unknown, not true.
Putting Leaf before Inner — every Inner gets mislabeled.
Forgetting to alias the outer table — the EXISTS correlation needs a name to reference.

SQL interview question on Binary Tree Nodes

Table BST(N INT, P INT). Classify each node as Root (no parent), Inner (has children), or Leaf (no children). Output N and the label, ordered by N ascending.

Solution using a three-branch CASE with EXISTS

SELECT
  b1.N,
  CASE
    WHEN b1.P IS NULL THEN 'Root'
    WHEN EXISTS (SELECT 1 FROM BST b2 WHERE b2.P = b1.N) THEN 'Inner'
    ELSE 'Leaf'
  END AS node_type
FROM BST b1
ORDER BY b1.N;

Step-by-step trace (input: 5-node tree N=1..5, parents [2, NULL, 5, 2, 2]):

N	P
1	2
2	NULL
3	5
4	2
5	2

N=1, P=2 — P IS NULL? No. EXISTS children of 1? No. → Leaf.
N=2, P=NULL — P IS NULL? Yes. → Root.
N=3, P=5 — P IS NULL? No. EXISTS children of 3? No. → Leaf.
N=4, P=2 — P IS NULL? No. EXISTS children of 4? No. → Leaf.
N=5, P=2 — P IS NULL? No. EXISTS children of 5? Yes (N=3 has P=5). → Inner.

Output:

N	node_type
1	Leaf
2	Root
3	Leaf
4	Leaf
5	Inner

Why this works — concept by concept:

P IS NULL — the unique root identifier; placed first because every other branch assumes a parent exists.
Self-EXISTS — a single round-trip test: "does any row in the same table have me as parent?"
Branch ordering — Root → Inner → Leaf in strict CASE order; reversing produces silent mislabels.
ORDER BY N — the grader expects ascending N; without it, output order is undefined.
Cost — one outer scan plus one EXISTS subquery per row: roughly O(N × log N) if P is indexed.

SQL
Topic — case-when
CASE expression problems

Practice →

SQL
Topic — self-join
Self-join problems

Practice →

6. Subqueries and the Output Formatting Trap (Weather Observation Station)

Scalar subqueries plus strict output formatting in SQL

Weather Observation Station 18 looks innocuous — find one row's column based on the max of another — but it filters out otherwise-correct candidates because of one detail: HackerRank's grader is exact-match on stdout. The mental model: query for the value, then format it for the grader. The invariant: ROUND(x, 4) does not add trailing zeros — it returns a numeric value the engine prints in its own default format. Wrap with FORMAT (MySQL/SQL Server) or to_char (PostgreSQL).

Output-strict rule: print 0.00, never 0 or .00. The Glassdoor data engineer who reported "my entire logic was correct but I was filtered out" failed exactly here. Coerce the format every time.

Scalar subquery in WHERE: `WHERE x = (SELECT MAX(x) FROM ...)`

A scalar subquery returns one value usable in any expression slot. To find the row with the maximum value, filter WHERE x = (SELECT MAX(x) FROM ...).

Scalar subquery: one row, one column.
Equality predicate: matches the row whose value equals the global aggregate.
Worked example: STATION filtered to LAT_N < 137.2345, then MAX(LAT_N) → one scalar; WHERE LAT_N = scalar → one row.

WHERE LAT_N = (SELECT MAX(LAT_N) FROM STATION WHERE LAT_N < 137.2345)

`FORMAT(ROUND(x, 4), '0.0000')` for fixed-decimal output

ROUND(x, 4) rounds the numeric value but returns a number — 1.2 stays 1.2, not 1.2000. To force trailing zeros, wrap with the engine's format function.

MySQL: FORMAT(ROUND(x, 4), '0.0000').
MS SQL Server: FORMAT(x, 'N4').
PostgreSQL: to_char(x, 'FM999990.0000').

Cross-engine portability table

HackerRank lets candidates pick MySQL / MS SQL Server / PostgreSQL / Oracle / DB2 per challenge. Knowing the format equivalents is the difference between submitting the right answer in your language vs guessing at one you've never used.

Common beginner mistakes

Using ROUND alone — 1.2 stays 1.2, the grader expected 1.2000.
Calling FORMAT(x, 4) (MS SQL) when you meant FORMAT(x, 'N4') — the second arg is a format string, not a precision.
Forgetting COALESCE for nullable columns — NULL prints as the literal string NULL, never 0.0000.

SQL interview question on Weather Observation Station 18

Table STATION(ID, CITY, STATE, LAT_N, LONG_W). Output LONG_W for the row with the largest LAT_N strictly less than 137.2345, formatted to four decimal places.

Solution using a scalar subquery and FORMAT

SELECT FORMAT(ROUND(LONG_W, 4), '0.0000') AS long_w_fmt
FROM STATION
WHERE LAT_N = (
  SELECT MAX(LAT_N)
  FROM STATION
  WHERE LAT_N < 137.2345
);

Step-by-step trace (input: 4 STATION rows):

LAT_N	LONG_W
100.0	-75.123
130.5	-80.4567
137.2	-90.1
138.0	-91.0

Inner subquery — LAT_N < 137.2345 filters to rows 1, 2, 3 (138.0 rejected). MAX(LAT_N) = 137.2.
Outer WHERE — LAT_N = 137.2 matches row 3 only.
Project — LONG_W = -90.1.
ROUND(-90.1, 4) = -90.1 (MySQL keeps it as numeric, not -90.1000).
FORMAT(-90.1, '0.0000') — coerces to the literal string -90.1000.

Output:

long_w_fmt
-90.1000

Why this works — concept by concept:

Scalar subquery — (SELECT MAX(...) WHERE ...) returns one value, usable in the outer WHERE.
Strict equality = — matches the row whose LAT_N equals the filtered max.
FORMAT(ROUND(x, 4), '0.0000') — forces trailing zeros so the stdout matches the grader's expected string.
Engine portability — same logic in PostgreSQL needs to_char; in MS SQL Server needs FORMAT(x, 'N4').
Cost — two scans of STATION (one for the subquery, one for the outer): O(N) total.

SQL
Topic — subqueries
Subquery problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

7. PySpark Dataframe Transforms for HackerRank DE Assessments

SQL → PySpark idioms for the HackerRank platform

Glassdoor reports for the HackerRank-the-company data engineer assessment confirm two coding questions per round: one SQL window function, one PySpark dataframe transform. The PySpark side is graded in a real IDE — runnable code, not pseudocode. The mental model: think SQL, write PySpark, and know the three idiom translations the DataFrame API can't express directly. The invariant: .isin() takes a Python list, not a DataFrame; correlated subqueries don't exist in the API; everything else is a Window.partitionBy(...).orderBy(...).

Pro tip: When .isin() would need a DataFrame, rewrite as a left anti-join: df.alias('a').join(other.alias('b'), col('a.k') == col('b.k'), 'left').where(col('b.k').isNull()). It scales to billions of rows; .collect() doesn't.

Define schema explicitly with `StructType` and `StructField`

Schema inference is convenient but flaky on edge cases — a null first row and the column becomes string when you needed int. In a 90-minute assessment, define schema up front.

StructType([StructField(name, type, nullable), ...]): the canonical pattern.
Worked example: Submissions(submission_date Date, submission_id Int, hacker_id Int, score Int) with nullable=False on the keys.

schema = StructType([
    StructField('submission_date', DateType(), False),
    StructField('submission_id',   IntegerType(), False),
    StructField('hacker_id',       IntegerType(), False),
    StructField('score',           IntegerType(), False),
])

SQL `IN (subquery)` → PySpark `LEFT JOIN ... IS NULL`

.isin() accepts a Python list, not a DataFrame. The clean rewrite is a left anti-join: keep rows from the left table that have no match on the right.

df.alias('a').join(other.alias('b'), key, 'left').where(col('b.key').isNull()): preserves left rows without a right match.
Worked example: start_dates not in end_dates becomes a left join + IS NULL filter.

Window functions: `Window.partitionBy(...).orderBy(...)`

Window functions translate one-to-one between SQL and PySpark. ROW_NUMBER, RANK, LAG, LEAD, SUM OVER — all use the same Window spec.

Window.partitionBy('day').orderBy(F.col('cnt').desc(), F.col('hacker_id').asc()): mirror the SQL window.
Worked example: the per-day winner pattern from §4 in PySpark.

w = Window.partitionBy('submission_date') \
          .orderBy(F.col('cnt').desc(), F.col('hacker_id').asc())
winners = daily.withColumn('rn', F.row_number().over(w)) \
               .where(F.col('rn') == 1)

Common beginner mistakes

Calling .isin(other_df) — other_df isn't iterable; runtime error.
.collect() on a billion-row dataframe — drives the driver out of memory.
Forgetting alias() on self-joins — ambiguous column references; the planner can't decide.

PySpark interview question on the anti-join translation

DataFrame projects_df(start_date Date, end_date Date). Return the start_date values that do not appear in any row's end_date. The SQL form is SELECT start_date FROM projects WHERE start_date NOT IN (SELECT end_date FROM projects). Write the PySpark equivalent.

Solution using a left anti-join

from pyspark.sql import functions as F

result = (
    projects_df.alias('a')
        .join(
            projects_df.alias('b'),
            F.col('a.start_date') == F.col('b.end_date'),
            how='left',
        )
        .where(F.col('b.end_date').isNull())
        .select(F.col('a.start_date'))
)

Step-by-step trace (input: 4 project rows):

start_date	end_date
2026-01-01	2026-01-10
2026-01-15	2026-01-20
2026-01-25	2026-01-30
2026-02-01	2026-02-10

Alias both copies — a for the outer dataframe, b for the lookup side.
Left join on a.start_date = b.end_date — for each start_date, attach any row whose end_date matches; otherwise the right side is NULL.
Filter where b.end_date IS NULL — keeps start_date rows with no matching end_date.
Project a.start_date — final output.

Output:

start_date
2026-01-01
2026-01-15
2026-01-25
2026-02-01

Why this works — concept by concept:

Left anti-join idiom — LEFT JOIN ... WHERE other.key IS NULL is the canonical "rows in A not in B" pattern; portable from SQL to PySpark.
Aliasing — alias('a') and alias('b') resolve column ambiguity in the self-join.
No .collect() — keeps the work distributed; safe for billions of rows.
F.col discipline — explicit column references avoid string-vs-column ambiguity in the planner.
Cost — one shuffle for the join key, one filter, one projection: O(N + M) distributed.

SQL
Topic — anti-join
Anti-join problems

Practice →

SQL
Topic — sql
All SQL practice

Practice →

Tips to crack HackerRank data engineering interviews

Practice the seven canonical clusters before chasing tools

Drill JOINs, multi-table aggregation, GROUP BY/HAVING, window functions, CASE on hierarchies, scalar subqueries, and PySpark dataframe transforms. The HackerRank assessment samples from these — questions that look new are recombinations of these primitives. Rehearsing all seven is more useful than memorizing twenty individual problems.

Match HackerRank's output format byte-for-byte

ROUND rounds, it does not pad. Wrap with FORMAT(ROUND(x, 4), '0.0000') (MySQL) or to_char(x, 'FM999990.0000') (PostgreSQL) every time the prompt mentions decimal places. Use COALESCE (or ISNULL in MS SQL Server) for nullable columns — NULL prints as the literal string NULL, not 0.

Learn one engine deeply, two well

HackerRank lets candidates pick MySQL, PostgreSQL, MS SQL Server, Oracle, or DB2 per challenge. Pick MySQL as your default; know the PostgreSQL and MS SQL Server equivalents for FORMAT, ROUND, CAST, and COALESCE/ISNULL. Window function syntax is identical across all three.

Read the schema before the prompt

The line "the tables may contain duplicate records" in New Companies changes every COUNT(*) to COUNT(DISTINCT). The line "the secondary sort is by ID, but the result includes UIN but not ID" in Tenured Employees tells you ORDER BY columns don't need to be in SELECT. Read schema notes before the prompt — they encode the half of the test the prose doesn't.

In PySpark, prefer joins over collect()

.isin() accepts only a Python list. For "is in another dataframe", rewrite as a left anti-join: df.alias('a').join(other.alias('b'), key, 'left').where(col('b.key').isNull()). Correlated subqueries don't exist in the DataFrame API — drop down to spark.sql(...) or rewrite the predicate as a join condition.

Practice on PipeCode with real grading

Pattern recognition without practice is a guess. Drill the HackerRank-tagged practice set and the topic-tagged sets for the seven clusters above. PipeCode grades on equivalent results, so you can focus on logic without fighting trailing-zero mismatches — then port the format-strict habits into your HackerRank submission.

Frequently asked questions

What basic SQL skills should freshers practice for HackerRank data engineering questions?

Freshers should drill INNER JOIN with foreign keys, GROUP BY with COUNT/SUM/AVG, WHERE filters with < and =, and ORDER BY with multi-key sorts. The Tenured Employees pattern covers all four primitives in one query, so it's a natural starting point. Freshers also need to internalize that HackerRank grades on stdout exact match — 0 and 0.00 are different answers. Spend an afternoon on the FORMAT and ROUND idioms before attempting any decimal-formatted question.

How do experienced data engineers prepare for HackerRank advanced SQL questions?

Experienced engineers should focus on window functions for streaks and per-day winners (15 Days of Learning SQL), self-EXISTS for hierarchy classification (Binary Tree Nodes), and pre-aggregation patterns to avoid fan-out (New Companies). The senior bar is fluency under time pressure — a 90-minute screen with two questions and a strict grader. Two weeks of drilling advanced SQL problems and reviewing the seven canonical clusters is a realistic prep window for someone with 3+ years of SQL.

Which JOIN patterns appear most often on HackerRank data engineering interviews?

Three JOIN patterns dominate: INNER JOIN on a foreign key (Tenured Employees), four-way INNER JOIN with a related-row equality predicate (Top Competitors), and self-JOIN with EXISTS for hierarchy classification (Binary Tree Nodes). LEFT JOIN with IS NULL is rarer in pure SQL but becomes critical in the PySpark anti-join translation. Drill all four on the HackerRank JOIN practice set before attempting the assessment.

Does HackerRank ask PySpark questions in data engineer interviews?

Yes. Glassdoor reports for the HackerRank company's own data engineer role confirm one PySpark dataframe transform per assessment, alongside one SQL window function question. PySpark questions are coded in a real IDE — runnable Python, not pseudocode — and graded on output equality just like SQL. The most-screened idiom is the SQL IN (subquery) to PySpark left-anti-join translation, so drill that pattern until it's automatic.

What scenario-based SQL questions should I expect on HackerRank?

Scenario-based questions wrap the seven clusters in business framings: leaderboards (Top Competitors → "find users with more than one perfect-score submission"), retention (a streak query like 15 Days of Learning), reporting cubes (New Companies → "count users per tier"), and audit trails (Tenured Employees → "list users matching a tenure filter"). The framing changes; the underlying primitive doesn't. State the mapping out loud during the interview — interviewers reward verbal clarity as much as correct SQL.

Are ETL or data warehousing questions part of HackerRank data engineering assessments?

The hands-on coding rounds are SQL + PySpark; ETL and warehousing concepts (staging tables, OLTP vs OLAP, surrogate keys, star vs snowflake) appear as MCQs in the proctored assessment. The Glassdoor New Delhi report confirms four MCQs covering Hadoop, SQL fundamentals, and data modelling alongside the coding question. Read a data engineering primer the night before to ensure the warehouse vocabulary is fresh.

Start practicing HackerRank data engineering problems

PipeCode pairs HackerRank-tagged problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and PySpark patterns the platform asks — without fighting the trailing-zero grader while you're still learning the logic.

Pipecode.ai is Leetcode for Data Engineering.

Top HackerRank data engineering interview topics

1. SQL JOINs and the Tenured Employees Pattern

Foreign-key joins with multi-key sorts in SQL for data engineering

INNER JOIN ... ON fk = pk

WHERE filter on the joined columns

ORDER BY name ASC, id ASC (sort key not in SELECT)

SQL interview question on tenured employees

Solution using INNER JOIN with a multi-key ORDER BY

2. Multi-Table Aggregation with the New Companies Hierarchy

Multi-level hierarchies and COUNT(DISTINCT) for data engineering

Pre-aggregate per level, then join

COUNT(DISTINCT child_id) GROUP BY parent_id

String-sort vs numeric-sort on company_code

SQL interview question on the New Companies hierarchy

Solution using scalar subqueries per level

3. GROUP BY, HAVING, and the Top Competitors Pattern

Related-row predicates with GROUP BY for SQL data engineering

WHERE submission.score = difficulty.score

GROUP BY hacker, COUNT(*), HAVING COUNT(*) > 1

Two-key sort: ORDER BY COUNT(*) DESC, hacker_id ASC

SQL interview question on Top Competitors

Solution using INNER JOIN with HAVING

4. SQL Window Functions and the 15 Days of Learning Pattern

Window functions for streaks and per-day winners in SQL

ROW_NUMBER() OVER (PARTITION BY day ORDER BY ...)

Tie-break: secondary key in ORDER BY

"Every day so far" via self-join + COUNT predicate

SQL interview question on 15 Days of Learning

Solution using ROW_NUMBER and a self-join streak predicate

5. CASE Expressions on Hierarchies (Binary Tree Nodes)

Three-branch CASE on parent-child columns for SQL

CASE WHEN P IS NULL THEN 'Root' (anchor)

EXISTS (SELECT 1 FROM BST b2 WHERE b2.P = b1.N) for "has children"

Three-branch CASE in strict order: Root → Inner → Leaf

SQL interview question on Binary Tree Nodes

Solution using a three-branch CASE with EXISTS

6. Subqueries and the Output Formatting Trap (Weather Observation Station)

Scalar subqueries plus strict output formatting in SQL

Scalar subquery in WHERE: WHERE x = (SELECT MAX(x) FROM ...)

FORMAT(ROUND(x, 4), '0.0000') for fixed-decimal output

Cross-engine portability table

SQL interview question on Weather Observation Station 18

Solution using a scalar subquery and FORMAT

7. PySpark Dataframe Transforms for HackerRank DE Assessments

SQL → PySpark idioms for the HackerRank platform

Define schema explicitly with StructType and StructField

SQL IN (subquery) → PySpark LEFT JOIN ... IS NULL

Window functions: Window.partitionBy(...).orderBy(...)

PySpark interview question on the anti-join translation

Solution using a left anti-join

Tips to crack HackerRank data engineering interviews

Practice the seven canonical clusters before chasing tools

Match HackerRank's output format byte-for-byte

Learn one engine deeply, two well

Read the schema before the prompt

In PySpark, prefer joins over collect()

Practice on PipeCode with real grading

Frequently asked questions

What basic SQL skills should freshers practice for HackerRank data engineering questions?

How do experienced data engineers prepare for HackerRank advanced SQL questions?

Which JOIN patterns appear most often on HackerRank data engineering interviews?

Does HackerRank ask PySpark questions in data engineer interviews?

What scenario-based SQL questions should I expect on HackerRank?

Are ETL or data warehousing questions part of HackerRank data engineering assessments?

Start practicing HackerRank data engineering problems

`INNER JOIN ... ON fk = pk`

`WHERE` filter on the joined columns

`ORDER BY name ASC, id ASC` (sort key not in SELECT)

Multi-level hierarchies and `COUNT(DISTINCT)` for data engineering

`COUNT(DISTINCT child_id) GROUP BY parent_id`

String-sort vs numeric-sort on `company_code`

`WHERE submission.score = difficulty.score`

`GROUP BY hacker, COUNT(), HAVING COUNT() > 1`

Two-key sort: `ORDER BY COUNT(*) DESC, hacker_id ASC`

`ROW_NUMBER() OVER (PARTITION BY day ORDER BY ...)`

`CASE WHEN P IS NULL THEN 'Root'` (anchor)

`EXISTS (SELECT 1 FROM BST b2 WHERE b2.P = b1.N)` for "has children"

Scalar subquery in WHERE: `WHERE x = (SELECT MAX(x) FROM ...)`

`FORMAT(ROUND(x, 4), '0.0000')` for fixed-decimal output

Define schema explicitly with `StructType` and `StructField`

SQL `IN (subquery)` → PySpark `LEFT JOIN ... IS NULL`

Window functions: `Window.partitionBy(...).orderBy(...)`