ByteDance data engineering interview questions (often discussed alongside TikTok-aligned or group-wide data roles) usually combine SQL and Python in the same loop: you need to get the grain right (joins, aggregation, subqueries, window functions and ranking), then show fast, clear code on arrays, sets, strings, and common two-pointer / search / simulation styles on harder items.
This guide is structured as a ByteDance data engineering interview prep map: a topic table first, then numbered sections in that order. Within each main topic, every sub-topic has a short explanation, then—where it helps—a worked example with solution, a note on a typical gotcha (problem), and finally a separate interview problem and full solution at the end of the section. All of that is original teaching copy.
Top ByteDance data engineering interview topics
| # | Topic | Why it shows up (hub tag) |
|---|---|---|
| 1 | SQL: joins and aggregation | Joins & Aggregation on company practice, plus fan-out and grain. |
| 2 | SQL: subqueries and aggregation | Aggregation & Subqueries—CTEs, layered filters. |
| 3 | SQL: “last N” and sort order | Sorting + per-key windows. |
| 4 | SQL: window functions + subqueries |
Window Functions + CTEs for roll-ups then per-row OVER. |
| 5 | SQL: ranking and ties | Window + ranking and tie policy. |
| 6 | Python: intervals, sort, array | intervals / array / sorting / Data Structures on the export. |
| 7 | Python: sets, string reporting | set / String / reporting-style list logic. |
| 8 | Python: regex and string | regular expression / String Processing. |
| 9 | Python: hash / stream dedup | Hash Table / set in arrival order. |
| 10 | Python: algorithms band | Two Pointers / Binary Search / DP / Math / Simulation on harder items. |
Sections 1–10 below follow this map.
If you are new to SQL: In most databases the engine processes a query roughly in this order:
FROM/ joins →WHERE(filter rows) →GROUP BY→ compute aggregates →HAVING(filter groups) → window functions →SELECT/ORDER BY. When in doubt, ask: “Am I filtering one row at a time (WHERE) or a whole group after summing (HAVING)?”
1. Joins, aggregation, and fan-out control in SQL for ByteDance screens
Joins and aggregation in SQL for ByteDance-tagged content
Overview: A join links rows on a key so you can enrich fact rows (many events) with dimension fields (one row per user, creator, or item). Aggregation then answers “how many / how much per …?” at the grain the product question names. If you GROUP BY the wrong set of keys, the output row is not the unit the business asked for. A join can also multiply row count: one user with N order rows can become N joined lines, so a one-per-user column (credits, tier, signup bonus) repeats—summing it after the join adds the same value N times unless you use MAX/MIN, or pre-aggregate the fact, or keep one row per key by design. The hub tags this as Joins & Aggregation because both skills must land together.
Sub-topic: INNER vs LEFT and join size
-
INNER JOIN: only matching keys on both sides—if the right side has no row for a left key, that left row disappears from the result. -
LEFT OUTER JOIN: keeps every left row. Where the right has no match, right-hand columns areNULL. Use this when you must list all entities (e.g. all users) even if some have no fact in the time window, or to detect “no match” withWHERE right.key IS NULL(anti-join).
Worked example (micro tables + row counts):
users(u_id): u1, u2. events(u_id): only u1 (2 rows e1, e2).
-
INNER JOIN: 2 result rows (both events);u2is absent (no event row to join to). -
FROM users u LEFT JOIN events e ON u.u_id = e.u_id: 3 rows: 2 foru1(e1, e2) and 1 foru2withe.*allNULL.
Solution (read pattern): INNER = “only keys that participate in both.” LEFT = “anchor on the left list; right may be empty.”
Typical problem: counting users with a metric after INNER—you silently drop users with no events. If the spec needs all users with 0 for missing, start from users LEFT JOIN facts and use COALESCE(SUM(...),0).
Sub-topic: GROUP BY grain and HAVING
GROUP BY defines buckets; aggregates run once per bucket over all rows in that bucket. WHERE is evaluated per input row before grouping; it cannot refer to an aggregate. HAVING filters after GROUP BY and can use SUM, COUNT, etc. Use it when the condition is on a per-group metric (e.g. teams with AOV above a threshold, or more than three orders).
Worked example (aggregate then filter): table t:
| order_id | user_id | amount |
|---|---|---|
| 1 | u1 | 10 |
| 2 | u1 | 20 |
| 3 | u2 | 5 |
SELECT user_id, SUM(amount) AS rev FROM t GROUP BY user_id → u1 → 30, u2 → 5.
HAVING SUM(amount) >= 20 → only u1.
Solution (why HAVING, not WHERE on the sum): the condition uses SUM(amount), so it must run after aggregation:
SELECT user_id, SUM(amount) AS rev
FROM t
GROUP BY user_id
HAVING SUM(amount) >= 20;
Worked example — fan-out: double-counting a one-per-user column after a join
Tables (grain): orders(order_id, user_id, amt) — one row per order; users(user_id, signup_credit) — one signup_credit per user, 20 for u1. Two orders for u1 with 10 and 10 amt. After orders JOIN users ON user_id, you see 2 lines, each with signup_credit = 20. A naive SUM(u.signup_credit) in that result adds 20 + 20 = 40—wrong for “how much credit does this user have once?”
Fix pattern: do not SUM a dimension that is 1:1 to user across multiple fact rows. Use e.g. MAX(u.signup_credit) with GROUP BY user_id (or join only after you aggregate orders per user, or SUM on order amt only and keep credit as a single picked value). Illustrative “safe” read:
SELECT
o.user_id,
MAX(u.signup_credit) AS credit_once,
SUM(o.amt) AS order_total
FROM orders o
INNER JOIN users u ON u.user_id = o.user_id
GROUP BY o.user_id;
Result for u1: credit_once = 20, order_total = 20.
SQL Interview Question on Creator Watch Revenue with Join and HAVING
Given watch(user_id, video_id, watch_sec, watch_date) (one row per session) and video(video_id, title, creator_id) (one row per video). For watch_date = '2026-04-25' only, return creator_id, total watch_sec that day, and count of watch rows, for creators with at least 2 sessions and total watch_sec > 0.
Solution Using INNER JOIN, GROUP BY, and HAVING
SELECT
v.creator_id,
SUM(w.watch_sec) AS total_sec,
COUNT(*) AS n_sessions
FROM watch w
INNER JOIN video v ON v.video_id = w.video_id
WHERE w.watch_date = DATE '2026-04-25'
GROUP BY v.creator_id
HAVING COUNT(*) >= 2
AND SUM(w.watch_sec) > 0;
Check: The join is 1:1 on video in the fact; aggregation is on (video → creator_id) after join—if the prompt were “per video_id” you would GROUP BY w.video_id (or v.video_id) instead. Say that distinction out loud in interview.
Practice
- SQL · Topic — Joins (all companies)
- COMPANY · ByteDance — sql ByteDance SQL (all difficulties)
- SQL · Hub — Browse all topics
2. Subqueries, CTEs, and layered aggregation in SQL
Subqueries, CTEs, and layered aggregation in SQL
Overview: Many questions are two phases: (1) roll raw events into per (user, day) or per user totals, (2) filter or join those totals. The inner query (or CTE) must output the grain you will filter on; the outer query then applies rules that need the aggregated columns. A CTE (WITH … AS) is only sugar for a subquery in FROM, but it names the step and lets you chain WITH a AS (…), b AS (SELECT … FROM a) …—in a live interview, that clarity often prevents mistakes.
Sub-topic: CTEs vs FROM subquery
They are the same to the engine in most databases; the difference is only naming and readability. Chain subqueries for paper clarity:
WITH a AS (SELECT k, SUM(x) s FROM t GROUP BY k)
SELECT * FROM a WHERE s > 10;
Worked example (two-step story): “Per key k, what is the sum? Then which keys have s > 10?” CTE a is the first report; the outer SELECT is the second filter.
Solution (expected shape): one row per k in a, then the outer WHERE keeps a subset of those rows. Toy data: if t has (k1,5), (k1,6), (k2,3) then a is k1→11, k2→3; WHERE s > 10 leaves k1 only.
Problem (readability in interview): a huge nested FROM (SELECT … FROM (SELECT …) …) is hard to read under time pressure. Solution: name each stage, e.g. WITH step1 AS (…), step2 AS (SELECT … FROM step1) …, so the grader can follow your logic.
Sub-topic: “aggregates, then re-filter”
You cannot reference a new SELECT alias in the same level’s WHERE in standard SQL (order of evaluation). Use HAVING inside the inner aggregate query, a wrap in a CTE, or repeat the expression. The “users who on some day had ≥ 5 events” pattern is: group in the CTE, then WHERE on the exposed column n in the outer query from the CTE’s result set (that is valid because n is a output column of the CTE).
Worked example (alias trap): inner query is SELECT user_id, SUM(amt) AS rev FROM t GROUP BY user_id. You want WHERE rev > 20 on the result.
-
Problem:
SELECT * FROM (…) AS x WHERE rev > 20is valid (outer query sees columnrev). A single-level… GROUP BY user_id WHERE rev > 20is not (norevinWHEREfor grouped query—you needHAVING SUM(amt) > 20on that level).
Solution: either HAVING SUM(amt) > 20 in the same aggregate query, or nest a CTE/subquery and filter WHERE rev > 20 on the outer SELECT.
Worked example — clicks per user per day, then keep busy days
Data:
| user_id | click_date | event_id |
|---|---|---|
| u1 | 2026-01-01 | e1 |
| u1 | 2026-01-01 | e2 |
| u1 | 2026-01-01 | e3 |
| u1 | 2026-01-01 | e4 |
| u1 | 2026-01-01 | e5 |
| u1 | 2026-01-02 | e6 |
Ask: return (user_id, click_date, n) where n >= 5.
Solution:
WITH per AS (
SELECT user_id, click_date, COUNT(*) AS n
FROM clicks
GROUP BY user_id, click_date
)
SELECT * FROM per WHERE n >= 5;
Result: one row: (u1, 2026-01-01, 5).
SQL Interview Question on High Daily Click Volume with a CTE
Table clicks(user_id, click_date, event_id, …). Build per (user_id, click_date) row counts as n, excluding null user_id in the inner aggregate. Return rows where n >= 5 with columns user_id, click_date, n.
Solution Using a CTE and Post-Aggregation Filter
WITH per_day AS (
SELECT
user_id,
click_date,
COUNT(*) AS n
FROM clicks
WHERE user_id IS NOT NULL
GROUP BY user_id, click_date
)
SELECT user_id, click_date, n
FROM per_day
WHERE n >= 5;
Check: The not null on user_id lives in WHERE inside the CTE (row filter); n >= 5 is on the pre-aggregated CTE, so a plain WHERE on the outer select is enough.
Practice
- SQL · Topic — Subqueries (all companies)
- COMPANY · ByteDance — sql ByteDance SQL
3. “Last N” per key, ORDER BY, and windows in SQL
“Last N” per key, ORDER BY, and row numbering in SQL
Overview: A global ORDER BY ts DESC at the end of a query would sort all events together; it does not give “last 2 for each user” unless the question only asked for the one most recent key globally. For per-user (or per-partition) “latest / last k”, the sort order must be inside a window: PARTITION BY user_id, ORDER BY ts DESC, and usually ROW_NUMBER, RANK, or a subquery with a correlated LIMIT per key (dialect-specific). The export tags this Sorting + per-key logic because the hard part is scoping ORDER BY to a partition, not the whole result set.
Sub-topic: ROW_NUMBER and ORDER BY inside OVER
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_ts DESC) labels 1, 2, 3, … within each user_id in that order. ROW_NUMBER never ties; two rows in the same partition get different numbers.
Worked example (trace one partition): for user_id = u1, if order_ts descending gives rows A, B, C, then rn is 1, 2, 3. The k most recent are WHERE rn <= k.
Solution (definition): ROW_NUMBER is the usual choice when you need exactly k rows per partition in SQL that allows window functions (vs ties with RANK).
Sub-topic: Ties in “last N”
If two events share the same order_ts, ORDER BY must add a tie-break (e.g. order_id DESC) or the engine’s order is not guaranteed and your k “last” rows may be ambiguous. The interviewer wants you to state a rule.
Worked example (two rows, same time): add , order_id DESC so the higher id is “more recent” in the product sense.
Solution (tie-break in window): ORDER BY order_ts DESC, order_id DESC.
Problem / solution (toy data): same user, two orders, identical order_ts only: without a second sort key, ROW_NUMBER is still unique but nondeterministic between the two (bad for tests). With ORDER BY order_ts DESC, order_id DESC and rows (id 9, ts T) and (id 10, ts T), row 10 gets rn = 1; 9 gets rn = 2. You can reproduce the same answer on every run.
Worked example — “last 2 orders per user” on toy data
| order_id | user_id | order_ts |
|---|---|---|
| 10 | u1 | 2026-01-02 |
| 9 | u1 | 2026-01-01 |
| 4 | u2 | 2026-01-01 |
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_ts DESC) → for u1: (10 → 1), (9 → 2); for u2: (4 → 1). WHERE rn <= 2 returns 2 rows for u1, 1 for u2.
Solution (query shape): WITH w AS (SELECT …, ROW_NUMBER() OVER (…) AS rn FROM orders) SELECT … FROM w WHERE rn <= 2; — same idiom as the Interview block below.
SQL Interview Question on Last Two Orders per User by Time
orders(order_id, user_id, order_ts, amount); all columns not null; for each user_id, return the 2 most recent rows by order_ts, breaking ties on larger order_id (use ROW_NUMBER then filter).
Solution Using ROW_NUMBER and an Outer WHERE on Rank
WITH r AS (
SELECT
order_id,
user_id,
order_ts,
amount,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY order_ts DESC, order_id DESC
) AS rn
FROM orders
)
SELECT order_id, user_id, order_ts, amount
FROM r
WHERE rn <= 2;
Check: WHERE rn <= 2 in the outer query, after the window—that is the idiom for last 2.
Practice
- SQL · Topic — Window functions (SQL)
- COMPANY · ByteDance — windows ByteDance + window functions
- COMPANY · ByteDance — sql ByteDance SQL (company filter)
4. Window functions, partitions, and CTEs in SQL
Window OVER and layered subqueries in SQL
Overview: GROUP BY collapses many rows into one per bucket. OVER (window) keeps each input row and attaches a value computed from a window of rows—usually a partition (same user_id / same day / same month). Many DE questions are layered: first a CTE that produces the right grain (e.g. per day per user), then a second pass that ranks or normalizes within a larger bucket (e.g. share of this day inside the user’s month). Confusing the two is a common failure mode: a plain SUM(amount) in SELECT without GROUP BY is invalid, but SUM(amount) OVER (PARTITION BY …) is valid because the window is per row.
Compare visually: a roll-up vs per-row window in the image above: left = one row per group; right = all detail kept with a rank or running value.
Sub-topic: SUM(...) OVER vs SUM(...) GROUP BY
-
SUM(x) GROUP BY y— one row perywith a single total. You lose the original rows. -
SUM(x) OVER (PARTITION BY y)— one total repeated on every row in the partition, unless you use a more restrictive frame (e.g.ROWS BETWEEN); used when you need both row-level detail and a subtotal on the same result.
Worked example (conceptual): fact rows for (user, day, revenue).
SELECT
user_id,
day,
revenue,
SUM(revenue) OVER (PARTITION BY user_id) AS user_all_days_rev
FROM fact;
Each row still exists; the window sum is the per-user total across the partition (same value on every row for that user_id).
Sub-topic: Layered CTE + window (subquery + window)
You first materialize a convenient grain in a CTE (e.g. one row per (user, day) with a daily total), then in the outer SELECT you add a window that partitions by something larger (e.g. day to rank users on that day, or user to rank days for that user).
Worked example — “largest order total per calendar day” (toy CTE + window)
orders(order_id, day, user_id, amt); aggregate to per (day, user) tot, then keep #1 per day by tot.
| day | user_id | tot (after GROUP BY) |
|---|---|---|
| D1 | a | 100 |
| D1 | b | 200 |
| D2 | a | 50 |
ROW_NUMBER() OVER (PARTITION BY day ORDER BY tot DESC, user_id ASC) AS r → (D1,b) and (D2,a) have r = 1.
Solution (shape):
WITH per AS (
SELECT day, user_id, SUM(amt) AS tot
FROM orders
GROUP BY day, user_id
)
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY day ORDER BY tot DESC, user_id ASC) AS r
FROM per
) x WHERE r = 1;
Problem to avoid: putting tot in WHERE before the window: filter r = 1 in the outer layer, not on the same level as the window definition unless you use a CTE (as above).
Sub-topic: Ratios, NULLIF, and “empty month”
Dividing this row by a group sum can hit divide by zero if a partition is empty in theory—but with real fact rows, the usual edge case is sum = 0 (all amount are 0). Use NULLIF(denominator, 0) so the result is NULL instead of an error; say that out loud in interview.
Micro worked example (all zeros in a month): one row for (u1, 2026-04-10, 0) and one for (u1, 2026-04-20, 0). Month sum = 0. So day_share = 0 / NULLIF(0,0). Because NULLIF(0,0) is NULL, the division yields NULL, not an error. Problem without NULLIF: divide by zero on 0/0 when the month sum is 0. Solution: use NULLIF( SUM(…) OVER (…), 0 ) in the denominator.
Worked example — day_share with toy fact rows
Table fact:
| user_id | d (date) | amount |
|---|---|---|
| u1 | 2026-04-10 | 30 |
| u1 | 2026-04-20 | 70 |
| u1 | 2026-05-01 | 100 |
Ask: for each row, add day_share = that row’s amount divided by the sum of amount for the same (user_id, calendar month).
Manual check: For April 2026 and u1, month total = 30 + 70 = 100 → first row’s share = 30/100 = 0.3, second = 0.7. For May, only one row → share = 1.0.
Solution (shape matches interview below): same expression as the Solution Using a Windowed SUM but on this data you should narrate the two April rows splitting the month.
SELECT
user_id,
d,
amount,
amount::float / NULLIF(
SUM(amount) OVER (PARTITION BY user_id, date_trunc('month', d)),
0
) AS day_share
FROM fact;
Problem you avoid: if you accidentally PARTITION BY user_id only, you would divide each day by all-time user revenue—wrong for “within month.”
SQL Interview Question on Day Share Within User-Month
fact(user_id, d, amount) at (user, day) grain. Add column day_share = this row’s amount / sum(amount) over the same (user, calendar month) (PostgreSQL: date_trunc('month', d) in the partition).
Solution Using a Windowed SUM Over User and Month
PostgreSQL-style:
SELECT
user_id,
d,
amount,
amount::float / NULLIF(
SUM(amount) OVER (
PARTITION BY user_id, date_trunc('month', d)
),
0) AS day_share
FROM fact;
Check: NULLIF(…,0) avoids divide by zero if a month is empty. Name date_trunc('month',d) as “month bucket” in interview; adjust for Trino/Spark/ANSI as needed.
Practice
- SQL · Topic — Window functions (SQL)
- COMPANY · ByteDance — windows ByteDance + window functions
5. Window ranking: RANK, DENSE_RANK, and ROW_NUMBER in SQL
Window ranking: RANK, DENSE_RANK, and ROW_NUMBER
Overview: Ranking orders rows inside a partition with rules for ties. Picking the wrong function is how you return two “first place” rows when the business asked for one rep per region, or how you get gaps in “rank 1,2,3” when a dense ladder was intended (leaderboards).
-
RANK()— ties share the same rank; the next rank skips (1,1,3). -
DENSE_RANK()— ties share the same rank; the next rank does not skip (1,1,2). -
ROW_NUMBER()— unique 1,2,3…; no shared positions—add a secondORDER BYkey for deterministic ties (e.g. smalleridwins).
Sub-topic: When you need a single winner
If the spec says at most one row per region, use ROW_NUMBER, not RANK—unless the prompt explicitly allows ties (e.g. “all top sellers”).
Worked example (ties in sales): three sellers in one region, revenues (100, 100, 50) in descending order:
| seller | revenue | RANK (DESC) | DENSE_RANK (DESC) | ROW_NUMBER (DESC, id ASC) |
|---|---|---|---|---|
| A (id 1) | 100 | 1 | 1 | 1 |
| B (id 2) | 100 | 1 | 1 | 2 |
| C (id 3) | 50 | 3 | 2 | 3 |
-
RANK: two rows at rank 1 → not a unique winner. -
DENSE_RANK: still two rows at 1. -
ROW_NUMBERwithORDER BY revenue DESC, seller_id ASC→ exactly onern = 1(seller A).
Solution: for “one row per partition” with tie-break, WHERE rn = 1 on ROW_NUMBER is the standard pattern (same as the interview block).
Sub-topic: When RANK is right
“All employees in the top compensation band in each office” often means everyone who ties for #1—use RANK or DENSE_RANK and WHERE rank = 1, and do not add a spurious seller_id tie-break unless the product says so.
Worked example (two co-leaders, one region): RANK() OVER (PARTITION BY office_id ORDER BY salary DESC).
| employee_id | office_id | salary |
|---|---|---|
| e1 | O1 | 200000 |
| e2 | O1 | 200000 |
| e3 | O1 | 100000 |
RANK (DESC): e1 and e2 both 1; e3 = 3. Ask: return all who are #1 in salary in O1.
-
Solution:
WHERE rank_col = 1afterRANKorDENSE_RANK: you return 2 rows (e1, e2). If you had usedROW_NUMBERwith a tie-break you would return only 1 row—wrong for this spec.
Problem (wrong function): using ROW_NUMBER … WHERE rn = 1 when the business wanted all tied winners. Fix: RANK or DENSE_RANK and rank = 1, or re-read the spec to allow a single winner.
Worked example — mini sellers table
| seller_id | region_id | revenue |
|---|---|---|
| 10 | R1 | 200 |
| 20 | R1 | 150 |
| 30 | R2 | 200 |
| 40 | R2 | 200 |
Ask: one winner per region_id, higher revenue wins, ties → smaller seller_id.
Result: R1 → (10, 200). R2 → two tie at 200: ROW_NUMBER(…, ORDER BY revenue DESC, seller_id ASC) picks 30 over 40.
Problem if you used RANK only with WHERE rank = 1: you would return two rows for R2. Fix: ROW_NUMBER (as below) or change the spec.
SQL Interview Question on Top Seller per Region with Tie-Breaks
sellers(seller_id, region_id, revenue) one row per (seller, region). Return one row per region_id: max revenue; ties → smaller seller_id.
Solution Using ROW_NUMBER with a Second Sort Key
WITH w AS (
SELECT
seller_id,
region_id,
revenue,
ROW_NUMBER() OVER (
PARTITION BY region_id
ORDER BY revenue DESC, seller_id ASC
) AS rn
FROM sellers
)
SELECT region_id, seller_id, revenue
FROM w
WHERE rn = 1;
Check: RANK would return two rows if two sellers tie max—ROW_NUMBER with seller_id tie break gives one.
Practice
- SQL · Topic — Window functions (SQL)
- COMPANY · ByteDance — windows ByteDance + window functions
6. Python: arrays, sorting, and merge intervals for ByteDance-tagged items
Sorting, sweeping, and merge intervals in Python
Overview: An interval is usually [a,b] inclusive in DE-style problems. Unsorted input is normal—sort by start (then end for stability). Sweep with one current merged segment [cur_s, cur_e]. For the next [s,e]: if s <= cur_e, the intervals touch or overlap—set cur_e = max(cur_e, e); else append a new block and move the cursor. Union length = sum of (end - start + 1) for integer inclusive lengths (watch one-point intervals: [5,5] has length 1).
Sub-topic: Merging intervals
| Step | New interval | Current merge |
|---|---|---|
| After sort | [1,3] | [[1,3]] |
| [2,6] overlaps | extend end | [[1,6]] |
| [8,10] | no overlap | [[1,6],[8,10]] |
Why sort first: the earliest starting interval that could connect to a chain must be processed in order; without sort, a later tiny interval that bridges two clusters could be missed in a single left-to-right pass.
Sub-topic: sort + O(n) scan
Time: O(n log n) to sort, O(n) merge. Space: O(n) for the sorted copy (or in-place sort on a mutable list in Python).
Worked example (complexity only): with n = 1,000,000 intervals, sorting dominates; a single left-to-right merge pass is O(n), so total time is O(n log n). In an interview, say: O(n log n) time, O(n) extra space (or O(1) extra if you may mutate the input list in place).
Problem (common bug): using s < cur_e instead of s <= cur_e, which fails when one interval ends exactly where the next starts if the spec says they merge (e.g. [1,4] and [4,5]—with inclusive touch, many problems still merge to [1,5]). Clarify with the interviewer; the solution below uses overlap as s <= m[-1][1] (merge if the new start is not strictly to the right of the current end).
Worked example — full trace and union length
Input: [[5,5],[1,3],[2,6],[8,10]] (unsorted).
-
Sort by start:
[[1,3],[2,6],[5,5],[8,10]]. -
Merge:
[1,3]; next[2,6]→2 <= 3→ extend →[1,6]. Next[5,5]→5 <= 6→ still inside →[1,6]. Next[8,10]→8 > 6→ new block[[1,6],[8,10]]. - Union length (inclusive int): (6−1+1) + (10−8+1) = 6 + 3 = 9.
Solution: this matches the code in the next section’s Interview solution (merged_union_length).
Python Interview Question on Union Length of Merged Intervals
Input: list of inclusive integer intervals [start, end] (unsorted, may overlap). Return the total length of the union after merging (inclusive length per piece: end - start + 1).
Solution Using Sort, Linear Merge, and Summed Lengths
def merged_union_length(intervals):
if not intervals:
return 0
iv = sorted(intervals, key=lambda x: (x[0], x[1]))
# merged list of [s,e] inclusive
m = [list(iv[0])]
for s, e in iv[1:]:
if s <= m[-1][1]:
m[-1][1] = max(m[-1][1], e)
else:
m.append([s, e])
return sum(e - s + 1 for s, e in m)
Check: [[1,3],[2,6],[8,10]] → [[1,6],[8,10]] → length 6 + 3 = 9.
Practice
- PYTHON · Topic — Array (all companies)
- COMPANY · ByteDance — array ByteDance + array
- COMPANY · ByteDance — sorting ByteDance + sorting
7. Python: sets, Counter, and string reporting for DE screens
Sets, Counter, and tie-aware reporting in Python
Overview: You combine list iteration, set membership in O(1) expected time, and Counter (or a dict) for frequencies. “Reporting” means turning raw rows into sorted or top-k answers with a defined tie rule. Many panels want the SQL mental model in Python: group by key, order by count DESC, then key ASC for ties.
Sub-topic: collections.Counter
Counter(iterable) gives per-key counts; .items() yields (tag, count). .most_common(k) is convenient when tie rules are only “by count,” but for lex ties among equal counts you must group tags by frequency and sort inside each bucket (see Solution below).
Worked example: Counter("abracadabra") has a:5, b:2, r:2, c:1, d:1. A k-tag report with lex tie: take a (highest f), then within 2s pick b before r if k is large enough, then count 1s: c before d.
Solution (bucket idea): for f in sorted(by_freq, reverse=True): for t in sorted(by_freq[f]): ...
Sub-topic: Tie-breaking on labels
Lexicographic on string ids is common: sort keys within a frequency bucket the same way you would in SQL: ORDER BY c DESC, tag ASC.
Worked example: ["a","b","a","a","c"] → counts a:3, b:1, c:1 — top-1 by count is a. If k=2, the next band is count 1: b before c.
Problem: most_common(k) alone does not guarantee lex order for tied counts—fix: explicit sort in each frequency group, as in the Solution block.
Worked example — k picks across frequency bands
Input: ["x","x","y","y","z"] → x:2, y:2, z:1. k=2: need two highest frequency tags: both 2; lex → x then y, never z.
Solution: the interview function returns the list in that order; see Check in the next section.
Python Interview Question on Top-k Tags with Lexicographic Tie-Breaks
Input: list of tag strings, integer k. Output: k tags with highest frequency; within a frequency band use lexicographic order; overall descending by frequency (SQL spirit: ORDER BY cnt DESC, tag ASC).
Solution Using Frequency Buckets and Sorted Keys
from collections import Counter
def top_k_tags(tags, k):
if k <= 0 or not tags:
return []
c = Counter(tags)
by_freq = {}
for t, f in c.items():
by_freq.setdefault(f, []).append(t)
res = []
for f in sorted(by_freq.keys(), reverse=True):
for t in sorted(by_freq[f]):
res.append(t)
if len(res) == k:
return res
return res
Check: ["x","x","y","y","z"], k=2 — both 2: pick x then y (lex) before any single-count tag.
Practice
- PYTHON · Topic — Hash tables & counting
- COMPANY · ByteDance — set ByteDance + set
- COMPANY · ByteDance — python ByteDance Python
8. Python: regular expressions and string tokenization for DE screens
Regular expressions and string tokenization in Python
Overview: The re module finds matches; for integers in text, a pattern like \d+ (maximal unsigned digit runs) or -?\d+ (optional sign) is common. In Python, re.findall returns non-overlapping matches left to right—so "a12b3" with r"\d+" gives ['12','3'], not ['1','2','3']. You must agree: signed? empty string? (usually sum 0 or 0 for no matches). Edge: "12-34" with -?\d+ can pick 12, -34 depending on how you write the pattern; the interview item below is intentionally unsigned blocks to avoid that ambiguity.
Sub-topic: re.findall vs one manual scan
Findall is compact; a for-loop and isdigit chunks (track start/end) is the same logic without regex—good if the panel says “no re import.” For alphanumeric ids, you might split on non-digits with re.split.
Worked example (tiny spec): s = "a12b3" with r"\d+": tokens 12 and 3 → sum 15. If the spec were “sum of each digit,” you would iterate characters—different problem.
Solution: one pass: sum(int(m) for m in re.findall(r"\d+", s)).
Sub-topic: Zero and multiple blocks
Explanation: with r"\d+", the empty string produces no matches → findall is [], and the sum of an empty list of ints is 0. The string "0" is one block → add 0. The string "10 20" (digits separated by spaces only) still matches two runs "10" and "20" → 10 + 20 = 30; the space breaks the run, so you do not get one number 1020 unless the digits are one contiguous run (e.g. "1020" → a single int 1020).
Worked example (problem and solution): s = "0plus00" → three blocks "0", "0", "0" → 0 + 0 + 0 = 0.
Problem (confused spec): for "a1b1" per-digit you might say 1+1; for block you get "1" and "1"; the sum is still 2, but for "a12b" blocks give 12 while per-digit would give 1+2 = 3. Clarify in interview. Solution in this guide and the Interview below: maximal digit runs (block sum).
Worked example — manual parse vs regex
| Input | re.findall(r"\d+", s) |
Sum |
|---|---|---|
Price 12x and 3y |
12, 3
|
15 |
no digits |
(empty) | 0 |
100200 |
100200 (one block) |
100200 |
Problem if you split on spaces only: "a12b34" would not split; maximal digit run detection needs regex or a state machine—the interview uses regex.
Python Interview Question on Summing Integers Extracted from Text
Input: string with letters and digits. Spec (simple): each maximal contiguous digit block is one unsigned integer to add.
Solution Using re.findall on Digit Runs
import re
def sum_of_ints_in_string(s):
return sum(int(m) for m in re.findall(r"\d+", s))
Check: "Price 12x and 3y" → 12 + 3 = 15.
If you need signed numbers, many panels use: re.findall(r"-?\d+", s) and sum(int(m) for m in ...) with care for lone -.
Practice
- PYTHON · Topic — String processing (Python)
- COMPANY · ByteDance — python ByteDance Python
9. Python: hash sets, first-seen order, and stream-style deduplication
Hash sets, first-seen order, and one-pass stream logic in Python
Overview: A set stores keys you have seen; a dict can store first index, count, or value. Stream = one pass; arrival order is fixed—do not pre-sort the array when the spec asks for the k-th new (first-occurrence) id in original order. Time O(n), set space O(unique keys) in expectation. A set + list of first-seen keys is a clear way to get 1-based “j-th new”.
Sub-topic: First-seen order with a set
When x not in seen: seen.add(x) and append to first_order. If you need only the k-th new id, return as soon as len(first_order) == k. O(n) time, O(unique) set space.
Worked example: [7,3,7,1] → 1st new 7, 2nd 3, skip duplicate 7, 3rd 1 → [7,3,1].
Solution (mapping to 1-based k): the j-th new id is index j-1 in first_order.
Sub-topic: “At most once” / Counter (different spec)
A Counter and filter for count == 1 answers “ids that appear exactly once”—not the same as “k-th time we first see an id” unless the problem combines them.
Problem with list(dict.fromkeys(arr)): you dedupe in order but do not get an early stop for the k-th new id without the same left-to-right loop as in kth_new_id.
Worked example (Counter “exactly once” — different from k-th new): arr = [1,2,1,3]; Counter is {1:2, 2:1, 3:1}. Ids with count 1 are 2 and 3 (often reported in first-seen order 2, 3 with an extra pass). Solution: that is not the k-th-new spec; for k-th new use set + first-occurrence list below.
Worked example — 1-based k (trace)
| step | id | 1-based “new” index |
|---|---|---|
| 1 | a | 1st new = a |
| 2 | b | 2nd new = b |
| 3 | a | (skip) |
| 4 | c | 3rd new = c |
k = 2 → "b"; k = 3 → "c"; k = 4 → "".
Solution: same as kth_new_id in the next Solution block.
Python Interview Question on k-th First-Seen Id in a Stream
Input: ordered list of string ids, integer k. The j-th “new” id is the j-th first occurrence in a left-to-right scan. Return that id for j = k, or "" if there are fewer than k first-seen events.
Solution Using a Set and Ordered First-Seen List
def kth_new_id(arr, k):
if k <= 0:
return ""
seen = set()
first_order = []
for x in arr:
if x not in seen:
seen.add(x)
first_order.append(x)
if len(first_order) == k:
return x
return ""
Check: ["a","b","a","c"], k=2 → "b"; k=4 → "".
Practice
- PYTHON · Topic — Hash tables & counting
- COMPANY · ByteDance — hash ByteDance + hash table
10. Python: two pointers, binary search, dynamic programming, and simulation on ByteDance hards
Two pointers, binary search, DP, and simulation patterns in Python
Overview: The export bundles algorithm patterns at hard/medium depth. In one prep article you need separate sub-skills, each with a small problem and solution—below are four sub-parts with one end-to-end item each, then a short simulation item.
Sub-part A: two pointers on a sorted array (pair sum)
Idea: Left/right on a sorted list; move inward by comparing nums[l] + nums[r] to the target.
Problem: return 0-based indices (i, j) with i < j and nums[i] + nums[j] == target, or None. nums is strictly increasing (enables the two-pointer proof).
Solution: O(n) two pointers; if nums[l] + nums[r] < target, the sum is too small, so l += 1; if too large, r -= 1.
Solution sketch (code):
def two_sum_sorted(nums, target):
l, r = 0, len(nums) - 1
while l < r:
s = nums[l] + nums[r]
if s == target:
return l, r
if s < target:
l += 1
else:
r -= 1
return None
Check: [1,2,4,6], target 6 → (1,2) (2+4).
Problem (no-solution case): [1,2,3], target 7 → None.
Sub-part B: binary search on a monotone prefix
Idea: find the smallest index with prefix_sum[i] >= target on a non-decreasing prefix (or return -1 if impossible).
Problem: prefix_sum = [0, 2, 5, 5, 9] (non-decreasing from cumulative sums), target = 5. Return the leftmost i with prefix_sum[i] >= 5.
Solution: binary search for the lowest index i with prefix_sum[i] >= target (the leftmost “first true” in a monotone predicate). O(log n) time.
Check (manual): valid indices: 2, 3, 4; leftmost is 2. If target = 20, the array maximum is 9, so the answer is -1.
Solution sketch (code):
def first_at_least(prefix_sum, target):
lo, hi = 0, len(prefix_sum) - 1
if prefix_sum[hi] < target:
return -1
while lo < hi:
mid = (lo + hi) // 2
if prefix_sum[mid] >= target:
hi = mid
else:
lo = mid + 1
return lo
Sub-part C: 1-D dynamic programming (Fibonacci-style)
Idea: Stairs with 1 or 2 steps—ways to reach n equals Fibonacci. States ways(0)=1, ways(1)=1.
Problem: you may take 1 or 2 steps at a time. In how many distinct ways can you reach the n-th step? (Ground = step 0; one move reaches step 1.) Use DP in O(n) time and O(1) space (rolling two numbers).
Solution: ways(n) = ways(n-1) + ways(n-2); rolling variables match the code below.
Solution sketch (code):
def count_ways_stairs(n):
if n <= 1:
return 1
a, b = 1, 1
for _ in range(2, n + 1):
a, b = b, a + b
return b
Check: n=4 → 5 (Fibonacci).
Check (base cases): the sketch returns 1 for n <= 1 (only one way to stand on step 0 or step 1 in this model).
Sub-part D: simulation with running state
Idea: one pass over deltas; track cumulative sum and 1-based day index; return the first day the running total is strictly negative, else 0.
Problem: deltas = [1, 2, -1, -2] (days 1–4). Cumulative sums: 1, 3, 2, 0—the running total is never strictly negative, so the return value is 0. For [5, -2, -4], the first strictly negative total appears on day 3 (running sums 5, then 3, then -1).
Solution: single pass; 1-based enumerate(..., start=1) for “day” index.
Solution sketch (code):
def first_negative_day(deltas):
s = 0
for i, d in enumerate(deltas, start=1):
s += d
if s < 0:
return i
return 0
Check: [5, -2, -4] → 3 (sums 5, 3, -1).
How this section maps to “problem + solution”: A–D are four separate sub-topics. Each has an explained idea, a stated problem, a solution (code sketch), and a check—the same pattern as sections 1–9, except there is one combined Interview block per sub-part instead of a single long question at the very end. Harder ByteDance items may blend two of these patterns in one prompt; here they are split for clear practice.
Practice
- PYTHON · Topic — Two pointers
- COMPANY · ByteDance — two pointers ByteDance + two pointers
- PYTHON · Topic — Binary search
- PYTHON · Topic — Dynamic programming
- COMPANY · ByteDance — hard ByteDance (hard)
- COMPANY · ByteDance — python ByteDance Python
Tips to crack ByteDance data engineering interviews
Data engineering interview preparation for a TikTok-aligned ByteDance loop is less about one-off LeetCode grinds and more about repeatable lanes: SQL grain, joins, windows and “last N,” then Python structures and the hard-band patterns in section 10. Pair those reps with a calm explain-aloud habit on WHERE vs HAVING and on stream order in Python.
SQL preparation, grain, and explain-aloud
Drill WHERE vs HAVING, join fan-out, and window partition keys on paper. Say one row of output = what? before you code. Use one CTE per logical block in hairy questions so the panel can follow you—the same structure as our other company interview guides on PipeCode, with the ByteDance company filter.
Where to practice on PipeCode
| Lane | ByteDance-focused path |
|---|---|
| Company hub | Browse ByteDance-tagged problems |
| SQL | ByteDance SQL |
| Python | ByteDance Python |
| Easy / Medium / Hard | Easy · Medium · Hard |
| All topics (site) | Practice topics |
Communication under time pressure
State assumptions in one pass: ties, NULLs, inclusive bounds, time zone. If stuck, offer a smaller version of the problem first. Interviewers often reward a defensible O(n log n) solution with a clean story over a shaky “optimal” idea you cannot finish.
Courses that line up: SQL for data engineering interviews, Python fundamentals for data engineering, plus Data modeling for data engineering interviews and ETL system design for data engineering interviews for verbal design rounds.
Frequently asked questions
What is the ByteDance data engineering interview process?
It often includes a screening conversation, SQL and Python technical rounds, and later system design / data modeling and behavioral discussions. Pipelines, warehouses, and ETL reliability come up in verbal form; treat our ETL and data modeling resources as add-ons to your company-tagged practice.
What should I study for a ByteDance data engineering interview: SQL, Python, or modeling?
Prioritize SQL and Python for coding rounds. Sections 1–10 in this guide track the topic table and the per-problem tag mix on the ByteDance set (5 SQL / 15 Python in the reference export). Add dimensional modeling and ETL for verbal design when the loop includes it.
How hard are ByteDance data engineering interview questions?
Difficulty varies by loop and team, but the PipeCode ByteDance set is published as 8 easy, 8 medium, and 4 hard—a practical ladder. Start easy, then medium as your default, and add hard for timed proof when basics are solid.
Does TikTok use the same data engineering interview style as ByteDance?
TikTok is part of the ByteDance group; many candidates use TikTok and ByteDance interview prep interchangeably for data engineer roles, with SQL, Python, and data-intensive system discussion in common. Always confirm role-specific requirements in your recruiter materials.
How many practice problems should I do before a ByteDance data engineering interview?
A common target is on the order of 30–50 problems across topics—not only the 20 company-tagged items—so weak areas get reps. Mix ByteDance filters with the global SQL and Python libraries.
What are common ByteDance data engineering interview questions in SQL and Python?
They follow the tag families in the topic list and sections 1–10 in this guide—joins/aggregation, subqueries/last N, windows/ranking, and the Python patterns at the end. All new problem text here is for teaching; the wording of Premium hub items is not copied.
Start practicing ByteDance data engineering problems
PipeCode pairs company-tagged ByteDance and TikTok data engineering problems with tests and feedback so you move from reading patterns to typing your own solutions.
Pipecode.ai is Leetcode for Data Engineering




Top comments (0)