Gowtham Potureddi

Posted on Apr 25 • Edited on Apr 26

ByteDance Data Engineering Interview Questions Guide

#career #dataengineering #interview #sql

ByteDance data engineering interview questions (often discussed alongside TikTok-aligned or group-wide data roles) usually combine SQL and Python in the same loop: you need to get the grain right (joins, aggregation, subqueries, window functions and ranking), then show fast, clear code on arrays, sets, strings, and common two-pointer / search / simulation styles on harder items.

This guide is structured as a ByteDance data engineering interview prep map: a topic table first, then numbered sections in that order. Within each main topic, every sub-topic has a short explanation, then—where it helps—a worked example with solution, a note on a typical gotcha (problem), and finally a separate interview problem and full solution at the end of the section. All of that is original teaching copy.

#	Topic	Why it shows up (hub tag)
1	SQL: joins and aggregation	Joins & Aggregation on company practice, plus fan-out and grain.
2	SQL: subqueries and aggregation	Aggregation & Subqueries—CTEs, layered filters.
3	SQL: “last N” and sort order	Sorting + per-key windows.
4	SQL: window functions + subqueries	Window Functions + CTEs for roll-ups then per-row `OVER`.
5	SQL: ranking and ties	Window + ranking and tie policy.
6	Python: intervals, sort, array	intervals / array / sorting / Data Structures on the export.
7	Python: sets, string reporting	set / String / reporting-style list logic.
8	Python: regex and string	regular expression / String Processing.
9	Python: hash / stream dedup	Hash Table / set in arrival order.
10	Python: algorithms band	Two Pointers / Binary Search / DP / Math / Simulation on harder items.

1. Joins, aggregation, and fan-out control in SQL for ByteDance screens

Joins and aggregation in SQL for ByteDance-tagged content

Overview: A join links rows on a key so you can enrich fact rows (many events) with dimension fields (one row per user, creator, or item). Aggregation then answers “how many / how much per …?” at the grain the product question names. If you GROUP BY the wrong set of keys, the output row is not the unit the business asked for. A join can also multiply row count: one user with N order rows can become N joined lines, so a one-per-user column (credits, tier, signup bonus) repeats—summing it after the join adds the same value N times unless you use MAX/MIN, or pre-aggregate the fact, or keep one row per key by design. The hub tags this as Joins & Aggregation because both skills must land together.

Sub-topic: `INNER` vs `LEFT` and join size

INNER JOIN: only matching keys on both sides—if the right side has no row for a left key, that left row disappears from the result.
LEFT OUTER JOIN: keeps every left row. Where the right has no match, right-hand columns are NULL. Use this when you must list all entities (e.g. all users) even if some have no fact in the time window, or to detect “no match” with WHERE right.key IS NULL (anti-join).

Worked example (micro tables + row counts):

users(u_id): u1, u2. events(u_id): only u1 (2 rows e1, e2).

INNER JOIN: 2 result rows (both events); u2 is absent (no event row to join to).
FROM users u LEFT JOIN events e ON u.u_id = e.u_id: 3 rows: 2 for u1 (e1, e2) and 1 for u2 with e.* all NULL.

Solution (read pattern): INNER = “only keys that participate in both.” LEFT = “anchor on the left list; right may be empty.”

Typical problem: counting users with a metric after INNER—you silently drop users with no events. If the spec needs all users with 0 for missing, start from users LEFT JOIN facts and use COALESCE(SUM(...),0).

Sub-topic: `GROUP BY` grain and `HAVING`

GROUP BY defines buckets; aggregates run once per bucket over all rows in that bucket. WHERE is evaluated per input row before grouping; it cannot refer to an aggregate. HAVING filters after GROUP BY and can use SUM, COUNT, etc. Use it when the condition is on a per-group metric (e.g. teams with AOV above a threshold, or more than three orders).

Worked example (aggregate then filter): table t:

order_id	user_id	amount
1	u1	10
2	u1	20
3	u2	5

SELECT user_id, SUM(amount) AS rev FROM t GROUP BY user_id → u1 → 30, u2 → 5.

HAVING SUM(amount) >= 20 → only u1.

Solution (why HAVING, not WHERE on the sum): the condition uses SUM(amount), so it must run after aggregation:

SELECT user_id, SUM(amount) AS rev
FROM t
GROUP BY user_id
HAVING SUM(amount) >= 20;

Worked example — fan-out: double-counting a one-per-user column after a join

Tables (grain): orders(order_id, user_id, amt) — one row per order; users(user_id, signup_credit) — one signup_credit per user, 20 for u1. Two orders for u1 with 10 and 10 amt. After orders JOIN users ON user_id, you see 2 lines, each with signup_credit = 20. A naive SUM(u.signup_credit) in that result adds 20 + 20 = 40—wrong for “how much credit does this user have once?”

Fix pattern: do not SUM a dimension that is 1:1 to user across multiple fact rows. Use e.g. MAX(u.signup_credit) with GROUP BY user_id (or join only after you aggregate orders per user, or SUM on order amt only and keep credit as a single picked value). Illustrative “safe” read:

SELECT
  o.user_id,
  MAX(u.signup_credit) AS credit_once,
  SUM(o.amt) AS order_total
FROM orders o
INNER JOIN users u ON u.user_id = o.user_id
GROUP BY o.user_id;

Result for u1: credit_once = 20, order_total = 20.

SQL Interview Question on Creator Watch Revenue with Join and HAVING

Given watch(user_id, video_id, watch_sec, watch_date) (one row per session) and video(video_id, title, creator_id) (one row per video). For watch_date = '2026-04-25' only, return creator_id, total watch_sec that day, and count of watch rows, for creators with at least 2 sessions and total watch_sec > 0.

Solution Using INNER JOIN, GROUP BY, and HAVING

SELECT
  v.creator_id,
  SUM(w.watch_sec) AS total_sec,
  COUNT(*) AS n_sessions
FROM watch w
INNER JOIN video v ON v.video_id = w.video_id
WHERE w.watch_date = DATE '2026-04-25'
GROUP BY v.creator_id
HAVING COUNT(*) >= 2
   AND SUM(w.watch_sec) > 0;

Check: The join is 1:1 on video in the fact; aggregation is on (video → creator_id) after join—if the prompt were “per video_id” you would GROUP BY w.video_id (or v.video_id) instead. Say that distinction out loud in interview.

Practice

SQL · Topic — Joins (all companies)
COMPANY · ByteDance — sql ByteDance SQL (all difficulties)
SQL · Hub — Browse all topics

2. Subqueries, CTEs, and layered aggregation in SQL

Subqueries, CTEs, and layered aggregation in SQL

Overview: Many questions are two phases: (1) roll raw events into per (user, day) or per user totals, (2) filter or join those totals. The inner query (or CTE) must output the grain you will filter on; the outer query then applies rules that need the aggregated columns. A CTE (WITH … AS) is only sugar for a subquery in FROM, but it names the step and lets you chain WITH a AS (…), b AS (SELECT … FROM a) …—in a live interview, that clarity often prevents mistakes.

Sub-topic: CTEs vs `FROM` subquery

They are the same to the engine in most databases; the difference is only naming and readability. Chain subqueries for paper clarity:

WITH a AS (SELECT k, SUM(x) s FROM t GROUP BY k)
SELECT * FROM a WHERE s > 10;

Worked example (two-step story): “Per key k, what is the sum? Then which keys have s > 10?” CTE a is the first report; the outer SELECT is the second filter.

Solution (expected shape): one row per k in a, then the outer WHERE keeps a subset of those rows. Toy data: if t has (k1,5), (k1,6), (k2,3) then a is k1→11, k2→3; WHERE s > 10 leaves k1 only.

Problem (readability in interview): a huge nested FROM (SELECT … FROM (SELECT …) …) is hard to read under time pressure. Solution: name each stage, e.g. WITH step1 AS (…), step2 AS (SELECT … FROM step1) …, so the grader can follow your logic.

Sub-topic: “aggregates, then re-filter”

You cannot reference a new SELECT alias in the same level’s WHERE in standard SQL (order of evaluation). Use HAVING inside the inner aggregate query, a wrap in a CTE, or repeat the expression. The “users who on some day had ≥ 5 events” pattern is: group in the CTE, then WHERE on the exposed column n in the outer query from the CTE’s result set (that is valid because n is a output column of the CTE).

Worked example (alias trap): inner query is SELECT user_id, SUM(amt) AS rev FROM t GROUP BY user_id. You want WHERE rev > 20 on the result.

Problem: SELECT * FROM (…) AS x WHERE rev > 20 is valid (outer query sees column rev). A single-level … GROUP BY user_id WHERE rev > 20 is not (no rev in WHERE for grouped query—you need HAVING SUM(amt) > 20 on that level).

Solution: either HAVING SUM(amt) > 20 in the same aggregate query, or nest a CTE/subquery and filter WHERE rev > 20 on the outer SELECT.

Worked example — clicks per user per day, then keep busy days

Data:

user_id	click_date	event_id
u1	2026-01-01	e1
u1	2026-01-01	e2
u1	2026-01-01	e3
u1	2026-01-01	e4
u1	2026-01-01	e5
u1	2026-01-02	e6

Ask: return (user_id, click_date, n) where n >= 5.

Solution:

WITH per AS (
  SELECT user_id, click_date, COUNT(*) AS n
  FROM clicks
  GROUP BY user_id, click_date
)
SELECT * FROM per WHERE n >= 5;

Result: one row: (u1, 2026-01-01, 5).

SQL Interview Question on High Daily Click Volume with a CTE

Table clicks(user_id, click_date, event_id, …). Build per (user_id, click_date) row counts as n, excluding null user_id in the inner aggregate. Return rows where n >= 5 with columns user_id, click_date, n.

Solution Using a CTE and Post-Aggregation Filter

WITH per_day AS (
  SELECT
    user_id,
    click_date,
    COUNT(*) AS n
  FROM clicks
  WHERE user_id IS NOT NULL
  GROUP BY user_id, click_date
)
SELECT user_id, click_date, n
FROM per_day
WHERE n >= 5;

Check: The not null on user_id lives in WHERE inside the CTE (row filter); n >= 5 is on the pre-aggregated CTE, so a plain WHERE on the outer select is enough.

Practice

SQL · Topic — Subqueries (all companies)
COMPANY · ByteDance — sql ByteDance SQL

3. “Last N” per key, `ORDER BY`, and windows in SQL

“Last N” per key, `ORDER BY`, and row numbering in SQL

Overview: A global ORDER BY ts DESC at the end of a query would sort all events together; it does not give “last 2 for each user” unless the question only asked for the one most recent key globally. For per-user (or per-partition) “latest / last k”, the sort order must be inside a window: PARTITION BY user_id, ORDER BY ts DESC, and usually ROW_NUMBER, RANK, or a subquery with a correlated LIMIT per key (dialect-specific). The export tags this Sorting + per-key logic because the hard part is scoping ORDER BY to a partition, not the whole result set.

Sub-topic: `ROW_NUMBER` and `ORDER BY` inside `OVER`

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_ts DESC) labels 1, 2, 3, … within each user_id in that order. ROW_NUMBER never ties; two rows in the same partition get different numbers.

Worked example (trace one partition): for user_id = u1, if order_ts descending gives rows A, B, C, then rn is 1, 2, 3. The k most recent are WHERE rn <= k.

Solution (definition): ROW_NUMBER is the usual choice when you need exactly k rows per partition in SQL that allows window functions (vs ties with RANK).

Sub-topic: Ties in “last N”

If two events share the same order_ts, ORDER BY must add a tie-break (e.g. order_id DESC) or the engine’s order is not guaranteed and your k “last” rows may be ambiguous. The interviewer wants you to state a rule.

Worked example (two rows, same time): add , order_id DESC so the higher id is “more recent” in the product sense.

Solution (tie-break in window): ORDER BY order_ts DESC, order_id DESC.

Problem / solution (toy data): same user, two orders, identical order_ts only: without a second sort key, ROW_NUMBER is still unique but nondeterministic between the two (bad for tests). With ORDER BY order_ts DESC, order_id DESC and rows (id 9, ts T) and (id 10, ts T), row 10 gets rn = 1; 9 gets rn = 2. You can reproduce the same answer on every run.

Worked example — “last 2 orders per user” on toy data

order_id	user_id	order_ts
10	u1	2026-01-02
9	u1	2026-01-01
4	u2	2026-01-01

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_ts DESC) → for u1: (10 → 1), (9 → 2); for u2: (4 → 1). WHERE rn <= 2 returns 2 rows for u1, 1 for u2.

Solution (query shape): WITH w AS (SELECT …, ROW_NUMBER() OVER (…) AS rn FROM orders) SELECT … FROM w WHERE rn <= 2; — same idiom as the Interview block below.

SQL Interview Question on Last Two Orders per User by Time

orders(order_id, user_id, order_ts, amount); all columns not null; for each user_id, return the 2 most recent rows by order_ts, breaking ties on larger order_id (use ROW_NUMBER then filter).

Solution Using ROW_NUMBER and an Outer WHERE on Rank

WITH r AS (
  SELECT
    order_id,
    user_id,
    order_ts,
    amount,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY order_ts DESC, order_id DESC
    ) AS rn
  FROM orders
)
SELECT order_id, user_id, order_ts, amount
FROM r
WHERE rn <= 2;

Check: WHERE rn <= 2 in the outer query, after the window—that is the idiom for last 2.

Practice

SQL · Topic — Window functions (SQL)
COMPANY · ByteDance — windows ByteDance + window functions
COMPANY · ByteDance — sql ByteDance SQL (company filter)

4. Window functions, partitions, and CTEs in SQL

Window `OVER` and layered subqueries in SQL

Overview: GROUP BY collapses many rows into one per bucket. OVER (window) keeps each input row and attaches a value computed from a window of rows—usually a partition (same user_id / same day / same month). Many DE questions are layered: first a CTE that produces the right grain (e.g. per day per user), then a second pass that ranks or normalizes within a larger bucket (e.g. share of this day inside the user’s month). Confusing the two is a common failure mode: a plain SUM(amount) in SELECT without GROUP BY is invalid, but SUM(amount) OVER (PARTITION BY …) is valid because the window is per row.

Compare visually: a roll-up vs per-row window in the image above: left = one row per group; right = all detail kept with a rank or running value.

Sub-topic: `SUM(...) OVER` vs `SUM(...) GROUP BY`

SUM(x) GROUP BY y — one row per y with a single total. You lose the original rows.
SUM(x) OVER (PARTITION BY y) — one total repeated on every row in the partition, unless you use a more restrictive frame (e.g. ROWS BETWEEN); used when you need both row-level detail and a subtotal on the same result.

Worked example (conceptual): fact rows for (user, day, revenue).

SELECT
  user_id,
  day,
  revenue,
  SUM(revenue) OVER (PARTITION BY user_id) AS user_all_days_rev
FROM fact;

Each row still exists; the window sum is the per-user total across the partition (same value on every row for that user_id).

Sub-topic: Layered CTE + window (subquery + window)

You first materialize a convenient grain in a CTE (e.g. one row per (user, day) with a daily total), then in the outer SELECT you add a window that partitions by something larger (e.g. day to rank users on that day, or user to rank days for that user).

Worked example — “largest order total per calendar day” (toy CTE + window)

orders(order_id, day, user_id, amt); aggregate to per (day, user) tot, then keep #1 per day by tot.

day	user_id	tot (after GROUP BY)
D1	a	100
D1	b	200
D2	a	50

ROW_NUMBER() OVER (PARTITION BY day ORDER BY tot DESC, user_id ASC) AS r → (D1,b) and (D2,a) have r = 1.

Solution (shape):

WITH per AS (
  SELECT day, user_id, SUM(amt) AS tot
  FROM orders
  GROUP BY day, user_id
)
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY day ORDER BY tot DESC, user_id ASC) AS r
  FROM per
) x WHERE r = 1;

Problem to avoid: putting tot in WHERE before the window: filter r = 1 in the outer layer, not on the same level as the window definition unless you use a CTE (as above).

Sub-topic: Ratios, `NULLIF`, and “empty month”

Dividing this row by a group sum can hit divide by zero if a partition is empty in theory—but with real fact rows, the usual edge case is sum = 0 (all amount are 0). Use NULLIF(denominator, 0) so the result is NULL instead of an error; say that out loud in interview.

Micro worked example (all zeros in a month): one row for (u1, 2026-04-10, 0) and one for (u1, 2026-04-20, 0). Month sum = 0. So day_share = 0 / NULLIF(0,0). Because NULLIF(0,0) is NULL, the division yields NULL, not an error. Problem without NULLIF: divide by zero on 0/0 when the month sum is 0. Solution: use NULLIF( SUM(…) OVER (…), 0 ) in the denominator.

Worked example — `day_share` with toy `fact` rows

Table fact:

user_id	d (date)	amount
u1	2026-04-10	30
u1	2026-04-20	70
u1	2026-05-01	100

Ask: for each row, add day_share = that row’s amount divided by the sum of amount for the same (user_id, calendar month).

Manual check: For April 2026 and u1, month total = 30 + 70 = 100 → first row’s share = 30/100 = 0.3, second = 0.7. For May, only one row → share = 1.0.

Solution (shape matches interview below): same expression as the Solution Using a Windowed SUM but on this data you should narrate the two April rows splitting the month.

SELECT
  user_id,
  d,
  amount,
  amount::float / NULLIF(
    SUM(amount) OVER (PARTITION BY user_id, date_trunc('month', d)),
    0
  ) AS day_share
FROM fact;

Problem you avoid: if you accidentally PARTITION BY user_id only, you would divide each day by all-time user revenue—wrong for “within month.”

SQL Interview Question on Day Share Within User-Month

fact(user_id, d, amount) at (user, day) grain. Add column day_share = this row’s amount / sum(amount) over the same (user, calendar month) (PostgreSQL: date_trunc('month', d) in the partition).

Solution Using a Windowed SUM Over User and Month

PostgreSQL-style:

SELECT
  user_id,
  d,
  amount,
  amount::float / NULLIF(
    SUM(amount) OVER (
      PARTITION BY user_id, date_trunc('month', d)
    ),
  0) AS day_share
FROM fact;

Check: NULLIF(…,0) avoids divide by zero if a month is empty. Name date_trunc('month',d) as “month bucket” in interview; adjust for Trino/Spark/ANSI as needed.

Practice

SQL · Topic — Window functions (SQL)
COMPANY · ByteDance — windows ByteDance + window functions

5. Window ranking: RANK, DENSE_RANK, and ROW_NUMBER in SQL

Window ranking: RANK, DENSE_RANK, and ROW_NUMBER

Overview: Ranking orders rows inside a partition with rules for ties. Picking the wrong function is how you return two “first place” rows when the business asked for one rep per region, or how you get gaps in “rank 1,2,3” when a dense ladder was intended (leaderboards).

RANK() — ties share the same rank; the next rank skips (1,1,3).
DENSE_RANK() — ties share the same rank; the next rank does not skip (1,1,2).
ROW_NUMBER() — unique 1,2,3…; no shared positions—add a second ORDER BY key for deterministic ties (e.g. smaller id wins).

Sub-topic: When you need a single winner

If the spec says at most one row per region, use ROW_NUMBER, not RANK—unless the prompt explicitly allows ties (e.g. “all top sellers”).

Worked example (ties in sales): three sellers in one region, revenues (100, 100, 50) in descending order:

seller	revenue	RANK (DESC)	DENSE_RANK (DESC)	ROW_NUMBER (DESC, id ASC)
A (id 1)	100	1	1	1
B (id 2)	100	1	1	2
C (id 3)	50	3	2	3

RANK: two rows at rank 1 → not a unique winner.
DENSE_RANK: still two rows at 1.
ROW_NUMBER with ORDER BY revenue DESC, seller_id ASC → exactly one rn = 1 (seller A).

Solution: for “one row per partition” with tie-break, WHERE rn = 1 on ROW_NUMBER is the standard pattern (same as the interview block).

Sub-topic: When `RANK` is right

“All employees in the top compensation band in each office” often means everyone who ties for #1—use RANK or DENSE_RANK and WHERE rank = 1, and do not add a spurious seller_id tie-break unless the product says so.

Worked example (two co-leaders, one region): RANK() OVER (PARTITION BY office_id ORDER BY salary DESC).

employee_id	office_id	salary
e1	O1	200000
e2	O1	200000
e3	O1	100000

RANK (DESC): e1 and e2 both 1; e3 = 3. Ask: return all who are #1 in salary in O1.

Solution: WHERE rank_col = 1 after RANK or DENSE_RANK: you return 2 rows (e1, e2). If you had used ROW_NUMBER with a tie-break you would return only 1 row—wrong for this spec.

Problem (wrong function): using ROW_NUMBER … WHERE rn = 1 when the business wanted all tied winners. Fix: RANK or DENSE_RANK and rank = 1, or re-read the spec to allow a single winner.

Worked example — mini `sellers` table

seller_id	region_id	revenue
10	R1	200
20	R1	150
30	R2	200
40	R2	200

Ask: one winner per region_id, higher revenue wins, ties → smaller seller_id.

Result: R1 → (10, 200). R2 → two tie at 200: ROW_NUMBER(…, ORDER BY revenue DESC, seller_id ASC) picks 30 over 40.

Problem if you used RANK only with WHERE rank = 1: you would return two rows for R2. Fix: ROW_NUMBER (as below) or change the spec.

SQL Interview Question on Top Seller per Region with Tie-Breaks

sellers(seller_id, region_id, revenue) one row per (seller, region). Return one row per region_id: max revenue; ties → smaller seller_id.

Solution Using ROW_NUMBER with a Second Sort Key

WITH w AS (
  SELECT
    seller_id,
    region_id,
    revenue,
    ROW_NUMBER() OVER (
      PARTITION BY region_id
      ORDER BY revenue DESC, seller_id ASC
    ) AS rn
  FROM sellers
)
SELECT region_id, seller_id, revenue
FROM w
WHERE rn = 1;

Check: RANK would return two rows if two sellers tie max—ROW_NUMBER with seller_id tie break gives one.

Practice

SQL · Topic — Window functions (SQL)
COMPANY · ByteDance — windows ByteDance + window functions

6. Python: arrays, sorting, and merge intervals for ByteDance-tagged items

Sorting, sweeping, and merge intervals in Python

Overview: An interval is usually [a,b] inclusive in DE-style problems. Unsorted input is normal—sort by start (then end for stability). Sweep with one current merged segment [cur_s, cur_e]. For the next [s,e]: if s <= cur_e, the intervals touch or overlap—set cur_e = max(cur_e, e); else append a new block and move the cursor. Union length = sum of (end - start + 1) for integer inclusive lengths (watch one-point intervals: [5,5] has length 1).

Sub-topic: Merging intervals

Step	New interval	Current merge
After sort	[1,3]	[[1,3]]
[2,6] overlaps	extend end	[[1,6]]
[8,10]	no overlap	[[1,6],[8,10]]

Why sort first: the earliest starting interval that could connect to a chain must be processed in order; without sort, a later tiny interval that bridges two clusters could be missed in a single left-to-right pass.

Sub-topic: `sort` + `O(n)` scan

Time: O(n log n) to sort, O(n) merge. Space: O(n) for the sorted copy (or in-place sort on a mutable list in Python).

Worked example (complexity only): with n = 1,000,000 intervals, sorting dominates; a single left-to-right merge pass is O(n), so total time is O(n log n). In an interview, say: O(n log n) time, O(n) extra space (or O(1) extra if you may mutate the input list in place).

Problem (common bug): using s < cur_e instead of s <= cur_e, which fails when one interval ends exactly where the next starts if the spec says they merge (e.g. [1,4] and [4,5]—with inclusive touch, many problems still merge to [1,5]). Clarify with the interviewer; the solution below uses overlap as s <= m[-1][1] (merge if the new start is not strictly to the right of the current end).

Worked example — full trace and union length

Input: [[5,5],[1,3],[2,6],[8,10]] (unsorted).

Sort by start: [[1,3],[2,6],[5,5],[8,10]].
Merge: [1,3]; next [2,6] → 2 <= 3 → extend → [1,6]. Next [5,5] → 5 <= 6 → still inside → [1,6]. Next [8,10] → 8 > 6 → new block [[1,6],[8,10]].
Union length (inclusive int): (6−1+1) + (10−8+1) = 6 + 3 = 9.

Solution: this matches the code in the next section’s Interview solution (merged_union_length).

Python Interview Question on Union Length of Merged Intervals

Input: list of inclusive integer intervals [start, end] (unsorted, may overlap). Return the total length of the union after merging (inclusive length per piece: end - start + 1).

Solution Using Sort, Linear Merge, and Summed Lengths

def merged_union_length(intervals):
    if not intervals:
        return 0
    iv = sorted(intervals, key=lambda x: (x[0], x[1]))
    # merged list of [s,e] inclusive
    m = [list(iv[0])]
    for s, e in iv[1:]:
        if s <= m[-1][1]:
            m[-1][1] = max(m[-1][1], e)
        else:
            m.append([s, e])
    return sum(e - s + 1 for s, e in m)

Check: [[1,3],[2,6],[8,10]] → [[1,6],[8,10]] → length 6 + 3 = 9.

Practice

PYTHON · Topic — Array (all companies)
COMPANY · ByteDance — array ByteDance + array
COMPANY · ByteDance — sorting ByteDance + sorting

7. Python: sets, `Counter`, and string reporting for DE screens

Sets, `Counter`, and tie-aware reporting in Python

Overview: You combine list iteration, set membership in O(1) expected time, and Counter (or a dict) for frequencies. “Reporting” means turning raw rows into sorted or top-k answers with a defined tie rule. Many panels want the SQL mental model in Python: group by key, order by count DESC, then key ASC for ties.

Sub-topic: `collections.Counter`

Counter(iterable) gives per-key counts; .items() yields (tag, count). .most_common(k) is convenient when tie rules are only “by count,” but for lex ties among equal counts you must group tags by frequency and sort inside each bucket (see Solution below).

Worked example: Counter("abracadabra") has a:5, b:2, r:2, c:1, d:1. A k-tag report with lex tie: take a (highest f), then within 2s pick b before r if k is large enough, then count 1s: c before d.

Solution (bucket idea): for f in sorted(by_freq, reverse=True): for t in sorted(by_freq[f]): ...

Sub-topic: Tie-breaking on labels

Lexicographic on string ids is common: sort keys within a frequency bucket the same way you would in SQL: ORDER BY c DESC, tag ASC.

Worked example: ["a","b","a","a","c"] → counts a:3, b:1, c:1 — top-1 by count is a. If k=2, the next band is count 1: b before c.

Problem: most_common(k) alone does not guarantee lex order for tied counts—fix: explicit sort in each frequency group, as in the Solution block.

Worked example — k picks across frequency bands

Input: ["x","x","y","y","z"] → x:2, y:2, z:1. k=2: need two highest frequency tags: both 2; lex → x then y, never z.

Solution: the interview function returns the list in that order; see Check in the next section.

Python Interview Question on Top-k Tags with Lexicographic Tie-Breaks

Input: list of tag strings, integer k. Output: k tags with highest frequency; within a frequency band use lexicographic order; overall descending by frequency (SQL spirit: ORDER BY cnt DESC, tag ASC).

Solution Using Frequency Buckets and Sorted Keys

from collections import Counter

def top_k_tags(tags, k):
    if k <= 0 or not tags:
        return []
    c = Counter(tags)
    by_freq = {}
    for t, f in c.items():
        by_freq.setdefault(f, []).append(t)
    res = []
    for f in sorted(by_freq.keys(), reverse=True):
        for t in sorted(by_freq[f]):
            res.append(t)
            if len(res) == k:
                return res
    return res

Check: ["x","x","y","y","z"], k=2 — both 2: pick x then y (lex) before any single-count tag.

Practice

PYTHON · Topic — Hash tables & counting
COMPANY · ByteDance — set ByteDance + set
COMPANY · ByteDance — python ByteDance Python

8. Python: regular expressions and string tokenization for DE screens

Regular expressions and string tokenization in Python

Overview: The re module finds matches; for integers in text, a pattern like \d+ (maximal unsigned digit runs) or -?\d+ (optional sign) is common. In Python, re.findall returns non-overlapping matches left to right—so "a12b3" with r"\d+" gives ['12','3'], not ['1','2','3']. You must agree: signed? empty string? (usually sum 0 or 0 for no matches). Edge: "12-34" with -?\d+ can pick 12, -34 depending on how you write the pattern; the interview item below is intentionally unsigned blocks to avoid that ambiguity.

Sub-topic: `re.findall` vs one manual scan

Findall is compact; a for-loop and isdigit chunks (track start/end) is the same logic without regex—good if the panel says “no re import.” For alphanumeric ids, you might split on non-digits with re.split.

Worked example (tiny spec): s = "a12b3" with r"\d+": tokens 12 and 3 → sum 15. If the spec were “sum of each digit,” you would iterate characters—different problem.

Solution: one pass: sum(int(m) for m in re.findall(r"\d+", s)).

Sub-topic: Zero and multiple blocks

Explanation: with r"\d+", the empty string produces no matches → findall is [], and the sum of an empty list of ints is 0. The string "0" is one block → add 0. The string "10 20" (digits separated by spaces only) still matches two runs "10" and "20" → 10 + 20 = 30; the space breaks the run, so you do not get one number 1020 unless the digits are one contiguous run (e.g. "1020" → a single int 1020).

Worked example (problem and solution): s = "0plus00" → three blocks "0", "0", "0" → 0 + 0 + 0 = 0.

Problem (confused spec): for "a1b1" per-digit you might say 1+1; for block you get "1" and "1"; the sum is still 2, but for "a12b" blocks give 12 while per-digit would give 1+2 = 3. Clarify in interview. Solution in this guide and the Interview below: maximal digit runs (block sum).

Worked example — manual parse vs regex

Input	`re.findall(r"\d+", s)`	Sum
`Price 12x and 3y`	`12`, `3`	15
`no digits`	(empty)	0
`100200`	`100200` (one block)	100200

Problem if you split on spaces only: "a12b34" would not split; maximal digit run detection needs regex or a state machine—the interview uses regex.

Python Interview Question on Summing Integers Extracted from Text

Input: string with letters and digits. Spec (simple): each maximal contiguous digit block is one unsigned integer to add.

Solution Using re.findall on Digit Runs

import re

def sum_of_ints_in_string(s):
    return sum(int(m) for m in re.findall(r"\d+", s))

Check: "Price 12x and 3y" → 12 + 3 = 15.

If you need signed numbers, many panels use: re.findall(r"-?\d+", s) and sum(int(m) for m in ...) with care for lone -.

Practice

PYTHON · Topic — String processing (Python)
COMPANY · ByteDance — python ByteDance Python

9. Python: hash sets, first-seen order, and stream-style deduplication

Hash sets, first-seen order, and one-pass stream logic in Python

Overview: A set stores keys you have seen; a dict can store first index, count, or value. Stream = one pass; arrival order is fixed—do not pre-sort the array when the spec asks for the k-th new (first-occurrence) id in original order. Time O(n), set space O(unique keys) in expectation. A set + list of first-seen keys is a clear way to get 1-based “j-th new”.

Sub-topic: First-seen order with a set

When x not in seen: seen.add(x) and append to first_order. If you need only the k-th new id, return as soon as len(first_order) == k. O(n) time, O(unique) set space.

Worked example: [7,3,7,1] → 1st new 7, 2nd 3, skip duplicate 7, 3rd 1 → [7,3,1].

Solution (mapping to 1-based k): the j-th new id is index j-1 in first_order.

Sub-topic: “At most once” / Counter (different spec)

A Counter and filter for count == 1 answers “ids that appear exactly once”—not the same as “k-th time we first see an id” unless the problem combines them.

Problem with list(dict.fromkeys(arr)): you dedupe in order but do not get an early stop for the k-th new id without the same left-to-right loop as in kth_new_id.

Worked example (Counter “exactly once” — different from k-th new): arr = [1,2,1,3]; Counter is {1:2, 2:1, 3:1}. Ids with count 1 are 2 and 3 (often reported in first-seen order 2, 3 with an extra pass). Solution: that is not the k-th-new spec; for k-th new use set + first-occurrence list below.

Worked example — 1-based k (trace)

step	id	1-based “new” index
1	a	1st new = a
2	b	2nd new = b
3	a	(skip)
4	c	3rd new = c

k = 2 → "b"; k = 3 → "c"; k = 4 → "".

Solution: same as kth_new_id in the next Solution block.

Python Interview Question on k-th First-Seen Id in a Stream

Input: ordered list of string ids, integer k. The j-th “new” id is the j-th first occurrence in a left-to-right scan. Return that id for j = k, or "" if there are fewer than k first-seen events.

Solution Using a Set and Ordered First-Seen List

def kth_new_id(arr, k):
    if k <= 0:
        return ""
    seen = set()
    first_order = []
    for x in arr:
        if x not in seen:
            seen.add(x)
            first_order.append(x)
            if len(first_order) == k:
                return x
    return ""

Check: ["a","b","a","c"], k=2 → "b"; k=4 → "".

Practice

PYTHON · Topic — Hash tables & counting
COMPANY · ByteDance — hash ByteDance + hash table

10. Python: two pointers, binary search, dynamic programming, and simulation on ByteDance hards

Two pointers, binary search, DP, and simulation patterns in Python

Overview: The export bundles algorithm patterns at hard/medium depth. In one prep article you need separate sub-skills, each with a small problem and solution—below are four sub-parts with one end-to-end item each, then a short simulation item.

Sub-part A: two pointers on a sorted array (pair sum)

Idea: Left/right on a sorted list; move inward by comparing nums[l] + nums[r] to the target.

Problem: return 0-based indices (i, j) with i < j and nums[i] + nums[j] == target, or None. nums is strictly increasing (enables the two-pointer proof).

Solution: O(n) two pointers; if nums[l] + nums[r] < target, the sum is too small, so l += 1; if too large, r -= 1.

Solution sketch (code):

def two_sum_sorted(nums, target):
    l, r = 0, len(nums) - 1
    while l < r:
        s = nums[l] + nums[r]
        if s == target:
            return l, r
        if s < target:
            l += 1
        else:
            r -= 1
    return None

Check: [1,2,4,6], target 6 → (1,2) (2+4).

Problem (no-solution case): [1,2,3], target 7 → None.

Sub-part B: binary search on a monotone prefix

Idea: find the smallest index with prefix_sum[i] >= target on a non-decreasing prefix (or return -1 if impossible).

Problem: prefix_sum = [0, 2, 5, 5, 9] (non-decreasing from cumulative sums), target = 5. Return the leftmost i with prefix_sum[i] >= 5.

Solution: binary search for the lowest index i with prefix_sum[i] >= target (the leftmost “first true” in a monotone predicate). O(log n) time.

Check (manual): valid indices: 2, 3, 4; leftmost is 2. If target = 20, the array maximum is 9, so the answer is -1.

Solution sketch (code):

def first_at_least(prefix_sum, target):
    lo, hi = 0, len(prefix_sum) - 1
    if prefix_sum[hi] < target:
        return -1
    while lo < hi:
        mid = (lo + hi) // 2
        if prefix_sum[mid] >= target:
            hi = mid
        else:
            lo = mid + 1
    return lo

Sub-part C: 1-D dynamic programming (Fibonacci-style)

Idea: Stairs with 1 or 2 steps—ways to reach n equals Fibonacci. States ways(0)=1, ways(1)=1.

Problem: you may take 1 or 2 steps at a time. In how many distinct ways can you reach the n-th step? (Ground = step 0; one move reaches step 1.) Use DP in O(n) time and O(1) space (rolling two numbers).

Solution: ways(n) = ways(n-1) + ways(n-2); rolling variables match the code below.

Solution sketch (code):

def count_ways_stairs(n):
    if n <= 1:
        return 1
    a, b = 1, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b

Check: n=4 → 5 (Fibonacci).

Check (base cases): the sketch returns 1 for n <= 1 (only one way to stand on step 0 or step 1 in this model).

Sub-part D: simulation with running state

Idea: one pass over deltas; track cumulative sum and 1-based day index; return the first day the running total is strictly negative, else 0.

Problem: deltas = [1, 2, -1, -2] (days 1–4). Cumulative sums: 1, 3, 2, 0—the running total is never strictly negative, so the return value is 0. For [5, -2, -4], the first strictly negative total appears on day 3 (running sums 5, then 3, then -1).

Solution: single pass; 1-based enumerate(..., start=1) for “day” index.

Solution sketch (code):

def first_negative_day(deltas):
    s = 0
    for i, d in enumerate(deltas, start=1):
        s += d
        if s < 0:
            return i
    return 0

Check: [5, -2, -4] → 3 (sums 5, 3, -1).

How this section maps to “problem + solution”: A–D are four separate sub-topics. Each has an explained idea, a stated problem, a solution (code sketch), and a check—the same pattern as sections 1–9, except there is one combined Interview block per sub-part instead of a single long question at the very end. Harder ByteDance items may blend two of these patterns in one prompt; here they are split for clear practice.

Practice

PYTHON · Topic — Two pointers
COMPANY · ByteDance — two pointers ByteDance + two pointers
PYTHON · Topic — Binary search
PYTHON · Topic — Dynamic programming
COMPANY · ByteDance — hard ByteDance (hard)
COMPANY · ByteDance — python ByteDance Python

Tips to crack ByteDance data engineering interviews

Data engineering interview preparation for a TikTok-aligned ByteDance loop is less about one-off LeetCode grinds and more about repeatable lanes: SQL grain, joins, windows and “last N,” then Python structures and the hard-band patterns in section 10. Pair those reps with a calm explain-aloud habit on WHERE vs HAVING and on stream order in Python.

SQL preparation, grain, and explain-aloud

Drill WHERE vs HAVING, join fan-out, and window partition keys on paper. Say one row of output = what? before you code. Use one CTE per logical block in hairy questions so the panel can follow you—the same structure as our other company interview guides on PipeCode, with the ByteDance company filter.

Where to practice on PipeCode

Lane	ByteDance-focused path
Company hub	Browse ByteDance-tagged problems
SQL	ByteDance SQL
Python	ByteDance Python
Easy / Medium / Hard	Easy · Medium · Hard
All topics (site)	Practice topics

Communication under time pressure

State assumptions in one pass: ties, NULLs, inclusive bounds, time zone. If stuck, offer a smaller version of the problem first. Interviewers often reward a defensible O(n log n) solution with a clean story over a shaky “optimal” idea you cannot finish.

Courses that line up: SQL for data engineering interviews, Python fundamentals for data engineering, plus Data modeling for data engineering interviews and ETL system design for data engineering interviews for verbal design rounds.

Frequently asked questions

What is the ByteDance data engineering interview process?

It often includes a screening conversation, SQL and Python technical rounds, and later system design / data modeling and behavioral discussions. Pipelines, warehouses, and ETL reliability come up in verbal form; treat our ETL and data modeling resources as add-ons to your company-tagged practice.

What should I study for a ByteDance data engineering interview: SQL, Python, or modeling?

Prioritize SQL and Python for coding rounds. Sections 1–10 in this guide track the topic table and the per-problem tag mix on the ByteDance set (5 SQL / 15 Python in the reference export). Add dimensional modeling and ETL for verbal design when the loop includes it.

How hard are ByteDance data engineering interview questions?

Difficulty varies by loop and team, but the PipeCode ByteDance set is published as 8 easy, 8 medium, and 4 hard—a practical ladder. Start easy, then medium as your default, and add hard for timed proof when basics are solid.

Does TikTok use the same data engineering interview style as ByteDance?

TikTok is part of the ByteDance group; many candidates use TikTok and ByteDance interview prep interchangeably for data engineer roles, with SQL, Python, and data-intensive system discussion in common. Always confirm role-specific requirements in your recruiter materials.

How many practice problems should I do before a ByteDance data engineering interview?

A common target is on the order of 30–50 problems across topics—not only the 20 company-tagged items—so weak areas get reps. Mix ByteDance filters with the global SQL and Python libraries.

What are common ByteDance data engineering interview questions in SQL and Python?

They follow the tag families in the topic list and sections 1–10 in this guide—joins/aggregation, subqueries/last N, windows/ranking, and the Python patterns at the end. All new problem text here is for teaching; the wording of Premium hub items is not copied.

Start practicing ByteDance data engineering problems

PipeCode pairs company-tagged ByteDance and TikTok data engineering problems with tests and feedback so you move from reading patterns to typing your own solutions.

Pipecode.ai is Leetcode for Data Engineering

Browse ByteDance practice →
View plans →

Top ByteDance data engineering interview topics

1. Joins, aggregation, and fan-out control in SQL for ByteDance screens

Joins and aggregation in SQL for ByteDance-tagged content

Sub-topic: INNER vs LEFT and join size

Sub-topic: GROUP BY grain and HAVING

Worked example — fan-out: double-counting a one-per-user column after a join

SQL Interview Question on Creator Watch Revenue with Join and HAVING

Solution Using INNER JOIN, GROUP BY, and HAVING

2. Subqueries, CTEs, and layered aggregation in SQL

Subqueries, CTEs, and layered aggregation in SQL

Sub-topic: CTEs vs FROM subquery

Sub-topic: “aggregates, then re-filter”

Worked example — clicks per user per day, then keep busy days

SQL Interview Question on High Daily Click Volume with a CTE

Solution Using a CTE and Post-Aggregation Filter

3. “Last N” per key, ORDER BY, and windows in SQL

“Last N” per key, ORDER BY, and row numbering in SQL

Sub-topic: ROW_NUMBER and ORDER BY inside OVER

Sub-topic: Ties in “last N”

Worked example — “last 2 orders per user” on toy data

SQL Interview Question on Last Two Orders per User by Time

Solution Using ROW_NUMBER and an Outer WHERE on Rank

4. Window functions, partitions, and CTEs in SQL

Window OVER and layered subqueries in SQL

Sub-topic: SUM(...) OVER vs SUM(...) GROUP BY

Sub-topic: Layered CTE + window (subquery + window)

Sub-topic: Ratios, NULLIF, and “empty month”

Worked example — day_share with toy fact rows

SQL Interview Question on Day Share Within User-Month

Solution Using a Windowed SUM Over User and Month

5. Window ranking: RANK, DENSE_RANK, and ROW_NUMBER in SQL

Window ranking: RANK, DENSE_RANK, and ROW_NUMBER

Sub-topic: When you need a single winner

Sub-topic: When RANK is right

Worked example — mini sellers table

SQL Interview Question on Top Seller per Region with Tie-Breaks

Solution Using ROW_NUMBER with a Second Sort Key

6. Python: arrays, sorting, and merge intervals for ByteDance-tagged items

Sorting, sweeping, and merge intervals in Python

Sub-topic: Merging intervals

Sub-topic: sort + O(n) scan

Worked example — full trace and union length

Python Interview Question on Union Length of Merged Intervals

Solution Using Sort, Linear Merge, and Summed Lengths

7. Python: sets, Counter, and string reporting for DE screens

Sets, Counter, and tie-aware reporting in Python

Sub-topic: collections.Counter

Sub-topic: Tie-breaking on labels

Worked example — k picks across frequency bands

Python Interview Question on Top-k Tags with Lexicographic Tie-Breaks

Solution Using Frequency Buckets and Sorted Keys

8. Python: regular expressions and string tokenization for DE screens

Regular expressions and string tokenization in Python

Sub-topic: re.findall vs one manual scan

Sub-topic: Zero and multiple blocks

Worked example — manual parse vs regex

Python Interview Question on Summing Integers Extracted from Text

Solution Using re.findall on Digit Runs

9. Python: hash sets, first-seen order, and stream-style deduplication

Hash sets, first-seen order, and one-pass stream logic in Python

Sub-topic: First-seen order with a set

Sub-topic: “At most once” / Counter (different spec)

Worked example — 1-based k (trace)

Python Interview Question on k-th First-Seen Id in a Stream

Solution Using a Set and Ordered First-Seen List

10. Python: two pointers, binary search, dynamic programming, and simulation on ByteDance hards

Two pointers, binary search, DP, and simulation patterns in Python

Sub-part A: two pointers on a sorted array (pair sum)

Sub-part B: binary search on a monotone prefix

Sub-part C: 1-D dynamic programming (Fibonacci-style)

Sub-part D: simulation with running state

Tips to crack ByteDance data engineering interviews

SQL preparation, grain, and explain-aloud

Where to practice on PipeCode

Communication under time pressure

Frequently asked questions

What is the ByteDance data engineering interview process?

What should I study for a ByteDance data engineering interview: SQL, Python, or modeling?

How hard are ByteDance data engineering interview questions?

Does TikTok use the same data engineering interview style as ByteDance?

Sub-topic: `INNER` vs `LEFT` and join size

Sub-topic: `GROUP BY` grain and `HAVING`

Sub-topic: CTEs vs `FROM` subquery

3. “Last N” per key, `ORDER BY`, and windows in SQL

“Last N” per key, `ORDER BY`, and row numbering in SQL

Sub-topic: `ROW_NUMBER` and `ORDER BY` inside `OVER`

Window `OVER` and layered subqueries in SQL

Sub-topic: `SUM(...) OVER` vs `SUM(...) GROUP BY`

Sub-topic: Ratios, `NULLIF`, and “empty month”

Worked example — `day_share` with toy `fact` rows

Sub-topic: When `RANK` is right

Worked example — mini `sellers` table

Sub-topic: `sort` + `O(n)` scan

7. Python: sets, `Counter`, and string reporting for DE screens

Sets, `Counter`, and tie-aware reporting in Python

Sub-topic: `collections.Counter`

Sub-topic: `re.findall` vs one manual scan