DEV Community

Cover image for ByteDance Data Engineering Interview Questions Guide
Gowtham Potureddi
Gowtham Potureddi

Posted on • Edited on

ByteDance Data Engineering Interview Questions Guide

ByteDance data engineering interview questions (often discussed alongside TikTok-aligned or group-wide data roles) usually combine SQL and Python in the same loop: you need to get the grain right (joins, aggregation, subqueries, window functions and ranking), then show fast, clear code on arrays, sets, strings, and common two-pointer / search / simulation styles on harder items.

This guide is structured as a ByteDance data engineering interview prep map: a topic table first, then numbered sections in that order. Within each main topic, every sub-topic has a short explanation, then—where it helps—a worked example with solution, a note on a typical gotcha (problem), and finally a separate interview problem and full solution at the end of the section. All of that is original teaching copy.

Header: ByteDance and TikTok data engineering interview prep with SQL and Python themes on a dark gradient.

Top ByteDance data engineering interview topics

# Topic Why it shows up (hub tag)
1 SQL: joins and aggregation Joins & Aggregation on company practice, plus fan-out and grain.
2 SQL: subqueries and aggregation Aggregation & Subqueries—CTEs, layered filters.
3 SQL: “last N” and sort order Sorting + per-key windows.
4 SQL: window functions + subqueries Window Functions + CTEs for roll-ups then per-row OVER.
5 SQL: ranking and ties Window + ranking and tie policy.
6 Python: intervals, sort, array intervals / array / sorting / Data Structures on the export.
7 Python: sets, string reporting set / String / reporting-style list logic.
8 Python: regex and string regular expression / String Processing.
9 Python: hash / stream dedup Hash Table / set in arrival order.
10 Python: algorithms band Two Pointers / Binary Search / DP / Math / Simulation on harder items.

Sections 1–10 below follow this map.

If you are new to SQL: In most databases the engine processes a query roughly in this order: FROM / joins → WHERE (filter rows) → GROUP BY → compute aggregates → HAVING (filter groups) → window functions → SELECT / ORDER BY. When in doubt, ask: “Am I filtering one row at a time (WHERE) or a whole group after summing (HAVING)?”


1. Joins, aggregation, and fan-out control in SQL for ByteDance screens

Joins and aggregation in SQL for ByteDance-tagged content

Overview: A join links rows on a key so you can enrich fact rows (many events) with dimension fields (one row per user, creator, or item). Aggregation then answers “how many / how much per …?” at the grain the product question names. If you GROUP BY the wrong set of keys, the output row is not the unit the business asked for. A join can also multiply row count: one user with N order rows can become N joined lines, so a one-per-user column (credits, tier, signup bonus) repeatssumming it after the join adds the same value N times unless you use MAX/MIN, or pre-aggregate the fact, or keep one row per key by design. The hub tags this as Joins & Aggregation because both skills must land together.

Sub-topic: INNER vs LEFT and join size

  • INNER JOIN: only matching keys on both sides—if the right side has no row for a left key, that left row disappears from the result.
  • LEFT OUTER JOIN: keeps every left row. Where the right has no match, right-hand columns are NULL. Use this when you must list all entities (e.g. all users) even if some have no fact in the time window, or to detect “no match” with WHERE right.key IS NULL (anti-join).

Worked example (micro tables + row counts):

users(u_id): u1, u2. events(u_id): only u1 (2 rows e1, e2).

  • INNER JOIN: 2 result rows (both events); u2 is absent (no event row to join to).
  • FROM users u LEFT JOIN events e ON u.u_id = e.u_id: 3 rows: 2 for u1 (e1, e2) and 1 for u2 with e.* all NULL.

Solution (read pattern): INNER = “only keys that participate in both.” LEFT = “anchor on the left list; right may be empty.”

Typical problem: counting users with a metric after INNER—you silently drop users with no events. If the spec needs all users with 0 for missing, start from users LEFT JOIN facts and use COALESCE(SUM(...),0).

Sub-topic: GROUP BY grain and HAVING

GROUP BY defines buckets; aggregates run once per bucket over all rows in that bucket. WHERE is evaluated per input row before grouping; it cannot refer to an aggregate. HAVING filters after GROUP BY and can use SUM, COUNT, etc. Use it when the condition is on a per-group metric (e.g. teams with AOV above a threshold, or more than three orders).

Worked example (aggregate then filter): table t:

order_id user_id amount
1 u1 10
2 u1 20
3 u2 5

SELECT user_id, SUM(amount) AS rev FROM t GROUP BY user_idu1 → 30, u2 → 5.

HAVING SUM(amount) >= 20 → only u1.

Solution (why HAVING, not WHERE on the sum): the condition uses SUM(amount), so it must run after aggregation:

SELECT user_id, SUM(amount) AS rev
FROM t
GROUP BY user_id
HAVING SUM(amount) >= 20;
Enter fullscreen mode Exit fullscreen mode

Worked example — fan-out: double-counting a one-per-user column after a join

Tables (grain): orders(order_id, user_id, amt) — one row per order; users(user_id, signup_credit)one signup_credit per user, 20 for u1. Two orders for u1 with 10 and 10 amt. After orders JOIN users ON user_id, you see 2 lines, each with signup_credit = 20. A naive SUM(u.signup_credit) in that result adds 20 + 20 = 40wrong for “how much credit does this user have once?”

Fix pattern: do not SUM a dimension that is 1:1 to user across multiple fact rows. Use e.g. MAX(u.signup_credit) with GROUP BY user_id (or join only after you aggregate orders per user, or SUM on order amt only and keep credit as a single picked value). Illustrative “safe” read:

SELECT
  o.user_id,
  MAX(u.signup_credit) AS credit_once,
  SUM(o.amt) AS order_total
FROM orders o
INNER JOIN users u ON u.user_id = o.user_id
GROUP BY o.user_id;
Enter fullscreen mode Exit fullscreen mode

Result for u1: credit_once = 20, order_total = 20.

SQL Interview Question on Creator Watch Revenue with Join and HAVING

Given watch(user_id, video_id, watch_sec, watch_date) (one row per session) and video(video_id, title, creator_id) (one row per video). For watch_date = '2026-04-25' only, return creator_id, total watch_sec that day, and count of watch rows, for creators with at least 2 sessions and total watch_sec > 0.

Solution Using INNER JOIN, GROUP BY, and HAVING

SELECT
  v.creator_id,
  SUM(w.watch_sec) AS total_sec,
  COUNT(*) AS n_sessions
FROM watch w
INNER JOIN video v ON v.video_id = w.video_id
WHERE w.watch_date = DATE '2026-04-25'
GROUP BY v.creator_id
HAVING COUNT(*) >= 2
   AND SUM(w.watch_sec) > 0;
Enter fullscreen mode Exit fullscreen mode

Check: The join is 1:1 on video in the fact; aggregation is on (video → creator_id) after join—if the prompt were “per video_id” you would GROUP BY w.video_id (or v.video_id) instead. Say that distinction out loud in interview.

Practice


2. Subqueries, CTEs, and layered aggregation in SQL

Subqueries, CTEs, and layered aggregation in SQL

Overview: Many questions are two phases: (1) roll raw events into per (user, day) or per user totals, (2) filter or join those totals. The inner query (or CTE) must output the grain you will filter on; the outer query then applies rules that need the aggregated columns. A CTE (WITH … AS) is only sugar for a subquery in FROM, but it names the step and lets you chain WITH a AS (…), b AS (SELECT … FROM a) …—in a live interview, that clarity often prevents mistakes.

Sub-topic: CTEs vs FROM subquery

They are the same to the engine in most databases; the difference is only naming and readability. Chain subqueries for paper clarity:

WITH a AS (SELECT k, SUM(x) s FROM t GROUP BY k)
SELECT * FROM a WHERE s > 10;
Enter fullscreen mode Exit fullscreen mode

Worked example (two-step story): “Per key k, what is the sum? Then which keys have s > 10?” CTE a is the first report; the outer SELECT is the second filter.

Solution (expected shape): one row per k in a, then the outer WHERE keeps a subset of those rows. Toy data: if t has (k1,5), (k1,6), (k2,3) then a is k1→11, k2→3; WHERE s > 10 leaves k1 only.

Problem (readability in interview): a huge nested FROM (SELECT … FROM (SELECT …) …) is hard to read under time pressure. Solution: name each stage, e.g. WITH step1 AS (…), step2 AS (SELECT … FROM step1) …, so the grader can follow your logic.

Sub-topic: “aggregates, then re-filter”

You cannot reference a new SELECT alias in the same level’s WHERE in standard SQL (order of evaluation). Use HAVING inside the inner aggregate query, a wrap in a CTE, or repeat the expression. The “users who on some day had ≥ 5 events” pattern is: group in the CTE, then WHERE on the exposed column n in the outer query from the CTE’s result set (that is valid because n is a output column of the CTE).

Worked example (alias trap): inner query is SELECT user_id, SUM(amt) AS rev FROM t GROUP BY user_id. You want WHERE rev > 20 on the result.

  • Problem: SELECT * FROM (…) AS x WHERE rev > 20 is valid (outer query sees column rev). A single-level … GROUP BY user_id WHERE rev > 20 is not (no rev in WHERE for grouped query—you need HAVING SUM(amt) > 20 on that level).

Solution: either HAVING SUM(amt) > 20 in the same aggregate query, or nest a CTE/subquery and filter WHERE rev > 20 on the outer SELECT.

Worked example — clicks per user per day, then keep busy days

Data:

user_id click_date event_id
u1 2026-01-01 e1
u1 2026-01-01 e2
u1 2026-01-01 e3
u1 2026-01-01 e4
u1 2026-01-01 e5
u1 2026-01-02 e6

Ask: return (user_id, click_date, n) where n >= 5.

Solution:

WITH per AS (
  SELECT user_id, click_date, COUNT(*) AS n
  FROM clicks
  GROUP BY user_id, click_date
)
SELECT * FROM per WHERE n >= 5;
Enter fullscreen mode Exit fullscreen mode

Result: one row: (u1, 2026-01-01, 5).

SQL Interview Question on High Daily Click Volume with a CTE

Table clicks(user_id, click_date, event_id, …). Build per (user_id, click_date) row counts as n, excluding null user_id in the inner aggregate. Return rows where n >= 5 with columns user_id, click_date, n.

Solution Using a CTE and Post-Aggregation Filter

WITH per_day AS (
  SELECT
    user_id,
    click_date,
    COUNT(*) AS n
  FROM clicks
  WHERE user_id IS NOT NULL
  GROUP BY user_id, click_date
)
SELECT user_id, click_date, n
FROM per_day
WHERE n >= 5;
Enter fullscreen mode Exit fullscreen mode

Check: The not null on user_id lives in WHERE inside the CTE (row filter); n >= 5 is on the pre-aggregated CTE, so a plain WHERE on the outer select is enough.

Practice


3. “Last N” per key, ORDER BY, and windows in SQL

“Last N” per key, ORDER BY, and row numbering in SQL

Overview: A global ORDER BY ts DESC at the end of a query would sort all events together; it does not give “last 2 for each user” unless the question only asked for the one most recent key globally. For per-user (or per-partition) “latest / last k”, the sort order must be inside a window: PARTITION BY user_id, ORDER BY ts DESC, and usually ROW_NUMBER, RANK, or a subquery with a correlated LIMIT per key (dialect-specific). The export tags this Sorting + per-key logic because the hard part is scoping ORDER BY to a partition, not the whole result set.

Sub-topic: ROW_NUMBER and ORDER BY inside OVER

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_ts DESC) labels 1, 2, 3, … within each user_id in that order. ROW_NUMBER never ties; two rows in the same partition get different numbers.

Worked example (trace one partition): for user_id = u1, if order_ts descending gives rows A, B, C, then rn is 1, 2, 3. The k most recent are WHERE rn <= k.

Solution (definition): ROW_NUMBER is the usual choice when you need exactly k rows per partition in SQL that allows window functions (vs ties with RANK).

Sub-topic: Ties in “last N”

If two events share the same order_ts, ORDER BY must add a tie-break (e.g. order_id DESC) or the engine’s order is not guaranteed and your k “last” rows may be ambiguous. The interviewer wants you to state a rule.

Worked example (two rows, same time): add , order_id DESC so the higher id is “more recent” in the product sense.

Solution (tie-break in window): ORDER BY order_ts DESC, order_id DESC.

Problem / solution (toy data): same user, two orders, identical order_ts only: without a second sort key, ROW_NUMBER is still unique but nondeterministic between the two (bad for tests). With ORDER BY order_ts DESC, order_id DESC and rows (id 9, ts T) and (id 10, ts T), row 10 gets rn = 1; 9 gets rn = 2. You can reproduce the same answer on every run.

Worked example — “last 2 orders per user” on toy data

order_id user_id order_ts
10 u1 2026-01-02
9 u1 2026-01-01
4 u2 2026-01-01

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_ts DESC) → for u1: (10 → 1), (9 → 2); for u2: (4 → 1). WHERE rn <= 2 returns 2 rows for u1, 1 for u2.

Solution (query shape): WITH w AS (SELECT …, ROW_NUMBER() OVER (…) AS rn FROM orders) SELECT … FROM w WHERE rn <= 2; — same idiom as the Interview block below.

SQL Interview Question on Last Two Orders per User by Time

orders(order_id, user_id, order_ts, amount); all columns not null; for each user_id, return the 2 most recent rows by order_ts, breaking ties on larger order_id (use ROW_NUMBER then filter).

Solution Using ROW_NUMBER and an Outer WHERE on Rank

WITH r AS (
  SELECT
    order_id,
    user_id,
    order_ts,
    amount,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY order_ts DESC, order_id DESC
    ) AS rn
  FROM orders
)
SELECT order_id, user_id, order_ts, amount
FROM r
WHERE rn <= 2;
Enter fullscreen mode Exit fullscreen mode

Check: WHERE rn <= 2 in the outer query, after the window—that is the idiom for last 2.

Practice

Side-by-side diagram comparing SQL GROUP BY roll-up results with per-row window function output using OVER PARTITION BY for ByteDance data engineering interview content by PipeCode.


4. Window functions, partitions, and CTEs in SQL

Window OVER and layered subqueries in SQL

Overview: GROUP BY collapses many rows into one per bucket. OVER (window) keeps each input row and attaches a value computed from a window of rows—usually a partition (same user_id / same day / same month). Many DE questions are layered: first a CTE that produces the right grain (e.g. per day per user), then a second pass that ranks or normalizes within a larger bucket (e.g. share of this day inside the user’s month). Confusing the two is a common failure mode: a plain SUM(amount) in SELECT without GROUP BY is invalid, but SUM(amount) OVER (PARTITION BY …) is valid because the window is per row.

Compare visually: a roll-up vs per-row window in the image above: left = one row per group; right = all detail kept with a rank or running value.

Sub-topic: SUM(...) OVER vs SUM(...) GROUP BY

  • SUM(x) GROUP BY yone row per y with a single total. You lose the original rows.
  • SUM(x) OVER (PARTITION BY y)one total repeated on every row in the partition, unless you use a more restrictive frame (e.g. ROWS BETWEEN); used when you need both row-level detail and a subtotal on the same result.

Worked example (conceptual): fact rows for (user, day, revenue).

SELECT
  user_id,
  day,
  revenue,
  SUM(revenue) OVER (PARTITION BY user_id) AS user_all_days_rev
FROM fact;
Enter fullscreen mode Exit fullscreen mode

Each row still exists; the window sum is the per-user total across the partition (same value on every row for that user_id).

Sub-topic: Layered CTE + window (subquery + window)

You first materialize a convenient grain in a CTE (e.g. one row per (user, day) with a daily total), then in the outer SELECT you add a window that partitions by something larger (e.g. day to rank users on that day, or user to rank days for that user).

Worked example — “largest order total per calendar day” (toy CTE + window)

orders(order_id, day, user_id, amt); aggregate to per (day, user) tot, then keep #1 per day by tot.

day user_id tot (after GROUP BY)
D1 a 100
D1 b 200
D2 a 50

ROW_NUMBER() OVER (PARTITION BY day ORDER BY tot DESC, user_id ASC) AS r(D1,b) and (D2,a) have r = 1.

Solution (shape):

WITH per AS (
  SELECT day, user_id, SUM(amt) AS tot
  FROM orders
  GROUP BY day, user_id
)
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY day ORDER BY tot DESC, user_id ASC) AS r
  FROM per
) x WHERE r = 1;
Enter fullscreen mode Exit fullscreen mode

Problem to avoid: putting tot in WHERE before the window: filter r = 1 in the outer layer, not on the same level as the window definition unless you use a CTE (as above).

Sub-topic: Ratios, NULLIF, and “empty month”

Dividing this row by a group sum can hit divide by zero if a partition is empty in theory—but with real fact rows, the usual edge case is sum = 0 (all amount are 0). Use NULLIF(denominator, 0) so the result is NULL instead of an error; say that out loud in interview.

Micro worked example (all zeros in a month): one row for (u1, 2026-04-10, 0) and one for (u1, 2026-04-20, 0). Month sum = 0. So day_share = 0 / NULLIF(0,0). Because NULLIF(0,0) is NULL, the division yields NULL, not an error. Problem without NULLIF: divide by zero on 0/0 when the month sum is 0. Solution: use NULLIF( SUM(…) OVER (…), 0 ) in the denominator.

Worked example — day_share with toy fact rows

Table fact:

user_id d (date) amount
u1 2026-04-10 30
u1 2026-04-20 70
u1 2026-05-01 100

Ask: for each row, add day_share = that row’s amount divided by the sum of amount for the same (user_id, calendar month).

Manual check: For April 2026 and u1, month total = 30 + 70 = 100 → first row’s share = 30/100 = 0.3, second = 0.7. For May, only one row → share = 1.0.

Solution (shape matches interview below): same expression as the Solution Using a Windowed SUM but on this data you should narrate the two April rows splitting the month.

SELECT
  user_id,
  d,
  amount,
  amount::float / NULLIF(
    SUM(amount) OVER (PARTITION BY user_id, date_trunc('month', d)),
    0
  ) AS day_share
FROM fact;
Enter fullscreen mode Exit fullscreen mode

Problem you avoid: if you accidentally PARTITION BY user_id only, you would divide each day by all-time user revenue—wrong for “within month.”

SQL Interview Question on Day Share Within User-Month

fact(user_id, d, amount) at (user, day) grain. Add column day_share = this row’s amount / sum(amount) over the same (user, calendar month) (PostgreSQL: date_trunc('month', d) in the partition).

Solution Using a Windowed SUM Over User and Month

PostgreSQL-style:

SELECT
  user_id,
  d,
  amount,
  amount::float / NULLIF(
    SUM(amount) OVER (
      PARTITION BY user_id, date_trunc('month', d)
    ),
  0) AS day_share
FROM fact;
Enter fullscreen mode Exit fullscreen mode

Check: NULLIF(…,0) avoids divide by zero if a month is empty. Name date_trunc('month',d) as “month bucket” in interview; adjust for Trino/Spark/ANSI as needed.

Practice


5. Window ranking: RANK, DENSE_RANK, and ROW_NUMBER in SQL

Window ranking: RANK, DENSE_RANK, and ROW_NUMBER

Overview: Ranking orders rows inside a partition with rules for ties. Picking the wrong function is how you return two “first place” rows when the business asked for one rep per region, or how you get gaps in “rank 1,2,3” when a dense ladder was intended (leaderboards).

  • RANK() — ties share the same rank; the next rank skips (1,1,3).
  • DENSE_RANK() — ties share the same rank; the next rank does not skip (1,1,2).
  • ROW_NUMBER()unique 1,2,3…; no shared positions—add a second ORDER BY key for deterministic ties (e.g. smaller id wins).

Sub-topic: When you need a single winner

If the spec says at most one row per region, use ROW_NUMBER, not RANK—unless the prompt explicitly allows ties (e.g. “all top sellers”).

Worked example (ties in sales): three sellers in one region, revenues (100, 100, 50) in descending order:

seller revenue RANK (DESC) DENSE_RANK (DESC) ROW_NUMBER (DESC, id ASC)
A (id 1) 100 1 1 1
B (id 2) 100 1 1 2
C (id 3) 50 3 2 3
  • RANK: two rows at rank 1not a unique winner.
  • DENSE_RANK: still two rows at 1.
  • ROW_NUMBER with ORDER BY revenue DESC, seller_id ASCexactly one rn = 1 (seller A).

Solution: for “one row per partition” with tie-break, WHERE rn = 1 on ROW_NUMBER is the standard pattern (same as the interview block).

Sub-topic: When RANK is right

“All employees in the top compensation band in each office” often means everyone who ties for #1—use RANK or DENSE_RANK and WHERE rank = 1, and do not add a spurious seller_id tie-break unless the product says so.

Worked example (two co-leaders, one region): RANK() OVER (PARTITION BY office_id ORDER BY salary DESC).

employee_id office_id salary
e1 O1 200000
e2 O1 200000
e3 O1 100000

RANK (DESC): e1 and e2 both 1; e3 = 3. Ask: return all who are #1 in salary in O1.

  • Solution: WHERE rank_col = 1 after RANK or DENSE_RANK: you return 2 rows (e1, e2). If you had used ROW_NUMBER with a tie-break you would return only 1 row—wrong for this spec.

Problem (wrong function): using ROW_NUMBER … WHERE rn = 1 when the business wanted all tied winners. Fix: RANK or DENSE_RANK and rank = 1, or re-read the spec to allow a single winner.

Worked example — mini sellers table

seller_id region_id revenue
10 R1 200
20 R1 150
30 R2 200
40 R2 200

Ask: one winner per region_id, higher revenue wins, ties → smaller seller_id.

Result: R1 → (10, 200). R2 → two tie at 200: ROW_NUMBER(…, ORDER BY revenue DESC, seller_id ASC) picks 30 over 40.

Problem if you used RANK only with WHERE rank = 1: you would return two rows for R2. Fix: ROW_NUMBER (as below) or change the spec.

SQL Interview Question on Top Seller per Region with Tie-Breaks

sellers(seller_id, region_id, revenue) one row per (seller, region). Return one row per region_id: max revenue; ties → smaller seller_id.

Solution Using ROW_NUMBER with a Second Sort Key

WITH w AS (
  SELECT
    seller_id,
    region_id,
    revenue,
    ROW_NUMBER() OVER (
      PARTITION BY region_id
      ORDER BY revenue DESC, seller_id ASC
    ) AS rn
  FROM sellers
)
SELECT region_id, seller_id, revenue
FROM w
WHERE rn = 1;
Enter fullscreen mode Exit fullscreen mode

Check: RANK would return two rows if two sellers tie max—ROW_NUMBER with seller_id tie break gives one.

Practice


6. Python: arrays, sorting, and merge intervals for ByteDance-tagged items

Sorting, sweeping, and merge intervals in Python

Overview: An interval is usually [a,b] inclusive in DE-style problems. Unsorted input is normal—sort by start (then end for stability). Sweep with one current merged segment [cur_s, cur_e]. For the next [s,e]: if s <= cur_e, the intervals touch or overlap—set cur_e = max(cur_e, e); else append a new block and move the cursor. Union length = sum of (end - start + 1) for integer inclusive lengths (watch one-point intervals: [5,5] has length 1).

Sub-topic: Merging intervals

Step New interval Current merge
After sort [1,3] [[1,3]]
[2,6] overlaps extend end [[1,6]]
[8,10] no overlap [[1,6],[8,10]]

Why sort first: the earliest starting interval that could connect to a chain must be processed in order; without sort, a later tiny interval that bridges two clusters could be missed in a single left-to-right pass.

Sub-topic: sort + O(n) scan

Time: O(n log n) to sort, O(n) merge. Space: O(n) for the sorted copy (or in-place sort on a mutable list in Python).

Worked example (complexity only): with n = 1,000,000 intervals, sorting dominates; a single left-to-right merge pass is O(n), so total time is O(n log n). In an interview, say: O(n log n) time, O(n) extra space (or O(1) extra if you may mutate the input list in place).

Problem (common bug): using s < cur_e instead of s <= cur_e, which fails when one interval ends exactly where the next starts if the spec says they merge (e.g. [1,4] and [4,5]—with inclusive touch, many problems still merge to [1,5]). Clarify with the interviewer; the solution below uses overlap as s <= m[-1][1] (merge if the new start is not strictly to the right of the current end).

Worked example — full trace and union length

Input: [[5,5],[1,3],[2,6],[8,10]] (unsorted).

  1. Sort by start: [[1,3],[2,6],[5,5],[8,10]].
  2. Merge: [1,3]; next [2,6]2 <= 3 → extend → [1,6]. Next [5,5]5 <= 6 → still inside → [1,6]. Next [8,10]8 > 6 → new block [[1,6],[8,10]].
  3. Union length (inclusive int): (6−1+1) + (10−8+1) = 6 + 3 = 9.

Solution: this matches the code in the next section’s Interview solution (merged_union_length).

Python Interview Question on Union Length of Merged Intervals

Input: list of inclusive integer intervals [start, end] (unsorted, may overlap). Return the total length of the union after merging (inclusive length per piece: end - start + 1).

Solution Using Sort, Linear Merge, and Summed Lengths

def merged_union_length(intervals):
    if not intervals:
        return 0
    iv = sorted(intervals, key=lambda x: (x[0], x[1]))
    # merged list of [s,e] inclusive
    m = [list(iv[0])]
    for s, e in iv[1:]:
        if s <= m[-1][1]:
            m[-1][1] = max(m[-1][1], e)
        else:
            m.append([s, e])
    return sum(e - s + 1 for s, e in m)
Enter fullscreen mode Exit fullscreen mode

Check: [[1,3],[2,6],[8,10]][[1,6],[8,10]] → length 6 + 3 = 9.

Practice


7. Python: sets, Counter, and string reporting for DE screens

Sets, Counter, and tie-aware reporting in Python

Overview: You combine list iteration, set membership in O(1) expected time, and Counter (or a dict) for frequencies. “Reporting” means turning raw rows into sorted or top-k answers with a defined tie rule. Many panels want the SQL mental model in Python: group by key, order by count DESC, then key ASC for ties.

Sub-topic: collections.Counter

Counter(iterable) gives per-key counts; .items() yields (tag, count). .most_common(k) is convenient when tie rules are only “by count,” but for lex ties among equal counts you must group tags by frequency and sort inside each bucket (see Solution below).

Worked example: Counter("abracadabra") has a:5, b:2, r:2, c:1, d:1. A k-tag report with lex tie: take a (highest f), then within 2s pick b before r if k is large enough, then count 1s: c before d.

Solution (bucket idea): for f in sorted(by_freq, reverse=True): for t in sorted(by_freq[f]): ...

Sub-topic: Tie-breaking on labels

Lexicographic on string ids is common: sort keys within a frequency bucket the same way you would in SQL: ORDER BY c DESC, tag ASC.

Worked example: ["a","b","a","a","c"] → counts a:3, b:1, c:1 — top-1 by count is a. If k=2, the next band is count 1: b before c.

Problem: most_common(k) alone does not guarantee lex order for tied counts—fix: explicit sort in each frequency group, as in the Solution block.

Worked example — k picks across frequency bands

Input: ["x","x","y","y","z"]x:2, y:2, z:1. k=2: need two highest frequency tags: both 2; lexx then y, never z.

Solution: the interview function returns the list in that order; see Check in the next section.

Python Interview Question on Top-k Tags with Lexicographic Tie-Breaks

Input: list of tag strings, integer k. Output: k tags with highest frequency; within a frequency band use lexicographic order; overall descending by frequency (SQL spirit: ORDER BY cnt DESC, tag ASC).

Solution Using Frequency Buckets and Sorted Keys

from collections import Counter

def top_k_tags(tags, k):
    if k <= 0 or not tags:
        return []
    c = Counter(tags)
    by_freq = {}
    for t, f in c.items():
        by_freq.setdefault(f, []).append(t)
    res = []
    for f in sorted(by_freq.keys(), reverse=True):
        for t in sorted(by_freq[f]):
            res.append(t)
            if len(res) == k:
                return res
    return res
Enter fullscreen mode Exit fullscreen mode

Check: ["x","x","y","y","z"], k=2 — both 2: pick x then y (lex) before any single-count tag.

Practice


8. Python: regular expressions and string tokenization for DE screens

Regular expressions and string tokenization in Python

Overview: The re module finds matches; for integers in text, a pattern like \d+ (maximal unsigned digit runs) or -?\d+ (optional sign) is common. In Python, re.findall returns non-overlapping matches left to right—so "a12b3" with r"\d+" gives ['12','3'], not ['1','2','3']. You must agree: signed? empty string? (usually sum 0 or 0 for no matches). Edge: "12-34" with -?\d+ can pick 12, -34 depending on how you write the pattern; the interview item below is intentionally unsigned blocks to avoid that ambiguity.

Sub-topic: re.findall vs one manual scan

Findall is compact; a for-loop and isdigit chunks (track start/end) is the same logic without regex—good if the panel says “no re import.” For alphanumeric ids, you might split on non-digits with re.split.

Worked example (tiny spec): s = "a12b3" with r"\d+": tokens 12 and 3 → sum 15. If the spec were “sum of each digit,” you would iterate characters—different problem.

Solution: one pass: sum(int(m) for m in re.findall(r"\d+", s)).

Sub-topic: Zero and multiple blocks

Explanation: with r"\d+", the empty string produces no matches → findall is [], and the sum of an empty list of ints is 0. The string "0" is one block → add 0. The string "10 20" (digits separated by spaces only) still matches two runs "10" and "20"10 + 20 = 30; the space breaks the run, so you do not get one number 1020 unless the digits are one contiguous run (e.g. "1020" → a single int 1020).

Worked example (problem and solution): s = "0plus00"three blocks "0", "0", "0"0 + 0 + 0 = 0.

Problem (confused spec): for "a1b1" per-digit you might say 1+1; for block you get "1" and "1"; the sum is still 2, but for "a12b" blocks give 12 while per-digit would give 1+2 = 3. Clarify in interview. Solution in this guide and the Interview below: maximal digit runs (block sum).

Worked example — manual parse vs regex

Input re.findall(r"\d+", s) Sum
Price 12x and 3y 12, 3 15
no digits (empty) 0
100200 100200 (one block) 100200

Problem if you split on spaces only: "a12b34" would not split; maximal digit run detection needs regex or a state machine—the interview uses regex.

Python Interview Question on Summing Integers Extracted from Text

Input: string with letters and digits. Spec (simple): each maximal contiguous digit block is one unsigned integer to add.

Solution Using re.findall on Digit Runs

import re

def sum_of_ints_in_string(s):
    return sum(int(m) for m in re.findall(r"\d+", s))
Enter fullscreen mode Exit fullscreen mode

Check: "Price 12x and 3y"12 + 3 = 15.

If you need signed numbers, many panels use: re.findall(r"-?\d+", s) and sum(int(m) for m in ...) with care for lone -.

Practice


9. Python: hash sets, first-seen order, and stream-style deduplication

Hash sets, first-seen order, and one-pass stream logic in Python

Overview: A set stores keys you have seen; a dict can store first index, count, or value. Stream = one pass; arrival order is fixed—do not pre-sort the array when the spec asks for the k-th new (first-occurrence) id in original order. Time O(n), set space O(unique keys) in expectation. A set + list of first-seen keys is a clear way to get 1-based “j-th new”.

Diagram of Python hash map frequency counting and set-based deduplication on an array for ByteDance data engineering interview prep on PipeCode.

Sub-topic: First-seen order with a set

When x not in seen: seen.add(x) and append to first_order. If you need only the k-th new id, return as soon as len(first_order) == k. O(n) time, O(unique) set space.

Worked example: [7,3,7,1]1st new 7, 2nd 3, skip duplicate 7, 3rd 1[7,3,1].

Solution (mapping to 1-based k): the j-th new id is index j-1 in first_order.

Sub-topic: “At most once” / Counter (different spec)

A Counter and filter for count == 1 answers “ids that appear exactly once”—not the same as “k-th time we first see an id” unless the problem combines them.

Problem with list(dict.fromkeys(arr)): you dedupe in order but do not get an early stop for the k-th new id without the same left-to-right loop as in kth_new_id.

Worked example (Counter “exactly once” — different from k-th new): arr = [1,2,1,3]; Counter is {1:2, 2:1, 3:1}. Ids with count 1 are 2 and 3 (often reported in first-seen order 2, 3 with an extra pass). Solution: that is not the k-th-new spec; for k-th new use set + first-occurrence list below.

Worked example — 1-based k (trace)

step id 1-based “new” index
1 a 1st new = a
2 b 2nd new = b
3 a (skip)
4 c 3rd new = c

k = 2"b"; k = 3"c"; k = 4"".

Solution: same as kth_new_id in the next Solution block.

Python Interview Question on k-th First-Seen Id in a Stream

Input: ordered list of string ids, integer k. The j-th “new” id is the j-th first occurrence in a left-to-right scan. Return that id for j = k, or "" if there are fewer than k first-seen events.

Solution Using a Set and Ordered First-Seen List

def kth_new_id(arr, k):
    if k <= 0:
        return ""
    seen = set()
    first_order = []
    for x in arr:
        if x not in seen:
            seen.add(x)
            first_order.append(x)
            if len(first_order) == k:
                return x
    return ""
Enter fullscreen mode Exit fullscreen mode

Check: ["a","b","a","c"], k=2"b"; k=4"".

Practice


10. Python: two pointers, binary search, dynamic programming, and simulation on ByteDance hards

Two pointers, binary search, DP, and simulation patterns in Python

Overview: The export bundles algorithm patterns at hard/medium depth. In one prep article you need separate sub-skills, each with a small problem and solution—below are four sub-parts with one end-to-end item each, then a short simulation item.

Sub-part A: two pointers on a sorted array (pair sum)

Idea: Left/right on a sorted list; move inward by comparing nums[l] + nums[r] to the target.

Problem: return 0-based indices (i, j) with i < j and nums[i] + nums[j] == target, or None. nums is strictly increasing (enables the two-pointer proof).

Solution: O(n) two pointers; if nums[l] + nums[r] < target, the sum is too small, so l += 1; if too large, r -= 1.

Solution sketch (code):

def two_sum_sorted(nums, target):
    l, r = 0, len(nums) - 1
    while l < r:
        s = nums[l] + nums[r]
        if s == target:
            return l, r
        if s < target:
            l += 1
        else:
            r -= 1
    return None
Enter fullscreen mode Exit fullscreen mode

Check: [1,2,4,6], target 6(1,2) (2+4).

Problem (no-solution case): [1,2,3], target 7None.

Sub-part B: binary search on a monotone prefix

Idea: find the smallest index with prefix_sum[i] >= target on a non-decreasing prefix (or return -1 if impossible).

Problem: prefix_sum = [0, 2, 5, 5, 9] (non-decreasing from cumulative sums), target = 5. Return the leftmost i with prefix_sum[i] >= 5.

Solution: binary search for the lowest index i with prefix_sum[i] >= target (the leftmost “first true” in a monotone predicate). O(log n) time.

Check (manual): valid indices: 2, 3, 4; leftmost is 2. If target = 20, the array maximum is 9, so the answer is -1.

Solution sketch (code):

def first_at_least(prefix_sum, target):
    lo, hi = 0, len(prefix_sum) - 1
    if prefix_sum[hi] < target:
        return -1
    while lo < hi:
        mid = (lo + hi) // 2
        if prefix_sum[mid] >= target:
            hi = mid
        else:
            lo = mid + 1
    return lo
Enter fullscreen mode Exit fullscreen mode

Sub-part C: 1-D dynamic programming (Fibonacci-style)

Idea: Stairs with 1 or 2 steps—ways to reach n equals Fibonacci. States ways(0)=1, ways(1)=1.

Problem: you may take 1 or 2 steps at a time. In how many distinct ways can you reach the n-th step? (Ground = step 0; one move reaches step 1.) Use DP in O(n) time and O(1) space (rolling two numbers).

Solution: ways(n) = ways(n-1) + ways(n-2); rolling variables match the code below.

Solution sketch (code):

def count_ways_stairs(n):
    if n <= 1:
        return 1
    a, b = 1, 1
    for _ in range(2, n + 1):
        a, b = b, a + b
    return b
Enter fullscreen mode Exit fullscreen mode

Check: n=45 (Fibonacci).

Check (base cases): the sketch returns 1 for n <= 1 (only one way to stand on step 0 or step 1 in this model).

Sub-part D: simulation with running state

Idea: one pass over deltas; track cumulative sum and 1-based day index; return the first day the running total is strictly negative, else 0.

Problem: deltas = [1, 2, -1, -2] (days 1–4). Cumulative sums: 1, 3, 2, 0—the running total is never strictly negative, so the return value is 0. For [5, -2, -4], the first strictly negative total appears on day 3 (running sums 5, then 3, then -1).

Solution: single pass; 1-based enumerate(..., start=1) for “day” index.

Solution sketch (code):

def first_negative_day(deltas):
    s = 0
    for i, d in enumerate(deltas, start=1):
        s += d
        if s < 0:
            return i
    return 0
Enter fullscreen mode Exit fullscreen mode

Check: [5, -2, -4]3 (sums 5, 3, -1).

How this section maps to “problem + solution”: A–D are four separate sub-topics. Each has an explained idea, a stated problem, a solution (code sketch), and a check—the same pattern as sections 1–9, except there is one combined Interview block per sub-part instead of a single long question at the very end. Harder ByteDance items may blend two of these patterns in one prompt; here they are split for clear practice.

Practice


Tips to crack ByteDance data engineering interviews

Data engineering interview preparation for a TikTok-aligned ByteDance loop is less about one-off LeetCode grinds and more about repeatable lanes: SQL grain, joins, windows and “last N,” then Python structures and the hard-band patterns in section 10. Pair those reps with a calm explain-aloud habit on WHERE vs HAVING and on stream order in Python.

SQL preparation, grain, and explain-aloud

Drill WHERE vs HAVING, join fan-out, and window partition keys on paper. Say one row of output = what? before you code. Use one CTE per logical block in hairy questions so the panel can follow you—the same structure as our other company interview guides on PipeCode, with the ByteDance company filter.

Where to practice on PipeCode

Lane ByteDance-focused path
Company hub Browse ByteDance-tagged problems
SQL ByteDance SQL
Python ByteDance Python
Easy / Medium / Hard Easy · Medium · Hard
All topics (site) Practice topics

Simple three-step study ladder for ByteDance data engineering practice: eight easy, eight medium, and four hard problems for PipeCode interview prep.

Communication under time pressure

State assumptions in one pass: ties, NULLs, inclusive bounds, time zone. If stuck, offer a smaller version of the problem first. Interviewers often reward a defensible O(n log n) solution with a clean story over a shaky “optimal” idea you cannot finish.

Courses that line up: SQL for data engineering interviews, Python fundamentals for data engineering, plus Data modeling for data engineering interviews and ETL system design for data engineering interviews for verbal design rounds.


Frequently asked questions

What is the ByteDance data engineering interview process?

It often includes a screening conversation, SQL and Python technical rounds, and later system design / data modeling and behavioral discussions. Pipelines, warehouses, and ETL reliability come up in verbal form; treat our ETL and data modeling resources as add-ons to your company-tagged practice.

What should I study for a ByteDance data engineering interview: SQL, Python, or modeling?

Prioritize SQL and Python for coding rounds. Sections 1–10 in this guide track the topic table and the per-problem tag mix on the ByteDance set (5 SQL / 15 Python in the reference export). Add dimensional modeling and ETL for verbal design when the loop includes it.

How hard are ByteDance data engineering interview questions?

Difficulty varies by loop and team, but the PipeCode ByteDance set is published as 8 easy, 8 medium, and 4 hard—a practical ladder. Start easy, then medium as your default, and add hard for timed proof when basics are solid.

Does TikTok use the same data engineering interview style as ByteDance?

TikTok is part of the ByteDance group; many candidates use TikTok and ByteDance interview prep interchangeably for data engineer roles, with SQL, Python, and data-intensive system discussion in common. Always confirm role-specific requirements in your recruiter materials.

How many practice problems should I do before a ByteDance data engineering interview?

A common target is on the order of 30–50 problems across topics—not only the 20 company-tagged items—so weak areas get reps. Mix ByteDance filters with the global SQL and Python libraries.

What are common ByteDance data engineering interview questions in SQL and Python?

They follow the tag families in the topic list and sections 1–10 in this guidejoins/aggregation, subqueries/last N, windows/ranking, and the Python patterns at the end. All new problem text here is for teaching; the wording of Premium hub items is not copied.

Start practicing ByteDance data engineering problems

PipeCode pairs company-tagged ByteDance and TikTok data engineering problems with tests and feedback so you move from reading patterns to typing your own solutions.

Pipecode.ai is Leetcode for Data Engineering

Browse ByteDance practice →
View plans →

Top comments (0)