DEV Community

Cover image for Uber Data Engineering Interview Questions & Prep
Gowtham Potureddi
Gowtham Potureddi

Posted on

Uber Data Engineering Interview Questions & Prep

Uber Data Engineering Interview Questions & Prep Guide

Uber data engineering interview questions usually show up across a full loop: recruiter screen, SQL and Python coding, a data pipeline / architecture discussion, and behavioral rounds. Data engineering screens reward correctness at the right grain, honest complexity discussion, and clear communication under time pressure—not just “finishing” a LeetCode-style puzzle.

This guide maps SQL, Python patterns that align with PipeCode’s Uber hub (for example 20 curated problems—1 easy, 14 medium, 5 hard—with Python skewing harder), plus modeling and ETL tradeoffs. Sample prompts and code are original teaching material, not leaked statements. PipeCode pairs Uber-tagged items with global SQL and 450+ broader data engineering practice—confirm interview stages with your recruiter.

PipeCode blog header for Uber data engineering interview questions and prep, with SQL, Python, and pipeline themes on a dark gradient.


Top Uber data engineering interview topics

# Topic Why it shows up
1 Aggregation & joins · SQL Hub SQL item (#243) is ranking + joins—“most common” per group is a recurring interview shape.
2 Intervals, arrays & two pointers · Python Line-sweep and merge logic across interval and array tags (#21, #118, #154, #157).
3 Sliding window · Python Invariant windows on strings and arrays (#122, #150, #151).
4 Hash table & heap · Python Counts, elections, streaming top-k (#131, #155, #159).
5 Stack, BFS & graph · Python Collisions, layered BFS, route / reach graphs (#121, #158, #320, #324).
6 DP, greedy, tree & linked list · Python Harder composed patterns (#152, #153, #321, #323, #325).
7 Star schema, grain & pipeline narrative Verbal modeling and ETL rounds in the full DE loop—beyond the 20-problem sampler.

SQL evaluation order (mental model): FROM / joins → WHERE (row filter) → GROUP BY → aggregates → HAVING (group filter) → windows → SELECT / ORDER BY. If you are filtering a sum or count of a group, that predicate almost always belongs in HAVING, not WHERE.


1. Aggregation and GROUP BY in Uber-style SQL

Aggregation and GROUP BY in SQL for data engineering

Picture fact tables with many detail rows—one row per trip, delivery event, or support ticket. Aggregation collapses those rows into one number (or a few numbers) per bucket. The bucket is whatever the prompt names: per rider_id, per city, per trip_date, per product_tier, and so on.

GROUP BY lists the bucket keys. Every distinct combination of those keys is one group; SUM, COUNT, AVG, … run inside each group only.

Report grain: After you sketch the query, say out loud: “One result row = one ___.” If the prompt wants revenue per city, the grain is city. If you group by (rider_id, trip_id) when the business question is per rider, your SUM(fare) can still be “locally correct” but wrong for the sentence the interviewer cares about.

SUM, AVG, and NULL

  • SUM(col) adds non-null numerics in the group. NULL is skipped (not treated as 0). If every value is NULL, SUM is usually NULL, not zero—call that out when they ask about edge cases.
  • AVG(col) = sum of non-null values ÷ count of non-null values. Missing fares do not enter the numerator or denominator.

Worked example: In one group, fare_usd values 10, NULL, 25SUM(fare_usd) = 35, AVG(fare_usd) = 17.5 (only two rows in the average).

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)

  • COUNT(*) counts rows in the group, even if some columns are NULL.
  • COUNT(col) counts rows where col is not NULL.
  • COUNT(DISTINCT col) counts unique non-null values—use after joins when you must count riders, not inflated rows.

Worked example: Three rows in a group, two with non-null fare_usdCOUNT(*) = 3, COUNT(fare_usd) = 2.

MIN / MAX

Apply to orderable types (amounts, TIMESTAMP). NULLs are skipped. MAX(completed_at) is the standard pattern for latest trip per bucket.

WHERE vs HAVING

  • WHERE filters input rows before grouping—only raw-column predicates (or expressions that do not use aggregates on the group).
  • HAVING filters groups after aggregates—predicates on SUM, COUNT, AVG, etc.

Worked example — same rides grain, two filters:

ride_id driver_id fare_usd
1 D1 12
2 D1 18
3 D1 5
4 D2 40
  • WHERE fare_usd > 10 drops ride 3 before grouping. Then GROUP BY driver_id with SUM(fare_usd)D1 = 30, D2 = 40.
  • HAVING SUM(fare_usd) > 35 on the full table → D1 = 35, D2 = 40—only D2 passes.

Rule of thumb: If the predicate uses SUM / COUNT / AVG of the group, it belongs in HAVING (or an outer query), not WHERE.

One-pass revenue per bucket

Worked example: rides(ride_id, city, fare_usd, completed_at)total fare in 2025 Q1 per city (half-open dates):

SELECT city, SUM(fare_usd) AS rev
FROM rides
WHERE completed_at >= DATE '2025-01-01'
  AND completed_at <  DATE '2025-04-01'
GROUP BY city;
Enter fullscreen mode Exit fullscreen mode

Takeaway: Grain is one row per city; the WHERE clause restricts which rides enter the sums, not which cities appear (every city with a ride in the window shows up).

Conditional aggregation with CASE

Sub-topic: Count or sum only rows that match a condition inside each group—without running two queries.

Worked example — trips per service_type mix:

trip_id city service_type fare_usd
1 SF X 20
2 SF XL 35
3 SF X 15
SELECT city,
       COUNT(*) AS total_trips,
       SUM(CASE WHEN service_type = 'XL' THEN 1 ELSE 0 END) AS xl_trips,
       SUM(CASE WHEN service_type = 'X' THEN fare_usd ELSE 0 END) AS x_fare
FROM trips
GROUP BY city;
Enter fullscreen mode Exit fullscreen mode

Result for SF: total_trips = 3, xl_trips = 1, x_fare = 35.

Common beginner mistakes

  • Putting SUM(x) > 100 in WHERE instead of HAVING.
  • Selecting a column that is not in GROUP BY and not wrapped in an aggregate (invalid in strict SQL).
  • Using COUNT(*) when the question asked for distinct riders—use COUNT(DISTINCT rider_id).
  • Joining one-to-many before GROUP BY and double-counting a one-per-rider attribute.

SQL Interview Question on Net Revenue by City After Refunds

Tables: ledger(trip_id, city, amount_usd, is_chargeback). amount_usd is positive for completed trip revenue and negative for chargebacks. Return each city where net revenue (sum of all amount_usd) is strictly greater than zero, ordered by net revenue descending.

Solution Using GROUP BY and Signed Amounts

SELECT city, SUM(amount_usd) AS net_rev
FROM ledger
GROUP BY city
HAVING SUM(amount_usd) > 0
ORDER BY net_rev DESC;
Enter fullscreen mode Exit fullscreen mode

Why this works: Chargebacks are already negative rows; SUM nets them against trip revenue at city grain. HAVING filters groups, not individual ledger lines.


2. Joins, fan-out, and deduplication in Uber-style SQL

Joins and row multiplication for marketplace metrics

A join answers: “Which rows from table B belong with this row from table A?” You declare that with ON a.key = b.key. In DE interviews you join facts (trips, events) to dimensions (riders, cities) to enrich or filter.

Inner join vs left join (interview vocabulary)

  • INNER JOIN: Keep only rows where the ON predicate matches on both sides. NULL = NULL is not true, so null keys usually drop from an inner join unless you COALESCE or handle them explicitly.
  • LEFT JOIN: Keep every row from the left table; matching rows from the right get columns filled in, non-matches get NULL on the right. Use this for “all riders, even those with zero trips” style prompts.

Fan-out (row multiplication)

Sub-topic: A one-to-many join replicates the “one” side across N “many” rows. Any SUM of a column from the one side after the join can multiply by N.

Worked example: sessions(session_id, user_id) has one row s1. events has two rows for s1. After sessions INNER JOIN events ON ..., session_id appears twice. A naive SUM(sessions.promo_credit_usd) doubles the promo if that column is one-per-session.

Fix patterns: (1) GROUP BY the stable key on the many side and aggregate first; (2) use MAX/MIN on a functionally dependent dimension column when it is constant within your GROUP BY keys; (3) dedupe the dimension to one row per key before joining.

Side-by-side diagram comparing SQL join fan-out row inflation with a deduplicated aggregate result for Uber data engineering interview prep by PipeCode.

Anti-join pattern (“no matching row”)

Sub-topic:Riders with no trips in the last 7 days” is often LEFT JOIN trips ... WHERE trips.trip_id IS NULL (or NOT EXISTS)—not an inner join, which would drop the zero-activity riders entirely.

Worked example (conceptual): riders LEFT JOIN recent_trips ON riders.rider_id = recent_trips.rider_id then WHERE recent_trips.rider_id IS NULL keeps only riders who never matched a recent trip row.

SQL Interview Question on Trip Metrics After a Fan-Out Join

Schema: trips(trip_id, rider_id, fare_usd) (one row per trip). riders(rider_id, signup_bonus_usd) (one row per rider). Return rider_id, SUM(fare_usd) as trip_revenue, and one signup_bonus_usd value per rider (not multiplied by trip count) for riders with at least two trips.

Solution Using Safe Aggregation on Dimensions

SELECT
  t.rider_id,
  SUM(t.fare_usd) AS trip_revenue,
  MAX(r.signup_bonus_usd) AS signup_bonus_usd
FROM trips t
INNER JOIN riders r ON r.rider_id = t.rider_id
GROUP BY t.rider_id
HAVING COUNT(*) >= 2;
Enter fullscreen mode Exit fullscreen mode

Why this works: MAX(r.signup_bonus_usd) collapses the duplicated dimension value across joined trip rows while SUM(t.fare_usd) only sums fact amounts. The predicate HAVING COUNT(*) >= 2 enforces the two-trip rule at rider grain.

SQL Interview Question on Deduping a Slowly Changing Rider Attribute

Tables: trips(trip_id, rider_id, fare_usd, completed_at); rider_profile(rider_id, tier, valid_from, valid_to) with multiple historical rows per rider_id (Type-2–style). For each trip, attach the tier that was valid when the trip completed (completed_at lies in [valid_from, valid_to)). Sketch how you avoid multiplying trips when rider_profile has overlapping rows.

Solution Sketch (join + filter + assert one match)

SELECT t.trip_id, t.rider_id, t.fare_usd, p.tier
FROM trips t
JOIN rider_profile p
  ON p.rider_id = t.rider_id
 AND t.completed_at >= p.valid_from
 AND t.completed_at <  p.valid_to;
Enter fullscreen mode Exit fullscreen mode

Why this works: The time predicate pins at most one profile version per trip if the source data is clean (no overlaps). In an interview, say you would validate “one tier per trip” with a GROUP BY trip_id HAVING COUNT(*) > 1 check on a staging query—or ask the interviewer how overlaps are defined.


3. Window functions and ranking in SQL

Window functions and ranking for Uber data engineering

Window functions and ranking patterns

ROW_NUMBER, RANK, DENSE_RANK, and SUM(...) OVER (PARTITION BY ... ORDER BY ...) show up constantly. Partition keys define independent streaks of rows; order defines sequence inside each partition.

Sub-topic: ROW_NUMBER vs RANK vs DENSE_RANK

  • ROW_NUMBER() assigns 1, 2, 3, … with no ties—even if order_ts is identical, you get arbitrary but unique ranks unless you add a tie-break column to ORDER BY.
  • RANK() skips after ties: ranks 1, 1, 3 if two rows tie for first.
  • DENSE_RANK() does not skip: 1, 1, 2 for the same tie pattern.

Worked example: Two trips for the same rider_id share the latest completed_at. ROW_NUMBER ... ORDER BY completed_at DESC, trip_id DESC breaks the tie deterministically; RANK would give both 1.

Running and rolling metrics

Sub-topic: SUM(fare_usd) OVER (PARTITION BY rider_id ORDER BY completed_at ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) is cumulative revenue per rider in trip order—useful for fraud or growth story questions. If the prompt only says “running total,” confirm whether ties on ORDER BY are allowed to be non-deterministic.

Worked example (three trips, one rider):

trip_id completed_at fare_usd
1 T1 10
2 T2 15
3 T3 5

Cumulative fares in time order: 10, 25, 30.

Complex queries: CTEs, subqueries, and interview readability

Use WITH steps to name grain shifts: “per user per day counts” as step1, then filter step1 in step2. Interviewers read named stages faster than nested blobs.

Sub-topic: when to prefer CTEs

  • Readability: Each WITH clause is a named subquery you can reference twice without copy-paste.
  • Grain discipline: Force yourself to write daily_rides AS (...) then SELECT ... FROM daily_rides so the next step cannot accidentally mix trip grain and day grain.

Hub-aligned SQL: most common category per store

The Uber hub’s SQL-tagged item emphasizes aggregation, joins, and ranking—think fan-out control, GROUP BY, HAVING, and top categories per entity.

Join then aggregate at correct grain

Join facts to dimensions, then GROUP BY the keys in the prompt—not every column you see.

“Most common” per group

Often MODE-like logic is done with per-group row counts (via COUNT(*)), ORDER BY the count descending, and ROW_NUMBER() or DISTINCT ON (PostgreSQL) to pick the top label per partition.

Worked example — votes

voter candidate
v1 A
v2 A
v3 B

Winner by count: A with 2 votes.

SQL Interview Question on Most Common Category per Store

Tables: sales(store_id, sku_id, day); sku(sku_id, category). For each store_id, return the category with the most rows in sales (tie-break: lexicographically smallest category).

Solution Using Group Counts and Window Ranking

WITH expanded AS (
  SELECT s.store_id, sk.category
  FROM sales s
  JOIN sku sk ON sk.sku_id = s.sku_id
),
counts AS (
  SELECT store_id, category, COUNT(*) AS n
  FROM expanded
  GROUP BY store_id, category
),
ranked AS (
  SELECT
    store_id,
    category,
    ROW_NUMBER() OVER (
      PARTITION BY store_id
      ORDER BY n DESC, category ASC
    ) AS rn
  FROM counts
)
SELECT store_id, category
FROM ranked
WHERE rn = 1;
Enter fullscreen mode Exit fullscreen mode

Why this works: counts fixes the grain to (store, category); ROW_NUMBER picks one row per store under your tie-break rule.

SQL Interview Question on Previous Trip Fare per Rider

Table: trips(trip_id, rider_id, completed_at, fare_usd) (one row per trip). For each trip, return trip_id, rider_id, fare_usd, and prev_fare_usd: the fare_usd of this rider’s immediately previous trip by completed_at ( NULL if none). Tie-break equal timestamps by smaller trip_id first.

Solution Using LAG with a Deterministic Window Order

SELECT
  trip_id,
  rider_id,
  fare_usd,
  LAG(fare_usd) OVER (
    PARTITION BY rider_id
    ORDER BY completed_at ASC, trip_id ASC
  ) AS prev_fare_usd
FROM trips;
Enter fullscreen mode Exit fullscreen mode

Why this works: LAG reads the prior row in the per-rider ordering you declare. The extra trip_id in ORDER BY makes ties on completed_at deterministic.


4. Date boundaries and time-series filters in SQL

Half-open windows for trip and event timestamps

Overview: Trip and event fact tables use timestamps; interviewers watch for double-counting at calendar boundaries and off-by-one bugs when a day “starts” at UTC vs local.

Why half-open intervals [start, end)

Sub-topic: Closed intervals BETWEEN '2025-01-01' AND '2025-03-31' are easy to misread when completed_at has time-of-day. A trip at 2025-03-31 23:59:59 might be in or out depending on inclusive wording.

Worked example: trips(trip_id, completed_at, fare_usd). Q1 revenue with half-open bounds:

SELECT SUM(fare_usd) AS q1_revenue
FROM trips
WHERE completed_at >= TIMESTAMP '2025-01-01 00:00:00'
  AND completed_at <  TIMESTAMP '2025-04-01 00:00:00';
Enter fullscreen mode Exit fullscreen mode

Takeaway: Prefer >= start AND < end so a midnight event never lands in two quarters. State timezone if the prompt is ambiguous.

Truncating to day or week

Sub-topic: DATE_TRUNC('day', completed_at) (PostgreSQL) buckets events into calendar days in the session timezone. If the business cares about rider-local days, you may need a timezone column or a known conversion rule—say that explicitly before coding.

Worked example: Count trips per day in SF local date (assuming completed_at is UTC stored):

SELECT
  DATE_TRUNC('day', completed_at AT TIME ZONE 'America/Los_Angeles')::date AS day_local,
  COUNT(*) AS n_trips
FROM trips
GROUP BY 1
ORDER BY 1;
Enter fullscreen mode Exit fullscreen mode

(Exact dialect varies; in an interview, name the **function familytruncation + timezone—even if you pseudo-code the literal.)

SQL Interview Question on Trips in a Sliding 7-Day Window

Table: trips(trip_id, rider_id, completed_at) (timestamps in UTC). Return rider_id values that had at least one trip in [2025-06-01 00:00 UTC, 2025-06-08 00:00 UTC) (a 7-day reporting window, half-open on the end).

Solution Using Half-Open Filter

SELECT DISTINCT rider_id
FROM trips
WHERE completed_at >= TIMESTAMP '2025-06-01 00:00:00+00'
  AND completed_at <  TIMESTAMP '2025-06-08 00:00:00+00';
Enter fullscreen mode Exit fullscreen mode

Why this works: DISTINCT answers “which riders had activity,” not how many trips. The end is exclusive, so a trip exactly at 2025-06-08 00:00 is out of the window—confirm with the interviewer if that edge matters.


5. Conditional logic, CASE, and readable SQL

CASE, conditional counts, and polish under time pressure

Use CASE inside aggregates when each row contributes to different buckets under the same GROUP BY—for example completed vs canceled trip counts per city.

Searched CASE vs simple CASE

Sub-topic: Simple CASE x WHEN ... compares x to values. Searched CASE WHEN predicate is more flexible (ranges, NULL checks). In interviews, searched form is more common for tier and SLA logic.

Worked example — tiered fare buckets:

trip_id city fare_usd
1 SF 8
2 SF 22
3 SF 55
SELECT city,
       SUM(CASE WHEN fare_usd < 10 THEN 1 ELSE 0 END) AS low_cnt,
       SUM(CASE WHEN fare_usd BETWEEN 10 AND 50 THEN 1 ELSE 0 END) AS mid_cnt,
       SUM(CASE WHEN fare_usd > 50 THEN 1 ELSE 0 END) AS high_cnt
FROM trips
GROUP BY city;
Enter fullscreen mode Exit fullscreen mode

Result for SF: low_cnt = 1, mid_cnt = 1, high_cnt = 1.

Readable WITH for multi-stage CASE logic

Sub-topic: When CASE nests more than two levels, split: labeled AS (SELECT ..., CASE ... END AS bucket FROM raw) then SELECT bucket, COUNT(*) FROM labeled GROUP BY bucket.

Common beginner mistakes

  • Filtering SUM(...) in WHERE instead of HAVING or an outer query.
  • JOIN explosion then a naive count-all-rows aggregate without noticing duplicated keys.
  • PARTITION BY keys that do not match the business question (wrong window).
  • Forgetting NULL in outer joins—predicates belong in ON vs WHERE depending on intent.

SQL Interview Question on SLA Buckets by City

Table: deliveries(delivery_id, city, minutes_to_complete, was_late). Return each city, total deliveries, count on_time (was_late = false), count late, and pct_late = late / total as a decimal rounded to 4 places (avoid integer division). Exclude cities with fewer than 100 deliveries.

Solution Using Conditional Sums and Filtering Groups

SELECT
  city,
  COUNT(*) AS total_deliveries,
  SUM(CASE WHEN NOT was_late THEN 1 ELSE 0 END) AS on_time,
  SUM(CASE WHEN was_late THEN 1 ELSE 0 END) AS late,
  ROUND(
    SUM(CASE WHEN was_late THEN 1 ELSE 0 END)::numeric / COUNT(*),
    4
  ) AS pct_late
FROM deliveries
GROUP BY city
HAVING COUNT(*) >= 100;
Enter fullscreen mode Exit fullscreen mode

Why this works: CASE turns boolean rows into 0/1 countable contributions; HAVING enforces the minimum volume rule at city grain.

Practice


6. Python patterns aligned with the curated Uber hub

Python patterns in interviews for data engineering

Overview: Many Uber-tagged items emphasize sliding windows, two pointers, intervals, hash maps, heaps, stacks, graphs / BFS, trees, linked lists, and occasional DP / greedy. Treat them as interview-sized scripts: clear loops, careful edge cases, O(n) or stated tradeoffs—not production frameworks.

Topic map (how this section is organized): Each block below follows the same recipe—what the hub is testingsub-topics (mechanics) → worked micro-exampleinterview-style problemsolutionwhy it works. That mirrors how PipeCode structures Uber-tagged Python practice: small, testable functions with explicit invariants.

Grid of Python interview pattern chips including sliding window, graph, and heap with complexity hints for Uber data engineering prep on PipeCode.

Sliding window and bounded substring problems

When: You need the best subarray/substring of bounded length or with at most K distinct values—invariant lives in the window edges.

Worked example: For array [1,2,1,0,2] and at most 2 distinct integers in the window, expand right, shrink left when distinct > 2, track max length—classic O(n) scan.

Trace (at most 2 distinct): Scan right, keep freq of values in [left, right]. When len(freq) > 2, advance left and decrement freq[nums[left]] until len(freq) <= 2. After each right, best = max(best, right - left + 1). On this array the maximum length is 3 (window [1,2,1] before 0 forces a shrink).

Two pointers, intervals, and merge patterns

When: Sorted arrays, interval merge, collision from both ends. Sorting plus linear scan often beats ad-hoc brute force.

Worked example: Intervals [1,3], [2,6], [8,10] merge to [1,6], [8,10] after sorting by start.

Hash maps, frequency counts, and heap / top-k patterns

When: Counts, first-seen order, heavy hitters, or streaming top-k. Pair dict/Counter with heapq for bounded memory.

Stacks and order-sensitive simulations

When: Nested structure, cancellation rules, or monotonic stack patterns (e.g., next greater variants).

Graphs, BFS, and multi-step reachability

When: Layers, shortest steps in unweighted graphs, or multi-source propagation. Clarify directed vs undirected before coding.

Trees, linked lists, and recursion

When: Recursive structure, in-place rewiring. Watch recursion depth on skewed trees; mention iterative stacks if asked.

Dynamic programming and greedy (when screens go harder)

When: Optimal substructure with overlapping subproblems—DP; local optimal leads to global—greedy after you prove it.

Common beginner mistakes

  • Off-by-one window bounds (right inclusive vs exclusive).
  • Mutating a structure while iterating without a plan.
  • BFS vs DFS confusion on shortest path questions.
  • Heap misuse—forgetting heapq is a min-heap in Python.

Python Interview Question on Merging Ride Intervals

You are given a list of half-open intervals [start, end) for busy windows per driver (integer minutes). Merge overlapping intervals and return the total covered minutes for a toy input [[1,4],[2,5],[7,8]].

Solution Using Sort and Linear Scan

def covered_minutes(intervals):
    if not intervals:
        return 0
    intervals.sort(key=lambda x: x[0])
    total = 0
    cs, ce = intervals[0]
    for s, e in intervals[1:]:
        if s <= ce:
            ce = max(ce, e)
        else:
            total += ce - cs
            cs, ce = s, e
    total += ce - cs
    return total
Enter fullscreen mode Exit fullscreen mode

Why this works: Sorting by start lets you extend the current merged block when s <= ce; otherwise you close the block and add its length. Half-open math avoids double-counting touch points.

Intervals and two pointers in Python for Uber-style screens

What the cluster tests: You treat 1-D intervals or sorted arrays with two indices (or sweep ideas). Hub items emphasize intersection, union, gaps, and symmetric-difference-style thinking on the line.

Sub-topic: ordering and merging

Sort by start (and break ties by end). Scan once: extend the current merged block when next.start <= current.end, else emit the block and start a new one.

Worked example — merge

Intervals (closed) [1,3], [2,6], [8,10] → merged [1,6], [8,10].

Worked example — total covered length (half-open tweak)

If intervals are [start, end), merged length is sum(end - start) over merged blocks—useful when the interview uses exclusive ends to avoid double counting touch points.

Sub-topic: intersection of two intervals

Two intervals [a1,a2] and [b1,b2] intersect iff max(a1,b1) <= min(a2,b2). The intersection is [max(a1,b1), min(a2,b2)] or empty if the inequality fails.

Python Interview Question on Cleaning Two Shift Windows

You have two closed intervals per machine id: scheduled = [s1,e1] and actual = [s2,e2] (minutes in a day, 0…1440). Return the symmetric difference duration in minutes (time covered by exactly one of the two intervals, not both).

Toy input: scheduled = [100, 200], actual = [150, 250]. Overlap is [150,200] (51 minutes if integer minute grid inclusive—clarify with interviewer). Symmetric difference = [100,149][201,250].

Solution Using Merge of Non-Overlap Pieces

def symmetric_diff_len(a, b):
    """a, b are [start, end] inclusive integer minutes."""
    s1, e1 = a
    s2, e2 = b
    lo = max(s1, s2)
    hi = min(e1, e2)
    inter = max(0, hi - lo + 1) if lo <= hi else 0
    len_a = e1 - s1 + 1
    len_b = e2 - s2 + 1
    return len_a + len_b - 2 * inter
Enter fullscreen mode Exit fullscreen mode

Why this works: Inclusion–exclusion on lengths: |A ∪ B| = |A| + |B| − |A ∩ B|; symmetric difference length is |A| + |B| − 2|A ∩ B| for counting discrete minutes when intervals are inclusive integers.

Practice


2. Sliding window in Python (distinct, bounded odd counts)

What the cluster tests: Maintain an invariant on a subarray/substring while right advances and left retreats—O(n) total when each index moves monotonically.

Sub-topic: “at most K distinct” or “longest without repeat”

Use a frequency map; shrink from left until the constraint holds again.

Worked example — longest length with at most 2 distinct values

Take [1,2,1,0,2]. Expand right while distinct ≤ 2, shrink left when you exceed:

  • Best window without 0 is [1,2,1] → length 3.
  • Including 0 forces three distinct values unless you drop enough from the left; the best length stays 3 on this array.

Interview lesson: always voice the invariant (“at most K distinct”) while you move left.

Sub-topic: parity / “at most K odd” subarrays

Track count of odd numbers in the window; shrink while odds > K.

Python Interview Question on Longest Subarray With at Most Two Odd Elements

Given a list of non-negative integers, return the maximum length of a contiguous subarray with at most 2 odd numbers.

Toy input: [2, 1, 3, 4, 5, 6] → longest window e.g. [3,4,5,6] has odds 3,5 → length 4; check if longer exists.

Solution Using Flexible Left Pointer

def longest_at_most_k_odds(nums, k=2):
    left = 0
    odds = 0
    best = 0
    for right, x in enumerate(nums):
        if x % 2 == 1:
            odds += 1
        while odds > k:
            if nums[left] % 2 == 1:
                odds -= 1
            left += 1
        best = max(best, right - left + 1)
    return best
Enter fullscreen mode Exit fullscreen mode

Why this works: Each right extends the window; left only moves forward when odds > k, so every element is visited at most twiceO(n).

Practice


3. Hash tables, frequency, and heaps (pairs, votes, top-k stream)

What the cluster tests: Counter / dict for counts modulo logic, tie-breaks, or heavy hitters; heapq for top-k or streaming maintenance.

Sub-topic: modulo pairs

Count freq[r] = count of x % k == r. Pairs that sum to k (or 0) combine min(freq[r], freq[k-r]) with care for r == 0 and r == k/2.

Worked example: nums = [1,2,3,4,5,6], k = 3. Residues: 1→2, 2→2, 0→2. Pair (1,2): min(2,2)=2 pairs; residue 0: 2//2 = 1 pair from within the group.

Sub-topic: streaming top-k

A min-heap of size k stores the k largest seen so far; heap root is the weakest of the k.

Python Interview Question on Pairs Summing to a Multiple of K

Given nums and k, count disjoint pairs (i,j) with i < j and nums[i] + nums[j] divisible by k. (Simplified counting: use residue buckets as above.)

Solution Using Residue Buckets

from collections import Counter

def count_pairs_divisible(nums, k):
    ctr = Counter(x % k for x in nums)
    ans = 0
    for r in range(k // 2 + 1):
        if r == 0:
            ans += ctr[0] * (ctr[0] - 1) // 2
        elif 2 * r == k:
            ans += ctr[r] * (ctr[r] - 1) // 2
        else:
            ans += ctr[r] * ctr[k - r]
    return ans
Enter fullscreen mode Exit fullscreen mode

Why this works: Only residues r and k−r can pair; r=0 and r=k/2 are self-pairs with C(n,2).

Python Interview Question on First Repeated Trip ID

You process trip_ids in order. Return the first id that appears twice, or None if all are unique. Constraint: O(n) time, O(n) extra space.

Toy input: [101, 202, 303, 202, 404]202.

Solution Using a Set

def first_repeat(trip_ids):
    seen = set()
    for tid in trip_ids:
        if tid in seen:
            return tid
        seen.add(tid)
    return None
Enter fullscreen mode Exit fullscreen mode

Why this works: Set membership is average O(1); the first duplicate is returned immediately when the second occurrence is read.

Practice


4. Stack simulations (collisions / cancellation)

What the cluster tests: LIFO reasoning: nested structure, matching, or sign-aware cancellation along a line.

Sub-topic: when to use a stack

If the next element interacts with the most recent unresolved element (not arbitrary past), use a stack.

Worked example — bracket-like cancellation

Symbols ( and )—push on (, pop/cancel on ); invalid if pop from empty.

Python Interview Question on Signed Collisions on a Line

Toy: You are given non-zero integers on a line; positive moves right, negative moves left. When adjacent in a list “collide,” the smaller absolute value explodes; equal both explode; survivor stays. (This is asteroid-style logic—state the full rules from the interviewer.)

Toy trace: [2, -1]2 wins → [2]; [2, -2] → both gone → [].

Solution Sketch Using a Stack

def asteroid_survive(arr):
    st = []
    for x in arr:
        alive = True
        while alive and st and st[-1] > 0 and x < 0:
            if st[-1] < -x:
                st.pop()
            elif st[-1] == -x:
                st.pop()
                alive = False
            else:
                alive = False
        if alive:
            st.append(x)
    return st
Enter fullscreen mode Exit fullscreen mode

Why this works: Only right-moving stack tops collide with an incoming left-moving value. Compare magnitudes: smaller pops; equal both annihilate; larger stops the loop and survives.

Practice


5. BFS, graphs, and layered search

What the cluster tests: Shortest hops in an unweighted graph, multi-source BFS, or state-space BFS (digits/steps).

Sub-topic: BFS vs DFS

Use BFS when the first time you dequeue a node is at minimum distance from the source (unweighted edges).

Worked example — layers

Graph 0—1—2, start 0 → distances {0:0,1:1,2:2}.

Sub-topic: bus / transfer graphs

Nodes can be routes or stops; edges encode “you can ride this route”—clarify whether changing buses costs +1 in the objective.

Python Interview Question on Minimum Hops Through Routes

Toy: Stops {A,B,C}. Route R1: A→B, Route R2: B→C. Start A, end C. Minimum bus count is 2 if you must switch at B (or 1 if one route covers A–C—state the model).

Solution Idea (graph + BFS)

Build adjacency: stop → routes through it; BFS on (stop, buses_used) or BFS on routes with visited stops. Answer is the minimum edge count under the interviewer’s transfer rule.

Python Interview Question on Shortest Path in a Grid

You have an m × n grid; 0 means walkable, 1 means blocked. Start (0,0), target (m-1,n-1). You may move up/down/left/right one cell per step. Return the minimum number of steps to reach the target, or -1 if impossible.

Toy grid:

0 0 1
0 0 0
1 0 0
Enter fullscreen mode Exit fullscreen mode

Answer: 4 steps along a shortest path (e.g. right, down, down, right—exact path depends on tie order).

Solution Using BFS on Cells

from collections import deque

def shortest_path_grid(grid):
    if not grid or not grid[0] or grid[0][0] == 1:
        return -1
    m, n = len(grid), len(grid[0])
    if grid[m - 1][n - 1] == 1:
        return -1
    q = deque([(0, 0, 0)])
    seen = {(0, 0)}
    dirs = [(1, 0), (-1, 0), (0, 1), (0, -1)]
    while q:
        r, c, d = q.popleft()
        if r == m - 1 and c == n - 1:
            return d
        for dr, dc in dirs:
            nr, nc = r + dr, c + dc
            if 0 <= nr < m and 0 <= nc < n and grid[nr][nc] == 0:
                if (nr, nc) not in seen:
                    seen.add((nr, nc))
                    q.append((nr, nc, d + 1))
    return -1
Enter fullscreen mode Exit fullscreen mode

Why this works: BFS explores cells in increasing distance from the start; the first time you dequeue the target is at minimum step count (unweighted moves).

Practice

  • COMPANY · Uber — BFS
  • COMPANY · Uber — Graph

6. Dynamic programming and greedy (strings, stocks, grid stories)

What the cluster tests: Optimal substructure with small state—often O(n) or O(n·m)—or a greedy that you can justify.

Sub-topic: greedy for “unlimited transactions” stock

If you can buy/sell any number of times and only need max profit, sum all positive daily deltas: profit += max(0, p[i]-p[i-1]).

Worked example: prices [1,3,2,5] → deltas +2, -1, +3 → greedy profit 2+3=5.

Sub-topic: DP when greedy fails

Wildcard / pattern matching typically needs dp[i][j]: first i chars of text vs first j of pattern.

Python Interview Question on Unlimited Transactions With No Overlap

Prices [7,1,5,3,6,4]. Max profit with unlimited buys/sells, one share at a time.

Solution Using Greedy on Deltas

def max_profit_unlimited(prices):
    return sum(max(0, prices[i] - prices[i - 1]) for i in range(1, len(prices)))
Enter fullscreen mode Exit fullscreen mode

Why this works: Every uphill segment can be fully captured; overlapping trades collapse to summing positive steps.

Practice


7. Trees and linked lists (recursion, pairwise reversal)

What the cluster tests: Recursive structure: mirror symmetry, in-place pointer rewiring.

Sub-topic: symmetric tree (intuition)

Two subtrees are mirrors if left.left matches right.right and left.right matches right.left.

Sub-topic: pairwise reversal

Process two nodes at a time: n1.next → n2.next, n2.next → n1, advance curr = n1.next.

Python Interview Question on Mirror Symmetry (Values)

Toy tree as nested dicts: {"v":1,"l":{...},"r":{...}}. Return True iff left and right subtrees are mirrors in shape and value.

Solution Using Recursive Mirror Check

def is_mirror(a, b):
    if a is None and b is None:
        return True
    if a is None or b is None:
        return False
    return (
        a["v"] == b["v"]
        and is_mirror(a.get("l"), b.get("r"))
        and is_mirror(a.get("r"), b.get("l"))
    )

def is_symmetric(root):
    if root is None:
        return True
    return is_mirror(root.get("l"), root.get("r"))
Enter fullscreen mode Exit fullscreen mode

Practice

Filter Uber Python problems by difficulty once patterns feel familiar.

Pattern When it shows up
Sliding window Bounded substring / subarray with an invariant
Two pointers / intervals Sorted arrays, merge, collision
Hash + heap Frequency, top-k, streaming heavy hitters
Stack Nested / cancel rules, monotonic variants
Graph + BFS Layers, unweighted shortest path
Tree / linked list Recursive rewiring, symmetry checks
DP / greedy Optimization with structure—prove greedy first

7. Star schema, grain, and pipeline prep for Uber DE interviews

Star schema, grain, and SCD basics for data engineering interviews

Overview: You may be asked to sketch a star schema: a fact table (events, transactions) surrounded by dimensions (time, user, geography, product). Interviewers listen for grain, keys, and slowly changing dimensions (SCD).

Simple star schema diagram with a central fact table, surrounding dimensions, and a grain label for Uber data engineering interview content by PipeCode.

Star schema: facts, dimensions, and keys

Facts hold numeric measures at a chosen grain; dimensions hold descriptive attributes. Surrogate keys on dimensions simplify SCD and role-playing (multiple time roles) in larger warehouses—mention them even if your sketch is small.

Grain: one row means one ___

Worked example (hypothetical): a trip fact at one row per completed trip—measures like fare and duration belong here; rider attributes live in dim_rider, not duplicated inconsistently across facts.

Slowly changing dimensions (interview-level)

  • Type 1: overwrite—no history.
  • Type 2: new row/version when an attribute changes—history preserved.
  • Type 3: limited “previous value” column—rare in modern designs but still asked.

Surrogate keys and why facts rarely store natural keys alone

Sub-topic: A surrogate rider_sk (integer) on dim_rider lets you append Type-2 rows when home_city changes without rewriting fact_trip. The fact stores rider_sk, not only rider_uuid, so historical joins stay correct.

Worked example (verbal): June trip must join to dim_rider row version valid in June, not the current address row—your SQL join or role-play dimension explains how you pick the right version (time range on the dimension or as-of join).

Design Interview Question on Trip Facts and Rider SCD Type 2

Prompt: Sketch fact_trip (one row per completed trip: trip_sk, rider_sk, pickup_dim_date_key, fare_usd, distance_mi) and dim_rider with Type 2 history (rider_sk, rider_uuid, tier, valid_from, valid_to, is_current). Explain how an analyst’s “tier at time of trip” report stays correct when tier changes mid-month.

Solution Outline (what interviewers want to hear)

  1. fact_trip.rider_sk references the dimension version that was active when the trip completed—loaded by the ETL job using completed_at and [valid_from, valid_to) (or is_current only for current-state marts).
  2. Type 2 never updates old rider_sk rows in place; it closes valid_to and inserts a new rider_sk for the new tier.
  3. Late-arriving trips are handled by reprocessing or adjusting facts in a controlled way—name idempotent loads and backfill discipline.

Common beginner mistakes

  • Mixed grain in one fact (some rows per trip, some per leg).
  • Putting 1:1 attributes that change slowly in the fact without a dimension story.
  • Ignoring late-arriving facts in verbal design.

Practice

Pipelines and warehouse tradeoffs for Uber-scale DE conversations

Overview: Expect to draw boxes: sources → ingestprocessingserving (warehouse, lakehouse, marts). Name failure modes: duplicates on replay, schema drift, backfills, SLAs.

Batch vs near-real-time ingestion

Batch simplifies correctness and cost; streaming improves freshness but needs idempotent sinks and watermarks. Pick based on latency needs you state explicitly.

Idempotent ETL and exactly-once vs at-least-once (conceptual)

Many systems are at-least-once; you dedupe with keys, merge into idempotent tables, or use exactly-once semantics where the stack supports it. Say what your design assumes.

Data quality checks and observability (interview framing)

Row counts, null rates, freshness, anomaly alerts—tie checks to consumer impact (dashboards, experiments, pricing).

Serving analysts and downstream consumers

Marts with clear ownership, documented grain, and stable contracts beat a single mushy “data lake” slide.

Pro tip: When interviewers push on tools, answer with principles first (idempotency, schema evolution, replay), then name technologies as examples—not the other way around.

Practice

Deepen ETL system design for interviews if whiteboard rounds make you nervous.

Study sequence matched to the Uber hub mix

Overview: The Uber hub export lists 20 items with 1 easy, 14 medium, and 5 hard—mediums dominate. That matches how many loops feel: core competence first, hard proof later.

Why Uber-tagged problems skew medium and hard

Mediums test pattern recognition; hards test composing patterns under time. Do not skip SQL because Python hards look exciting.

Sequencing: SQL, then Python patterns, then modeling and design

  1. SQL: joins → aggregates → windows → layered CTEs.
  2. Python: windows / two pointers → hash + heap → graphs → trees → DP as needed.
  3. Modeling + ETL narration daily—short 5-minute whiteboard talks.

Volume and spaced repetition before the loop

Aim for 30–50+ solved problems across lanes (not only one tag). Mix Uber hub items with global SQL and Python.

Band Count (curated Uber set) How to use it
Easy 1 Confidence and syntax warmup
Medium 14 Core bar—repeat until patterns feel automatic
Hard 5 Timed proof; narrate tradeoffs

Practice

Optional acceleration: SQL for data engineering interviews and Python fundamentals for DE interviews.


Tips to crack Uber data engineering interviews

Data engineering interview preparation sticks when you simulate the loop: timed SQL, timed Python, and spoken design—not passive reading.

SQL preparation and query discipline

State keys and grain before SELECT. Verbalize JOIN cardinality. Re-read the prompt for ties, NULL, and half-open time windows.

Python patterns under time pressure

Warm up templates (window hash map, interval merge, BFS queue) so you do not invent structure from zero under stress.

Data pipeline and ETL storytelling

Practice a 90-second ETL story: sources, ingest, transform, serving, quality, failure. Link to ETL system design.

Modeling and grain out loud

Keep a star sketch habit: fact in the middle, dimensions around, one row = one ___ written on the fact.

Where to practice on PipeCode


Frequently asked questions

What is the Uber data engineering interview process?

It commonly includes a recruiter screen, SQL and Python technical rounds, a data pipeline or architecture discussion, and behavioral interviews—exact stages vary by team and level. Use this article as a skills map, then confirm timing with your recruiter.

What topics should I study for Uber data engineering interview questions?

Prioritize SQL joins, aggregates, and window functions; Python patterns like sliding window, intervals, hash maps, heaps, and graphs; star schema and grain; and ETL / warehouse tradeoffs. Practice on the Uber hub plus global SQL topics.

How difficult are Uber data engineering interview questions compared to other companies?

The curated Uber set on PipeCode skews medium and hard on Python, with a small easy count—use that as a difficulty thermometer, not a guarantee for every loop. Comparing companies is less useful than mastering grain and narration.

Does Uber test SQL or Python more for data engineering roles?

Expect both: SQL for reporting and metric correctness, Python for algorithmic cleanliness and data transforms. Some rounds lean more on one language; prep both to avoid surprises.

How should I prepare for Uber data engineering system design and pipeline questions?

Study ingest → transform → serve, idempotency, data quality, and failure replay. Pair reading with the ETL system design course and deep dives on PipeCode.

How many practice problems should I solve before an Uber data engineering interview?

A practical band is 30–50+ problems across SQL, Python, and a few modeling exercises, emphasizing mediums until you are fast, then hards for proof. Volume matters less than review and verbal explanation.

Start practicing Uber-style data engineering problems

PipeCode pairs Uber-tagged SQL and Python problems with tests and feedback so you move from reading guides to typing working solutions—the same skill mix this post outlined.

Pipecode.ai is Leetcode for Data Engineering

Browse Uber practice →

View plans →

Top comments (0)