DEV Community

Cover image for DoorDash Data Engineering Interview Questions
Gowtham Potureddi
Gowtham Potureddi

Posted on

DoorDash Data Engineering Interview Questions

DoorDash data engineering interview questions usually stress SQL for marketplace metrics—orders, merchants, Dashers, ratings—and Python for structured problems (hash maps, sliding windows, graphs) that still show up on many DE loops. This guide maps the skills behind 63+ DoorDash-tagged practice problems on PipeCode so you know what to drill and in what order.

We walk through aggregation and grain, joins and fan-out, window functions, time logic, bucketing, representative Python patterns, and a concise data modeling pass (star schema and SCDs). Sample prompts below are original teaching examples, not leaked items.

PipeCode blog header for DoorDash data engineering interview prep with marketplace icons and purple accents on a dark gradient background.

Top DoorDash data engineering interview topics

# Topic Why it shows up
1 Aggregation & GROUP BY / HAVING Per-merchant revenue, per-Dasher counts, SLA segments.
2 Joins & deduplication Orders to deliveries without row explosion; “no activity” anti-joins.
3 Window functions & ranking Top-N per region, previous/next event, running metrics.
4 Dates, streaks, time-series SQL Daily actives, latency buckets, gaps between orders.
5 CASE & bucketing On-time vs late, radius tiers, volume bands.
6 Python: frequency & sliding windows Counters, bounded intervals, order-sequence scans.
7 Star schema, grain, SCDs Verbal warehouse design and historical dimensions.

If you are new to SQL: In most databases the engine processes a query roughly in this order: FROM / joins → WHERE (filter rows) → GROUP BY → compute aggregates → HAVING (filter groups) → window functions → SELECT / ORDER BY. When in doubt, ask: “Am I filtering one row at a time (WHERE) or a whole group after summing (HAVING)?” Saying that out loud in a DoorDash SQL interview buys you time to structure the query.


1. Aggregation and GROUP BY concepts in marketplace SQL

Aggregation and GROUP BY in SQL for marketplace metrics

Picture a table with many detail rows—for example one row per order at a merchant. Aggregation means: “turn lots of rows into one summary value (or a few values) per bucket.” The bucket is whatever the prompt names: per merchant, per Dasher, per day, and so on.

  • GROUP BY defines the bucket: “Put all rows with the same merchant_id together,” or “the same (store_id, day) together.” Every distinct combination of the GROUP BY columns is one group; the database runs your aggregate functions separately inside each group.

SUM(col)

  • What it does: Adds all numeric values of col in the bucket.
  • NULL behavior: NULL cells are ignored (they are not treated as 0). If every value is NULL, SUM is usually NULL, not 0—say that in an interview if the edge case matters.
  • Typical use: Gross merchandise value per merchant, delivery fees summed per region.

Worked example: In one group, subtotal values 10, NULL, 30SUM(subtotal) = 40 (only 10 + 30).

AVG(col)

  • What it does: Average of non-NULL values: sum of non-null values divided by count of non-null values.
  • NULL behavior: Rows where col is NULL do not enter the numerator or the denominator. If you need “average where missing means 0,” use AVG(COALESCE(col, 0)) (only if the business defines it that way).
  • Typical use: Average order value per merchant, average delivery minutes per zone.

Worked example: Same three rows 10, NULL, 30AVG(subtotal) = 20 because (10 + 30) / 2; the NULL row is not counted in the average.

COUNT(*)

  • What it does: Counts how many rows are in the bucket—every row counts, even if some columns are NULL.
  • Typical use: “How many orders per merchant?”, “how many delivery attempts in this hour?”—when each row is one event.

Worked example: Same three rows → COUNT(*) = 3 (the NULL row still counts as a row).

COUNT(col)

  • What it does: Counts rows where col is not NULL. Differs from COUNT(*) as soon as col has nulls.
  • Related: COUNT(DISTINCT col) counts unique non-null values in the bucket—essential after joins when you must count customers, not multiplied rows (see section 2).

Worked example: Same three rows → COUNT(subtotal) = 2. If the third row were 50 instead of NULL, COUNT(DISTINCT subtotal) with values 10, 30, 50 would be 3; with 10, 30, 10 it would be 2.

MIN(col) and MAX(col)

  • What they do: Return the smallest or largest value of col in the bucket. Works on orderable types: numbers, dates/timestamps, strings (lexicographic order).
  • NULL behavior: NULLs are skipped. If all values are NULL, the result is NULL.
  • Typical use: Latest order time (MAX(order_ts)), earliest Dasher signup (MIN(created_at)).

Worked example: Subtotals 10, NULL, 30MIN(subtotal) = 10, MAX(subtotal) = 30.

Worked example — one dataset, several aggregates

Suppose orders looks like this:

order_id merchant_id subtotal
101 M1 20.00
102 M1 NULL
103 M1 40.00
104 M2 100.00

Run:

SELECT merchant_id,
       SUM(subtotal)   AS sum_amt,
       AVG(subtotal)   AS avg_amt,
       COUNT(*)        AS n_rows,
       COUNT(subtotal) AS n_known_amt,
       MIN(subtotal)   AS min_amt,
       MAX(subtotal)   AS max_amt
FROM orders
GROUP BY merchant_id
ORDER BY merchant_id;
Enter fullscreen mode Exit fullscreen mode

You should get:

merchant_id sum_amt avg_amt n_rows n_known_amt min_amt max_amt
M1 60.00 30.0000… 3 2 20.00 40.00
M2 100.00 100.0000… 1 1 100.00 100.00

For M1, the row with subtotal NULL still counts in COUNT(*) (3 rows) but not in SUM / AVG / COUNT(subtotal) (only the 20 and 40 matter).

COUNT(DISTINCT) mini-example: if delivery_attempts has two rows for the same order_id (retry), COUNT(*) is 2 but COUNT(DISTINCT order_id) is 1 for that bucket.

Conditional aggregation (CASE inside aggregates)

  • Idea: Count or sum only some rows in the group without splitting into multiple queries—put the condition inside the aggregate.
  • Patterns: SUM(CASE WHEN condition THEN col ELSE 0 END), COUNT(CASE WHEN … THEN 1 END) (or SUM(CASE WHEN … THEN 1 ELSE 0 END)).

Worked example — events per merchant by type

merchant_id event_type
M1 impression
M1 order
M1 impression
M2 impression
SELECT merchant_id,
       COUNT(*) AS total_events,
       SUM(CASE WHEN event_type = 'order' THEN 1 ELSE 0 END) AS orders,
       SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END) AS impressions
FROM merchant_events
GROUP BY merchant_id;
Enter fullscreen mode Exit fullscreen mode

Result

merchant_id total_events orders impressions
M1 3 1 2
M2 1 0 1

This is the same “one GROUP BY, many metrics” style as COUNT(*) FILTER (WHERE …) in PostgreSQL—portable warehouses use CASE heavily.

GROUP BY and HAVING (how they fit together)

  • HAVING is WHERE for buckets: it runs after grouping, so you can filter on AVG(subtotal), COUNT(*), etc. WHERE runs before grouping and only sees raw row columns—so WHERE AVG(subtotal) > 50 is invalid: the average does not exist until after GROUP BY.

Worked example — WHERE vs HAVING

order_id merchant_id subtotal
1 M1 40
2 M1 70
3 M1 20
4 M2 200
  • WHERE subtotal > 30 drops row 3 before grouping. Then GROUP BY merchant_id with SUM(subtotal) gives M1 = 110 (40+70), M2 = 200.
  • HAVING SUM(subtotal) > 100 runs after grouping on the unfiltered table: M1’s sum is 130, M2’s is 200—both pass.

Worked example — HAVING output

SELECT merchant_id, SUM(subtotal) AS total_amt, COUNT(*) AS n_orders
FROM orders
GROUP BY merchant_id
HAVING SUM(subtotal) > 100;
Enter fullscreen mode Exit fullscreen mode

Result

merchant_id total_amt n_orders
M1 130 3
M2 200 1

Rule of thumb: If the condition uses SUM / COUNT / AVG / …` of the group, use HAVING. If it only uses this row’s columns, use WHERE (and put it first—it usually makes the query faster too).

If a column appears in SELECT and is not inside an aggregate, it must appear in GROUP BY (in strict SQL).

Common beginner mistakes

  • Putting AVG(subtotal) > 50 in WHERE → use HAVING after GROUP BY.
  • Forgetting a column in GROUP BY when it appears in SELECT without an aggregate → invalid query in strict SQL.
  • Answering at the wrong grain (e.g. one row per order_line when the question asked per merchant).

SQL Interview Question on Aggregation and GROUP BY

Given orders(order_id, merchant_id, subtotal, is_cancelled), compute gross revenue per merchant excluding cancelled orders. Return merchants with revenue strictly greater than 100.

Solution Using GROUP BY, Conditional Sum, and HAVING

sql
SELECT merchant_id,
SUM(CASE WHEN NOT is_cancelled THEN subtotal ELSE 0 END) AS gross_rev
FROM orders
GROUP BY merchant_id
HAVING SUM(CASE WHEN NOT is_cancelled THEN subtotal ELSE 0 END) > 100;

Why this works: We group by merchant, sum only non-cancelled subtotals, then keep only groups whose gross revenue passes the threshold. That threshold is an aggregate, so it belongs in HAVING.

Practice

When you are ready for timed reps, open DoorDash SQL (medium).


2. Joins, deduplication, and fan-out patterns in SQL

Joins and Deduplication in SQL for logistics analytics

A join answers: “For each row in table A, which rows in table B belong with it?” You declare that with ON a.key = b.key. In marketplace analytics you enrich orders and deliveries with merchants, customers, and Dashers.

Fact vs dimension and grain (how DEs talk about joins)

  • Fact table: Many rows, often one row per event or transaction (order, delivery attempt)—the thing you SUM or COUNT.
  • Dimension table: Fewer rows, one row per entity you describe (merchant, Dasher)—the thing you JOIN to get names, zones, categories.
  • Grain: “One result row per what?” Wrong grain after a join usually means fan-out: you aggregated at order level but summed a merchant attribute that repeated.

Worked example: order_lines (fact: order_id, sku, qty) joins items (dimension: sku, list_price`). SUM(qty) by sku is correct at line grain. If items had duplicate rows per sku, one join could inflate SUM(list_price)—fix the dimension to one row per sku before summing price.

Diagram comparing SQL join fan-out row multiplication versus a deduplicated join path for DoorDash-style orders and deliveries, in PipeCode brand colors.

Inner join (JOIN / INNER JOIN)

  • Result: Only rows where the ON condition succeeds on both sides.
  • Use when: You only care about “matched” pairs—e.g. orders that have a known merchant row.
  • NULL keys: NULL = NULL is not true in join semantics, so inner joins drop rows with a null join key unless you COALESCE or filter explicitly.

Worked example

orders

order_id merchant_id
O1 M10
O2 M99

merchants

merchant_id name
M10 Noodle House

FROM orders o JOIN merchants m ON o.merchant_id = m.merchant_id returns one row: O1. Order O2 has no merchant row and disappears from an inner join.

Left outer join (LEFT JOIN)

  • Result: Every row from the left table appears once per matching right row; if there is no match, right-side columns are NULL.
  • Use when: You must keep all left entities even if they have zero events—e.g. all merchants including those with no orders this week.

Worked example

merchants

merchant_id name
M1 Cafe A
M2 Cafe B

orders

order_id merchant_id amount
X1 M1 50

FROM merchants m LEFT JOIN orders o ON m.merchant_id = o.merchant_id yields two rows: M1 + X1, and M2 + NULL (no order—right side null-padded).

Semi-join vs anti-join (names, not syntax)

  • Semi-join: “Keep rows from A where some match exists in B”—WHERE EXISTS (SELECT 1 FROM B WHERE …).
  • Anti-join: “Keep rows from A with no match in B”—LEFT JOIN B … WHERE B.key IS NULL or WHERE NOT EXISTS (SELECT 1 FROM B WHERE …).

Worked example

dashers dasher_id: d1, d2, d3completed_deliveries dasher_id: d2 only (completed).

  • Semi-join: dashers who have at least one completed delivery → d2.
  • Anti-join: dashers who never completed → d1, d3.

Fan-out (row multiplication)

  • If the join key is not unique on one side (one order, many delivery attempts), the “one” side’s columns repeat on every matching row.
  • Bug pattern: Summing a merchant-level column after the join counts that merchant’s field multiple times. Fixes: aggregate first, or COUNT(DISTINCT merchant_id), or a subquery that makes the dimension one row per key before joining.

Worked example

merchants: (merchant_id=M1, fee_tier_credit=500)

orders: two rows for merchant_id=M1.

After merchants JOIN orders, fee_tier_credit appears twice (500, 500). SUM(m.fee_tier_credit) would add 1000, which is wrong for “total tier credit in the system.” Fix: SUM order amounts, or MAX(m.fee_tier_credit) per merchant if you only need the credit once.

Deduplication

  • SELECT DISTINCT removes duplicate rows when any duplicate row is interchangeable.
  • Latest row per id: Prefer ROW_NUMBER() OVER (PARTITION BY id ORDER BY time DESC) AS rn in a CTE, then WHERE rn = 1. DISTINCT ON (PostgreSQL) is a compact alternative when the prompt names the dialect.

Worked example — latest attempt per order

WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY attempt_ts DESC) AS rn
  FROM delivery_attempts
)
SELECT * FROM ranked WHERE rn = 1;
Enter fullscreen mode Exit fullscreen mode

Join that result to orders to avoid inflated revenue.

Common beginner mistakes

  • Joining on the wrong key or ignoring NULL keys (they do not match in an inner join).
  • Assuming one row per merchant after a join when the join multiplied rows.
  • Using COUNT(*) when the question wanted distinct customers—use COUNT(DISTINCT customer_id).

Joins-Based Interview Question on Dashers Without Completed Deliveries

From dashers(dasher_id) and deliveries(delivery_id, dasher_id, status), list dashers who have never completed a delivery (status = 'completed').

Solution Using NOT EXISTS

SELECT d.dasher_id
FROM dashers d
WHERE NOT EXISTS (
  SELECT 1 FROM deliveries x
  WHERE x.dasher_id = d.dasher_id AND x.status = 'completed'
);
Enter fullscreen mode Exit fullscreen mode

Why this works: For each Dasher we ask “is there any completed delivery?”—if not, the Dasher passes the filter. NOT EXISTS is a standard anti-join pattern and often reads clearer than LEFT JOIN … IS NULL.

Practice


3. Window functions and ranking in SQL

Window Functions and Ranking in SQL for delivery timelines

GROUP BY shrinks the table: many rows → one row per bucket. Window functions do the opposite idea: every row stays, but you add a new column computed from a window of rows around it. Syntax: some_function(...) OVER (...).

Illustration of SQL window function partitions and RANK ties for marketplace per-merchant order rows, styled for PipeCode DoorDash DE interview content.

PARTITION BY (optional but common)

  • Meaning: “Start a new window for each distinct value of these columns”—like a separate mini-table per merchant_id or dasher_id.
  • Without it: The window is usually the whole result set (after WHERE), which is rare for rankings but valid for some global running totals.

Worked example (one merchant, running sum over days)

merchant_id day subtotal
M1 2026-01-01 10
M1 2026-01-02 30

SUM(subtotal) OVER (PARTITION BY merchant_id ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) adds 10 on the first row and 40 on the second—cumulative subtotal for that merchant.

ORDER BY inside OVER

  • Meaning: Defines which row is ‘first’ inside each partition—required for RANK, ROW_NUMBER, LAG/LEAD, and for ordered frames (running sums).
  • Direction: DESC vs ASC changes “who is rank 1” (e.g. largest order first).

ROW_NUMBER()

  • Assigns 1, 2, 3, … with no ties—the database breaks ties arbitrarily unless you add more ORDER BY keys.
  • Use when: You need exactly one row per partition (pick-one tie-break) or deduplication.

RANK()

  • Ties get the same rank; the next rank skips: 1, 2, 2, 4.

DENSE_RANK()

  • Ties get the same rank, but no gaps: 1, 2, 2, 3.
  • Use when: You want “top N tiers” including all ties at the boundary—common for “top two order amounts per merchant.”

LAG / LEAD

  • LAG(col, 1) = previous row’s value in partition order; LEAD = next row’s—used for time between deliveries, session gaps, streak logic.

Worked example

SELECT dasher_id, delivery_ts,
       delivery_ts - LAG(delivery_ts) OVER (PARTITION BY dasher_id ORDER BY delivery_ts) AS gap
FROM deliveries;
Enter fullscreen mode Exit fullscreen mode

Filtering window results: You cannot put WHERE RANK() = 1 in the same SELECT that defines the rank in most databases. Compute the window in a CTE, then WHERE rnk = 1 on the outside.

Common beginner mistakes

  • Omitting ORDER BY inside OVER when order defines the answer.
  • Using GROUP BY when the question still needs one output row per input row.
  • Using ROW_NUMBER() when tied rows should share the same rank—use RANK or DENSE_RANK.

SQL Interview Question on Top Two Orders per Merchant by Amount

For each merchant_id, return the two highest order_amount rows (ties can both appear). Use PostgreSQL-friendly SQL.

Solution Using DENSE_RANK and a CTE

WITH ranked AS (
  SELECT *, DENSE_RANK() OVER (PARTITION BY merchant_id ORDER BY order_amount DESC) AS dr
  FROM orders
)
SELECT * FROM ranked WHERE dr <= 2;
Enter fullscreen mode Exit fullscreen mode

Why this works: DENSE_RANK() orders rows inside each merchant by order_amount; dr <= 2 keeps the top two distinct amount tiers, so ties at rank 2 are included.

For more pattern drills, browse SQL deep dives.

Practice


4. Date functions, arithmetic, and streak-style SQL

Date and Time-Series Handling in SQL for marketplace metrics

Time-series questions almost always mean: “Put each event in a time bucket (day, hour, week), then aggregate.” Delivery and order pipelines make half-open ranges and time zones fair game in a DoorDash SQL interview.

Bucketing with date_trunc (PostgreSQL-style)

  • date_trunc('day', ts) snaps every timestamp in that calendar day to the same instant (midnight at the start of the day). Common grains: hour, week, month.
  • After truncating, GROUP BY the truncated value (or cast to date) and apply SUM, COUNT, etc.

Worked example

order_ts (UTC) subtotal
2026-01-01 22:00 10
2026-01-02 03:00 20

GROUP BY date_trunc('day', order_ts)::date gives two buckets: 2026-01-01 → 10, 2026-01-02 → 20.

INTERVAL and relative windows

  • CURRENT_TIMESTAMP - INTERVAL '7 days' expresses sliding windows without hard-coding dates.
  • Pair with WHERE ts >= … AND ts < … for half-open ranges.

Worked example: “Last 24 hours”: WHERE event_ts >= NOW() - INTERVAL '24 hours' AND event_ts < NOW() (or < NOW() depending on whether you treat “now” as exclusive).

Difference between timestamps

  • dropoff_ts - pickup_ts (PostgreSQL interval)—useful for delivery duration. Cast with EXTRACT(EPOCH FROM …)/3600 for hours.

Worked example

pickup_ts dropoff_ts
2026-01-01 10:00 2026-01-01 16:00

dropoff_ts - pickup_ts is a 6-hour interval.

Boundaries: half-open vs BETWEEN

  • Prefer ts >= start_ts AND ts < end_ts so an event exactly on end_ts is not double-counted across adjacent windows.
  • BETWEEN a AND b includes both endpoints—easy to off-by-one with timestamps.

Gaps, sessions, and streaks

  • Sort by time, then LAG(ts) / LEAD(ts) for gaps between consecutive events.
  • Streak / islands (conceptual): order_day - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_day) is constant within a streak of consecutive calendar days. Full gaps-and-islands is a common follow-up—practice on time-series SQL.

Common beginner mistakes

  • Mixing DATE and TIMESTAMP without casting.
  • Using BETWEEN on timestamps when a half-open range is safer.
  • Forgetting UTC vs local when filtering “today.”

SQL Interview Question on Orders in the Last Rolling Seven Days

Count completed orders per merchant_id for order_ts in [today-7d, today) (half-open).

Solution Using Range Predicate

SELECT merchant_id, COUNT(*) AS n_orders
FROM orders
WHERE status = 'completed'
  AND order_ts >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  AND order_ts < CURRENT_TIMESTAMP
GROUP BY merchant_id;
Enter fullscreen mode Exit fullscreen mode

Why this works: WHERE filters to the rolling window and completed status before GROUP BY; COUNT(*) counts orders per merchant in that slice.

Practice

Practice time-series SQL patterns after this section—the hub matches many “consecutive day” and cohort-style prompts.


5. Bucketing and CASE expressions for logistics metrics

CASE Expressions and Bucketing in SQL for SLA tiers

CASE maps continuous metrics (minutes late, miles) into ordered buckets interviewers can scan quickly—on-time vs late, distance bands, volume tiers.

CASE for SLA and latency tiers

  • Pattern: CASE WHEN … THEN … WHEN … THEN … ELSE … END — first matching branch wins; ELSE avoids silent NULL labels.

Worked example: delivery minutes mon_time if m <= 30, late if m <= 60, else very_late:

CASE
  WHEN delivery_minutes <= 30 THEN 'on_time'
  WHEN delivery_minutes <= 60 THEN 'late'
  ELSE 'very_late'
END
Enter fullscreen mode Exit fullscreen mode

Test boundaries: is 30 inclusive? Say it explicitly.

Ordered buckets vs percentiles

  • Fixed thresholds (SLA minutes) are more common than percentile buckets in interview cards.
  • NTILE(n) over a partition splits ordered rows into n buckets—useful when the prompt says “quartile” or “decile”; watch ties at bucket edges.

Pitfalls: overlapping conditions and ELSE NULL

  • Non-exhaustive CASE returns NULL for misses—sometimes you want ELSE 'unknown' so rows still appear in reports.
  • Overlapping WHEN ranges are a logic bug; keep branches mutually exclusive unless the prompt says otherwise.

Common beginner mistakes

  • Off-by-one on inclusive/exclusive boundaries.
  • Putting CASE results in WHERE without wrapping or using a subquery when the engine evaluates order unexpectedly—often compute CASE in SELECT or a CTE, then filter.

SQL Interview Question on Segmenting Merchants by Weekly Order Volume

From orders, bucket merchants into low / medium / high for last week: low < 50 orders, medium < 200, else high.

Solution Using CASE on a Pre-Aggregated CTE

WITH w AS (
  SELECT merchant_id, COUNT(*) AS n
  FROM orders
  WHERE order_ts >= DATE_TRUNC('week', CURRENT_DATE) - INTERVAL '7 days'
    AND order_ts < DATE_TRUNC('week', CURRENT_DATE)
  GROUP BY merchant_id
)
SELECT merchant_id, n,
       CASE
         WHEN n < 50 THEN 'low'
         WHEN n < 200 THEN 'medium'
         ELSE 'high'
       END AS volume_tier
FROM w;
Enter fullscreen mode Exit fullscreen mode

Why this works: We aggregate first (counts per merchant in the window), then label each count with CASE—clean separation of “how many” vs “which tier.”

Practice


6. Hash maps, sliding windows, and Python patterns for DE screens

Hash Maps and Counting in Python for interview loops

A dict (hash map) maps keys to values. Average-time lookup, insert, and update are O(1) in the usual amortized sense—much faster than rescanning a whole list for every key.

Many DoorDash data engineer Python items are algorithmic but small-input: you implement clear logic under time pressure, not a production service.

Plain dict and .get

  • Frequency: counts[k] = counts.get(k, 0) + 1 avoids KeyError on first sighting of k.

Worked example

merchant_ids = ["M1", "M2", "M1"] — after the loop, freq is {"M1": 2, "M2": 1}.

collections.Counter

  • Built for frequency counts: Counter(iterable), .most_common(k) for top-k.

Worked example

from collections import Counter
Counter(["late", "ok", "late"]).most_common(1)  # [('late', 2)]
Enter fullscreen mode Exit fullscreen mode

defaultdict from collections

  • defaultdict(list)d[order_id].append(event) mirrors GROUP BY order_id “collect all rows in a bucket.”

Why avoid nested scans?

  • “For each distinct key, loop the entire list” is O(n²). One pass with a dict is typically O(n)—what interviewers expect.

Worked example: With about 10,000 events and 500 distinct merchants, rescanning all rows per merchant is on the order of millions of operations; one dict pass stays on the order of ten thousand.

Sliding windows in memory

  • Keep a running sum and subtract the element leaving the window, or use a deque of indices/times—pattern for “max sum of k consecutive delivery fees” or “events in last N minutes.”

Worked example: Fees [2, 3, 1, 5], k = 2: window sums are 5, 4, 6; max is 6.

When two pointers or binary search show up

  • Two pointers often pair with sorted arrays; binary search answers “minimum feasible time/speed” when a predicate is monotonic.

Common beginner mistakes

  • Re-scanning the whole list inside a loop over unique keys.
  • Off-by-one on window bounds (left inclusive vs exclusive).
  • Not handling empty input or k = 0.

Pro tip: Before you code, restate inputs, output, and edge cases. That structure mirrors how PipeCode’s editorials explain 450+ problems.

Python Interview Question on Two Deliveries in the Same Hour

Given a sorted list of delivery start times (integers, seconds since epoch), return true if any two fall in the same clock hour (use t // 3600 as a simple hour bucket).

Solution Using a Set of Hour Buckets

def same_hour_exists(times: list[int]) -> bool:
    seen = set()
    for t in times:
        h = t // 3600
        if h in seen:
            return True
        seen.add(h)
    return False
Enter fullscreen mode Exit fullscreen mode

Why this works: Each timestamp maps to one hour id; if we see the same id twice, two deliveries share that hour. set gives O(1) expected lookup—O(n) overall.

Practice

Open the full DoorDash Python problem set when you want language-filtered reps.


7. Star schema, grain, and SCD patterns in data engineering

Dimensional Modeling and SCD Patterns for warehouse interviews

Verbal rounds often ask you to sketch a star schema: a fact at the center (e.g. one row per order or per delivery line) and dimensions around it (merchant, customer, Dasher, date). Be crisp about grain—“one fact row represents what event?”

Star schema diagram with a central fact table and dimension tables plus a Type 1 versus Type 2 slowly changing dimension callout for PipeCode data modeling interview prep.

Fact table grain

  • Grain is the atomic event the fact row represents. Mixing order-level and order-line-level metrics in one fact without adjustment breaks additivity (you cannot blindly SUM across rows).

Worked example: If fact_orders is one row per order, line_item_subtotal does not belong there unless you allocate or roll up lines first.

Star vs snowflake

  • Star: Dimensions are denormalized (wide tables)—fewer joins for analysts.
  • Snowflake: Dimensions are normalized (multiple tables)—saves storage, more joins.
  • Trade-off: Query simplicity vs normalization; say both names in a design interview.

SCD Type 1 vs Type 2

  • Type 1: Overwrite the merchant’s address—no history; past reports would look “wrong” if you re-run them without snapshotting.
  • Type 2: New row per change with valid_from, valid_to, is_current (or similar); facts store merchant_sk pointing at the version active at order time.

Bridge tables (many-to-many)

  • Promo applies to many merchants; merchant runs many promos—a bridge merchant_promo(merchant_id, promo_id, …) at assignment grain is the usual pattern.

Common beginner mistakes

  • Snowflaking the fact (multiple grains in one table).
  • Forgetting surrogate keys on Type 2 dimensions so facts can point at the right version.
  • Confusing operational OLTP schemas with analytic star schemas—interview answers should name grain and keys explicitly.

Deep dives: Data modeling for data engineering interviews and ETL system design.

Data Engineering Interview Question on Tracking Merchant Address History

Explain how you would model merchant addresses so historical orders still show the address at order time while dashboards can list the current address.

Solution Using SCD Type 2 for History and a Current Flag

Store Type 2 rows on dim_merchant with valid_from, valid_to, is_current. Facts carry merchant_sk (surrogate key) tied to the version active when the order was placed. Dashboards filter is_current = true. Optionally maintain a Type 1 “current view” as a separate table or view for simplicity.

Why this works: Type 2 preserves history for facts that already stored the right merchant_sk; is_current gives analysts a single row per merchant for “where are they now?” without rewriting history.

Practice


Tips to crack DoorDash data engineering interviews

Data engineering interview preparation is less about memorizing every LeetCode pattern and more about repeatable lanes: SQL grain, joins, windows, dates, then Python frequency and windows, then a clean verbal on schemas and pipelines.

SQL preparation

Drill WHERE vs HAVING, join fan-out, and window partition keys on paper first. Explain NULL behavior in aggregates before you touch the keyboard.

Pipeline and ETL thinking

Be ready to discuss idempotent loads, late-arriving facts, and backfills at a high level—pair that narrative with ETL topic practice.

Where to practice on PipeCode

Courses that match this article: SQL for data engineering interviews, Python fundamentals, plus the modeling and ETL links above.

Communication under time pressure

Narrate assumptions (timezone, tie-break, inclusive bounds), sketch grain, then code. Interviewers hire people who debug fan-out and NULL issues calmly—that is how to crack this style of loop without burning time.


Frequently asked questions

What types of DoorDash data engineering interview questions should I expect—SQL, Python, or system design?

Expect heavy SQL (joins, aggregates, windows, dates), Python structured problems (dicts, sliding window, occasional graphs), and at least one design conversation about pipelines, warehouses, or dimensional models. This article mirrors that mix; use the DoorDash practice hub to align reps with those lanes.

How hard are DoorDash SQL interviews compared to other large tech companies?

Difficulty is broadly comparable to other marketplace and consumer-tech data engineering loops: medium SQL with a few edge-case checks is common. PipeCode tags 63 DoorDash problems across easy, medium, and hard—use that spread instead of guessing from forums alone.

Do DoorDash data engineer interviews focus on Python algorithms or data-structure problems?

You should expect some algorithmic Python (hash maps, intervals, sorting, graphs) in addition to data-wrangling intuition. It is not identical to a pure SWE loop, but brushing up on patterns in the DoorDash Python filter pays off.

How do I prepare for DoorDash data modeling or warehouse design questions?

Practice star schema, fact grain, and SCD Type 1 vs 2 out loud; sketch surrogate keys and how facts reference dimensions. Pair verbal reps with dimensional modeling drills and the data modeling course linked above.

How many DoorDash-tagged practice problems should I finish before the loop?

There is no magic number, but 30–50 mixed problems across SQL, Python, and at least a few modeling prompts is a solid data engineering interview preparation cadence if you explain each solution after you submit.

Where can I practice DoorDash-style SQL and Python with tests and feedback?

PipeCode runs 450+ interview-style problems with tests and AI feedback; start at /explore/practice/company/doordash for company-tagged sets, then branch into topic pages for weak spots.

Start practicing DoorDash data engineering problems

PipeCode pairs company-tagged DoorDash problems with tests and feedback so you move from reading patterns to typing your own solutions.

Pipecode.ai is Leetcode for Data Engineering

Browse DoorDash practice →
View plans →

Top comments (0)