Gowtham Potureddi

Posted on Apr 23

DoorDash Data Engineering Interview Questions

DoorDash data engineering interview questions usually stress SQL for marketplace metrics—orders, merchants, Dashers, ratings—and Python for structured problems (hash maps, sliding windows, graphs) that still show up on many DE loops. This guide maps the skills behind 63+ DoorDash-tagged practice problems on PipeCode so you know what to drill and in what order.

We walk through aggregation and grain, joins and fan-out, window functions, time logic, bucketing, representative Python patterns, and a concise data modeling pass (star schema and SCDs). Sample prompts below are original teaching examples, not leaked items.

#	Topic	Why it shows up
1	Aggregation & `GROUP BY` / `HAVING`	Per-merchant revenue, per-Dasher counts, SLA segments.
2	Joins & deduplication	Orders to deliveries without row explosion; “no activity” anti-joins.
3	Window functions & ranking	Top-N per region, previous/next event, running metrics.
4	Dates, streaks, time-series SQL	Daily actives, latency buckets, gaps between orders.
5	`CASE` & bucketing	On-time vs late, radius tiers, volume bands.
6	Python: frequency & sliding windows	Counters, bounded intervals, order-sequence scans.
7	Star schema, grain, SCDs	Verbal warehouse design and historical dimensions.

1. Aggregation and GROUP BY concepts in marketplace SQL

Aggregation and GROUP BY in SQL for marketplace metrics

Picture a table with many detail rows—for example one row per order at a merchant. Aggregation means: “turn lots of rows into one summary value (or a few values) per bucket.” The bucket is whatever the prompt names: per merchant, per Dasher, per day, and so on.

GROUP BY defines the bucket: “Put all rows with the same merchant_id together,” or “the same (store_id, day) together.” Every distinct combination of the GROUP BY columns is one group; the database runs your aggregate functions separately inside each group.

`SUM(col)`

What it does: Adds all numeric values of col in the bucket.
NULL behavior: NULL cells are ignored (they are not treated as 0). If every value is NULL, SUM is usually NULL, not 0—say that in an interview if the edge case matters.
Typical use: Gross merchandise value per merchant, delivery fees summed per region.

Worked example: In one group, subtotal values 10, NULL, 30 → SUM(subtotal) = 40 (only 10 + 30).

`AVG(col)`

What it does: Average of non-NULL values: sum of non-null values divided by count of non-null values.
NULL behavior: Rows where col is NULL do not enter the numerator or the denominator. If you need “average where missing means 0,” use AVG(COALESCE(col, 0)) (only if the business defines it that way).
Typical use: Average order value per merchant, average delivery minutes per zone.

Worked example: Same three rows 10, NULL, 30 → AVG(subtotal) = 20 because (10 + 30) / 2; the NULL row is not counted in the average.

`COUNT(*)`

What it does: Counts how many rows are in the bucket—every row counts, even if some columns are NULL.
Typical use: “How many orders per merchant?”, “how many delivery attempts in this hour?”—when each row is one event.

Worked example: Same three rows → COUNT(*) = 3 (the NULL row still counts as a row).

`COUNT(col)`

What it does: Counts rows where col is not NULL. Differs from COUNT(*) as soon as col has nulls.
Related: COUNT(DISTINCT col) counts unique non-null values in the bucket—essential after joins when you must count customers, not multiplied rows (see section 2).

Worked example: Same three rows → COUNT(subtotal) = 2. If the third row were 50 instead of NULL, COUNT(DISTINCT subtotal) with values 10, 30, 50 would be 3; with 10, 30, 10 it would be 2.

`MIN(col)` and `MAX(col)`

What they do: Return the smallest or largest value of col in the bucket. Works on orderable types: numbers, dates/timestamps, strings (lexicographic order).
NULL behavior: NULLs are skipped. If all values are NULL, the result is NULL.
Typical use: Latest order time (MAX(order_ts)), earliest Dasher signup (MIN(created_at)).

Worked example: Subtotals 10, NULL, 30 → MIN(subtotal) = 10, MAX(subtotal) = 30.

Worked example — one dataset, several aggregates

Suppose orders looks like this:

order_id	merchant_id	subtotal
101	M1	20.00
102	M1	NULL
103	M1	40.00
104	M2	100.00

Run:

SELECT merchant_id,
       SUM(subtotal)   AS sum_amt,
       AVG(subtotal)   AS avg_amt,
       COUNT(*)        AS n_rows,
       COUNT(subtotal) AS n_known_amt,
       MIN(subtotal)   AS min_amt,
       MAX(subtotal)   AS max_amt
FROM orders
GROUP BY merchant_id
ORDER BY merchant_id;

You should get:

merchant_id	sum_amt	avg_amt	n_rows	n_known_amt	min_amt	max_amt
M1	60.00	30.0000…	3	2	20.00	40.00
M2	100.00	100.0000…	1	1	100.00	100.00

For M1, the row with subtotal NULL still counts in COUNT(*) (3 rows) but not in SUM / AVG / COUNT(subtotal) (only the 20 and 40 matter).

COUNT(DISTINCT) mini-example: if delivery_attempts has two rows for the same order_id (retry), COUNT(*) is 2 but COUNT(DISTINCT order_id) is 1 for that bucket.

Conditional aggregation (`CASE` inside aggregates)

Idea: Count or sum only some rows in the group without splitting into multiple queries—put the condition inside the aggregate.
Patterns: SUM(CASE WHEN condition THEN col ELSE 0 END), COUNT(CASE WHEN … THEN 1 END) (or SUM(CASE WHEN … THEN 1 ELSE 0 END)).

Worked example — events per merchant by type

merchant_id	event_type
M1	impression
M1	order
M1	impression
M2	impression

SELECT merchant_id,
       COUNT(*) AS total_events,
       SUM(CASE WHEN event_type = 'order' THEN 1 ELSE 0 END) AS orders,
       SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END) AS impressions
FROM merchant_events
GROUP BY merchant_id;

Result

merchant_id	total_events	orders	impressions
M1	3	1	2
M2	1	0	1

This is the same “one GROUP BY, many metrics” style as COUNT(*) FILTER (WHERE …) in PostgreSQL—portable warehouses use CASE heavily.

`GROUP BY` and `HAVING` (how they fit together)

HAVING is WHERE for buckets: it runs after grouping, so you can filter on AVG(subtotal), COUNT(*), etc. WHERE runs before grouping and only sees raw row columns—so WHERE AVG(subtotal) > 50 is invalid: the average does not exist until after GROUP BY.

Worked example — WHERE vs HAVING

order_id	merchant_id	subtotal
1	M1	40
2	M1	70
3	M1	20
4	M2	200

WHERE subtotal > 30 drops row 3 before grouping. Then GROUP BY merchant_id with SUM(subtotal) gives M1 = 110 (40+70), M2 = 200.
HAVING SUM(subtotal) > 100 runs after grouping on the unfiltered table: M1’s sum is 130, M2’s is 200—both pass.

Worked example — HAVING output

SELECT merchant_id, SUM(subtotal) AS total_amt, COUNT(*) AS n_orders
FROM orders
GROUP BY merchant_id
HAVING SUM(subtotal) > 100;

Result

merchant_id	total_amt	n_orders
M1	130	3
M2	200	1

Rule of thumb: If the condition uses SUM / COUNT / AVG / …` of the group, use HAVING. If it only uses this row’s columns, use WHERE (and put it first—it usually makes the query faster too).

If a column appears in SELECT and is not inside an aggregate, it must appear in GROUP BY (in strict SQL).

Common beginner mistakes

Putting AVG(subtotal) > 50 in WHERE → use HAVING after GROUP BY.
Forgetting a column in GROUP BY when it appears in SELECT without an aggregate → invalid query in strict SQL.
Answering at the wrong grain (e.g. one row per order_line when the question asked per merchant).

SQL Interview Question on Aggregation and GROUP BY

Given orders(order_id, merchant_id, subtotal, is_cancelled), compute gross revenue per merchant excluding cancelled orders. Return merchants with revenue strictly greater than 100.

Solution Using GROUP BY, Conditional Sum, and HAVING

sql SELECT merchant_id, SUM(CASE WHEN NOT is_cancelled THEN subtotal ELSE 0 END) AS gross_rev FROM orders GROUP BY merchant_id HAVING SUM(CASE WHEN NOT is_cancelled THEN subtotal ELSE 0 END) > 100;

Why this works: We group by merchant, sum only non-cancelled subtotals, then keep only groups whose gross revenue passes the threshold. That threshold is an aggregate, so it belongs in HAVING.

Practice

SQL · Topic — Aggregation problems (all companies)
COMPANY · DoorDash — aggregation DoorDash-tagged aggregation
COMPANY · DoorDash — hub DoorDash SQL (all difficulties)

When you are ready for timed reps, open DoorDash SQL (medium).

2. Joins, deduplication, and fan-out patterns in SQL

Joins and Deduplication in SQL for logistics analytics

A join answers: “For each row in table A, which rows in table B belong with it?” You declare that with ON a.key = b.key. In marketplace analytics you enrich orders and deliveries with merchants, customers, and Dashers.

Fact vs dimension and grain (how DEs talk about joins)

Fact table: Many rows, often one row per event or transaction (order, delivery attempt)—the thing you SUM or COUNT.
Dimension table: Fewer rows, one row per entity you describe (merchant, Dasher)—the thing you JOIN to get names, zones, categories.
Grain: “One result row per what?” Wrong grain after a join usually means fan-out: you aggregated at order level but summed a merchant attribute that repeated.

Worked example: order_lines (fact: order_id, sku, qty) joins items (dimension: sku, list_price`). SUM(qty) by sku is correct at line grain. If items had duplicate rows per sku, one join could inflate SUM(list_price)—fix the dimension to one row per sku before summing price.

Inner join (`JOIN` / `INNER JOIN`)

Result: Only rows where the ON condition succeeds on both sides.
Use when: You only care about “matched” pairs—e.g. orders that have a known merchant row.
NULL keys: NULL = NULL is not true in join semantics, so inner joins drop rows with a null join key unless you COALESCE or filter explicitly.

Worked example

orders

order_id	merchant_id
O1	M10
O2	M99

merchants

merchant_id	name
M10	Noodle House

FROM orders o JOIN merchants m ON o.merchant_id = m.merchant_id returns one row: O1. Order O2 has no merchant row and disappears from an inner join.

Left outer join (`LEFT JOIN`)

Result: Every row from the left table appears once per matching right row; if there is no match, right-side columns are NULL.
Use when: You must keep all left entities even if they have zero events—e.g. all merchants including those with no orders this week.

Worked example

merchants

merchant_id	name
M1	Cafe A
M2	Cafe B

orders

order_id	merchant_id	amount
X1	M1	50

FROM merchants m LEFT JOIN orders o ON m.merchant_id = o.merchant_id yields two rows: M1 + X1, and M2 + NULL (no order—right side null-padded).

Semi-join vs anti-join (names, not syntax)

Semi-join: “Keep rows from A where some match exists in B”—WHERE EXISTS (SELECT 1 FROM B WHERE …).
Anti-join: “Keep rows from A with no match in B”—LEFT JOIN B … WHERE B.key IS NULL or WHERE NOT EXISTS (SELECT 1 FROM B WHERE …).

Worked example

dashers dasher_id: d1, d2, d3 — completed_deliveries dasher_id: d2 only (completed).

Semi-join: dashers who have at least one completed delivery → d2.
Anti-join: dashers who never completed → d1, d3.

Fan-out (row multiplication)

If the join key is not unique on one side (one order, many delivery attempts), the “one” side’s columns repeat on every matching row.
Bug pattern: Summing a merchant-level column after the join counts that merchant’s field multiple times. Fixes: aggregate first, or COUNT(DISTINCT merchant_id), or a subquery that makes the dimension one row per key before joining.

Worked example

merchants: (merchant_id=M1, fee_tier_credit=500)

orders: two rows for merchant_id=M1.

After merchants JOIN orders, fee_tier_credit appears twice (500, 500). SUM(m.fee_tier_credit) would add 1000, which is wrong for “total tier credit in the system.” Fix: SUM order amounts, or MAX(m.fee_tier_credit) per merchant if you only need the credit once.

Deduplication

SELECT DISTINCT removes duplicate rows when any duplicate row is interchangeable.
Latest row per id: Prefer ROW_NUMBER() OVER (PARTITION BY id ORDER BY time DESC) AS rn in a CTE, then WHERE rn = 1. DISTINCT ON (PostgreSQL) is a compact alternative when the prompt names the dialect.

Worked example — latest attempt per order

WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY attempt_ts DESC) AS rn
  FROM delivery_attempts
)
SELECT * FROM ranked WHERE rn = 1;

Join that result to orders to avoid inflated revenue.

Common beginner mistakes

Joining on the wrong key or ignoring NULL keys (they do not match in an inner join).
Assuming one row per merchant after a join when the join multiplied rows.
Using COUNT(*) when the question wanted distinct customers—use COUNT(DISTINCT customer_id).

Joins-Based Interview Question on Dashers Without Completed Deliveries

From dashers(dasher_id) and deliveries(delivery_id, dasher_id, status), list dashers who have never completed a delivery (status = 'completed').

Solution Using NOT EXISTS

SELECT d.dasher_id
FROM dashers d
WHERE NOT EXISTS (
  SELECT 1 FROM deliveries x
  WHERE x.dasher_id = d.dasher_id AND x.status = 'completed'
);

Why this works: For each Dasher we ask “is there any completed delivery?”—if not, the Dasher passes the filter. NOT EXISTS is a standard anti-join pattern and often reads clearer than LEFT JOIN … IS NULL.

Practice

SQL · Topic — Joins & deduplication
COMPANY · DoorDash — joins DoorDash-tagged joins

3. Window functions and ranking in SQL

Window Functions and Ranking in SQL for delivery timelines

GROUP BY shrinks the table: many rows → one row per bucket. Window functions do the opposite idea: every row stays, but you add a new column computed from a window of rows around it. Syntax: some_function(...) OVER (...).

`PARTITION BY` (optional but common)

Meaning: “Start a new window for each distinct value of these columns”—like a separate mini-table per merchant_id or dasher_id.
Without it: The window is usually the whole result set (after WHERE), which is rare for rankings but valid for some global running totals.

Worked example (one merchant, running sum over days)

merchant_id	day	subtotal
M1	2026-01-01	10
M1	2026-01-02	30

SUM(subtotal) OVER (PARTITION BY merchant_id ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) adds 10 on the first row and 40 on the second—cumulative subtotal for that merchant.

`ORDER BY` inside `OVER`

Meaning: Defines which row is ‘first’ inside each partition—required for RANK, ROW_NUMBER, LAG/LEAD, and for ordered frames (running sums).
Direction: DESC vs ASC changes “who is rank 1” (e.g. largest order first).

`ROW_NUMBER()`

Assigns 1, 2, 3, … with no ties—the database breaks ties arbitrarily unless you add more ORDER BY keys.
Use when: You need exactly one row per partition (pick-one tie-break) or deduplication.

`RANK()`

Ties get the same rank; the next rank skips: 1, 2, 2, 4.

`DENSE_RANK()`

Ties get the same rank, but no gaps: 1, 2, 2, 3.
Use when: You want “top N tiers” including all ties at the boundary—common for “top two order amounts per merchant.”

`LAG` / `LEAD`

LAG(col, 1) = previous row’s value in partition order; LEAD = next row’s—used for time between deliveries, session gaps, streak logic.

Worked example

SELECT dasher_id, delivery_ts,
       delivery_ts - LAG(delivery_ts) OVER (PARTITION BY dasher_id ORDER BY delivery_ts) AS gap
FROM deliveries;

Filtering window results: You cannot put WHERE RANK() = 1 in the same SELECT that defines the rank in most databases. Compute the window in a CTE, then WHERE rnk = 1 on the outside.

Common beginner mistakes

Omitting ORDER BY inside OVER when order defines the answer.
Using GROUP BY when the question still needs one output row per input row.
Using ROW_NUMBER() when tied rows should share the same rank—use RANK or DENSE_RANK.

SQL Interview Question on Top Two Orders per Merchant by Amount

For each merchant_id, return the two highest order_amount rows (ties can both appear). Use PostgreSQL-friendly SQL.

Solution Using DENSE_RANK and a CTE

WITH ranked AS (
  SELECT *, DENSE_RANK() OVER (PARTITION BY merchant_id ORDER BY order_amount DESC) AS dr
  FROM orders
)
SELECT * FROM ranked WHERE dr <= 2;

Why this works: DENSE_RANK() orders rows inside each merchant by order_amount; dr <= 2 keeps the top two distinct amount tiers, so ties at rank 2 are included.

For more pattern drills, browse SQL deep dives.

Practice

SQL · Topic — Window functions (SQL)
COMPANY · DoorDash — windows DoorDash-tagged window functions
COMPANY · DoorDash — ranking DoorDash-tagged ranking

4. Date functions, arithmetic, and streak-style SQL

Date and Time-Series Handling in SQL for marketplace metrics

Time-series questions almost always mean: “Put each event in a time bucket (day, hour, week), then aggregate.” Delivery and order pipelines make half-open ranges and time zones fair game in a DoorDash SQL interview.

Bucketing with `date_trunc` (PostgreSQL-style)

date_trunc('day', ts) snaps every timestamp in that calendar day to the same instant (midnight at the start of the day). Common grains: hour, week, month.
After truncating, GROUP BY the truncated value (or cast to date) and apply SUM, COUNT, etc.

Worked example

order_ts (UTC)	subtotal
2026-01-01 22:00	10
2026-01-02 03:00	20

GROUP BY date_trunc('day', order_ts)::date gives two buckets: 2026-01-01 → 10, 2026-01-02 → 20.

`INTERVAL` and relative windows

CURRENT_TIMESTAMP - INTERVAL '7 days' expresses sliding windows without hard-coding dates.
Pair with WHERE ts >= … AND ts < … for half-open ranges.

Worked example: “Last 24 hours”: WHERE event_ts >= NOW() - INTERVAL '24 hours' AND event_ts < NOW() (or < NOW() depending on whether you treat “now” as exclusive).

Difference between timestamps

dropoff_ts - pickup_ts (PostgreSQL interval)—useful for delivery duration. Cast with EXTRACT(EPOCH FROM …)/3600 for hours.

Worked example

pickup_ts	dropoff_ts
2026-01-01 10:00	2026-01-01 16:00

dropoff_ts - pickup_ts is a 6-hour interval.

Boundaries: half-open vs `BETWEEN`

Prefer ts >= start_ts AND ts < end_ts so an event exactly on end_ts is not double-counted across adjacent windows.
BETWEEN a AND b includes both endpoints—easy to off-by-one with timestamps.

Gaps, sessions, and streaks

Sort by time, then LAG(ts) / LEAD(ts) for gaps between consecutive events.
Streak / islands (conceptual): order_day - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_day) is constant within a streak of consecutive calendar days. Full gaps-and-islands is a common follow-up—practice on time-series SQL.

Common beginner mistakes

Mixing DATE and TIMESTAMP without casting.
Using BETWEEN on timestamps when a half-open range is safer.
Forgetting UTC vs local when filtering “today.”

SQL Interview Question on Orders in the Last Rolling Seven Days

Count completed orders per merchant_id for order_ts in [today-7d, today) (half-open).

Solution Using Range Predicate

SELECT merchant_id, COUNT(*) AS n_orders
FROM orders
WHERE status = 'completed'
  AND order_ts >= CURRENT_TIMESTAMP - INTERVAL '7 days'
  AND order_ts < CURRENT_TIMESTAMP
GROUP BY merchant_id;

Why this works: WHERE filters to the rolling window and completed status before GROUP BY; COUNT(*) counts orders per merchant in that slice.

Practice

SQL · Topic — Time-series SQL
COMPANY · DoorDash — dates DoorDash-tagged date functions
COMPANY · DoorDash — date math DoorDash-tagged date arithmetic

Practice time-series SQL patterns after this section—the hub matches many “consecutive day” and cohort-style prompts.

5. Bucketing and CASE expressions for logistics metrics

CASE Expressions and Bucketing in SQL for SLA tiers

CASE maps continuous metrics (minutes late, miles) into ordered buckets interviewers can scan quickly—on-time vs late, distance bands, volume tiers.

`CASE` for SLA and latency tiers

Pattern: CASE WHEN … THEN … WHEN … THEN … ELSE … END — first matching branch wins; ELSE avoids silent NULL labels.

Worked example: delivery minutes m → on_time if m <= 30, late if m <= 60, else very_late:

CASE
  WHEN delivery_minutes <= 30 THEN 'on_time'
  WHEN delivery_minutes <= 60 THEN 'late'
  ELSE 'very_late'
END

Test boundaries: is 30 inclusive? Say it explicitly.

Ordered buckets vs percentiles

Fixed thresholds (SLA minutes) are more common than percentile buckets in interview cards.
NTILE(n) over a partition splits ordered rows into n buckets—useful when the prompt says “quartile” or “decile”; watch ties at bucket edges.

Pitfalls: overlapping conditions and `ELSE NULL`

Non-exhaustive CASE returns NULL for misses—sometimes you want ELSE 'unknown' so rows still appear in reports.
Overlapping WHEN ranges are a logic bug; keep branches mutually exclusive unless the prompt says otherwise.

Common beginner mistakes

Off-by-one on inclusive/exclusive boundaries.
Putting CASE results in WHERE without wrapping or using a subquery when the engine evaluates order unexpectedly—often compute CASE in SELECT or a CTE, then filter.

SQL Interview Question on Segmenting Merchants by Weekly Order Volume

From orders, bucket merchants into low / medium / high for last week: low < 50 orders, medium < 200, else high.

Solution Using CASE on a Pre-Aggregated CTE

WITH w AS (
  SELECT merchant_id, COUNT(*) AS n
  FROM orders
  WHERE order_ts >= DATE_TRUNC('week', CURRENT_DATE) - INTERVAL '7 days'
    AND order_ts < DATE_TRUNC('week', CURRENT_DATE)
  GROUP BY merchant_id
)
SELECT merchant_id, n,
       CASE
         WHEN n < 50 THEN 'low'
         WHEN n < 200 THEN 'medium'
         ELSE 'high'
       END AS volume_tier
FROM w;

Why this works: We aggregate first (counts per merchant in the window), then label each count with CASE—clean separation of “how many” vs “which tier.”

Practice

SQL · Topic — Bucketing (SQL)
COMPANY · DoorDash — bucketing DoorDash-tagged bucketing
COMPANY · DoorDash — case DoorDash-tagged case expressions

6. Hash maps, sliding windows, and Python patterns for DE screens

Hash Maps and Counting in Python for interview loops

A dict (hash map) maps keys to values. Average-time lookup, insert, and update are O(1) in the usual amortized sense—much faster than rescanning a whole list for every key.

Many DoorDash data engineer Python items are algorithmic but small-input: you implement clear logic under time pressure, not a production service.

Plain `dict` and `.get`

Frequency: counts[k] = counts.get(k, 0) + 1 avoids KeyError on first sighting of k.

Worked example

merchant_ids = ["M1", "M2", "M1"] — after the loop, freq is {"M1": 2, "M2": 1}.

`collections.Counter`

Built for frequency counts: Counter(iterable), .most_common(k) for top-k.

Worked example

from collections import Counter
Counter(["late", "ok", "late"]).most_common(1)  # [('late', 2)]

`defaultdict` from `collections`

defaultdict(list) — d[order_id].append(event) mirrors GROUP BY order_id “collect all rows in a bucket.”

Why avoid nested scans?

“For each distinct key, loop the entire list” is O(n²). One pass with a dict is typically O(n)—what interviewers expect.

Worked example: With about 10,000 events and 500 distinct merchants, rescanning all rows per merchant is on the order of millions of operations; one dict pass stays on the order of ten thousand.

Sliding windows in memory

Keep a running sum and subtract the element leaving the window, or use a deque of indices/times—pattern for “max sum of k consecutive delivery fees” or “events in last N minutes.”

Worked example: Fees [2, 3, 1, 5], k = 2: window sums are 5, 4, 6; max is 6.

When two pointers or binary search show up

Two pointers often pair with sorted arrays; binary search answers “minimum feasible time/speed” when a predicate is monotonic.

Common beginner mistakes

Re-scanning the whole list inside a loop over unique keys.
Off-by-one on window bounds (left inclusive vs exclusive).
Not handling empty input or k = 0.

Pro tip: Before you code, restate inputs, output, and edge cases. That structure mirrors how PipeCode’s editorials explain 450+ problems.

Python Interview Question on Two Deliveries in the Same Hour

Given a sorted list of delivery start times (integers, seconds since epoch), return true if any two fall in the same clock hour (use t // 3600 as a simple hour bucket).

Solution Using a Set of Hour Buckets

def same_hour_exists(times: list[int]) -> bool:
    seen = set()
    for t in times:
        h = t // 3600
        if h in seen:
            return True
        seen.add(h)
    return False

Why this works: Each timestamp maps to one hour id; if we see the same id twice, two deliveries share that hour. set gives O(1) expected lookup—O(n) overall.

Practice

PYTHON · Topic — Hash tables & counting
COMPANY · DoorDash — hash DoorDash-tagged hash tables
PYTHON · Topic — Sliding window
COMPANY · DoorDash — sliding DoorDash-tagged sliding window

Open the full DoorDash Python problem set when you want language-filtered reps.

7. Star schema, grain, and SCD patterns in data engineering

Dimensional Modeling and SCD Patterns for warehouse interviews

Verbal rounds often ask you to sketch a star schema: a fact at the center (e.g. one row per order or per delivery line) and dimensions around it (merchant, customer, Dasher, date). Be crisp about grain—“one fact row represents what event?”

Fact table grain

Grain is the atomic event the fact row represents. Mixing order-level and order-line-level metrics in one fact without adjustment breaks additivity (you cannot blindly SUM across rows).

Worked example: If fact_orders is one row per order, line_item_subtotal does not belong there unless you allocate or roll up lines first.

Star vs snowflake

Star: Dimensions are denormalized (wide tables)—fewer joins for analysts.
Snowflake: Dimensions are normalized (multiple tables)—saves storage, more joins.
Trade-off: Query simplicity vs normalization; say both names in a design interview.

SCD Type 1 vs Type 2

Type 1: Overwrite the merchant’s address—no history; past reports would look “wrong” if you re-run them without snapshotting.
Type 2: New row per change with valid_from, valid_to, is_current (or similar); facts store merchant_sk pointing at the version active at order time.

Bridge tables (many-to-many)

Promo applies to many merchants; merchant runs many promos—a bridge merchant_promo(merchant_id, promo_id, …) at assignment grain is the usual pattern.

Common beginner mistakes

Snowflaking the fact (multiple grains in one table).
Forgetting surrogate keys on Type 2 dimensions so facts can point at the right version.
Confusing operational OLTP schemas with analytic star schemas—interview answers should name grain and keys explicitly.

Deep dives: Data modeling for data engineering interviews and ETL system design.

Data Engineering Interview Question on Tracking Merchant Address History

Explain how you would model merchant addresses so historical orders still show the address at order time while dashboards can list the current address.

Solution Using SCD Type 2 for History and a Current Flag

Store Type 2 rows on dim_merchant with valid_from, valid_to, is_current. Facts carry merchant_sk (surrogate key) tied to the version active when the order was placed. Dashboards filter is_current = true. Optionally maintain a Type 1 “current view” as a separate table or view for simplicity.

Why this works: Type 2 preserves history for facts that already stored the right merchant_sk; is_current gives analysts a single row per merchant for “where are they now?” without rewriting history.

Practice

SQL · Topic — Dimensional modeling
SQL · Hub — Data modeling language hub

Tips to crack DoorDash data engineering interviews

Data engineering interview preparation is less about memorizing every LeetCode pattern and more about repeatable lanes: SQL grain, joins, windows, dates, then Python frequency and windows, then a clean verbal on schemas and pipelines.

SQL preparation

Drill WHERE vs HAVING, join fan-out, and window partition keys on paper first. Explain NULL behavior in aggregates before you touch the keyboard.

Pipeline and ETL thinking

Be ready to discuss idempotent loads, late-arriving facts, and backfills at a high level—pair that narrative with ETL topic practice.

Where to practice on PipeCode

Lane	DoorDash-focused path
Company hub	/explore/practice/company/doordash
SQL	/explore/practice/company/doordash/sql
Python	/explore/practice/company/doordash/python
Medium difficulty	/explore/practice/company/doordash/difficulty/medium

Courses that match this article: SQL for data engineering interviews, Python fundamentals, plus the modeling and ETL links above.

Communication under time pressure

Narrate assumptions (timezone, tie-break, inclusive bounds), sketch grain, then code. Interviewers hire people who debug fan-out and NULL issues calmly—that is how to crack this style of loop without burning time.

Frequently asked questions

What types of DoorDash data engineering interview questions should I expect—SQL, Python, or system design?

Expect heavy SQL (joins, aggregates, windows, dates), Python structured problems (dicts, sliding window, occasional graphs), and at least one design conversation about pipelines, warehouses, or dimensional models. This article mirrors that mix; use the DoorDash practice hub to align reps with those lanes.

How hard are DoorDash SQL interviews compared to other large tech companies?

Difficulty is broadly comparable to other marketplace and consumer-tech data engineering loops: medium SQL with a few edge-case checks is common. PipeCode tags 63 DoorDash problems across easy, medium, and hard—use that spread instead of guessing from forums alone.

Do DoorDash data engineer interviews focus on Python algorithms or data-structure problems?

You should expect some algorithmic Python (hash maps, intervals, sorting, graphs) in addition to data-wrangling intuition. It is not identical to a pure SWE loop, but brushing up on patterns in the DoorDash Python filter pays off.

How do I prepare for DoorDash data modeling or warehouse design questions?

Practice star schema, fact grain, and SCD Type 1 vs 2 out loud; sketch surrogate keys and how facts reference dimensions. Pair verbal reps with dimensional modeling drills and the data modeling course linked above.

How many DoorDash-tagged practice problems should I finish before the loop?

There is no magic number, but 30–50 mixed problems across SQL, Python, and at least a few modeling prompts is a solid data engineering interview preparation cadence if you explain each solution after you submit.

Where can I practice DoorDash-style SQL and Python with tests and feedback?

PipeCode runs 450+ interview-style problems with tests and AI feedback; start at /explore/practice/company/doordash for company-tagged sets, then branch into topic pages for weak spots.

Start practicing DoorDash data engineering problems

PipeCode pairs company-tagged DoorDash problems with tests and feedback so you move from reading patterns to typing your own solutions.

Pipecode.ai is Leetcode for Data Engineering

Browse DoorDash practice →
View plans →

Top DoorDash data engineering interview topics

1. Aggregation and GROUP BY concepts in marketplace SQL

Aggregation and GROUP BY in SQL for marketplace metrics

SUM(col)

AVG(col)

COUNT(*)

COUNT(col)

MIN(col) and MAX(col)

Conditional aggregation (CASE inside aggregates)

GROUP BY and HAVING (how they fit together)

SQL Interview Question on Aggregation and GROUP BY

Solution Using GROUP BY, Conditional Sum, and HAVING

2. Joins, deduplication, and fan-out patterns in SQL

Joins and Deduplication in SQL for logistics analytics

Inner join (JOIN / INNER JOIN)

Left outer join (LEFT JOIN)

Semi-join vs anti-join (names, not syntax)

Fan-out (row multiplication)

Deduplication

Joins-Based Interview Question on Dashers Without Completed Deliveries

Solution Using NOT EXISTS

3. Window functions and ranking in SQL

Window Functions and Ranking in SQL for delivery timelines

PARTITION BY (optional but common)

ORDER BY inside OVER

ROW_NUMBER()

RANK()

DENSE_RANK()

LAG / LEAD

SQL Interview Question on Top Two Orders per Merchant by Amount

Solution Using DENSE_RANK and a CTE

4. Date functions, arithmetic, and streak-style SQL

Date and Time-Series Handling in SQL for marketplace metrics

Bucketing with date_trunc (PostgreSQL-style)

INTERVAL and relative windows

Difference between timestamps

Boundaries: half-open vs BETWEEN

Gaps, sessions, and streaks

SQL Interview Question on Orders in the Last Rolling Seven Days

Solution Using Range Predicate

5. Bucketing and CASE expressions for logistics metrics

CASE Expressions and Bucketing in SQL for SLA tiers

CASE for SLA and latency tiers

Ordered buckets vs percentiles

Pitfalls: overlapping conditions and ELSE NULL

SQL Interview Question on Segmenting Merchants by Weekly Order Volume

Solution Using CASE on a Pre-Aggregated CTE

6. Hash maps, sliding windows, and Python patterns for DE screens

Hash Maps and Counting in Python for interview loops

Plain dict and .get

collections.Counter

defaultdict from collections

Why avoid nested scans?

Sliding windows in memory

When two pointers or binary search show up

Python Interview Question on Two Deliveries in the Same Hour

Solution Using a Set of Hour Buckets

7. Star schema, grain, and SCD patterns in data engineering

Dimensional Modeling and SCD Patterns for warehouse interviews

Fact table grain

Star vs snowflake

SCD Type 1 vs Type 2

Bridge tables (many-to-many)

Data Engineering Interview Question on Tracking Merchant Address History

Solution Using SCD Type 2 for History and a Current Flag

Tips to crack DoorDash data engineering interviews

SQL preparation

Pipeline and ETL thinking

Where to practice on PipeCode

Communication under time pressure

Frequently asked questions

What types of DoorDash data engineering interview questions should I expect—SQL, Python, or system design?

How hard are DoorDash SQL interviews compared to other large tech companies?

Do DoorDash data engineer interviews focus on Python algorithms or data-structure problems?

How do I prepare for DoorDash data modeling or warehouse design questions?

How many DoorDash-tagged practice problems should I finish before the loop?

Where can I practice DoorDash-style SQL and Python with tests and feedback?

Start practicing DoorDash data engineering problems

`SUM(col)`

`AVG(col)`

`COUNT(*)`

`COUNT(col)`

`MIN(col)` and `MAX(col)`

Conditional aggregation (`CASE` inside aggregates)

`GROUP BY` and `HAVING` (how they fit together)

Inner join (`JOIN` / `INNER JOIN`)

Left outer join (`LEFT JOIN`)

`PARTITION BY` (optional but common)

`ORDER BY` inside `OVER`

`ROW_NUMBER()`

`RANK()`

`DENSE_RANK()`

`LAG` / `LEAD`

Bucketing with `date_trunc` (PostgreSQL-style)

`INTERVAL` and relative windows

Boundaries: half-open vs `BETWEEN`

`CASE` for SLA and latency tiers

Pitfalls: overlapping conditions and `ELSE NULL`

Plain `dict` and `.get`

`collections.Counter`

`defaultdict` from `collections`