DoorDash data engineering interview questions usually stress SQL for marketplace metrics—orders, merchants, Dashers, ratings—and Python for structured problems (hash maps, sliding windows, graphs) that still show up on many DE loops. This guide maps the skills behind 63+ DoorDash-tagged practice problems on PipeCode so you know what to drill and in what order.
We walk through aggregation and grain, joins and fan-out, window functions, time logic, bucketing, representative Python patterns, and a concise data modeling pass (star schema and SCDs). Sample prompts below are original teaching examples, not leaked items.
Top DoorDash data engineering interview topics
| # | Topic | Why it shows up |
|---|---|---|
| 1 | Aggregation & GROUP BY / HAVING |
Per-merchant revenue, per-Dasher counts, SLA segments. |
| 2 | Joins & deduplication | Orders to deliveries without row explosion; “no activity” anti-joins. |
| 3 | Window functions & ranking | Top-N per region, previous/next event, running metrics. |
| 4 | Dates, streaks, time-series SQL | Daily actives, latency buckets, gaps between orders. |
| 5 | CASE & bucketing |
On-time vs late, radius tiers, volume bands. |
| 6 | Python: frequency & sliding windows | Counters, bounded intervals, order-sequence scans. |
| 7 | Star schema, grain, SCDs | Verbal warehouse design and historical dimensions. |
If you are new to SQL: In most databases the engine processes a query roughly in this order:
FROM/ joins →WHERE(filter rows) →GROUP BY→ compute aggregates →HAVING(filter groups) → window functions →SELECT/ORDER BY. When in doubt, ask: “Am I filtering one row at a time (WHERE) or a whole group after summing (HAVING)?” Saying that out loud in a DoorDash SQL interview buys you time to structure the query.
1. Aggregation and GROUP BY concepts in marketplace SQL
Aggregation and GROUP BY in SQL for marketplace metrics
Picture a table with many detail rows—for example one row per order at a merchant. Aggregation means: “turn lots of rows into one summary value (or a few values) per bucket.” The bucket is whatever the prompt names: per merchant, per Dasher, per day, and so on.
-
GROUP BYdefines the bucket: “Put all rows with the samemerchant_idtogether,” or “the same(store_id, day)together.” Every distinct combination of theGROUP BYcolumns is one group; the database runs your aggregate functions separately inside each group.
SUM(col)
-
What it does: Adds all numeric values of
colin the bucket. -
NULLbehavior:NULLcells are ignored (they are not treated as 0). If every value isNULL,SUMis usuallyNULL, not 0—say that in an interview if the edge case matters. - Typical use: Gross merchandise value per merchant, delivery fees summed per region.
Worked example: In one group, subtotal values 10, NULL, 30 → SUM(subtotal) = 40 (only 10 + 30).
AVG(col)
-
What it does: Average of non-
NULLvalues: sum of non-null values divided by count of non-null values. -
NULLbehavior: Rows wherecolisNULLdo not enter the numerator or the denominator. If you need “average where missing means 0,” useAVG(COALESCE(col, 0))(only if the business defines it that way). - Typical use: Average order value per merchant, average delivery minutes per zone.
Worked example: Same three rows 10, NULL, 30 → AVG(subtotal) = 20 because (10 + 30) / 2; the NULL row is not counted in the average.
COUNT(*)
-
What it does: Counts how many rows are in the bucket—every row counts, even if some columns are
NULL. - Typical use: “How many orders per merchant?”, “how many delivery attempts in this hour?”—when each row is one event.
Worked example: Same three rows → COUNT(*) = 3 (the NULL row still counts as a row).
COUNT(col)
-
What it does: Counts rows where
colis notNULL. Differs fromCOUNT(*)as soon ascolhas nulls. -
Related:
COUNT(DISTINCT col)counts unique non-null values in the bucket—essential after joins when you must count customers, not multiplied rows (see section 2).
Worked example: Same three rows → COUNT(subtotal) = 2. If the third row were 50 instead of NULL, COUNT(DISTINCT subtotal) with values 10, 30, 50 would be 3; with 10, 30, 10 it would be 2.
MIN(col) and MAX(col)
-
What they do: Return the smallest or largest value of
colin the bucket. Works on orderable types: numbers, dates/timestamps, strings (lexicographic order). -
NULLbehavior:NULLs are skipped. If all values areNULL, the result isNULL. -
Typical use: Latest order time (
MAX(order_ts)), earliest Dasher signup (MIN(created_at)).
Worked example: Subtotals 10, NULL, 30 → MIN(subtotal) = 10, MAX(subtotal) = 30.
Worked example — one dataset, several aggregates
Suppose orders looks like this:
| order_id | merchant_id | subtotal |
|---|---|---|
| 101 | M1 | 20.00 |
| 102 | M1 | NULL |
| 103 | M1 | 40.00 |
| 104 | M2 | 100.00 |
Run:
SELECT merchant_id,
SUM(subtotal) AS sum_amt,
AVG(subtotal) AS avg_amt,
COUNT(*) AS n_rows,
COUNT(subtotal) AS n_known_amt,
MIN(subtotal) AS min_amt,
MAX(subtotal) AS max_amt
FROM orders
GROUP BY merchant_id
ORDER BY merchant_id;
You should get:
| merchant_id | sum_amt | avg_amt | n_rows | n_known_amt | min_amt | max_amt |
|---|---|---|---|---|---|---|
| M1 | 60.00 | 30.0000… | 3 | 2 | 20.00 | 40.00 |
| M2 | 100.00 | 100.0000… | 1 | 1 | 100.00 | 100.00 |
For M1, the row with subtotal NULL still counts in COUNT(*) (3 rows) but not in SUM / AVG / COUNT(subtotal) (only the 20 and 40 matter).
COUNT(DISTINCT) mini-example: if delivery_attempts has two rows for the same order_id (retry), COUNT(*) is 2 but COUNT(DISTINCT order_id) is 1 for that bucket.
Conditional aggregation (CASE inside aggregates)
- Idea: Count or sum only some rows in the group without splitting into multiple queries—put the condition inside the aggregate.
-
Patterns:
SUM(CASE WHEN condition THEN col ELSE 0 END),COUNT(CASE WHEN … THEN 1 END)(orSUM(CASE WHEN … THEN 1 ELSE 0 END)).
Worked example — events per merchant by type
| merchant_id | event_type |
|---|---|
| M1 | impression |
| M1 | order |
| M1 | impression |
| M2 | impression |
SELECT merchant_id,
COUNT(*) AS total_events,
SUM(CASE WHEN event_type = 'order' THEN 1 ELSE 0 END) AS orders,
SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END) AS impressions
FROM merchant_events
GROUP BY merchant_id;
Result
| merchant_id | total_events | orders | impressions |
|---|---|---|---|
| M1 | 3 | 1 | 2 |
| M2 | 1 | 0 | 1 |
This is the same “one GROUP BY, many metrics” style as COUNT(*) FILTER (WHERE …) in PostgreSQL—portable warehouses use CASE heavily.
GROUP BY and HAVING (how they fit together)
-
HAVINGisWHEREfor buckets: it runs after grouping, so you can filter onAVG(subtotal),COUNT(*), etc.WHEREruns before grouping and only sees raw row columns—soWHERE AVG(subtotal) > 50is invalid: the average does not exist until afterGROUP BY.
Worked example — WHERE vs HAVING
| order_id | merchant_id | subtotal |
|---|---|---|
| 1 | M1 | 40 |
| 2 | M1 | 70 |
| 3 | M1 | 20 |
| 4 | M2 | 200 |
-
WHERE subtotal > 30drops row 3 before grouping. ThenGROUP BY merchant_idwithSUM(subtotal)gives M1 = 110 (40+70), M2 = 200. -
HAVING SUM(subtotal) > 100runs after grouping on the unfiltered table: M1’s sum is 130, M2’s is 200—both pass.
Worked example — HAVING output
SELECT merchant_id, SUM(subtotal) AS total_amt, COUNT(*) AS n_orders
FROM orders
GROUP BY merchant_id
HAVING SUM(subtotal) > 100;
Result
| merchant_id | total_amt | n_orders |
|---|---|---|
| M1 | 130 | 3 |
| M2 | 200 | 1 |
Rule of thumb: If the condition uses SUM / COUNT / AVG / …` of the group, use HAVING. If it only uses this row’s columns, use WHERE (and put it first—it usually makes the query faster too).
If a column appears in SELECT and is not inside an aggregate, it must appear in GROUP BY (in strict SQL).
Common beginner mistakes
- Putting
AVG(subtotal) > 50inWHERE→ useHAVINGafterGROUP BY. - Forgetting a column in
GROUP BYwhen it appears inSELECTwithout an aggregate → invalid query in strict SQL. - Answering at the wrong grain (e.g. one row per
order_linewhen the question asked per merchant).
SQL Interview Question on Aggregation and GROUP BY
Given orders(order_id, merchant_id, subtotal, is_cancelled), compute gross revenue per merchant excluding cancelled orders. Return merchants with revenue strictly greater than 100.
Solution Using GROUP BY, Conditional Sum, and HAVING
sql
SELECT merchant_id,
SUM(CASE WHEN NOT is_cancelled THEN subtotal ELSE 0 END) AS gross_rev
FROM orders
GROUP BY merchant_id
HAVING SUM(CASE WHEN NOT is_cancelled THEN subtotal ELSE 0 END) > 100;
Why this works: We group by merchant, sum only non-cancelled subtotals, then keep only groups whose gross revenue passes the threshold. That threshold is an aggregate, so it belongs in HAVING.
Practice
- SQL · Topic — Aggregation problems (all companies)
- COMPANY · DoorDash — aggregation DoorDash-tagged aggregation
- COMPANY · DoorDash — hub DoorDash SQL (all difficulties)
When you are ready for timed reps, open DoorDash SQL (medium).
2. Joins, deduplication, and fan-out patterns in SQL
Joins and Deduplication in SQL for logistics analytics
A join answers: “For each row in table A, which rows in table B belong with it?” You declare that with ON a.key = b.key. In marketplace analytics you enrich orders and deliveries with merchants, customers, and Dashers.
Fact vs dimension and grain (how DEs talk about joins)
-
Fact table: Many rows, often one row per event or transaction (order, delivery attempt)—the thing you
SUMorCOUNT. -
Dimension table: Fewer rows, one row per entity you describe (merchant, Dasher)—the thing you
JOINto get names, zones, categories. - Grain: “One result row per what?” Wrong grain after a join usually means fan-out: you aggregated at order level but summed a merchant attribute that repeated.
Worked example: order_lines (fact: order_id, sku, qty) joins items (dimension: sku, list_price`). SUM(qty) by sku is correct at line grain. If items had duplicate rows per sku, one join could inflate SUM(list_price)—fix the dimension to one row per sku before summing price.
Inner join (JOIN / INNER JOIN)
-
Result: Only rows where the
ONcondition succeeds on both sides. - Use when: You only care about “matched” pairs—e.g. orders that have a known merchant row.
-
NULLkeys:NULL = NULLis not true in join semantics, so inner joins drop rows with a null join key unless youCOALESCEor filter explicitly.
Worked example
orders
| order_id | merchant_id |
|---|---|
| O1 | M10 |
| O2 | M99 |
merchants
| merchant_id | name |
|---|---|
| M10 | Noodle House |
FROM orders o JOIN merchants m ON o.merchant_id = m.merchant_id returns one row: O1. Order O2 has no merchant row and disappears from an inner join.
Left outer join (LEFT JOIN)
-
Result: Every row from the left table appears once per matching right row; if there is no match, right-side columns are
NULL. - Use when: You must keep all left entities even if they have zero events—e.g. all merchants including those with no orders this week.
Worked example
merchants
| merchant_id | name |
|---|---|
| M1 | Cafe A |
| M2 | Cafe B |
orders
| order_id | merchant_id | amount |
|---|---|---|
| X1 | M1 | 50 |
FROM merchants m LEFT JOIN orders o ON m.merchant_id = o.merchant_id yields two rows: M1 + X1, and M2 + NULL (no order—right side null-padded).
Semi-join vs anti-join (names, not syntax)
-
Semi-join: “Keep rows from A where some match exists in B”—
WHERE EXISTS (SELECT 1 FROM B WHERE …). -
Anti-join: “Keep rows from A with no match in B”—
LEFT JOIN B … WHERE B.key IS NULLorWHERE NOT EXISTS (SELECT 1 FROM B WHERE …).
Worked example
dashers dasher_id: d1, d2, d3 — completed_deliveries dasher_id: d2 only (completed).
- Semi-join: dashers who have at least one completed delivery → d2.
- Anti-join: dashers who never completed → d1, d3.
Fan-out (row multiplication)
- If the join key is not unique on one side (one order, many delivery attempts), the “one” side’s columns repeat on every matching row.
-
Bug pattern: Summing a merchant-level column after the join counts that merchant’s field multiple times. Fixes: aggregate first, or
COUNT(DISTINCT merchant_id), or a subquery that makes the dimension one row per key before joining.
Worked example
merchants: (merchant_id=M1, fee_tier_credit=500)
orders: two rows for merchant_id=M1.
After merchants JOIN orders, fee_tier_credit appears twice (500, 500). SUM(m.fee_tier_credit) would add 1000, which is wrong for “total tier credit in the system.” Fix: SUM order amounts, or MAX(m.fee_tier_credit) per merchant if you only need the credit once.
Deduplication
-
SELECT DISTINCTremoves duplicate rows when any duplicate row is interchangeable. -
Latest row per id: Prefer
ROW_NUMBER() OVER (PARTITION BY id ORDER BY time DESC) AS rnin a CTE, thenWHERE rn = 1.DISTINCT ON(PostgreSQL) is a compact alternative when the prompt names the dialect.
Worked example — latest attempt per order
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY attempt_ts DESC) AS rn
FROM delivery_attempts
)
SELECT * FROM ranked WHERE rn = 1;
Join that result to orders to avoid inflated revenue.
Common beginner mistakes
- Joining on the wrong key or ignoring
NULLkeys (they do not match in an inner join). - Assuming one row per merchant after a join when the join multiplied rows.
- Using
COUNT(*)when the question wanted distinct customers—useCOUNT(DISTINCT customer_id).
Joins-Based Interview Question on Dashers Without Completed Deliveries
From dashers(dasher_id) and deliveries(delivery_id, dasher_id, status), list dashers who have never completed a delivery (status = 'completed').
Solution Using NOT EXISTS
SELECT d.dasher_id
FROM dashers d
WHERE NOT EXISTS (
SELECT 1 FROM deliveries x
WHERE x.dasher_id = d.dasher_id AND x.status = 'completed'
);
Why this works: For each Dasher we ask “is there any completed delivery?”—if not, the Dasher passes the filter. NOT EXISTS is a standard anti-join pattern and often reads clearer than LEFT JOIN … IS NULL.
Practice
- SQL · Topic — Joins & deduplication
- COMPANY · DoorDash — joins DoorDash-tagged joins
3. Window functions and ranking in SQL
Window Functions and Ranking in SQL for delivery timelines
GROUP BY shrinks the table: many rows → one row per bucket. Window functions do the opposite idea: every row stays, but you add a new column computed from a window of rows around it. Syntax: some_function(...) OVER (...).
PARTITION BY (optional but common)
-
Meaning: “Start a new window for each distinct value of these columns”—like a separate mini-table per
merchant_idordasher_id. -
Without it: The window is usually the whole result set (after
WHERE), which is rare for rankings but valid for some global running totals.
Worked example (one merchant, running sum over days)
| merchant_id | day | subtotal |
|---|---|---|
| M1 | 2026-01-01 | 10 |
| M1 | 2026-01-02 | 30 |
SUM(subtotal) OVER (PARTITION BY merchant_id ORDER BY day ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) adds 10 on the first row and 40 on the second—cumulative subtotal for that merchant.
ORDER BY inside OVER
-
Meaning: Defines which row is ‘first’ inside each partition—required for
RANK,ROW_NUMBER,LAG/LEAD, and for ordered frames (running sums). -
Direction:
DESCvsASCchanges “who is rank 1” (e.g. largest order first).
ROW_NUMBER()
- Assigns 1, 2, 3, … with no ties—the database breaks ties arbitrarily unless you add more
ORDER BYkeys. - Use when: You need exactly one row per partition (pick-one tie-break) or deduplication.
RANK()
- Ties get the same rank; the next rank skips: 1, 2, 2, 4.
DENSE_RANK()
- Ties get the same rank, but no gaps: 1, 2, 2, 3.
- Use when: You want “top N tiers” including all ties at the boundary—common for “top two order amounts per merchant.”
LAG / LEAD
-
LAG(col, 1)= previous row’s value in partition order;LEAD= next row’s—used for time between deliveries, session gaps, streak logic.
Worked example
SELECT dasher_id, delivery_ts,
delivery_ts - LAG(delivery_ts) OVER (PARTITION BY dasher_id ORDER BY delivery_ts) AS gap
FROM deliveries;
Filtering window results: You cannot put WHERE RANK() = 1 in the same SELECT that defines the rank in most databases. Compute the window in a CTE, then WHERE rnk = 1 on the outside.
Common beginner mistakes
- Omitting
ORDER BYinsideOVERwhen order defines the answer. - Using
GROUP BYwhen the question still needs one output row per input row. - Using
ROW_NUMBER()when tied rows should share the same rank—useRANKorDENSE_RANK.
SQL Interview Question on Top Two Orders per Merchant by Amount
For each merchant_id, return the two highest order_amount rows (ties can both appear). Use PostgreSQL-friendly SQL.
Solution Using DENSE_RANK and a CTE
WITH ranked AS (
SELECT *, DENSE_RANK() OVER (PARTITION BY merchant_id ORDER BY order_amount DESC) AS dr
FROM orders
)
SELECT * FROM ranked WHERE dr <= 2;
Why this works: DENSE_RANK() orders rows inside each merchant by order_amount; dr <= 2 keeps the top two distinct amount tiers, so ties at rank 2 are included.
For more pattern drills, browse SQL deep dives.
Practice
- SQL · Topic — Window functions (SQL)
- COMPANY · DoorDash — windows DoorDash-tagged window functions
- COMPANY · DoorDash — ranking DoorDash-tagged ranking
4. Date functions, arithmetic, and streak-style SQL
Date and Time-Series Handling in SQL for marketplace metrics
Time-series questions almost always mean: “Put each event in a time bucket (day, hour, week), then aggregate.” Delivery and order pipelines make half-open ranges and time zones fair game in a DoorDash SQL interview.
Bucketing with date_trunc (PostgreSQL-style)
-
date_trunc('day', ts)snaps every timestamp in that calendar day to the same instant (midnight at the start of the day). Common grains:hour,week,month. - After truncating,
GROUP BYthe truncated value (or cast todate) and applySUM,COUNT, etc.
Worked example
| order_ts (UTC) | subtotal |
|---|---|
| 2026-01-01 22:00 | 10 |
| 2026-01-02 03:00 | 20 |
GROUP BY date_trunc('day', order_ts)::date gives two buckets: 2026-01-01 → 10, 2026-01-02 → 20.
INTERVAL and relative windows
-
CURRENT_TIMESTAMP - INTERVAL '7 days'expresses sliding windows without hard-coding dates. - Pair with
WHERE ts >= … AND ts < …for half-open ranges.
Worked example: “Last 24 hours”: WHERE event_ts >= NOW() - INTERVAL '24 hours' AND event_ts < NOW() (or < NOW() depending on whether you treat “now” as exclusive).
Difference between timestamps
-
dropoff_ts - pickup_ts(PostgreSQLinterval)—useful for delivery duration. Cast withEXTRACT(EPOCH FROM …)/3600for hours.
Worked example
| pickup_ts | dropoff_ts |
|---|---|
| 2026-01-01 10:00 | 2026-01-01 16:00 |
dropoff_ts - pickup_ts is a 6-hour interval.
Boundaries: half-open vs BETWEEN
- Prefer
ts >= start_ts AND ts < end_tsso an event exactly onend_tsis not double-counted across adjacent windows. -
BETWEEN a AND bincludes both endpoints—easy to off-by-one with timestamps.
Gaps, sessions, and streaks
- Sort by time, then
LAG(ts)/LEAD(ts)for gaps between consecutive events. -
Streak / islands (conceptual):
order_day - ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_day)is constant within a streak of consecutive calendar days. Full gaps-and-islands is a common follow-up—practice on time-series SQL.
Common beginner mistakes
- Mixing
DATEandTIMESTAMPwithout casting. - Using
BETWEENon timestamps when a half-open range is safer. - Forgetting UTC vs local when filtering “today.”
SQL Interview Question on Orders in the Last Rolling Seven Days
Count completed orders per merchant_id for order_ts in [today-7d, today) (half-open).
Solution Using Range Predicate
SELECT merchant_id, COUNT(*) AS n_orders
FROM orders
WHERE status = 'completed'
AND order_ts >= CURRENT_TIMESTAMP - INTERVAL '7 days'
AND order_ts < CURRENT_TIMESTAMP
GROUP BY merchant_id;
Why this works: WHERE filters to the rolling window and completed status before GROUP BY; COUNT(*) counts orders per merchant in that slice.
Practice
- SQL · Topic — Time-series SQL
- COMPANY · DoorDash — dates DoorDash-tagged date functions
- COMPANY · DoorDash — date math DoorDash-tagged date arithmetic
Practice time-series SQL patterns after this section—the hub matches many “consecutive day” and cohort-style prompts.
5. Bucketing and CASE expressions for logistics metrics
CASE Expressions and Bucketing in SQL for SLA tiers
CASE maps continuous metrics (minutes late, miles) into ordered buckets interviewers can scan quickly—on-time vs late, distance bands, volume tiers.
CASE for SLA and latency tiers
-
Pattern:
CASE WHEN … THEN … WHEN … THEN … ELSE … END— first matching branch wins;ELSEavoids silentNULLlabels.
Worked example: delivery minutes m → on_time if m <= 30, late if m <= 60, else very_late:
CASE
WHEN delivery_minutes <= 30 THEN 'on_time'
WHEN delivery_minutes <= 60 THEN 'late'
ELSE 'very_late'
END
Test boundaries: is 30 inclusive? Say it explicitly.
Ordered buckets vs percentiles
- Fixed thresholds (SLA minutes) are more common than percentile buckets in interview cards.
-
NTILE(n)over a partition splits ordered rows into n buckets—useful when the prompt says “quartile” or “decile”; watch ties at bucket edges.
Pitfalls: overlapping conditions and ELSE NULL
- Non-exhaustive
CASEreturnsNULLfor misses—sometimes you wantELSE 'unknown'so rows still appear in reports. - Overlapping
WHENranges are a logic bug; keep branches mutually exclusive unless the prompt says otherwise.
Common beginner mistakes
- Off-by-one on inclusive/exclusive boundaries.
- Putting
CASEresults inWHEREwithout wrapping or using a subquery when the engine evaluates order unexpectedly—often computeCASEinSELECTor a CTE, then filter.
SQL Interview Question on Segmenting Merchants by Weekly Order Volume
From orders, bucket merchants into low / medium / high for last week: low < 50 orders, medium < 200, else high.
Solution Using CASE on a Pre-Aggregated CTE
WITH w AS (
SELECT merchant_id, COUNT(*) AS n
FROM orders
WHERE order_ts >= DATE_TRUNC('week', CURRENT_DATE) - INTERVAL '7 days'
AND order_ts < DATE_TRUNC('week', CURRENT_DATE)
GROUP BY merchant_id
)
SELECT merchant_id, n,
CASE
WHEN n < 50 THEN 'low'
WHEN n < 200 THEN 'medium'
ELSE 'high'
END AS volume_tier
FROM w;
Why this works: We aggregate first (counts per merchant in the window), then label each count with CASE—clean separation of “how many” vs “which tier.”
Practice
- SQL · Topic — Bucketing (SQL)
- COMPANY · DoorDash — bucketing DoorDash-tagged bucketing
- COMPANY · DoorDash — case DoorDash-tagged case expressions
6. Hash maps, sliding windows, and Python patterns for DE screens
Hash Maps and Counting in Python for interview loops
A dict (hash map) maps keys to values. Average-time lookup, insert, and update are O(1) in the usual amortized sense—much faster than rescanning a whole list for every key.
Many DoorDash data engineer Python items are algorithmic but small-input: you implement clear logic under time pressure, not a production service.
Plain dict and .get
-
Frequency:
counts[k] = counts.get(k, 0) + 1avoidsKeyErroron first sighting ofk.
Worked example
merchant_ids = ["M1", "M2", "M1"] — after the loop, freq is {"M1": 2, "M2": 1}.
collections.Counter
- Built for frequency counts:
Counter(iterable),.most_common(k)for top-k.
Worked example
from collections import Counter
Counter(["late", "ok", "late"]).most_common(1) # [('late', 2)]
defaultdict from collections
-
defaultdict(list)—d[order_id].append(event)mirrorsGROUP BY order_id“collect all rows in a bucket.”
Why avoid nested scans?
- “For each distinct key, loop the entire list” is O(n²). One pass with a dict is typically O(n)—what interviewers expect.
Worked example: With about 10,000 events and 500 distinct merchants, rescanning all rows per merchant is on the order of millions of operations; one dict pass stays on the order of ten thousand.
Sliding windows in memory
- Keep a running sum and subtract the element leaving the window, or use a deque of indices/times—pattern for “max sum of k consecutive delivery fees” or “events in last N minutes.”
Worked example: Fees [2, 3, 1, 5], k = 2: window sums are 5, 4, 6; max is 6.
When two pointers or binary search show up
- Two pointers often pair with sorted arrays; binary search answers “minimum feasible time/speed” when a predicate is monotonic.
Common beginner mistakes
- Re-scanning the whole list inside a loop over unique keys.
- Off-by-one on window bounds (
leftinclusive vs exclusive). - Not handling empty input or
k = 0.
Pro tip: Before you code, restate inputs, output, and edge cases. That structure mirrors how PipeCode’s editorials explain 450+ problems.
Python Interview Question on Two Deliveries in the Same Hour
Given a sorted list of delivery start times (integers, seconds since epoch), return true if any two fall in the same clock hour (use t // 3600 as a simple hour bucket).
Solution Using a Set of Hour Buckets
def same_hour_exists(times: list[int]) -> bool:
seen = set()
for t in times:
h = t // 3600
if h in seen:
return True
seen.add(h)
return False
Why this works: Each timestamp maps to one hour id; if we see the same id twice, two deliveries share that hour. set gives O(1) expected lookup—O(n) overall.
Practice
- PYTHON · Topic — Hash tables & counting
- COMPANY · DoorDash — hash DoorDash-tagged hash tables
- PYTHON · Topic — Sliding window
- COMPANY · DoorDash — sliding DoorDash-tagged sliding window
Open the full DoorDash Python problem set when you want language-filtered reps.
7. Star schema, grain, and SCD patterns in data engineering
Dimensional Modeling and SCD Patterns for warehouse interviews
Verbal rounds often ask you to sketch a star schema: a fact at the center (e.g. one row per order or per delivery line) and dimensions around it (merchant, customer, Dasher, date). Be crisp about grain—“one fact row represents what event?”
Fact table grain
-
Grain is the atomic event the fact row represents. Mixing order-level and order-line-level metrics in one fact without adjustment breaks additivity (you cannot blindly
SUMacross rows).
Worked example: If fact_orders is one row per order, line_item_subtotal does not belong there unless you allocate or roll up lines first.
Star vs snowflake
- Star: Dimensions are denormalized (wide tables)—fewer joins for analysts.
- Snowflake: Dimensions are normalized (multiple tables)—saves storage, more joins.
- Trade-off: Query simplicity vs normalization; say both names in a design interview.
SCD Type 1 vs Type 2
- Type 1: Overwrite the merchant’s address—no history; past reports would look “wrong” if you re-run them without snapshotting.
-
Type 2: New row per change with
valid_from,valid_to,is_current(or similar); facts storemerchant_skpointing at the version active at order time.
Bridge tables (many-to-many)
-
Promo applies to many merchants; merchant runs many promos—a bridge
merchant_promo(merchant_id, promo_id, …)at assignment grain is the usual pattern.
Common beginner mistakes
- Snowflaking the fact (multiple grains in one table).
- Forgetting surrogate keys on Type 2 dimensions so facts can point at the right version.
- Confusing operational OLTP schemas with analytic star schemas—interview answers should name grain and keys explicitly.
Deep dives: Data modeling for data engineering interviews and ETL system design.
Data Engineering Interview Question on Tracking Merchant Address History
Explain how you would model merchant addresses so historical orders still show the address at order time while dashboards can list the current address.
Solution Using SCD Type 2 for History and a Current Flag
Store Type 2 rows on dim_merchant with valid_from, valid_to, is_current. Facts carry merchant_sk (surrogate key) tied to the version active when the order was placed. Dashboards filter is_current = true. Optionally maintain a Type 1 “current view” as a separate table or view for simplicity.
Why this works: Type 2 preserves history for facts that already stored the right merchant_sk; is_current gives analysts a single row per merchant for “where are they now?” without rewriting history.
Practice
- SQL · Topic — Dimensional modeling
- SQL · Hub — Data modeling language hub
Tips to crack DoorDash data engineering interviews
Data engineering interview preparation is less about memorizing every LeetCode pattern and more about repeatable lanes: SQL grain, joins, windows, dates, then Python frequency and windows, then a clean verbal on schemas and pipelines.
SQL preparation
Drill WHERE vs HAVING, join fan-out, and window partition keys on paper first. Explain NULL behavior in aggregates before you touch the keyboard.
Pipeline and ETL thinking
Be ready to discuss idempotent loads, late-arriving facts, and backfills at a high level—pair that narrative with ETL topic practice.
Where to practice on PipeCode
| Lane | DoorDash-focused path |
|---|---|
| Company hub | /explore/practice/company/doordash |
| SQL | /explore/practice/company/doordash/sql |
| Python | /explore/practice/company/doordash/python |
| Medium difficulty | /explore/practice/company/doordash/difficulty/medium |
Courses that match this article: SQL for data engineering interviews, Python fundamentals, plus the modeling and ETL links above.
Communication under time pressure
Narrate assumptions (timezone, tie-break, inclusive bounds), sketch grain, then code. Interviewers hire people who debug fan-out and NULL issues calmly—that is how to crack this style of loop without burning time.
Frequently asked questions
What types of DoorDash data engineering interview questions should I expect—SQL, Python, or system design?
Expect heavy SQL (joins, aggregates, windows, dates), Python structured problems (dicts, sliding window, occasional graphs), and at least one design conversation about pipelines, warehouses, or dimensional models. This article mirrors that mix; use the DoorDash practice hub to align reps with those lanes.
How hard are DoorDash SQL interviews compared to other large tech companies?
Difficulty is broadly comparable to other marketplace and consumer-tech data engineering loops: medium SQL with a few edge-case checks is common. PipeCode tags 63 DoorDash problems across easy, medium, and hard—use that spread instead of guessing from forums alone.
Do DoorDash data engineer interviews focus on Python algorithms or data-structure problems?
You should expect some algorithmic Python (hash maps, intervals, sorting, graphs) in addition to data-wrangling intuition. It is not identical to a pure SWE loop, but brushing up on patterns in the DoorDash Python filter pays off.
How do I prepare for DoorDash data modeling or warehouse design questions?
Practice star schema, fact grain, and SCD Type 1 vs 2 out loud; sketch surrogate keys and how facts reference dimensions. Pair verbal reps with dimensional modeling drills and the data modeling course linked above.
How many DoorDash-tagged practice problems should I finish before the loop?
There is no magic number, but 30–50 mixed problems across SQL, Python, and at least a few modeling prompts is a solid data engineering interview preparation cadence if you explain each solution after you submit.
Where can I practice DoorDash-style SQL and Python with tests and feedback?
PipeCode runs 450+ interview-style problems with tests and AI feedback; start at /explore/practice/company/doordash for company-tagged sets, then branch into topic pages for weak spots.
Start practicing DoorDash data engineering problems
PipeCode pairs company-tagged DoorDash problems with tests and feedback so you move from reading patterns to typing your own solutions.
Pipecode.ai is Leetcode for Data Engineering




Top comments (0)