Amazon data engineering interview questions lean on SQL for operational and retail-style analytics—aggregates, joins, window functions, and date logic—and Python for in-memory transforms (hash maps, arrays, nested structures). Harder loops often add verbal data modeling (star schemas, slowly changing dimensions, fact grain).
Everything below uses original teaching examples only—not proprietary interview items. Narrate grain, join semantics, and NULL behavior out loud; interviewers grade that reasoning as much as the final query or sketch.
Top Amazon data engineering interview topics
| # | Topic | Why it shows up |
|---|---|---|
| 1 |
Aggregation & GROUP BY / HAVING
|
Revenue and volume by region, SKU, or customer segment. |
| 2 | Joins & anti-joins | Enrich facts, find non-buyers, avoid join fan-out. |
| 3 | Window functions | Top-N per group, ranks, LAG / LEAD, running metrics. |
| 4 | Dates & time boundaries | Shipped-within-N-days, yearly filters, buckets. |
| 5 |
CASE & percentages
|
SLA tiers, rejection rates, safe ratios. |
| 6 | Python: dicts, sets, arrays | Frequency, sorting, light ETL-style transforms. |
| 7 | Star schema & SCDs | Verbal warehouse design at DE bar. |
SQL evaluation order (mental model):
FROM/ joins →WHERE(row filter) →GROUP BY→ aggregates →HAVING(group filter) → windows →SELECT/ORDER BY. If you are filtering a sum or count, that condition almost always belongs inHAVING, notWHERE.
1. Aggregation and GROUP BY concepts in Amazon-style SQL
Aggregation and GROUP BY in SQL for data engineering
Picture a table with many detail rows—for example one row per order line or package scan. Aggregation means: turn lots of rows into one summary value (or a few values) per bucket. The bucket is whatever the prompt names: per seller, per FC (fulfillment center), per day, per postal code, and so on.
GROUP BY defines the bucket: “Put all rows with the same seller_id together,” or “the same (fc_id, day) together.” Each distinct combination of GROUP BY columns is one group; the engine runs SUM, COUNT, AVG, … inside each group separately.
Report grain: After the query, ask: “One output row represents what?” If the question asked revenue per region, the grain is region. If you accidentally group at order-line grain when the prompt wanted per order, your SUM can be right for lines but wrong for the business sentence.
SUM, AVG, and NULL
SUM(col)adds non-null numerics in the group.NULLcells are skipped (not treated as 0). If every value isNULL,SUMis usuallyNULL, not 0—say that in the interview when edge cases matter.AVG(col)is sum of non-null values ÷ count of non-null values. Missing prices do not enter the numerator or denominator.
Worked example: In one group, amount values 10, NULL, 30 → SUM(amount) = 40, AVG(amount) = 20 (only two values averaged).
COUNT(*) vs COUNT(col)
COUNT(*)counts rows in the group, including rows where some columns areNULL.COUNT(col)counts rows wherecolis notNULL.COUNT(DISTINCT col)counts unique non-null values—essential after joins when you must count buyers, not multiplied rows.
Worked example: Three rows in a group, two with non-null amount → COUNT(*) = 3, COUNT(amount) = 2.
MIN / MAX
Work on orderable types (numbers, dates, timestamps). NULLs are skipped. Use MAX(order_ts) for “latest order time” per bucket.
WHERE vs HAVING
WHEREfilters rows before grouping—predicates on raw columns only.HAVINGfilters groups after aggregates—predicates onSUM,COUNT,AVG, etc.
Worked example — same orders grain, two filters:
| order_id | merchant_id | subtotal |
|---|---|---|
| 1 | M1 | 40 |
| 2 | M1 | 70 |
| 3 | M1 | 20 |
| 4 | M2 | 200 |
WHERE subtotal > 30removes row 3 before grouping. ThenGROUP BY merchant_idwithSUM(subtotal)gives M1 = 110, M2 = 200.HAVING SUM(subtotal) > 100on the full table gives M1 = 130, M2 = 200—both groups pass.
Rule of thumb: If the condition uses SUM / COUNT / AVG of the group**, it belongs in **HAVING**. If it only uses **this row’s** columns, use **WHERE` first (and prefer it for performance).
Half-open dates and revenue per zip
Worked example: orders(order_id, zip_code, amount, order_date) — 2020 revenue per zip:
sql
SELECT zip_code, SUM(amount) AS revenue
FROM orders
WHERE order_date >= DATE '2020-01-01' AND order_date < DATE '2021-01-01'
GROUP BY zip_code;
Takeaway: [2020-01-01, 2021-01-01) avoids double-counting a midnight boundary. The result grain is one row per zip.
Conditional aggregation (CASE inside aggregates)
Idea: Count or sum only some rows in the group without splitting into multiple queries.
Worked example — lines per FC by disposition:
| fc_id | status |
|---|---|
| A | shipped |
| A | pending |
| A | shipped |
sql
SELECT fc_id,
COUNT(*) AS total_lines,
SUM(CASE WHEN status = 'shipped' THEN 1 ELSE 0 END) AS shipped_lines
FROM fc_lines
GROUP BY fc_id;
Result: fc A → total_lines = 3, shipped_lines = 2.
Common beginner mistakes
Putting
SUM(x) > 10inWHEREinstead ofHAVING.Selecting a non-aggregated column that is not in
GROUP BY(invalid in strict SQL).Joining one-to-many before
GROUP BYand double-counting a dimension field—fix grain or dedupe first.Using
COUNT(*)when the question wanted distinct people—useCOUNT(DISTINCT user_id).
SQL Interview Question on Aggregation by Region
Table: orders(order_id, region, amount, is_refund). Return regions where net revenue (sum of amount excluding refunds) is strictly positive, ordered by net revenue descending.
Solution Using GROUP BY and Conditional Sum
sql
SELECT region,
SUM(CASE WHEN NOT is_refund THEN amount ELSE 0 END) AS net_rev
FROM orders
GROUP BY region
HAVING SUM(CASE WHEN NOT is_refund THEN amount ELSE 0 END) > 0
ORDER BY net_rev DESC;
Why this works: Refunds are zeroed in the CASE; HAVING applies to the per-region aggregate, not individual rows.
SQL · Topic — Aggregation problems (all companies)
COMPANY · Amazon — aggregation Amazon-tagged aggregation
COMPANY · Amazon — hub Amazon SQL (all difficulties)
When you want timed medium SQL, open Amazon SQL (medium).
2. Joins, anti-joins, and deduplication in SQL
Joins and deduplication in SQL for data engineering
A join answers: “For each row in table A, which rows in table B belong with it?” You declare that with ON a.key = b.key. In analytics you use joins to enrich events (orders, clicks, shipments) with reference data (customers, SKUs, fulfillment centers).
Fact vs dimension and grain (how data engineers talk about joins)
Fact table: Many rows—often one row per event or transaction (order line, click, package scan)—the thing you
SUMorCOUNT.Dimension table: Fewer rows—one row per entity you describe (customer, product, warehouse)—the thing you
JOINto get names, categories, regions.Grain: “One result row per what?” (per order, per customer per day, per ASIN). Wrong grain after a join usually means fan-out below: you aggregated at order level but summed a customer attribute that repeated after the join.
Worked example — fact + dimension: order_lines (fact: order_id, sku, qty) joins products (dimension: sku, list_price, title). SUM(qty) by sku is correct at line grain. If products accidentally had duplicate rows per sku, one join could inflate SUM(list_price) when you sum price alongside lines—fix the dimension so there is exactly one row per sku before you trust price totals.
Inner join (JOIN / INNER JOIN)
Result: Only rows where the
ONcondition succeeds on both sides.Use when: You only care about matched pairs—e.g. clicks that resolve to a known catalog row.
NULLkeys:NULL = NULLis not true in join logic, so inner joins drop rows with a null join key unless youCOALESCEor filter explicitly.
Worked example — clicks and ads:
| click_id | user_id | ad_id |
|---|---|---|
| c1 | u1 | 10 |
| c2 | u1 | 99 |
ads
| ad_id | brand_id |
|---|---|
| 10 | 7 |
| 11 | 7 |
FROM clicks c JOIN ads a ON c.ad_id = a.ad_id returns one row: (c1, u1, 10, 7). Click c2 has ad_id = 99 with no ad row, so it disappears from an inner join.
Left outer join (LEFT JOIN)
Result: Every row from the left table appears once per matching right row; if there is no match, right-side columns are
NULL.Use when: You must keep all left entities even if they have zero events—e.g. all customers including those with no orders.
Worked example — users and orders:
| id | name |
|---|---|
| 1 | Ann |
| 2 | Bo |
| id | user_id | amount |
|---|---|---|
| o1 | 1 | 50 |
FROM users u LEFT JOIN orders o ON u.id = o.user_id yields two rows: Ann + o1, and Bo + NULL (no order—right side null-padded).
Right outer join (RIGHT JOIN)
Result: Mirror of
LEFT JOIN: keep all rows from the right table; left columnsNULLwhen unmatched.In practice: Rarely needed—swap tables and use
LEFT JOINso “keep all rows from the table I care about” always reads the same way.
Worked example: Same users and orders as above. FROM orders o RIGHT JOIN users u ON u.id = o.user_id returns the same two logical rows as FROM users u LEFT JOIN orders o …—only preferred style differs.
Full outer join (FULL OUTER JOIN)
Result: Keeps rows that match plus unmatched rows from either side (missing side filled with
NULL).Use when: You need a symmetric “everything from A and B” merge; shows up in reconciliation-style questions.
Worked example: If legacy has ids 1, 2 and new has 2, 3, a full outer join on id returns three logical keys 1, 2, 3 with NULLs where one side is missing.
Cross join (CROSS JOIN)
Result: Cartesian product: every row of A paired with every row of B.
Use when: You intentionally need all combinations (e.g. stores × quarters grid)—easy to explode row count; use sparingly.
Worked example: stores has 2 rows, quarters has 4 → CROSS JOIN produces 8 rows (every store × every quarter).
Semi-join vs anti-join (names, not syntax)
Semi-join: “Keep rows from A where some match exists in B”—typically
WHERE EXISTS (SELECT 1 FROM B WHERE …)orWHERE key IN (SELECT key FROM B). You do not add B’s columns to the result; you only filter A.Anti-join: “Keep rows from A with no match in B”—
LEFT JOIN B … WHERE B.key IS NULLorWHERE NOT EXISTS (SELECT 1 FROM B WHERE …). Same idea: filter A, not a full merge for reporting.
Worked example — customers ids **1, 2, 3; table vip has customer_id = 2 only.
Semi-join:
SELECT id FROM customers c WHERE EXISTS (SELECT 1 FROM vip v WHERE v.customer_id = c.id)→ 2.Anti-join:
SELECT c.id FROM customers c LEFT JOIN vip v ON v.customer_id = c.id WHERE v.customer_id IS NULL→ 1 and 3.
Fan-out (row multiplication)
If the join key is not unique on one side (one user, many orders), the “one” side’s columns repeat on every matching row. That is correct for order-grain analysis.
Bug pattern: Summing a user-level numeric column after the join counts that user’s balance multiple times. Fixes: aggregate first, or make the dimension one row per key before joining. SUM(DISTINCT …) rarely fixes business logic—prefer COUNT(DISTINCT user_id) or a subquery that dedupes.
Worked example:
| users | id | credit_balance |
|---|---|---|
| 1 | 500 |
| orders | user_id |
|---|---|
| 1 | |
| 1 |
After users JOIN orders, credit_balance appears twice (500, 500). SUM(u.credit_balance) would add 1000, wrong for “total credit in the system.” Fix: sum order amounts, or MAX(u.credit_balance) per user if you only need the balance once.
Deduplication
SELECT DISTINCT col(or distinct on several columns) removes duplicate rows—fine when any duplicate row is interchangeable.Latest row per id: Prefer
ROW_NUMBER() OVER (PARTITION BY id ORDER BY time DESC) AS rnin a CTE, thenWHERE rn = 1, so you control ties and ordering.DISTINCT ON (user_id)(PostgreSQL) is a compact alternative when the dialect allows it.
Worked example — DISTINCT:
| visit_id | user_id |
|---|---|
| v1 | u1 |
| v2 | u1 |
| v3 | u2 |
SELECT DISTINCT user_id FROM visits → u1, u2.
Worked example — latest event per user:
| user_id | ts | page |
|---|---|---|
| u1 | t2 | /home |
| u1 | t1 | /cart |
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) assigns rn = 1 to (u1, t2, /home) and rn = 2 to (u1, t1, /cart); WHERE rn = 1 keeps only the latest row.
Worked example — DISTINCT ON (PostgreSQL):
sql
SELECT DISTINCT ON (user_id) user_id, ts, page
FROM events
ORDER BY user_id, ts DESC;
DISTINCT ON (user_id) requires ORDER BY to start with the same columns; the first row per user_id in that order is kept. For tie-breaking and clarity, ROW_NUMBER() in a CTE is often easier in interviews unless the prompt says PostgreSQL.
Common beginner mistakes
Joining on the wrong key or ignoring
NULLkeys (they do not match in an inner join).Assuming one row per user after a join when the join multiplied rows.
Using
COUNT(*)when the question wanted distinct people—useCOUNT(DISTINCT user_id).
Joins-Based Interview Question on Distinct Users and Brands
Tables: clicks(user_id, ad_id) and ads(ad_id, brand_id). How many distinct users clicked any ad for brand_id = 7?
Solution Using INNER JOIN and COUNT DISTINCT
sql
SELECT COUNT(DISTINCT c.user_id) AS distinct_users
FROM clicks c
JOIN ads a ON a.ad_id = c.ad_id
WHERE a.brand_id = 7;
Why this works: The join finds clicks whose ad belongs to brand 7; COUNT(DISTINCT user_id) counts people, not click rows.
Joins-Based Interview Question on Customers With No Orders
Tables: customers(customer_id), orders(order_id, customer_id). Return customer_id for customers with no orders.
Solution Using LEFT JOIN and NULL Filter
sql
SELECT c.customer_id
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
WHERE o.order_id IS NULL;
Why this works: Unmatched rows still appear from customers; orders columns are NULL, which we filter—classic anti-join pattern.
SQL · Topic — Joins & deduplication
COMPANY · Amazon — joins Amazon-tagged joins
COMPANY · Amazon — subqueries Amazon-tagged subqueries
3. Window functions and ranking in SQL
Window functions and ranking in SQL for analytics loops
GROUP BY collapses many rows into one row per group. Window functions keep every input row and attach a value computed over a window of rows—usually “all rows in the same partition that sort before or with this row.”
Parts of a window: PARTITION BY (optional) splits the table into independent mini-tables—like “per customer” or “per FC per day.” ORDER BY inside OVER (...) defines sequence within the partition. ROWS BETWEEN … (optional) trims which neighbors count for running aggregates.
When to reach for windows: Top-N per group, gaps between events (LAG / LEAD), running revenue, share of partition (SUM(x) OVER (...) / SUM(SUM(x)) in a subquery pattern), deduping with ROW_NUMBER() (see §2).
Ranking: ROW_NUMBER, RANK, DENSE_RANK
All three need ORDER BY inside OVER.
ROW_NUMBER(): Unique integer per row in the partition after sorting—no ties; the engine picks an arbitrary order among equal keys unless you add tie-break columns.RANK(): Tied rows get the same rank; next rank skips (1, 2, 2, 4).DENSE_RANK(): Tied rows get the same rank; no gaps (1, 2, 2, 3).
Worked example — three employees in dept 10 with salaries 90k, 80k, 80k (sorted desc):
| emp | salary | ROW_NUMBER | RANK | DENSE_RANK |
|---|---|---|---|---|
| A | 90 | 1 | 1 | 1 |
| B | 80 | 2 | 2 | 2 |
| C | 80 | 3 | 2 | 2 |
Use ROW_NUMBER when the prompt says “pick one” and gives a deterministic tie-break (e.g. lowest user_id). Use DENSE_RANK when ties should share a bucket (e.g. “top three distinct salary tiers”).
LAG / LEAD
LAG(col, n) looks n rows back in the partition order; LEAD looks forward. Great for time between events, sessionization, and YoY same-row comparisons.
Worked example — gap between orders:
| customer_id | order_ts | prev_ts |
|---|---|---|
| 1 | t1 | NULL |
| 1 | t3 | t1 |
LAG(order_ts) OVER (PARTITION BY customer_id ORDER BY order_ts) on row (1, t3) yields t1; first row per customer gets NULL lag.
Running totals
Worked example: Cumulative amount per seller_id in order_ts order:
sql
SELECT seller_id, order_ts, amount,
SUM(amount) OVER (
PARTITION BY seller_id
ORDER BY order_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM seller_orders;
(If the dialect defaults frame correctly for running sum, the explicit ROWS clause is optional—but stating it shows you understand frame rules.)
Common beginner mistakes
Omitting
ORDER BYinsideOVER (…)—ranking andLAGbecome meaningless or nondeterministic.Using
ROW_NUMBERwhen multiple rows must tie for “first place” under the business rule.Forgetting
PARTITION BYand accidentally ranking the whole table as one group.
SQL Interview Question on Top Spender per Region
Table: spend(user_id, region, amt). Return the user_id with the highest amt in each region (break ties by lowest user_id).
Solution Using ROW_NUMBER
sql
SELECT user_id, region, amt
FROM (
SELECT user_id, region, amt,
ROW_NUMBER() OVER (
PARTITION BY region ORDER BY amt DESC, user_id ASC
) AS rn
FROM spend
) t
WHERE rn = 1;
Why this works: ROW_NUMBER assigns a unique order per partition; rn = 1 is the chosen winner per region under the stated tie-break.
SQL Interview Question on Consecutive Order Gap
Table: orders(order_id, customer_id, order_date) with order_date as a DATE. Return orders where the gap in whole days from this customer’s previous order is strictly greater than 7. Skip customers’ first order (no predecessor).
Solution Using LAG
sql
SELECT order_id, customer_id, order_date, prev_date,
order_date - prev_date AS gap_days
FROM (
SELECT order_id, customer_id, order_date,
LAG(order_date) OVER (
PARTITION BY customer_id ORDER BY order_date
) AS prev_date
FROM orders
) t
WHERE prev_date IS NOT NULL
AND order_date - prev_date > 7;
Why this works: For DATE, subtraction yields an integer day count in PostgreSQL-style SQL. LAG lines up each row with the prior order_date per customer_id; prev_date IS NOT NULL removes first orders.
SQL · Topic — Window functions (SQL)
COMPANY · Amazon — windows Amazon-tagged window functions
For more pattern drills, browse SQL deep dives.
4. Date functions and time-series boundaries in SQL
Date arithmetic and buckets in SQL for operations metrics
Operational questions—“shipped within N days,” “last 30 days of clicks,” “revenue by calendar month”—are really three skills: typing (DATE vs TIMESTAMP), time zones (state assumptions), and inclusive vs exclusive bounds.
Half-open ranges
Prefer [start, end): WHERE ts >= start AND ts < end. That avoids double-counting events exactly at midnight on a boundary and mirrors how many warehouses partition data.
Calendar buckets with DATE_TRUNC
Worked example (PostgreSQL): Truncate click timestamps to UTC day:
sql
SELECT DATE_TRUNC('day', click_ts AT TIME ZONE 'UTC') AS day_utc, COUNT(*) AS clicks
FROM clicks
GROUP BY 1;
Use week, month, quarter the same way—then GROUP BY the truncated value.
Day differences and thresholds
Worked example: If order_date and ship_date are DATE, then ship_date - order_date <= 3 means “shipped within three calendar days” including same-day ship as 0—confirm with the interviewer whether same-day counts as “within 1 day” or “within 0 days.”
INTERVAL arithmetic on TIMESTAMP is dialect-specific; say whether you assume UTC, warehouse local, or event local.
Common beginner mistakes
Inclusive end dates that overlap the next period (
BETWEEN '2020-01-01' AND '2020-12-31'on timestamps can still pull 2021-01-01 00:00 in some stores—prefer half-open).Mixing
TIMESTAMPandDATEwithoutCAST.Forgetting leap years or DST when comparing “exactly 24 hours” vs calendar days.
SQL Interview Question on Revenue in a Calendar Year
Table: orders(order_id, order_date, zip_code, amount). Return the zip_code with the largest total amount in calendar year 2020.
Solution Using WHERE Range and GROUP BY
sql
SELECT zip_code, SUM(amount) AS revenue
FROM orders
WHERE order_date >= DATE '2020-01-01' AND order_date < DATE '2021-01-01'
GROUP BY zip_code
ORDER BY revenue DESC
LIMIT 1;
Why this works: The date predicate fixes the year grain; GROUP BY zip rolls up to the question’s unit; LIMIT 1 picks the top (ties need a tie-break rule—ask).
SQL Interview Question on Rolling Activity Window
Table: events(user_id, event_ts) (TIMESTAMP). Return user_id values that had at least one event in the last 7 full days through CURRENT_DATE (exclusive of today’s partial day), using date arithmetic.
Solution Using Date Cast and Half-Open Bound
sql
SELECT DISTINCT user_id
FROM events
WHERE event_ts >= (CURRENT_DATE - INTERVAL '7 days')
AND event_ts < CURRENT_DATE;
Why this works: event_ts < CURRENT_DATE drops anything from today after midnight if you interpret “through yesterday” as the business rule; adjust if the prompt wants inclusive today. The lower bound is seven days before CURRENT_DATE at midnight.
SQL · Topic — Time-series SQL
COMPANY · Amazon — dates Amazon-tagged date functions
Practice time-series SQL patterns after this section—the hub matches many “consecutive day” and cohort-style prompts.
5. Conditional logic and percentage metrics in SQL
CASE expressions and ratios in SQL for DE interviews
CASE has two shapes: simple (CASE col WHEN val THEN …) and searched (CASE WHEN predicate THEN … END). In interviews, searched form is more common because you can express ranges and NULL checks.
CASE in SELECT: Derive tier labels (SLA buckets, price bands). CASE inside aggregates: Count or sum only rows matching a condition—same idea as conditional aggregation in §1.
Percentages: Always CAST or multiply by 1.0 so you do not hit integer division (3/10 → 0 in many engines). Guard the denominator with NULLIF(denominator, 0) so you return NULL instead of a runtime error.
Rejection rate pattern (portable)
Worked example — per role_id, share of rows that are rejected among decided applications (rejected or accepted):
sql
SELECT role_id,
SUM(CASE WHEN status = 'rejected' THEN 1 ELSE 0 END) * 1.0
/ NULLIF(
SUM(CASE WHEN status IN ('rejected', 'accepted') THEN 1 ELSE 0 END),
0
) AS rejection_rate
FROM applications
GROUP BY role_id;
Numerator and denominator both ignore pending rows by giving them 0 in their respective CASE arms.
Common beginner mistakes
Dividing two integers and getting 0 everywhere.
Using
COUNT(*)as the denominator when the question wanted only completed stages.Returning percentage as 0–1 when the interviewer asked for 0–100—state units.
SQL Interview Question on Rejection Rate by Role
Table: applications(id, role_id, status) with status in ('rejected','accepted',...). Return each role_id and rejection rate among terminal rows where status is either rejected or accepted.
Solution Using FILTER (PostgreSQL)
sql
SELECT role_id,
COUNT(*) FILTER (WHERE status = 'rejected')::float
/ NULLIF(COUNT(*) FILTER (WHERE status IN ('rejected','accepted')), 0)
AS rejection_rate
FROM applications
GROUP BY role_id;
Why this works: FILTER is concise PostgreSQL syntax; numerator and denominator both restrict to decided applications; NULLIF avoids divide-by-zero. For engines without FILTER, use the SUM(CASE …) form above.
SQL Interview Question on Click-Through Rate by Campaign
Tables: impressions(campaign_id, user_id) and clicks(campaign_id, user_id) (one row per user per campaign in each table if they saw / clicked). Return each campaign_id and ctr = clicks / impressions among users who had an impression; use FLOAT (or REAL) division.
Solution Using LEFT JOIN and Conditional Counts
sql
SELECT i.campaign_id,
SUM(CASE WHEN c.user_id IS NOT NULL THEN 1 ELSE 0 END) * 1.0
/ NULLIF(COUNT(DISTINCT i.user_id), 0) AS ctr
FROM impressions i
LEFT JOIN clicks c
ON c.campaign_id = i.campaign_id AND c.user_id = i.user_id
GROUP BY i.campaign_id;
Why this works: Grain is one row per (campaign, user) on the left; LEFT JOIN attaches a click if any; SUM(CASE WHEN c.user_id IS NOT NULL …) counts distinct click matches per impression row—here one row per user so it equals “clicked or not.” COUNT(DISTINCT i.user_id) is the impression denominator. If duplicate impression rows exist, dedupe first or use COUNT(*) vs DISTINCT consistently with the prompt.
SQL · Topic — Percentage SQL
COMPANY · Amazon — conditional logic Amazon-tagged conditional logic
COMPANY · Amazon — percentage Amazon-tagged percentage calculation
6. Hash maps, arrays, and ETL-style Python for DE screens
Dicts, sets, and counting in Python for data processing
Amazon Python screens often look like small ETL: group records, aggregate in memory, join two lists by id, or validate uniqueness. The winning move is usually dict / defaultdict / Counter / set so you stay O(n) instead of O(n²) nested scans.
When to use what
set: Membershipx in set, dedupe keys, “have I seen this?” while scanning once.dict: Map id → object, id → running sum, id → last timestamp—your in-memory join key.collections.Counter: Frequencies of hashable items;Counter.most_common(k)for top‑k.collections.defaultdict(list):group_bypatterns—append all rows for each key, then post-process.
Frequency with Counter
Worked example: skus = ["A","B","A","A"] → Counter(skus) → Counter({'A': 3, 'B': 1}).
Complement set / “missing keys”
Worked example: Given all_ids and completed_ids, set(all_ids) - set(completed_ids) is the anti-join of ids—same logic as LEFT JOIN … IS NULL in SQL.
Python Interview Question on First Duplicate ID
Input: ids: list[int]. Return the first value that appears twice while scanning left-to-right; if none, return None.
Solution Using a Set
python
def first_duplicate(ids: list[int]) -> int | None:
seen: set[int] = set()
for x in ids:
if x in seen:
return x
seen.add(x)
return None
Why this works: The set holds earlier elements; the first repeat is detected on the second occurrence.
Python Interview Question on Merge SKU Quantities
Input: Two lists of dicts loads_a and loads_b, each like {"sku": str, "qty": int}. Return a list of dicts with one entry per sku and qty equal to the sum of quantities from both lists. If a sku appears only once, still include it.
Solution Using defaultdict
`python
from collections import defaultdict
def merge_skus(loads_a: list[dict], loads_b: list[dict]) -> list[dict]:
totals: dict[str, int] = defaultdict(int)
for row in loads_a:
totals[row["sku"]] += row["qty"]
for row in loads_b:
totals[row["sku"]] += row["qty"]
return [{"sku": sku, "qty": q} for sku, q in sorted(totals.items())]
`
Why this works: One pass per list (O(n)), dict accumulates by sku, then materialize output. sorted on keys makes tests deterministic; drop it if order does not matter.
PYTHON · Topic — Hash tables & counting
COMPANY · Amazon — hash Amazon-tagged hash tables
COMPANY · Amazon — array Amazon-tagged arrays
Open the full Amazon Python problem set when you want language-filtered reps.
7. Star schema, grain, and SCD patterns in data engineering
Dimensional modeling and SCD patterns for warehouse interviews
Star schema is the default story for warehouse interviews: a fact table at the center (numeric measures + foreign keys to dimensions) surrounded by dimension tables (descriptive context). Snowflake normalizes dimensions further (e.g. brand → subcategory → category); know the tradeoff: fewer joins vs more tables.
Fact grain (say this first): “One row per…” order line? Package scan? Daily inventory snapshot? Mixing order-level and line-level measures in one fact without a clear rule is a red flag.
Surrogate keys: Warehouse sk integers on dimensions decouple the warehouse from source system ids that might change or collide across systems. Facts store customer_sk, not raw email.
Role-play: “If marketing changes a product’s category retroactively, do we restate history?”—that question picks Type 1 vs 2.
SCD Type 1: Overwrite the dimension row—easy, but history is lost. Use when only the current truth matters (fix a typo, current brand name).
SCD Type 2: Insert a new row with valid_from / valid_to (or is_current). Facts point to the sk row that was active when the event occurred—as-of reporting stays correct.
Worked example — Type 2 rows for one natural customer:
| customer_sk | natural_id | city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| 101 | U1 | Seattle | 2020-01-01 | 2022-05-31 | false |
| 102 | U1 | Austin | 2022-06-01 | NULL | true |
An order in 2021 should reference customer_sk = 101; a current dashboard filters is_current = true (102).
Common beginner mistakes
Mixing grains in one fact (lines and orders in the same table without a clear rule).
Storing natural keys on facts when dimensions are Type 2—you lose as-of joins.
Type 2 without
valid_to/is_current—hard to write point-in-time queries.
Data Engineering Interview Question on Customer Address History
Explain how you store customer addresses so historical orders show the address at order time while ops dashboards show the current address.
Solution Using SCD Type 2 plus current flag
Use Type 2 on dim_customer with valid_from, valid_to, is_current. Facts store customer_sk referencing the version active when the order was placed. Analysts filter is_current = true for “where they live now.”
Why this works: Facts remain true to history; the dimension versions track change without rewriting old keys.
SQL · Topic — Dimensional modeling
COMPANY · Amazon — modeling Amazon-tagged dimensional modeling
COMPANY · Amazon — SCD Amazon-tagged slowly changing data
COMPANY · Amazon — hub Amazon data modeling filter
Deep dives: Data modeling for data engineering interviews and ETL system design for data engineering interviews.
Tips to crack Amazon data engineering interviews
Data engineering interview preparation for Amazon-shaped loops means lanes: SQL grain → joins → windows → dates → CASE/percentages → Python structures → a clean verbal on modeling.
SQL preparation
Drill WHERE vs HAVING, fan-out, and partition keys on paper before you code. State timezone and tie-break assumptions early.
Pipeline and ETL thinking
Be ready to discuss idempotent loads, late-arriving facts, and backfills at a high level—pair that narrative with ETL topic practice.
Where to practice on PipeCode
| Lane | Amazon-focused path |
|---|---|
| Company hub | /explore/practice/company/amazon |
| SQL | /explore/practice/company/amazon/sql |
| Python | /explore/practice/company/amazon/python |
| Medium difficulty | /explore/practice/company/amazon/difficulty/medium |
Courses that match this article: SQL for data engineering interviews, Python fundamentals, plus the modeling and ETL links in §7.
Communication under time pressure
Narrate grain first, then SQL. When stuck, check join cardinality and NULL behavior—that is how experienced DE candidates de-risk Amazon DE SQL interview and Python rounds without spiraling.
Frequently asked questions
What types of Amazon data engineering interview questions show up—SQL, Python, or system design?
Expect heavy SQL (joins, aggregates, windows, dates), Python structured coding, and design discussion on pipelines and warehouses. This article mirrors that mix; use the Amazon practice hub to align reps with those lanes.
How hard are Amazon data engineering SQL problems compared to other FAANG companies?
Difficulty is broadly comparable to other large-tech data engineering loops: medium SQL with edge checks is typical. Mix easy warmups with harder multi-step prompts so you are not surprised by edge cases on the day.
Does Amazon test Python algorithms or data engineering coding more in DE loops?
You should expect both: hash maps, sorting, and string or JSON transforms appear alongside SQL. It is usually not identical to a pure SWE loop, but algorithmic Python still matters.
How should I prepare for Amazon data modeling or warehouse design questions?
Practice star schema, fact grain, and SCD Type 1 vs 2 out loud; explain surrogate keys and as-of joins. Pair verbal reps with dimensional modeling drills and the data modeling course linked in §7.
How many Amazon-tagged practice problems should I finish before the interview?
There is no magic number, but 30–50 mixed problems across SQL, Python, and a few modeling prompts is a solid preparation cadence if you review each submission.
Where can I practice Amazon-style data engineering questions with tests and feedback?
PipeCode runs 450+ interview-style problems with tests and AI feedback; start at /explore/practice/company/amazon for company-tagged sets, then branch into topic pages for weak spots.
Start practicing Amazon data engineering problems
Pair company-tagged Amazon reps with an environment that runs tests and surfaces wrong assumptions early, so you move from reading patterns to typing your own solutions under time pressure.
Pipecode.ai is Leetcode for Data Engineering




Top comments (0)