DEV Community

Gowtham Potureddi
Gowtham Potureddi

Posted on

Amazon Data Engineering Interview Questions & Prep

Amazon data engineering interview questions lean on SQL for operational and retail-style analytics—aggregates, joins, window functions, and date logic—and Python for in-memory transforms (hash maps, arrays, nested structures). Harder loops often add verbal data modeling (star schemas, slowly changing dimensions, fact grain).

Everything below uses original teaching examples only—not proprietary interview items. Narrate grain, join semantics, and NULL behavior out loud; interviewers grade that reasoning as much as the final query or sketch.

Blog header graphic for Amazon data engineering interview prep with SQL, Python, and data modeling themes on a dark gradient.


Top Amazon data engineering interview topics

# Topic Why it shows up
1 Aggregation & GROUP BY / HAVING Revenue and volume by region, SKU, or customer segment.
2 Joins & anti-joins Enrich facts, find non-buyers, avoid join fan-out.
3 Window functions Top-N per group, ranks, LAG / LEAD, running metrics.
4 Dates & time boundaries Shipped-within-N-days, yearly filters, buckets.
5 CASE & percentages SLA tiers, rejection rates, safe ratios.
6 Python: dicts, sets, arrays Frequency, sorting, light ETL-style transforms.
7 Star schema & SCDs Verbal warehouse design at DE bar.

SQL evaluation order (mental model): FROM / joins → WHERE (row filter) → GROUP BY → aggregates → HAVING (group filter) → windows → SELECT / ORDER BY. If you are filtering a sum or count, that condition almost always belongs in HAVING, not WHERE.


1. Aggregation and GROUP BY concepts in Amazon-style SQL

Aggregation and GROUP BY in SQL for data engineering

Picture a table with many detail rows—for example one row per order line or package scan. Aggregation means: turn lots of rows into one summary value (or a few values) per bucket. The bucket is whatever the prompt names: per seller, per FC (fulfillment center), per day, per postal code, and so on.

GROUP BY defines the bucket: “Put all rows with the same seller_id together,” or “the same (fc_id, day) together.” Each distinct combination of GROUP BY columns is one group; the engine runs SUM, COUNT, AVG, … inside each group separately.

Report grain: After the query, ask: “One output row represents what?” If the question asked revenue per region, the grain is region. If you accidentally group at order-line grain when the prompt wanted per order, your SUM can be right for lines but wrong for the business sentence.

SUM, AVG, and NULL

  • SUM(col) adds non-null numerics in the group. NULL cells are skipped (not treated as 0). If every value is NULL, SUM is usually NULL, not 0—say that in the interview when edge cases matter.

  • AVG(col) is sum of non-null values ÷ count of non-null values. Missing prices do not enter the numerator or denominator.

Worked example: In one group, amount values 10, NULL, 30SUM(amount) = 40, AVG(amount) = 20 (only two values averaged).

COUNT(*) vs COUNT(col)

  • COUNT(*) counts rows in the group, including rows where some columns are NULL.

  • COUNT(col) counts rows where col is not NULL.

  • COUNT(DISTINCT col) counts unique non-null values—essential after joins when you must count buyers, not multiplied rows.

Worked example: Three rows in a group, two with non-null amountCOUNT(*) = 3, COUNT(amount) = 2.

MIN / MAX

Work on orderable types (numbers, dates, timestamps). NULLs are skipped. Use MAX(order_ts) for “latest order time” per bucket.

WHERE vs HAVING

  • WHERE filters rows before grouping—predicates on raw columns only.

  • HAVING filters groups after aggregates—predicates on SUM, COUNT, AVG, etc.

Worked example — same orders grain, two filters:

order_id merchant_id subtotal
1 M1 40
2 M1 70
3 M1 20
4 M2 200
  • WHERE subtotal > 30 removes row 3 before grouping. Then GROUP BY merchant_id with SUM(subtotal) gives M1 = 110, M2 = 200.

  • HAVING SUM(subtotal) > 100 on the full table gives M1 = 130, M2 = 200—both groups pass.

Rule of thumb: If the condition uses SUM / COUNT / AVG of the group**, it belongs in **HAVING**. If it only uses **this row’s** columns, use **WHERE` first (and prefer it for performance).

Half-open dates and revenue per zip

Worked example: orders(order_id, zip_code, amount, order_date)2020 revenue per zip:

sql
SELECT zip_code, SUM(amount) AS revenue
FROM orders
WHERE order_date >= DATE '2020-01-01' AND order_date < DATE '2021-01-01'
GROUP BY zip_code;

Takeaway: [2020-01-01, 2021-01-01) avoids double-counting a midnight boundary. The result grain is one row per zip.

Conditional aggregation (CASE inside aggregates)

Idea: Count or sum only some rows in the group without splitting into multiple queries.

Worked example — lines per FC by disposition:

fc_id status
A shipped
A pending
A shipped

sql
SELECT fc_id,
COUNT(*) AS total_lines,
SUM(CASE WHEN status = 'shipped' THEN 1 ELSE 0 END) AS shipped_lines
FROM fc_lines
GROUP BY fc_id;

Result: fc Atotal_lines = 3, shipped_lines = 2.

Common beginner mistakes

  • Putting SUM(x) > 10 in WHERE instead of HAVING.

  • Selecting a non-aggregated column that is not in GROUP BY (invalid in strict SQL).

  • Joining one-to-many before GROUP BY and double-counting a dimension field—fix grain or dedupe first.

  • Using COUNT(*) when the question wanted distinct people—use COUNT(DISTINCT user_id).

SQL Interview Question on Aggregation by Region

Table: orders(order_id, region, amount, is_refund). Return regions where net revenue (sum of amount excluding refunds) is strictly positive, ordered by net revenue descending.

Solution Using GROUP BY and Conditional Sum

sql
SELECT region,
SUM(CASE WHEN NOT is_refund THEN amount ELSE 0 END) AS net_rev
FROM orders
GROUP BY region
HAVING SUM(CASE WHEN NOT is_refund THEN amount ELSE 0 END) > 0
ORDER BY net_rev DESC;

Why this works: Refunds are zeroed in the CASE; HAVING applies to the per-region aggregate, not individual rows.

When you want timed medium SQL, open Amazon SQL (medium).


2. Joins, anti-joins, and deduplication in SQL

Joins and deduplication in SQL for data engineering

A join answers: “For each row in table A, which rows in table B belong with it?” You declare that with ON a.key = b.key. In analytics you use joins to enrich events (orders, clicks, shipments) with reference data (customers, SKUs, fulfillment centers).

Fact vs dimension and grain (how data engineers talk about joins)

  • Fact table: Many rows—often one row per event or transaction (order line, click, package scan)—the thing you SUM or COUNT.

  • Dimension table: Fewer rows—one row per entity you describe (customer, product, warehouse)—the thing you JOIN to get names, categories, regions.

  • Grain: “One result row per what?” (per order, per customer per day, per ASIN). Wrong grain after a join usually means fan-out below: you aggregated at order level but summed a customer attribute that repeated after the join.

Worked example — fact + dimension: order_lines (fact: order_id, sku, qty) joins products (dimension: sku, list_price, title). SUM(qty) by sku is correct at line grain. If products accidentally had duplicate rows per sku, one join could inflate SUM(list_price) when you sum price alongside lines—fix the dimension so there is exactly one row per sku before you trust price totals.

Diagram of two SQL tables joining with row multiplication labeled fan-out versus a clean grain path for Amazon data engineering interview prep.

Inner join (JOIN / INNER JOIN)

  • Result: Only rows where the ON condition succeeds on both sides.

  • Use when: You only care about matched pairs—e.g. clicks that resolve to a known catalog row.

  • NULL keys: NULL = NULL is not true in join logic, so inner joins drop rows with a null join key unless you COALESCE or filter explicitly.

Worked example — clicks and ads:

click_id user_id ad_id
c1 u1 10
c2 u1 99

ads

ad_id brand_id
10 7
11 7

FROM clicks c JOIN ads a ON c.ad_id = a.ad_id returns one row: (c1, u1, 10, 7). Click c2 has ad_id = 99 with no ad row, so it disappears from an inner join.

Left outer join (LEFT JOIN)

  • Result: Every row from the left table appears once per matching right row; if there is no match, right-side columns are NULL.

  • Use when: You must keep all left entities even if they have zero events—e.g. all customers including those with no orders.

Worked example — users and orders:

id name
1 Ann
2 Bo
id user_id amount
o1 1 50

FROM users u LEFT JOIN orders o ON u.id = o.user_id yields two rows: Ann + o1, and Bo + NULL (no order—right side null-padded).

Right outer join (RIGHT JOIN)

  • Result: Mirror of LEFT JOIN: keep all rows from the right table; left columns NULL when unmatched.

  • In practice: Rarely needed—swap tables and use LEFT JOIN so “keep all rows from the table I care about” always reads the same way.

Worked example: Same users and orders as above. FROM orders o RIGHT JOIN users u ON u.id = o.user_id returns the same two logical rows as FROM users u LEFT JOIN orders o …—only preferred style differs.

Full outer join (FULL OUTER JOIN)

  • Result: Keeps rows that match plus unmatched rows from either side (missing side filled with NULL).

  • Use when: You need a symmetric “everything from A and B” merge; shows up in reconciliation-style questions.

Worked example: If legacy has ids 1, 2 and new has 2, 3, a full outer join on id returns three logical keys 1, 2, 3 with NULLs where one side is missing.

Cross join (CROSS JOIN)

  • Result: Cartesian product: every row of A paired with every row of B.

  • Use when: You intentionally need all combinations (e.g. stores × quarters grid)—easy to explode row count; use sparingly.

Worked example: stores has 2 rows, quarters has 4CROSS JOIN produces 8 rows (every store × every quarter).

Semi-join vs anti-join (names, not syntax)

  • Semi-join: “Keep rows from A where some match exists in B”—typically WHERE EXISTS (SELECT 1 FROM B WHERE …) or WHERE key IN (SELECT key FROM B). You do not add B’s columns to the result; you only filter A.

  • Anti-join: “Keep rows from A with no match in B”—LEFT JOIN B … WHERE B.key IS NULL or WHERE NOT EXISTS (SELECT 1 FROM B WHERE …). Same idea: filter A, not a full merge for reporting.

Worked example — customers ids **1, 2, 3; table vip has customer_id = 2 only.

  • Semi-join: SELECT id FROM customers c WHERE EXISTS (SELECT 1 FROM vip v WHERE v.customer_id = c.id)2.

  • Anti-join: SELECT c.id FROM customers c LEFT JOIN vip v ON v.customer_id = c.id WHERE v.customer_id IS NULL1 and 3.

Fan-out (row multiplication)

If the join key is not unique on one side (one user, many orders), the “one” side’s columns repeat on every matching row. That is correct for order-grain analysis.

Bug pattern: Summing a user-level numeric column after the join counts that user’s balance multiple times. Fixes: aggregate first, or make the dimension one row per key before joining. SUM(DISTINCT …) rarely fixes business logic—prefer COUNT(DISTINCT user_id) or a subquery that dedupes.

Worked example:

users id credit_balance
1 500
orders user_id
1
1

After users JOIN orders, credit_balance appears twice (500, 500). SUM(u.credit_balance) would add 1000, wrong for “total credit in the system.” Fix: sum order amounts, or MAX(u.credit_balance) per user if you only need the balance once.

Deduplication

  • SELECT DISTINCT col (or distinct on several columns) removes duplicate rows—fine when any duplicate row is interchangeable.

  • Latest row per id: Prefer ROW_NUMBER() OVER (PARTITION BY id ORDER BY time DESC) AS rn in a CTE, then WHERE rn = 1, so you control ties and ordering. DISTINCT ON (user_id) (PostgreSQL) is a compact alternative when the dialect allows it.

Worked example — DISTINCT:

visit_id user_id
v1 u1
v2 u1
v3 u2

SELECT DISTINCT user_id FROM visitsu1, u2.

Worked example — latest event per user:

user_id ts page
u1 t2 /home
u1 t1 /cart

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) assigns rn = 1 to (u1, t2, /home) and rn = 2 to (u1, t1, /cart); WHERE rn = 1 keeps only the latest row.

Worked example — DISTINCT ON (PostgreSQL):

sql
SELECT DISTINCT ON (user_id) user_id, ts, page
FROM events
ORDER BY user_id, ts DESC;

DISTINCT ON (user_id) requires ORDER BY to start with the same columns; the first row per user_id in that order is kept. For tie-breaking and clarity, ROW_NUMBER() in a CTE is often easier in interviews unless the prompt says PostgreSQL.

Common beginner mistakes

  • Joining on the wrong key or ignoring NULL keys (they do not match in an inner join).

  • Assuming one row per user after a join when the join multiplied rows.

  • Using COUNT(*) when the question wanted distinct people—use COUNT(DISTINCT user_id).

Joins-Based Interview Question on Distinct Users and Brands

Tables: clicks(user_id, ad_id) and ads(ad_id, brand_id). How many distinct users clicked any ad for brand_id = 7?

Solution Using INNER JOIN and COUNT DISTINCT

sql
SELECT COUNT(DISTINCT c.user_id) AS distinct_users
FROM clicks c
JOIN ads a ON a.ad_id = c.ad_id
WHERE a.brand_id = 7;

Why this works: The join finds clicks whose ad belongs to brand 7; COUNT(DISTINCT user_id) counts people, not click rows.

Joins-Based Interview Question on Customers With No Orders

Tables: customers(customer_id), orders(order_id, customer_id). Return customer_id for customers with no orders.

Solution Using LEFT JOIN and NULL Filter

sql
SELECT c.customer_id
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
WHERE o.order_id IS NULL;

Why this works: Unmatched rows still appear from customers; orders columns are NULL, which we filter—classic anti-join pattern.


3. Window functions and ranking in SQL

Window functions and ranking in SQL for analytics loops

GROUP BY collapses many rows into one row per group. Window functions keep every input row and attach a value computed over a window of rows—usually “all rows in the same partition that sort before or with this row.”

Parts of a window: PARTITION BY (optional) splits the table into independent mini-tables—like “per customer” or “per FC per day.” ORDER BY inside OVER (...) defines sequence within the partition. ROWS BETWEEN … (optional) trims which neighbors count for running aggregates.

When to reach for windows: Top-N per group, gaps between events (LAG / LEAD), running revenue, share of partition (SUM(x) OVER (...) / SUM(SUM(x)) in a subquery pattern), deduping with ROW_NUMBER() (see §2).

Illustration of SQL window function partitions with ranking values 1, 2, 2 for DENSE\_RANK-style ties in Amazon data engineering interview prep.

Ranking: ROW_NUMBER, RANK, DENSE_RANK

All three need ORDER BY inside OVER.

  • ROW_NUMBER(): Unique integer per row in the partition after sorting—no ties; the engine picks an arbitrary order among equal keys unless you add tie-break columns.

  • RANK(): Tied rows get the same rank; next rank skips (1, 2, 2, 4).

  • DENSE_RANK(): Tied rows get the same rank; no gaps (1, 2, 2, 3).

Worked example — three employees in dept 10 with salaries 90k, 80k, 80k (sorted desc):

emp salary ROW_NUMBER RANK DENSE_RANK
A 90 1 1 1
B 80 2 2 2
C 80 3 2 2

Use ROW_NUMBER when the prompt says “pick one” and gives a deterministic tie-break (e.g. lowest user_id). Use DENSE_RANK when ties should share a bucket (e.g. “top three distinct salary tiers”).

LAG / LEAD

LAG(col, n) looks n rows back in the partition order; LEAD looks forward. Great for time between events, sessionization, and YoY same-row comparisons.

Worked example — gap between orders:

customer_id order_ts prev_ts
1 t1 NULL
1 t3 t1

LAG(order_ts) OVER (PARTITION BY customer_id ORDER BY order_ts) on row (1, t3) yields t1; first row per customer gets NULL lag.

Running totals

Worked example: Cumulative amount per seller_id in order_ts order:

sql
SELECT seller_id, order_ts, amount,
SUM(amount) OVER (
PARTITION BY seller_id
ORDER BY order_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM seller_orders;

(If the dialect defaults frame correctly for running sum, the explicit ROWS clause is optional—but stating it shows you understand frame rules.)

Common beginner mistakes

  • Omitting ORDER BY inside OVER (…)—ranking and LAG become meaningless or nondeterministic.

  • Using ROW_NUMBER when multiple rows must tie for “first place” under the business rule.

  • Forgetting PARTITION BY and accidentally ranking the whole table as one group.

SQL Interview Question on Top Spender per Region

Table: spend(user_id, region, amt). Return the user_id with the highest amt in each region (break ties by lowest user_id).

Solution Using ROW_NUMBER

sql
SELECT user_id, region, amt
FROM (
SELECT user_id, region, amt,
ROW_NUMBER() OVER (
PARTITION BY region ORDER BY amt DESC, user_id ASC
) AS rn
FROM spend
) t
WHERE rn = 1;

Why this works: ROW_NUMBER assigns a unique order per partition; rn = 1 is the chosen winner per region under the stated tie-break.

SQL Interview Question on Consecutive Order Gap

Table: orders(order_id, customer_id, order_date) with order_date as a DATE. Return orders where the gap in whole days from this customer’s previous order is strictly greater than 7. Skip customers’ first order (no predecessor).

Solution Using LAG

sql
SELECT order_id, customer_id, order_date, prev_date,
order_date - prev_date AS gap_days
FROM (
SELECT order_id, customer_id, order_date,
LAG(order_date) OVER (
PARTITION BY customer_id ORDER BY order_date
) AS prev_date
FROM orders
) t
WHERE prev_date IS NOT NULL
AND order_date - prev_date > 7;

Why this works: For DATE, subtraction yields an integer day count in PostgreSQL-style SQL. LAG lines up each row with the prior order_date per customer_id; prev_date IS NOT NULL removes first orders.

For more pattern drills, browse SQL deep dives.


4. Date functions and time-series boundaries in SQL

Date arithmetic and buckets in SQL for operations metrics

Operational questions—“shipped within N days,” “last 30 days of clicks,” “revenue by calendar month”—are really three skills: typing (DATE vs TIMESTAMP), time zones (state assumptions), and inclusive vs exclusive bounds.

Half-open ranges

Prefer [start, end): WHERE ts >= start AND ts < end. That avoids double-counting events exactly at midnight on a boundary and mirrors how many warehouses partition data.

Calendar buckets with DATE_TRUNC

Worked example (PostgreSQL): Truncate click timestamps to UTC day:

sql
SELECT DATE_TRUNC('day', click_ts AT TIME ZONE 'UTC') AS day_utc, COUNT(*) AS clicks
FROM clicks
GROUP BY 1;

Use week, month, quarter the same way—then GROUP BY the truncated value.

Day differences and thresholds

Worked example: If order_date and ship_date are DATE, then ship_date - order_date <= 3 means “shipped within three calendar days” including same-day ship as 0—confirm with the interviewer whether same-day counts as “within 1 day” or “within 0 days.”

INTERVAL arithmetic on TIMESTAMP is dialect-specific; say whether you assume UTC, warehouse local, or event local.

Common beginner mistakes

  • Inclusive end dates that overlap the next period (BETWEEN '2020-01-01' AND '2020-12-31' on timestamps can still pull 2021-01-01 00:00 in some stores—prefer half-open).

  • Mixing TIMESTAMP and DATE without CAST.

  • Forgetting leap years or DST when comparing “exactly 24 hours” vs calendar days.

SQL Interview Question on Revenue in a Calendar Year

Table: orders(order_id, order_date, zip_code, amount). Return the zip_code with the largest total amount in calendar year 2020.

Solution Using WHERE Range and GROUP BY

sql
SELECT zip_code, SUM(amount) AS revenue
FROM orders
WHERE order_date >= DATE '2020-01-01' AND order_date < DATE '2021-01-01'
GROUP BY zip_code
ORDER BY revenue DESC
LIMIT 1;

Why this works: The date predicate fixes the year grain; GROUP BY zip rolls up to the question’s unit; LIMIT 1 picks the top (ties need a tie-break rule—ask).

SQL Interview Question on Rolling Activity Window

Table: events(user_id, event_ts) (TIMESTAMP). Return user_id values that had at least one event in the last 7 full days through CURRENT_DATE (exclusive of today’s partial day), using date arithmetic.

Solution Using Date Cast and Half-Open Bound

sql
SELECT DISTINCT user_id
FROM events
WHERE event_ts >= (CURRENT_DATE - INTERVAL '7 days')
AND event_ts < CURRENT_DATE;

Why this works: event_ts < CURRENT_DATE drops anything from today after midnight if you interpret “through yesterday” as the business rule; adjust if the prompt wants inclusive today. The lower bound is seven days before CURRENT_DATE at midnight.

Practice time-series SQL patterns after this section—the hub matches many “consecutive day” and cohort-style prompts.


5. Conditional logic and percentage metrics in SQL

CASE expressions and ratios in SQL for DE interviews

CASE has two shapes: simple (CASE col WHEN val THEN …) and searched (CASE WHEN predicate THEN … END). In interviews, searched form is more common because you can express ranges and NULL checks.

CASE in SELECT: Derive tier labels (SLA buckets, price bands). CASE inside aggregates: Count or sum only rows matching a condition—same idea as conditional aggregation in §1.

Percentages: Always CAST or multiply by 1.0 so you do not hit integer division (3/100 in many engines). Guard the denominator with NULLIF(denominator, 0) so you return NULL instead of a runtime error.

Rejection rate pattern (portable)

Worked example — per role_id, share of rows that are rejected among decided applications (rejected or accepted):

sql
SELECT role_id,
SUM(CASE WHEN status = 'rejected' THEN 1 ELSE 0 END) * 1.0
/ NULLIF(
SUM(CASE WHEN status IN ('rejected', 'accepted') THEN 1 ELSE 0 END),
0
) AS rejection_rate
FROM applications
GROUP BY role_id;

Numerator and denominator both ignore pending rows by giving them 0 in their respective CASE arms.

Common beginner mistakes

  • Dividing two integers and getting 0 everywhere.

  • Using COUNT(*) as the denominator when the question wanted only completed stages.

  • Returning percentage as 0–1 when the interviewer asked for 0–100—state units.

SQL Interview Question on Rejection Rate by Role

Table: applications(id, role_id, status) with status in ('rejected','accepted',...). Return each role_id and rejection rate among terminal rows where status is either rejected or accepted.

Solution Using FILTER (PostgreSQL)

sql
SELECT role_id,
COUNT(*) FILTER (WHERE status = 'rejected')::float
/ NULLIF(COUNT(*) FILTER (WHERE status IN ('rejected','accepted')), 0)
AS rejection_rate
FROM applications
GROUP BY role_id;

Why this works: FILTER is concise PostgreSQL syntax; numerator and denominator both restrict to decided applications; NULLIF avoids divide-by-zero. For engines without FILTER, use the SUM(CASE …) form above.

SQL Interview Question on Click-Through Rate by Campaign

Tables: impressions(campaign_id, user_id) and clicks(campaign_id, user_id) (one row per user per campaign in each table if they saw / clicked). Return each campaign_id and ctr = clicks / impressions among users who had an impression; use FLOAT (or REAL) division.

Solution Using LEFT JOIN and Conditional Counts

sql
SELECT i.campaign_id,
SUM(CASE WHEN c.user_id IS NOT NULL THEN 1 ELSE 0 END) * 1.0
/ NULLIF(COUNT(DISTINCT i.user_id), 0) AS ctr
FROM impressions i
LEFT JOIN clicks c
ON c.campaign_id = i.campaign_id AND c.user_id = i.user_id
GROUP BY i.campaign_id;

Why this works: Grain is one row per (campaign, user) on the left; LEFT JOIN attaches a click if any; SUM(CASE WHEN c.user_id IS NOT NULL …) counts distinct click matches per impression row—here one row per user so it equals “clicked or not.” COUNT(DISTINCT i.user_id) is the impression denominator. If duplicate impression rows exist, dedupe first or use COUNT(*) vs DISTINCT consistently with the prompt.


6. Hash maps, arrays, and ETL-style Python for DE screens

Dicts, sets, and counting in Python for data processing

Amazon Python screens often look like small ETL: group records, aggregate in memory, join two lists by id, or validate uniqueness. The winning move is usually dict / defaultdict / Counter / set so you stay O(n) instead of O(n²) nested scans.

When to use what

  • set: Membership x in set, dedupe keys, “have I seen this?” while scanning once.

  • dict: Map id → object, id → running sum, id → last timestamp—your in-memory join key.

  • collections.Counter: Frequencies of hashable items; Counter.most_common(k) for top‑k.

  • collections.defaultdict(list): group_by patterns—append all rows for each key, then post-process.

Frequency with Counter

Worked example: skus = ["A","B","A","A"]Counter(skus)Counter({'A': 3, 'B': 1}).

Complement set / “missing keys”

Worked example: Given all_ids and completed_ids, set(all_ids) - set(completed_ids) is the anti-join of ids—same logic as LEFT JOIN … IS NULL in SQL.

Python Interview Question on First Duplicate ID

Input: ids: list[int]. Return the first value that appears twice while scanning left-to-right; if none, return None.

Solution Using a Set

python
def first_duplicate(ids: list[int]) -> int | None:
seen: set[int] = set()
for x in ids:
if x in seen:
return x
seen.add(x)
return None

Why this works: The set holds earlier elements; the first repeat is detected on the second occurrence.

Python Interview Question on Merge SKU Quantities

Input: Two lists of dicts loads_a and loads_b, each like {"sku": str, "qty": int}. Return a list of dicts with one entry per sku and qty equal to the sum of quantities from both lists. If a sku appears only once, still include it.

Solution Using defaultdict

`python
from collections import defaultdict

def merge_skus(loads_a: list[dict], loads_b: list[dict]) -> list[dict]:
totals: dict[str, int] = defaultdict(int)
for row in loads_a:
totals[row["sku"]] += row["qty"]
for row in loads_b:
totals[row["sku"]] += row["qty"]
return [{"sku": sku, "qty": q} for sku, q in sorted(totals.items())]
`

Why this works: One pass per list (O(n)), dict accumulates by sku, then materialize output. sorted on keys makes tests deterministic; drop it if order does not matter.

Open the full Amazon Python problem set when you want language-filtered reps.


7. Star schema, grain, and SCD patterns in data engineering

Dimensional modeling and SCD patterns for warehouse interviews

Star schema is the default story for warehouse interviews: a fact table at the center (numeric measures + foreign keys to dimensions) surrounded by dimension tables (descriptive context). Snowflake normalizes dimensions further (e.g. brand → subcategory → category); know the tradeoff: fewer joins vs more tables.

Fact grain (say this first):One row per…” order line? Package scan? Daily inventory snapshot? Mixing order-level and line-level measures in one fact without a clear rule is a red flag.

Surrogate keys: Warehouse sk integers on dimensions decouple the warehouse from source system ids that might change or collide across systems. Facts store customer_sk, not raw email.

Role-play: “If marketing changes a product’s category retroactively, do we restate history?”—that question picks Type 1 vs 2.

SCD Type 1: Overwrite the dimension row—easy, but history is lost. Use when only the current truth matters (fix a typo, current brand name).

SCD Type 2: Insert a new row with valid_from / valid_to (or is_current). Facts point to the sk row that was active when the event occurredas-of reporting stays correct.

Star schema diagram with a central fact table and dimension tables plus Type 1 versus Type 2 slowly changing dimension callouts for Amazon data modeling interview prep.

Worked example — Type 2 rows for one natural customer:

customer_sk natural_id city valid_from valid_to is_current
101 U1 Seattle 2020-01-01 2022-05-31 false
102 U1 Austin 2022-06-01 NULL true

An order in 2021 should reference customer_sk = 101; a current dashboard filters is_current = true (102).

Common beginner mistakes

  • Mixing grains in one fact (lines and orders in the same table without a clear rule).

  • Storing natural keys on facts when dimensions are Type 2—you lose as-of joins.

  • Type 2 without valid_to / is_current—hard to write point-in-time queries.

Data Engineering Interview Question on Customer Address History

Explain how you store customer addresses so historical orders show the address at order time while ops dashboards show the current address.

Solution Using SCD Type 2 plus current flag

Use Type 2 on dim_customer with valid_from, valid_to, is_current. Facts store customer_sk referencing the version active when the order was placed. Analysts filter is_current = true for “where they live now.”

Why this works: Facts remain true to history; the dimension versions track change without rewriting old keys.

Deep dives: Data modeling for data engineering interviews and ETL system design for data engineering interviews.


Tips to crack Amazon data engineering interviews

Data engineering interview preparation for Amazon-shaped loops means lanes: SQL grain → joins → windows → dates → CASE/percentagesPython structures → a clean verbal on modeling.

SQL preparation

Drill WHERE vs HAVING, fan-out, and partition keys on paper before you code. State timezone and tie-break assumptions early.

Pipeline and ETL thinking

Be ready to discuss idempotent loads, late-arriving facts, and backfills at a high level—pair that narrative with ETL topic practice.

Where to practice on PipeCode

Courses that match this article: SQL for data engineering interviews, Python fundamentals, plus the modeling and ETL links in §7.

Communication under time pressure

Narrate grain first, then SQL. When stuck, check join cardinality and NULL behavior—that is how experienced DE candidates de-risk Amazon DE SQL interview and Python rounds without spiraling.


Frequently asked questions

What types of Amazon data engineering interview questions show up—SQL, Python, or system design?

Expect heavy SQL (joins, aggregates, windows, dates), Python structured coding, and design discussion on pipelines and warehouses. This article mirrors that mix; use the Amazon practice hub to align reps with those lanes.

How hard are Amazon data engineering SQL problems compared to other FAANG companies?

Difficulty is broadly comparable to other large-tech data engineering loops: medium SQL with edge checks is typical. Mix easy warmups with harder multi-step prompts so you are not surprised by edge cases on the day.

Does Amazon test Python algorithms or data engineering coding more in DE loops?

You should expect both: hash maps, sorting, and string or JSON transforms appear alongside SQL. It is usually not identical to a pure SWE loop, but algorithmic Python still matters.

How should I prepare for Amazon data modeling or warehouse design questions?

Practice star schema, fact grain, and SCD Type 1 vs 2 out loud; explain surrogate keys and as-of joins. Pair verbal reps with dimensional modeling drills and the data modeling course linked in §7.

How many Amazon-tagged practice problems should I finish before the interview?

There is no magic number, but 30–50 mixed problems across SQL, Python, and a few modeling prompts is a solid preparation cadence if you review each submission.

Where can I practice Amazon-style data engineering questions with tests and feedback?

PipeCode runs 450+ interview-style problems with tests and AI feedback; start at /explore/practice/company/amazon for company-tagged sets, then branch into topic pages for weak spots.


Start practicing Amazon data engineering problems

Pair company-tagged Amazon reps with an environment that runs tests and surfaces wrong assumptions early, so you move from reading patterns to typing your own solutions under time pressure.

Pipecode.ai is Leetcode for Data Engineering

Browse Amazon practice → View plans →

Top comments (0)