Gowtham Potureddi

Posted on Apr 24 • Edited on Apr 26

Amazon Data Engineering Interview Questions & Prep

#career #dataengineering #interview #sql

Amazon data engineering interview questions lean on SQL for operational and retail-style analytics—aggregates, joins, window functions, and date logic—and Python for in-memory transforms (hash maps, arrays, nested structures). Harder loops often add verbal data modeling (star schemas, slowly changing dimensions, fact grain).

Everything below uses original teaching examples only—not proprietary interview items. Narrate grain, join semantics, and NULL behavior out loud; interviewers grade that reasoning as much as the final query or sketch.

#	Topic	Why it shows up
1	Aggregation & `GROUP BY` / `HAVING`	Revenue and volume by region, SKU, or customer segment.
2	Joins & anti-joins	Enrich facts, find non-buyers, avoid join fan-out.
3	Window functions	Top-N per group, ranks, `LAG` / `LEAD`, running metrics.
4	Dates & time boundaries	Shipped-within-N-days, yearly filters, buckets.
5	`CASE` & percentages	SLA tiers, rejection rates, safe ratios.
6	Python: dicts, sets, arrays	Frequency, sorting, light ETL-style transforms.
7	Star schema & SCDs	Verbal warehouse design at DE bar.

1. Aggregation and GROUP BY concepts in Amazon-style SQL

Aggregation and GROUP BY in SQL for data engineering

Picture a table with many detail rows—for example one row per order line or package scan. Aggregation means: turn lots of rows into one summary value (or a few values) per bucket. The bucket is whatever the prompt names: per seller, per FC (fulfillment center), per day, per postal code, and so on.

GROUP BY defines the bucket: “Put all rows with the same seller_id together,” or “the same (fc_id, day) together.” Each distinct combination of GROUP BY columns is one group; the engine runs SUM, COUNT, AVG, … inside each group separately.

Report grain: After the query, ask: “One output row represents what?” If the question asked revenue per region, the grain is region. If you accidentally group at order-line grain when the prompt wanted per order, your SUM can be right for lines but wrong for the business sentence.

`SUM`, `AVG`, and `NULL`

SUM(col) adds non-null numerics in the group. NULL cells are skipped (not treated as 0). If every value is NULL, SUM is usually NULL, not 0—say that in the interview when edge cases matter.
AVG(col) is sum of non-null values ÷ count of non-null values. Missing prices do not enter the numerator or denominator.

Worked example: In one group, amount values 10, NULL, 30 → SUM(amount) = 40, AVG(amount) = 20 (only two values averaged).

`COUNT(*)` vs `COUNT(col)`

COUNT(*) counts rows in the group, including rows where some columns are NULL.
COUNT(col) counts rows where col is not NULL.
COUNT(DISTINCT col) counts unique non-null values—essential after joins when you must count buyers, not multiplied rows.

Worked example: Three rows in a group, two with non-null amount → COUNT(*) = 3, COUNT(amount) = 2.

`MIN` / `MAX`

Work on orderable types (numbers, dates, timestamps). NULLs are skipped. Use MAX(order_ts) for “latest order time” per bucket.

`WHERE` vs `HAVING`

WHERE filters rows before grouping—predicates on raw columns only.
HAVING filters groups after aggregates—predicates on SUM, COUNT, AVG, etc.

Worked example — same orders grain, two filters:

order_id	merchant_id	subtotal
1	M1	40
2	M1	70
3	M1	20
4	M2	200

WHERE subtotal > 30 removes row 3 before grouping. Then GROUP BY merchant_id with SUM(subtotal) gives M1 = 110, M2 = 200.
HAVING SUM(subtotal) > 100 on the full table gives M1 = 130, M2 = 200—both groups pass.

Rule of thumb: If the condition uses SUM / COUNT / AVG of the group**, it belongs in **HAVING**. If it only uses **this row’s** columns, use **WHERE` first (and prefer it for performance).

Half-open dates and revenue per zip

Worked example: orders(order_id, zip_code, amount, order_date) — 2020 revenue per zip:

sql SELECT zip_code, SUM(amount) AS revenue FROM orders WHERE order_date >= DATE '2020-01-01' AND order_date < DATE '2021-01-01' GROUP BY zip_code;

Takeaway: [2020-01-01, 2021-01-01) avoids double-counting a midnight boundary. The result grain is one row per zip.

Conditional aggregation (`CASE` inside aggregates)

Idea: Count or sum only some rows in the group without splitting into multiple queries.

Worked example — lines per FC by disposition:

fc_id	status
A	shipped
A	pending
A	shipped

sql SELECT fc_id, COUNT(*) AS total_lines, SUM(CASE WHEN status = 'shipped' THEN 1 ELSE 0 END) AS shipped_lines FROM fc_lines GROUP BY fc_id;

Result: fc A → total_lines = 3, shipped_lines = 2.

Common beginner mistakes

Putting SUM(x) > 10 in WHERE instead of HAVING.
Selecting a non-aggregated column that is not in GROUP BY (invalid in strict SQL).
Joining one-to-many before GROUP BY and double-counting a dimension field—fix grain or dedupe first.
Using COUNT(*) when the question wanted distinct people—use COUNT(DISTINCT user_id).

SQL Interview Question on Aggregation by Region

Table: orders(order_id, region, amount, is_refund). Return regions where net revenue (sum of amount excluding refunds) is strictly positive, ordered by net revenue descending.

Solution Using GROUP BY and Conditional Sum

sql SELECT region, SUM(CASE WHEN NOT is_refund THEN amount ELSE 0 END) AS net_rev FROM orders GROUP BY region HAVING SUM(CASE WHEN NOT is_refund THEN amount ELSE 0 END) > 0 ORDER BY net_rev DESC;

Why this works: Refunds are zeroed in the CASE; HAVING applies to the per-region aggregate, not individual rows.

SQL · Topic — Aggregation problems (all companies)
COMPANY · Amazon — aggregation Amazon-tagged aggregation
COMPANY · Amazon — hub Amazon SQL (all difficulties)

When you want timed medium SQL, open Amazon SQL (medium).

2. Joins, anti-joins, and deduplication in SQL

Joins and deduplication in SQL for data engineering

A join answers: “For each row in table A, which rows in table B belong with it?” You declare that with ON a.key = b.key. In analytics you use joins to enrich events (orders, clicks, shipments) with reference data (customers, SKUs, fulfillment centers).

Fact vs dimension and grain (how data engineers talk about joins)

Fact table: Many rows—often one row per event or transaction (order line, click, package scan)—the thing you SUM or COUNT.
Dimension table: Fewer rows—one row per entity you describe (customer, product, warehouse)—the thing you JOIN to get names, categories, regions.
Grain: “One result row per what?” (per order, per customer per day, per ASIN). Wrong grain after a join usually means fan-out below: you aggregated at order level but summed a customer attribute that repeated after the join.

Worked example — fact + dimension: order_lines (fact: order_id, sku, qty) joins products (dimension: sku, list_price, title). SUM(qty) by sku is correct at line grain. If products accidentally had duplicate rows per sku, one join could inflate SUM(list_price) when you sum price alongside lines—fix the dimension so there is exactly one row per sku before you trust price totals.

Inner join (`JOIN` / `INNER JOIN`)

Result: Only rows where the ON condition succeeds on both sides.
Use when: You only care about matched pairs—e.g. clicks that resolve to a known catalog row.
NULL keys: NULL = NULL is not true in join logic, so inner joins drop rows with a null join key unless you COALESCE or filter explicitly.

Worked example — clicks and ads:

click_id	user_id	ad_id
c1	u1	10
c2	u1	99

ads

ad_id	brand_id
10	7
11	7

FROM clicks c JOIN ads a ON c.ad_id = a.ad_id returns one row: (c1, u1, 10, 7). Click c2 has ad_id = 99 with no ad row, so it disappears from an inner join.

Left outer join (`LEFT JOIN`)

Result: Every row from the left table appears once per matching right row; if there is no match, right-side columns are NULL.
Use when: You must keep all left entities even if they have zero events—e.g. all customers including those with no orders.

Worked example — users and orders:

id	name
1	Ann
2	Bo

id	user_id	amount
o1	1	50

FROM users u LEFT JOIN orders o ON u.id = o.user_id yields two rows: Ann + o1, and Bo + NULL (no order—right side null-padded).

Right outer join (`RIGHT JOIN`)

Result: Mirror of LEFT JOIN: keep all rows from the right table; left columns NULL when unmatched.
In practice: Rarely needed—swap tables and use LEFT JOIN so “keep all rows from the table I care about” always reads the same way.

Worked example: Same users and orders as above. FROM orders o RIGHT JOIN users u ON u.id = o.user_id returns the same two logical rows as FROM users u LEFT JOIN orders o …—only preferred style differs.

Full outer join (`FULL OUTER JOIN`)

Result: Keeps rows that match plus unmatched rows from either side (missing side filled with NULL).
Use when: You need a symmetric “everything from A and B” merge; shows up in reconciliation-style questions.

Worked example: If legacy has ids 1, 2 and new has 2, 3, a full outer join on id returns three logical keys 1, 2, 3 with NULLs where one side is missing.

Cross join (`CROSS JOIN`)

Result: Cartesian product: every row of A paired with every row of B.
Use when: You intentionally need all combinations (e.g. stores × quarters grid)—easy to explode row count; use sparingly.

Worked example: stores has 2 rows, quarters has 4 → CROSS JOIN produces 8 rows (every store × every quarter).

Semi-join vs anti-join (names, not syntax)

Semi-join: “Keep rows from A where some match exists in B”—typically WHERE EXISTS (SELECT 1 FROM B WHERE …) or WHERE key IN (SELECT key FROM B). You do not add B’s columns to the result; you only filter A.
Anti-join: “Keep rows from A with no match in B”—LEFT JOIN B … WHERE B.key IS NULL or WHERE NOT EXISTS (SELECT 1 FROM B WHERE …). Same idea: filter A, not a full merge for reporting.

Worked example — customers ids **1, 2, 3; table vip has customer_id = 2 only.

Semi-join: SELECT id FROM customers c WHERE EXISTS (SELECT 1 FROM vip v WHERE v.customer_id = c.id) → 2.
Anti-join: SELECT c.id FROM customers c LEFT JOIN vip v ON v.customer_id = c.id WHERE v.customer_id IS NULL → 1 and 3.

Fan-out (row multiplication)

If the join key is not unique on one side (one user, many orders), the “one” side’s columns repeat on every matching row. That is correct for order-grain analysis.

Bug pattern: Summing a user-level numeric column after the join counts that user’s balance multiple times. Fixes: aggregate first, or make the dimension one row per key before joining. SUM(DISTINCT …) rarely fixes business logic—prefer COUNT(DISTINCT user_id) or a subquery that dedupes.

Worked example:

users	id	credit_balance
	1	500

orders	user_id
	1
	1

After users JOIN orders, credit_balance appears twice (500, 500). SUM(u.credit_balance) would add 1000, wrong for “total credit in the system.” Fix: sum order amounts, or MAX(u.credit_balance) per user if you only need the balance once.

Deduplication

SELECT DISTINCT col (or distinct on several columns) removes duplicate rows—fine when any duplicate row is interchangeable.
Latest row per id: Prefer ROW_NUMBER() OVER (PARTITION BY id ORDER BY time DESC) AS rn in a CTE, then WHERE rn = 1, so you control ties and ordering. DISTINCT ON (user_id) (PostgreSQL) is a compact alternative when the dialect allows it.

Worked example — DISTINCT:

visit_id	user_id
v1	u1
v2	u1
v3	u2

SELECT DISTINCT user_id FROM visits → u1, u2.

Worked example — latest event per user:

user_id	ts	page
u1	t2	/home
u1	t1	/cart

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) assigns rn = 1 to (u1, t2, /home) and rn = 2 to (u1, t1, /cart); WHERE rn = 1 keeps only the latest row.

Worked example — DISTINCT ON (PostgreSQL):

sql SELECT DISTINCT ON (user_id) user_id, ts, page FROM events ORDER BY user_id, ts DESC;

DISTINCT ON (user_id) requires ORDER BY to start with the same columns; the first row per user_id in that order is kept. For tie-breaking and clarity, ROW_NUMBER() in a CTE is often easier in interviews unless the prompt says PostgreSQL.

Common beginner mistakes

Joining on the wrong key or ignoring NULL keys (they do not match in an inner join).
Assuming one row per user after a join when the join multiplied rows.
Using COUNT(*) when the question wanted distinct people—use COUNT(DISTINCT user_id).

Joins-Based Interview Question on Distinct Users and Brands

Tables: clicks(user_id, ad_id) and ads(ad_id, brand_id). How many distinct users clicked any ad for brand_id = 7?

Solution Using INNER JOIN and COUNT DISTINCT

sql SELECT COUNT(DISTINCT c.user_id) AS distinct_users FROM clicks c JOIN ads a ON a.ad_id = c.ad_id WHERE a.brand_id = 7;

Why this works: The join finds clicks whose ad belongs to brand 7; COUNT(DISTINCT user_id) counts people, not click rows.

Joins-Based Interview Question on Customers With No Orders

Tables: customers(customer_id), orders(order_id, customer_id). Return customer_id for customers with no orders.

Solution Using LEFT JOIN and NULL Filter

sql SELECT c.customer_id FROM customers c LEFT JOIN orders o ON o.customer_id = c.customer_id WHERE o.order_id IS NULL;

Why this works: Unmatched rows still appear from customers; orders columns are NULL, which we filter—classic anti-join pattern.

SQL · Topic — Joins & deduplication
COMPANY · Amazon — joins Amazon-tagged joins
COMPANY · Amazon — subqueries Amazon-tagged subqueries

3. Window functions and ranking in SQL

Window functions and ranking in SQL for analytics loops

GROUP BY collapses many rows into one row per group. Window functions keep every input row and attach a value computed over a window of rows—usually “all rows in the same partition that sort before or with this row.”

Parts of a window: PARTITION BY (optional) splits the table into independent mini-tables—like “per customer” or “per FC per day.” ORDER BY inside OVER (...) defines sequence within the partition. ROWS BETWEEN … (optional) trims which neighbors count for running aggregates.

When to reach for windows: Top-N per group, gaps between events (LAG / LEAD), running revenue, share of partition (SUM(x) OVER (...) / SUM(SUM(x)) in a subquery pattern), deduping with ROW_NUMBER() (see §2).

$Illustration of SQL window function partitions with ranking values 1, 2, 2 for DENSE\_RANK-style ties in Amazon data engineering interview prep.$

Ranking: `ROW_NUMBER`, `RANK`, `DENSE_RANK`

All three need ORDER BY inside OVER.

ROW_NUMBER(): Unique integer per row in the partition after sorting—no ties; the engine picks an arbitrary order among equal keys unless you add tie-break columns.
RANK(): Tied rows get the same rank; next rank skips (1, 2, 2, 4).
DENSE_RANK(): Tied rows get the same rank; no gaps (1, 2, 2, 3).

Worked example — three employees in dept 10 with salaries 90k, 80k, 80k (sorted desc):

emp	salary	ROW_NUMBER	RANK	DENSE_RANK
A	90	1	1	1
B	80	2	2	2
C	80	3	2	2

Use ROW_NUMBER when the prompt says “pick one” and gives a deterministic tie-break (e.g. lowest user_id). Use DENSE_RANK when ties should share a bucket (e.g. “top three distinct salary tiers”).

`LAG` / `LEAD`

LAG(col, n) looks n rows back in the partition order; LEAD looks forward. Great for time between events, sessionization, and YoY same-row comparisons.

Worked example — gap between orders:

customer_id	order_ts	prev_ts
1	t1	NULL
1	t3	t1

LAG(order_ts) OVER (PARTITION BY customer_id ORDER BY order_ts) on row (1, t3) yields t1; first row per customer gets NULL lag.

Running totals

Worked example: Cumulative amount per seller_id in order_ts order:

sql SELECT seller_id, order_ts, amount, SUM(amount) OVER ( PARTITION BY seller_id ORDER BY order_ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS running_total FROM seller_orders;

(If the dialect defaults frame correctly for running sum, the explicit ROWS clause is optional—but stating it shows you understand frame rules.)

Common beginner mistakes

Omitting ORDER BY inside OVER (…)—ranking and LAG become meaningless or nondeterministic.
Using ROW_NUMBER when multiple rows must tie for “first place” under the business rule.
Forgetting PARTITION BY and accidentally ranking the whole table as one group.

SQL Interview Question on Top Spender per Region

Table: spend(user_id, region, amt). Return the user_id with the highest amt in each region (break ties by lowest user_id).

Solution Using ROW_NUMBER

sql SELECT user_id, region, amt FROM ( SELECT user_id, region, amt, ROW_NUMBER() OVER ( PARTITION BY region ORDER BY amt DESC, user_id ASC ) AS rn FROM spend ) t WHERE rn = 1;

Why this works: ROW_NUMBER assigns a unique order per partition; rn = 1 is the chosen winner per region under the stated tie-break.

SQL Interview Question on Consecutive Order Gap

Table: orders(order_id, customer_id, order_date) with order_date as a DATE. Return orders where the gap in whole days from this customer’s previous order is strictly greater than 7. Skip customers’ first order (no predecessor).

Solution Using LAG

sql SELECT order_id, customer_id, order_date, prev_date, order_date - prev_date AS gap_days FROM ( SELECT order_id, customer_id, order_date, LAG(order_date) OVER ( PARTITION BY customer_id ORDER BY order_date ) AS prev_date FROM orders ) t WHERE prev_date IS NOT NULL AND order_date - prev_date > 7;

Why this works: For DATE, subtraction yields an integer day count in PostgreSQL-style SQL. LAG lines up each row with the prior order_date per customer_id; prev_date IS NOT NULL removes first orders.

SQL · Topic — Window functions (SQL)
COMPANY · Amazon — windows Amazon-tagged window functions

For more pattern drills, browse SQL deep dives.

4. Date functions and time-series boundaries in SQL

Date arithmetic and buckets in SQL for operations metrics

Operational questions—“shipped within N days,” “last 30 days of clicks,” “revenue by calendar month”—are really three skills: typing (DATE vs TIMESTAMP), time zones (state assumptions), and inclusive vs exclusive bounds.

Half-open ranges

Prefer [start, end): WHERE ts >= start AND ts < end. That avoids double-counting events exactly at midnight on a boundary and mirrors how many warehouses partition data.

Calendar buckets with `DATE_TRUNC`

Worked example (PostgreSQL): Truncate click timestamps to UTC day:

sql SELECT DATE_TRUNC('day', click_ts AT TIME ZONE 'UTC') AS day_utc, COUNT(*) AS clicks FROM clicks GROUP BY 1;

Use week, month, quarter the same way—then GROUP BY the truncated value.

Day differences and thresholds

Worked example: If order_date and ship_date are DATE, then ship_date - order_date <= 3 means “shipped within three calendar days” including same-day ship as 0—confirm with the interviewer whether same-day counts as “within 1 day” or “within 0 days.”

INTERVAL arithmetic on TIMESTAMP is dialect-specific; say whether you assume UTC, warehouse local, or event local.

Common beginner mistakes

Inclusive end dates that overlap the next period (BETWEEN '2020-01-01' AND '2020-12-31' on timestamps can still pull 2021-01-01 00:00 in some stores—prefer half-open).
Mixing TIMESTAMP and DATE without CAST.
Forgetting leap years or DST when comparing “exactly 24 hours” vs calendar days.

SQL Interview Question on Revenue in a Calendar Year

Table: orders(order_id, order_date, zip_code, amount). Return the zip_code with the largest total amount in calendar year 2020.

Solution Using WHERE Range and GROUP BY

sql SELECT zip_code, SUM(amount) AS revenue FROM orders WHERE order_date >= DATE '2020-01-01' AND order_date < DATE '2021-01-01' GROUP BY zip_code ORDER BY revenue DESC LIMIT 1;

Why this works: The date predicate fixes the year grain; GROUP BY zip rolls up to the question’s unit; LIMIT 1 picks the top (ties need a tie-break rule—ask).

SQL Interview Question on Rolling Activity Window

Table: events(user_id, event_ts) (TIMESTAMP). Return user_id values that had at least one event in the last 7 full days through CURRENT_DATE (exclusive of today’s partial day), using date arithmetic.

Solution Using Date Cast and Half-Open Bound

sql SELECT DISTINCT user_id FROM events WHERE event_ts >= (CURRENT_DATE - INTERVAL '7 days') AND event_ts < CURRENT_DATE;

Why this works: event_ts < CURRENT_DATE drops anything from today after midnight if you interpret “through yesterday” as the business rule; adjust if the prompt wants inclusive today. The lower bound is seven days before CURRENT_DATE at midnight.

SQL · Topic — Time-series SQL
COMPANY · Amazon — dates Amazon-tagged date functions

Practice time-series SQL patterns after this section—the hub matches many “consecutive day” and cohort-style prompts.

5. Conditional logic and percentage metrics in SQL

CASE expressions and ratios in SQL for DE interviews

CASE has two shapes: simple (CASE col WHEN val THEN …) and searched (CASE WHEN predicate THEN … END). In interviews, searched form is more common because you can express ranges and NULL checks.

CASE in SELECT: Derive tier labels (SLA buckets, price bands). CASE inside aggregates: Count or sum only rows matching a condition—same idea as conditional aggregation in §1.

Percentages: Always CAST or multiply by 1.0 so you do not hit integer division (3/10 → 0 in many engines). Guard the denominator with NULLIF(denominator, 0) so you return NULL instead of a runtime error.

Rejection rate pattern (portable)

Worked example — per role_id, share of rows that are rejected among decided applications (rejected or accepted):

sql SELECT role_id, SUM(CASE WHEN status = 'rejected' THEN 1 ELSE 0 END) * 1.0 / NULLIF( SUM(CASE WHEN status IN ('rejected', 'accepted') THEN 1 ELSE 0 END), 0 ) AS rejection_rate FROM applications GROUP BY role_id;

Numerator and denominator both ignore pending rows by giving them 0 in their respective CASE arms.

Common beginner mistakes

Dividing two integers and getting 0 everywhere.
Using COUNT(*) as the denominator when the question wanted only completed stages.
Returning percentage as 0–1 when the interviewer asked for 0–100—state units.

SQL Interview Question on Rejection Rate by Role

Table: applications(id, role_id, status) with status in ('rejected','accepted',...). Return each role_id and rejection rate among terminal rows where status is either rejected or accepted.

Solution Using FILTER (PostgreSQL)

sql SELECT role_id, COUNT(*) FILTER (WHERE status = 'rejected')::float / NULLIF(COUNT(*) FILTER (WHERE status IN ('rejected','accepted')), 0) AS rejection_rate FROM applications GROUP BY role_id;

Why this works: FILTER is concise PostgreSQL syntax; numerator and denominator both restrict to decided applications; NULLIF avoids divide-by-zero. For engines without FILTER, use the SUM(CASE …) form above.

SQL Interview Question on Click-Through Rate by Campaign

Tables: impressions(campaign_id, user_id) and clicks(campaign_id, user_id) (one row per user per campaign in each table if they saw / clicked). Return each campaign_id and ctr = clicks / impressions among users who had an impression; use FLOAT (or REAL) division.

Solution Using LEFT JOIN and Conditional Counts

sql SELECT i.campaign_id, SUM(CASE WHEN c.user_id IS NOT NULL THEN 1 ELSE 0 END) * 1.0 / NULLIF(COUNT(DISTINCT i.user_id), 0) AS ctr FROM impressions i LEFT JOIN clicks c ON c.campaign_id = i.campaign_id AND c.user_id = i.user_id GROUP BY i.campaign_id;

Why this works: Grain is one row per (campaign, user) on the left; LEFT JOIN attaches a click if any; SUM(CASE WHEN c.user_id IS NOT NULL …) counts distinct click matches per impression row—here one row per user so it equals “clicked or not.” COUNT(DISTINCT i.user_id) is the impression denominator. If duplicate impression rows exist, dedupe first or use COUNT(*) vs DISTINCT consistently with the prompt.

SQL · Topic — Percentage SQL
COMPANY · Amazon — conditional logic Amazon-tagged conditional logic
COMPANY · Amazon — percentage Amazon-tagged percentage calculation

6. Hash maps, arrays, and ETL-style Python for DE screens

Dicts, sets, and counting in Python for data processing

Amazon Python screens often look like small ETL: group records, aggregate in memory, join two lists by id, or validate uniqueness. The winning move is usually dict / defaultdict / Counter / set so you stay O(n) instead of O(n²) nested scans.

When to use what

set: Membership x in set, dedupe keys, “have I seen this?” while scanning once.
dict: Map id → object, id → running sum, id → last timestamp—your in-memory join key.
collections.Counter: Frequencies of hashable items; Counter.most_common(k) for top‑k.
collections.defaultdict(list): group_by patterns—append all rows for each key, then post-process.

Frequency with Counter

Worked example: skus = ["A","B","A","A"] → Counter(skus) → Counter({'A': 3, 'B': 1}).

Complement set / “missing keys”

Worked example: Given all_ids and completed_ids, set(all_ids) - set(completed_ids) is the anti-join of ids—same logic as LEFT JOIN … IS NULL in SQL.

Python Interview Question on First Duplicate ID

Input: ids: list[int]. Return the first value that appears twice while scanning left-to-right; if none, return None.

Solution Using a Set

python def first_duplicate(ids: list[int]) -> int | None: seen: set[int] = set() for x in ids: if x in seen: return x seen.add(x) return None

Why this works: The set holds earlier elements; the first repeat is detected on the second occurrence.

Python Interview Question on Merge SKU Quantities

Input: Two lists of dicts loads_a and loads_b, each like {"sku": str, "qty": int}. Return a list of dicts with one entry per sku and qty equal to the sum of quantities from both lists. If a sku appears only once, still include it.

Solution Using defaultdict

`python
from collections import defaultdict

def merge_skus(loads_a: list[dict], loads_b: list[dict]) -> list[dict]:
totals: dict[str, int] = defaultdict(int)
for row in loads_a:
totals[row["sku"]] += row["qty"]
for row in loads_b:
totals[row["sku"]] += row["qty"]
return [{"sku": sku, "qty": q} for sku, q in sorted(totals.items())]
`

Why this works: One pass per list (O(n)), dict accumulates by sku, then materialize output. sorted on keys makes tests deterministic; drop it if order does not matter.

PYTHON · Topic — Hash tables & counting
COMPANY · Amazon — hash Amazon-tagged hash tables
COMPANY · Amazon — array Amazon-tagged arrays

Open the full Amazon Python problem set when you want language-filtered reps.

7. Star schema, grain, and SCD patterns in data engineering

Dimensional modeling and SCD patterns for warehouse interviews

Star schema is the default story for warehouse interviews: a fact table at the center (numeric measures + foreign keys to dimensions) surrounded by dimension tables (descriptive context). Snowflake normalizes dimensions further (e.g. brand → subcategory → category); know the tradeoff: fewer joins vs more tables.

Fact grain (say this first): “One row per…” order line? Package scan? Daily inventory snapshot? Mixing order-level and line-level measures in one fact without a clear rule is a red flag.

Surrogate keys: Warehouse sk integers on dimensions decouple the warehouse from source system ids that might change or collide across systems. Facts store customer_sk, not raw email.

Role-play: “If marketing changes a product’s category retroactively, do we restate history?”—that question picks Type 1 vs 2.

SCD Type 1: Overwrite the dimension row—easy, but history is lost. Use when only the current truth matters (fix a typo, current brand name).

SCD Type 2: Insert a new row with valid_from / valid_to (or is_current). Facts point to the sk row that was active when the event occurred—as-of reporting stays correct.

Worked example — Type 2 rows for one natural customer:

customer_sk	natural_id	city	valid_from	valid_to	is_current
101	U1	Seattle	2020-01-01	2022-05-31	false
102	U1	Austin	2022-06-01	NULL	true

An order in 2021 should reference customer_sk = 101; a current dashboard filters is_current = true (102).

Common beginner mistakes

Mixing grains in one fact (lines and orders in the same table without a clear rule).
Storing natural keys on facts when dimensions are Type 2—you lose as-of joins.
Type 2 without valid_to / is_current—hard to write point-in-time queries.

Data Engineering Interview Question on Customer Address History

Explain how you store customer addresses so historical orders show the address at order time while ops dashboards show the current address.

Solution Using SCD Type 2 plus current flag

Use Type 2 on dim_customer with valid_from, valid_to, is_current. Facts store customer_sk referencing the version active when the order was placed. Analysts filter is_current = true for “where they live now.”

Why this works: Facts remain true to history; the dimension versions track change without rewriting old keys.

SQL · Topic — Dimensional modeling
COMPANY · Amazon — modeling Amazon-tagged dimensional modeling
COMPANY · Amazon — SCD Amazon-tagged slowly changing data
COMPANY · Amazon — hub Amazon data modeling filter

Deep dives: Data modeling for data engineering interviews and ETL system design for data engineering interviews.

Tips to crack Amazon data engineering interviews

Data engineering interview preparation for Amazon-shaped loops means lanes: SQL grain → joins → windows → dates → CASE/percentages → Python structures → a clean verbal on modeling.

SQL preparation

Drill WHERE vs HAVING, fan-out, and partition keys on paper before you code. State timezone and tie-break assumptions early.

Pipeline and ETL thinking

Be ready to discuss idempotent loads, late-arriving facts, and backfills at a high level—pair that narrative with ETL topic practice.

Where to practice on PipeCode

Lane	Amazon-focused path
Company hub	/explore/practice/company/amazon
SQL	/explore/practice/company/amazon/sql
Python	/explore/practice/company/amazon/python
Medium difficulty	/explore/practice/company/amazon/difficulty/medium

Courses that match this article: SQL for data engineering interviews, Python fundamentals, plus the modeling and ETL links in §7.

Communication under time pressure

Narrate grain first, then SQL. When stuck, check join cardinality and NULL behavior—that is how experienced DE candidates de-risk Amazon DE SQL interview and Python rounds without spiraling.

Frequently asked questions

What types of Amazon data engineering interview questions show up—SQL, Python, or system design?

Expect heavy SQL (joins, aggregates, windows, dates), Python structured coding, and design discussion on pipelines and warehouses. This article mirrors that mix; use the Amazon practice hub to align reps with those lanes.

How hard are Amazon data engineering SQL problems compared to other FAANG companies?

Difficulty is broadly comparable to other large-tech data engineering loops: medium SQL with edge checks is typical. Mix easy warmups with harder multi-step prompts so you are not surprised by edge cases on the day.

Does Amazon test Python algorithms or data engineering coding more in DE loops?

You should expect both: hash maps, sorting, and string or JSON transforms appear alongside SQL. It is usually not identical to a pure SWE loop, but algorithmic Python still matters.

How should I prepare for Amazon data modeling or warehouse design questions?

Practice star schema, fact grain, and SCD Type 1 vs 2 out loud; explain surrogate keys and as-of joins. Pair verbal reps with dimensional modeling drills and the data modeling course linked in §7.

How many Amazon-tagged practice problems should I finish before the interview?

There is no magic number, but 30–50 mixed problems across SQL, Python, and a few modeling prompts is a solid preparation cadence if you review each submission.

Where can I practice Amazon-style data engineering questions with tests and feedback?

PipeCode runs 450+ interview-style problems with tests and AI feedback; start at /explore/practice/company/amazon for company-tagged sets, then branch into topic pages for weak spots.

Start practicing Amazon data engineering problems

Pair company-tagged Amazon reps with an environment that runs tests and surfaces wrong assumptions early, so you move from reading patterns to typing your own solutions under time pressure.

Pipecode.ai is Leetcode for Data Engineering

Browse Amazon practice → View plans →

Top Amazon data engineering interview topics

1. Aggregation and GROUP BY concepts in Amazon-style SQL

Aggregation and GROUP BY in SQL for data engineering

SUM, AVG, and NULL

COUNT(*) vs COUNT(col)

MIN / MAX

WHERE vs HAVING

Half-open dates and revenue per zip

Conditional aggregation (CASE inside aggregates)

SQL Interview Question on Aggregation by Region

Solution Using GROUP BY and Conditional Sum

2. Joins, anti-joins, and deduplication in SQL

Joins and deduplication in SQL for data engineering

Inner join (JOIN / INNER JOIN)

Left outer join (LEFT JOIN)

Right outer join (RIGHT JOIN)

Full outer join (FULL OUTER JOIN)

Cross join (CROSS JOIN)

Semi-join vs anti-join (names, not syntax)

Fan-out (row multiplication)

Deduplication

Joins-Based Interview Question on Distinct Users and Brands

Solution Using INNER JOIN and COUNT DISTINCT

Joins-Based Interview Question on Customers With No Orders

Solution Using LEFT JOIN and NULL Filter

3. Window functions and ranking in SQL

Window functions and ranking in SQL for analytics loops

Ranking: ROW_NUMBER, RANK, DENSE_RANK

LAG / LEAD

Running totals

SQL Interview Question on Top Spender per Region

Solution Using ROW_NUMBER

SQL Interview Question on Consecutive Order Gap

Solution Using LAG

4. Date functions and time-series boundaries in SQL

Date arithmetic and buckets in SQL for operations metrics

Half-open ranges

Calendar buckets with DATE_TRUNC

Day differences and thresholds

SQL Interview Question on Revenue in a Calendar Year

Solution Using WHERE Range and GROUP BY

SQL Interview Question on Rolling Activity Window

Solution Using Date Cast and Half-Open Bound

5. Conditional logic and percentage metrics in SQL

CASE expressions and ratios in SQL for DE interviews

Rejection rate pattern (portable)

SQL Interview Question on Rejection Rate by Role

Solution Using FILTER (PostgreSQL)

SQL Interview Question on Click-Through Rate by Campaign

Solution Using LEFT JOIN and Conditional Counts

6. Hash maps, arrays, and ETL-style Python for DE screens

Dicts, sets, and counting in Python for data processing

When to use what

Frequency with Counter

Complement set / “missing keys”

Python Interview Question on First Duplicate ID

Solution Using a Set

Python Interview Question on Merge SKU Quantities

Solution Using defaultdict

7. Star schema, grain, and SCD patterns in data engineering

Dimensional modeling and SCD patterns for warehouse interviews

Data Engineering Interview Question on Customer Address History

Solution Using SCD Type 2 plus current flag

Tips to crack Amazon data engineering interviews

SQL preparation

Pipeline and ETL thinking

Where to practice on PipeCode

Communication under time pressure

Frequently asked questions

What types of Amazon data engineering interview questions show up—SQL, Python, or system design?

How hard are Amazon data engineering SQL problems compared to other FAANG companies?

Does Amazon test Python algorithms or data engineering coding more in DE loops?

How should I prepare for Amazon data modeling or warehouse design questions?

How many Amazon-tagged practice problems should I finish before the interview?

Where can I practice Amazon-style data engineering questions with tests and feedback?

Start practicing Amazon data engineering problems

`SUM`, `AVG`, and `NULL`

`COUNT(*)` vs `COUNT(col)`

`MIN` / `MAX`

`WHERE` vs `HAVING`

Conditional aggregation (`CASE` inside aggregates)

Inner join (`JOIN` / `INNER JOIN`)

Left outer join (`LEFT JOIN`)

Right outer join (`RIGHT JOIN`)

Full outer join (`FULL OUTER JOIN`)

Cross join (`CROSS JOIN`)

Ranking: `ROW_NUMBER`, `RANK`, `DENSE_RANK`

`LAG` / `LEAD`

Calendar buckets with `DATE_TRUNC`