Gowtham Potureddi

Posted on Apr 30

Salesforce Data Engineering Interview Questions

#python #sql #interview #dataengineering

Salesforce data engineering interview questions lean SQL-heavy with a serious analytics edge: five SQL primitives (subquery + aggregation for Nth-largest values, cohort aggregation for retention, self-join with date arithmetic for consecutive-day activity, window functions with LAG for month-over-month growth, and aggregation with string manipulation for volume reports) plus two Python primitives that test data-engineering fundamentals (hash-table design for atomic transaction state, and closures with higher-order functions for pipeline composition). The framings are enterprise SaaS analytics—salaries, retention, activity cohorts, coupon revenue, warehouse throughput, transaction state, and ETL function composition.

This guide walks through the seven topic clusters Salesforce actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 7-problem Salesforce set (1 easy, 5 medium, 1 hard)—a medium-heavy hub that rewards window-function fluency and cohort-style SQL over algorithm-puzzle Python.

#	Topic (sections 1–7)	Why it shows up at Salesforce
1	SQL subqueries for top-N salary queries	Second Largest Salary—`MAX(x) WHERE x < (SELECT MAX(x))` or `LIMIT 1 OFFSET 1`.
2	SQL cohort aggregation for user retention rate	User Retention Rate—define cohort with `MIN(login_date)`, divide retained over cohort size.
3	SQL self-join with date arithmetic for consecutive-day activity	Users Active 3 Consecutive Days—triple self-join with `t2.date = t1.date + 1` invariants.
4	SQL window functions for month-over-month growth	Coupon Sales Analysis With MoM Growth—`LAG(units) OVER (ORDER BY month)` over a `DATE_TRUNC` partition.
5	SQL aggregation with string manipulation for volume reports	Warehouse Volume Metrics Report—`SPLIT_PART` to parse composite keys, then aggregate.
6	Python hash tables for atomic transaction state	Atomic Transaction Dictionary—dict-as-state with snapshot/commit/rollback.
7	Python closures and higher-order functions for pipelines	Function Composition Pipeline—`compose(f, g)(x) == f(g(x))` plus `functools.reduce`.

1. SQL Subqueries for Top-N Salary Queries

Subqueries for Nth-largest values in SQL for data engineering

"Find the second-largest salary" is the classic SQL subquery interview prompt. The mental model: the second-largest value is the maximum of all values strictly less than the global maximum—a one-line nested aggregate. Three idioms work; choose for clarity and edge-case handling: nested MAX, LIMIT 1 OFFSET N-1, or DENSE_RANK(). Each has subtly different tie semantics.

Pro tip: Watch the empty-set edge case. If the table has only one distinct salary, the "second-largest" is conventionally NULL. The nested MAX form returns NULL automatically; LIMIT/OFFSET returns no row and your application has to handle "0 rows" vs "1 row with NULL."

Why `MAX` alone fails for second-largest

MAX(salary) returns one value—the global maximum. To get the second-largest, you need values strictly below the maximum. The invariant: second-largest = MAX of the subset where value < MAX(all). Drop < for <= and you'll silently return the global max again on tied top values.

MAX(salary) alone is the global max—not the answer.
MAX(salary) WHERE salary < global_max is the answer; the predicate filter is the active ingredient.
Tied tops: if two employees share the top salary, both get filtered out by <, and you get the third distinct salary unless you treat ties as a single rank (DENSE_RANK).
Worked example: salaries [100, 200, 300, 300]. MAX = 300; filter < 300 leaves [100, 200]; MAX of that is 200 → second-largest distinct = 200.

-- WRONG: returns the global max (300)
SELECT MAX(salary) FROM employee;

Subquery pattern: `MAX(x) WHERE x < (SELECT MAX(x))`

The canonical form for second-largest is a scalar subquery in the WHERE clause. The outer query computes MAX(salary) over the filtered subset; the inner scalar subquery computes the global maximum. The invariant: the scalar subquery returns one value; the outer MAX aggregates the filtered rows.

Scalar subquery — one row, one column; usable in any expression slot.
< operator — strict, so tied top rows are excluded.
Empty-set behavior — if the filtered subset is empty (only one distinct salary), MAX returns NULL.
Worked example: for [100, 200, 300, 300], the scalar subquery returns 300; the outer MAX(salary) WHERE salary < 300 returns 200.

step	expression	result
1	`(SELECT MAX(salary) FROM employee)`	300
2	`salary < 300` filter	rows with salary 100, 200
3	`MAX(salary)` over filtered	200

SELECT MAX(salary) AS second_max_salary
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);

`LIMIT 1 OFFSET N-1` and `DENSE_RANK()` alternatives

Two alternative idioms cover broader cases. LIMIT 1 OFFSET N-1 on a sorted distinct list returns the Nth value—generalizes cleanly to "third-largest," "fifth-largest," etc. DENSE_RANK() solves the "what counts as second" question explicitly: ties at the top compress into rank 1, so rank 2 is the next distinct value.

SELECT DISTINCT salary FROM employee ORDER BY salary DESC LIMIT 1 OFFSET 1 — Nth-distinct generalization (use OFFSET N-1).
DENSE_RANK() OVER (ORDER BY salary DESC) = 2 — explicit rank semantics; ties share rank, no gap.
Empty-set caveat — LIMIT/OFFSET returns 0 rows on insufficient data; the scalar-subquery form returns 1 row with NULL. Choose based on what the application expects.
Worked example: [100, 200, 300, 300] → DISTINCT becomes [100, 200, 300]; ORDER BY DESC → [300, 200, 100]; LIMIT 1 OFFSET 1 → 200.

-- LIMIT/OFFSET form (returns 0 rows if no second salary)
SELECT salary AS second_max_salary
FROM (SELECT DISTINCT salary FROM employee) d
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

-- DENSE_RANK form (returns 0 rows if no second salary)
SELECT salary AS second_max_salary
FROM (
  SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS rk
  FROM employee
) ranked
WHERE rk = 2
LIMIT 1;

Common beginner mistakes

Using MAX(salary) WHERE salary <= MAX(salary)—returns the global max because <= lets the top tie through.
Forgetting DISTINCT in the LIMIT/OFFSET form on tied data—returns the same top salary twice.
Confusing RANK and DENSE_RANK for "second-largest distinct"—RANK skips numbers on ties, DENSE_RANK doesn't.
Running two separate queries (max, then "exclude max")—slower and verbose; the nested subquery is one round-trip.
Returning a row count of zero and forgetting to handle it in application code—the SQL works, the consumer breaks.

SQL interview question on second-largest salary

Table employee(id INT, salary INT). Return the second-highest distinct salary. If there is no second salary (e.g. only one distinct value), return NULL. Output a single column second_max_salary.

Solution using a scalar subquery with `MAX`

SELECT MAX(salary) AS second_max_salary
FROM employee
WHERE salary < (SELECT MAX(salary) FROM employee);

Step-by-step trace (input: salaries [100, 200, 300, 300]):

Inner query first — (SELECT MAX(salary) FROM employee) scans employee, computes MAX = 300. Returns one scalar value.
Outer WHERE filter — applies salary < 300 to every row of employee. Rows with salary 300 (both of them) are excluded; rows with 100 and 200 survive.
Outer MAX — aggregates the surviving rows: MAX(100, 200) = 200.
Final result — single row, single column.
Empty-set edge case — if every row had salary 300, step 2 leaves zero rows; MAX over an empty set returns NULL, which the outer query emits as the answer (one row, one NULL value).

Output:

second_max_salary
200

Why this works — concept by concept:

Scalar subquery — (SELECT MAX(salary) FROM employee) returns exactly one value (the global maximum); usable in any expression slot, including a WHERE predicate.
Strict less-than filter — WHERE salary < global_max excludes every row at the top tier, leaving values strictly below.
Outer MAX — over the filtered subset, the maximum is by definition the second-largest distinct value.
Empty-set semantics — if only one distinct salary exists, the filter leaves zero rows; MAX over an empty set returns NULL — exactly the expected behavior, no special-case code needed.
Cost — two scans (or one scan plus an aggregate cache) on employee, both O(N).

SQL
Topic — subquery
Subquery problems

Practice →

SQL
Topic — aggregation
Aggregation problems

Practice →

2. SQL Cohort Aggregation for User Retention Rate

Cohort-style retention with subqueries in SQL for data engineering

User retention is the canonical SaaS KPI: of users who first appeared in week W, what fraction came back in week W+1? The mental model: retention = retained_users / cohort_size. The numerator filters users who logged in during the retention window; the denominator counts the cohort itself. Two aggregates, one ratio—the structure is identical for "next-day retention," "30-day retention," "monthly retention."

Pro tip: Always pin the cohort in a CTE (or scalar subquery) and compute the numerator on top of it. Mixing cohort definition and retention filter into a single WHERE clause with OR produces ambiguous semantics that an interviewer will pick apart.

Define a cohort: `MIN(login_date) GROUP BY user_id`

A cohort is the set of users who first did something on a specific date or in a specific bucket. The first-login date per user is the per-user MIN(login_date). The invariant: MIN(login_date) GROUP BY user_id defines the cohort key—truncate to a week / month / quarter to bucket.

MIN(login_date) per user — the cohort definition.
DATE_TRUNC('week', MIN(login_date)) — bucket users into weekly cohorts.
HAVING MIN(login_date) >= '2026-04-01' — restrict the cohort window directly in the same query.
Worked example: three users with first-login dates April 1, April 1, April 2.

user_id	first_login	week_cohort
1	2026-04-01	2026-03-30
2	2026-04-01	2026-03-30
3	2026-04-02	2026-03-30

All three are in the same week cohort starting Monday 2026-03-30.

SELECT
  user_id,
  MIN(login_date) AS first_login,
  DATE_TRUNC('week', MIN(login_date))::date AS week_cohort
FROM logins
GROUP BY user_id;

Retention formula: `retained_users / cohort_size`

The retention formula is fraction of cohort users who came back during the retention window. The shape: numerator = COUNT(DISTINCT user_id) matching the retention predicate; denominator = COUNT(DISTINCT user_id) in the cohort. Convert to a percentage with 100.0 * num / denom and NULLIF to guard against an empty cohort.

Numerator: users in the cohort who logged in during the retention window.
Denominator: total users in the cohort.
Boundary semantics: "next-day retention" usually means login_date = first_login + 1; "7-day retention" means login_date BETWEEN first_login AND first_login + 6 (or half-open <= first_login + 7).
Worked example: 100 users in the cohort, 60 returned the following day → 60% retention.

WITH cohort AS (
  SELECT user_id, MIN(login_date) AS first_login
  FROM logins
  GROUP BY user_id
)
SELECT
  ROUND(
    100.0 * COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1)
          / NULLIF(COUNT(DISTINCT c.user_id), 0),
    2
  ) AS next_day_retention_pct
FROM cohort c
JOIN logins l ON l.user_id = c.user_id;

Subquery vs. CTE for cohort sizing

You can pin the cohort in a CTE (named, readable, reusable) or a scalar subquery (compact, single-use). The decision rule: CTE for multi-step queries, scalar subquery for one-off filters. The invariant: the cohort definition must be one consistent set everywhere it's referenced—mixing two slightly different definitions in numerator and denominator silently corrupts the retention number.

CTE form — WITH cohort AS (...) SELECT ... FROM cohort JOIN logins .... Readable; supports multi-step.
Subquery form — (SELECT user_id FROM logins GROUP BY user_id HAVING MIN(login_date) = '2026-04-01'). Compact for single-use cases.
Most engines — CTEs and subqueries plan equivalently; pick by readability, not performance.
Worked example: CTE-based 7-day retention reusing the same cohort set in numerator and denominator.

WITH cohort AS (
  SELECT user_id, MIN(login_date) AS first_login
  FROM logins
  WHERE login_date >= '2026-04-01' AND login_date < '2026-04-08'
  GROUP BY user_id
)
SELECT
  c.first_login AS cohort_date,
  COUNT(DISTINCT c.user_id) AS cohort_size,
  COUNT(DISTINCT l.user_id) FILTER (
    WHERE l.login_date > c.first_login
      AND l.login_date <= c.first_login + INTERVAL '7 days'
  ) AS retained_7d,
  ROUND(
    100.0 * COUNT(DISTINCT l.user_id) FILTER (
      WHERE l.login_date > c.first_login
        AND l.login_date <= c.first_login + INTERVAL '7 days'
    ) / NULLIF(COUNT(DISTINCT c.user_id), 0),
    2
  ) AS retention_7d_pct
FROM cohort c
JOIN logins l ON l.user_id = c.user_id
GROUP BY c.first_login
ORDER BY c.first_login;

Common beginner mistakes

Counting COUNT(*) instead of COUNT(DISTINCT user_id)—double-counts users who logged in multiple times in the window.
Mixing two slightly different cohort definitions between numerator and denominator—silently wrong retention numbers.
Forgetting NULLIF(denom, 0)—divide-by-zero on empty cohorts.
Using BETWEEN for "within 7 days" when you want half-open semantics—off-by-one boundary inclusion.
Counting users who never left as "retained"—the retention predicate must be login_date > first_login, not >=.

SQL interview question on user retention rate

Tables logins(user_id, login_date). Return per-cohort next-day retention rate: for each first-login date, the percentage of users who logged in again the following day. Round to 2 decimals, sort by cohort date.

Solution using cohort CTE and conditional aggregation

WITH cohort AS (
  SELECT user_id, MIN(login_date) AS first_login
  FROM logins
  GROUP BY user_id
)
SELECT
  c.first_login AS cohort_date,
  COUNT(DISTINCT c.user_id) AS cohort_size,
  COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1) AS retained_users,
  ROUND(
    100.0 * COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1)
          / NULLIF(COUNT(DISTINCT c.user_id), 0),
    2
  ) AS next_day_retention_pct
FROM cohort c
JOIN logins l ON l.user_id = c.user_id
GROUP BY c.first_login
ORDER BY c.first_login;

Step-by-step trace (input: 3 users, login table below):

user_id	login_date
1	2026-04-01
1	2026-04-02
2	2026-04-01
3	2026-04-01
3	2026-04-03

Build the CTE cohort — group by user, take MIN(login_date). Result: (1, 2026-04-01), (2, 2026-04-01), (3, 2026-04-01). Three users, one cohort date.
Join cohort to logins on user_id. Each cohort row is joined to every login row for that user — user 1 → 2 rows, user 2 → 1 row, user 3 → 2 rows. Total: 5 joined rows.
GROUP BY c.first_login — collapses to one row per cohort date (2026-04-01).
Cohort-size aggregate — COUNT(DISTINCT c.user_id) = 3.
Retained-user aggregate — COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1) counts users who have any joined row with login_date = 2026-04-02. Only user 1 qualifies → 1.
Percentage — 100.0 * 1 / NULLIF(3, 0) = 33.33. Rounded to 2 decimals.
Final result — single cohort row.

Output:

cohort_date	cohort_size	retained_users	next_day_retention_pct
2026-04-01	3	1	33.33

Why this works — concept by concept:

Cohort CTE — WITH cohort AS (... MIN(login_date) ... GROUP BY user_id) pins each user once with their first-login date; reused as the canonical cohort definition.
Outer join — JOIN logins l ON l.user_id = c.user_id brings every login per user into the row set so the retention predicate can see all return visits.
GROUP BY + denominator — COUNT(DISTINCT c.user_id) collapsed by c.first_login is the cohort size — distinct so duplicate joined rows don't inflate it.
Conditional aggregate (numerator) — COUNT(DISTINCT l.user_id) FILTER (WHERE l.login_date = c.first_login + 1) counts only users who returned the next day; same DISTINCT discipline.
NULLIF divide-by-zero guard — NULLIF(denom, 0) turns an empty cohort into NULL instead of raising.
ROUND(..., 2) — clamps the percentage to 2 decimals, removing float noise.
Cost — one scan of logins for the CTE + one hash join + one aggregate → O(N + M).

SQL
Topic — cohort analysis
Cohort analysis problems

Practice →

SQL
Topic — subquery
Subquery problems

Practice →

3. SQL Self-Join with Date Arithmetic for Consecutive-Day Activity

Self-joins with date filters for consecutive-day patterns in SQL for data engineering

"Find users active 3 consecutive days" is the canonical self-join + date-arithmetic interview prompt. The mental model: a triple self-join with date constraints t2.date = t1.date + 1 and t3.date = t1.date + 2 returns user-ids that have all three consecutive days in the table. Self-join is the structural primitive; date arithmetic is the constraint glue.

Self-join basics: alias the same table three times

A self-join uses one physical table as multiple logical references. For 3-consecutive-day activity, alias logins three times: t1, t2, t3. The aliases are required to disambiguate columns; the ON clauses define the relationships. The invariant: aliasing makes the consecutive-day claim explicit—t1 is day 1, t2 is day 2, t3 is day 3.

logins t1, logins t2, logins t3 — three logical aliases on the same physical table.
Each ON clause carries one relationship constraint: same user, +1 day, +2 days.
SELECT DISTINCT t1.user_id — deduplicate across multiple matching streaks.
Worked example: user 42 has logins on April 1, 2, 3.

t1.user_id	t1.date	t2.date	t3.date
42	2026-04-01	2026-04-02	2026-04-03

The triple self-join produces one row → user 42 qualifies.

SELECT DISTINCT t1.user_id
FROM logins t1
JOIN logins t2 ON t2.user_id = t1.user_id AND t2.date = t1.date + 1
JOIN logins t3 ON t3.user_id = t1.user_id AND t3.date = t1.date + 2;

Consecutive-day invariant: `t2.date = t1.date + 1 AND t3.date = t1.date + 2`

The structural constraint is the date-arithmetic invariant: the second login is exactly one day after the first; the third is exactly two days after. Both are equality predicates on date columns plus integer offsets. The invariant: strict equality on the offset—= +1, not >= +1—or you'll match non-consecutive streaks.

t2.date = t1.date + 1 — exact next-day; equality, not inequality.
t3.date = t1.date + 2 — exact two-days-later; same shape.
Generalizes to N days: add tN.date = t1.date + N - 1 for an N-day streak.
Worked example: if user 42 logs in on April 1 and 3 but not April 2, the t2.date = t1.date + 1 constraint fails and the row is dropped.

user_id	dates present	t2 match	t3 match	qualifies?
42	Apr 1, 2, 3	yes (Apr 2)	yes (Apr 3)	yes
99	Apr 1, 3, 5	no (no Apr 2)	—	no

Worked-example solution. The query that encodes the invariant for a 3-day streak — +1 and +2 offsets are the load-bearing predicates:

SELECT DISTINCT t1.user_id
FROM logins t1
JOIN logins t2
  ON t2.user_id = t1.user_id
 AND t2.login_date = t1.login_date + INTERVAL '1 day'   -- exact +1
JOIN logins t3
  ON t3.user_id = t1.user_id
 AND t3.login_date = t1.login_date + INTERVAL '2 days'  -- exact +2
WHERE t1.user_id IN (42, 99);
-- → only user 42 returned; user 99 is dropped because Apr 2 is missing.

Gaps-and-islands alternative with `ROW_NUMBER` and date subtraction

The classic "gaps and islands" trick uses ROW_NUMBER() partitioned by user, ordered by date; date - ROW_NUMBER is constant inside one consecutive run. Group by (user_id, date - rn) and filter COUNT(*) >= 3. The invariant: for any user, date - ROW_NUMBER OVER (PARTITION BY user_id ORDER BY date) is constant within a single consecutive streak.

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY date) — sequential per user.
date - rn::int — subtract the row number to get a streak-id.
GROUP BY user_id, streak_id HAVING COUNT(*) >= 3 — filter streaks of length ≥ 3.
Performance: one window pass + one aggregate; usually faster than a triple self-join on large tables.
Worked example: user 42 with April 1, 2, 3, 5.

date	rn	date − rn	streak id
2026-04-01	1	2026-03-31	A
2026-04-02	2	2026-03-31	A
2026-04-03	3	2026-03-31	A
2026-04-05	4	2026-04-01	B

Streak A has 3 rows → user 42 qualifies.

WITH numbered AS (
  SELECT
    user_id,
    login_date,
    login_date - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date))::int AS streak_id
  FROM (SELECT DISTINCT user_id, login_date FROM logins) d
)
SELECT user_id
FROM numbered
GROUP BY user_id, streak_id
HAVING COUNT(*) >= 3;

Common beginner mistakes

Using >= instead of = in the date-offset constraint—matches non-consecutive streaks (April 1, 5, 10).
Forgetting t2.user_id = t1.user_id—joins across users, returns nonsense.
Forgetting DISTINCT on the input or the output—double-counts users with multiple overlapping 3-day streaks.
Using BETWEEN on the date offset—silently allows zero-day or off-by-one windows.
For the gaps-and-islands form, forgetting to deduplicate (user_id, login_date) first—ROW_NUMBER skips when the table has duplicate rows.

Drill self-join SQL problems →

SQL interview question on three-consecutive-day activity

Table logins(user_id, login_date)—each row is one login event; (user_id, login_date) may appear multiple times. Return distinct user_ids who logged in on three consecutive calendar days at any point in the data. Sort ascending.

Solution using a triple self-join with date arithmetic

SELECT DISTINCT t1.user_id
FROM logins t1
JOIN logins t2
  ON t2.user_id = t1.user_id
 AND t2.login_date = t1.login_date + INTERVAL '1 day'
JOIN logins t3
  ON t3.user_id = t1.user_id
 AND t3.login_date = t1.login_date + INTERVAL '2 days'
ORDER BY t1.user_id;

Step-by-step trace (input: logins rows below):

user_id	login_date
42	2026-04-01
42	2026-04-02
42	2026-04-03
99	2026-04-01
99	2026-04-03

Pick t1 candidate — engine iterates each logins row as t1. Take t1 = (42, 2026-04-01).
Probe t2 — look for a row with t2.user_id = 42 AND t2.login_date = 2026-04-02. Found → (42, 2026-04-02).
Probe t3 — look for t3.user_id = 42 AND t3.login_date = 2026-04-03. Found → (42, 2026-04-03). Triple matched — emit t1.user_id = 42.
Next t1 for user 99 — t1 = (99, 2026-04-01). t2 would need (99, 2026-04-02) — not in table → row dropped.
t1 = (99, 2026-04-03) — t2 would need (99, 2026-04-04) — not present → dropped. User 99 never qualifies.
DISTINCT on output — even if user 42 had a longer streak (e.g. 4 consecutive days), multiple t1 starting points would emit duplicate 42s; DISTINCT deduplicates.
Final result — one row.

Output:

user_id
42

Why this works — concept by concept:

Triple alias — logins t1, logins t2, logins t3 give three logical "day 1," "day 2," "day 3" references on one physical table.
Same-user constraint — t2.user_id = t1.user_id (and t3.user_id = t1.user_id) keep the join within a single user; without it the join cross-products users.
Strict-equality date offsets — t2.login_date = t1.login_date + INTERVAL '1 day' and t3.login_date = t1.login_date + INTERVAL '2 days' enforce exact consecutiveness; >= would match non-consecutive streaks.
DISTINCT — deduplicates users who have multiple overlapping 3-day streaks (e.g. 4-day runs produce two starting points).
ORDER BY t1.user_id — deterministic output ordering for grading.
Cost — a hash join on (user_id, login_date + 1) plus another on (user_id, login_date + 2); both index-supported, O(N) on a (user_id, login_date) index.

SQL
Topic — self-join
Self-join problems

Practice →

SQL
Topic — date arithmetic
Date arithmetic problems

Practice →

4. SQL Window Functions for Month-over-Month Growth

`LAG` over `DATE_TRUNC` partitions for MoM growth in SQL for data engineering

Month-over-month growth is the canonical analytics KPI and the Hard-tier problem in the Salesforce set. The mental model: bucket events by month with DATE_TRUNC, sum the metric per month, use LAG(metric, 1) to pull the previous month's value into the current row, and compute (curr - prev) * 100.0 / prev as percent growth. Three window functions, one query, no self-join required.

Pro tip: LAG returns NULL on the very first row of the partition—there's no previous month for January's growth. Wrap the percentage in NULLIF(prev, 0) so a zero-prior-month doesn't crash the divide; the first-row NULL flows through naturally.

`DATE_TRUNC('month', sale_date)` for monthly bucketing

DATE_TRUNC('month', ts) truncates a timestamp to the first of its month at 00:00:00. The result is a left-edge timestamp that groups all rows from one month onto a single key. The invariant: DATE_TRUNC('month', ...) returns the first-of-month timestamp—rows in April 2026 all collapse to 2026-04-01 00:00:00.

DATE_TRUNC('month', sale_date) — primary monthly bucketing.
DATE_TRUNC('week', ts) — Monday-start in PostgreSQL (ISO 8601).
DATE_TRUNC('quarter', ts) — Jan / Apr / Jul / Oct quarter starts.
Worked example: four sales bucketed to month.

sale_id	sale_date	DATE_TRUNC('month', sale_date)
1	2026-01-15	2026-01-01
2	2026-01-28	2026-01-01
3	2026-02-03	2026-02-01
4	2026-02-19	2026-02-01

SELECT
  DATE_TRUNC('month', sale_date)::date AS month,
  SUM(units) AS total_units
FROM coupon_sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month;

`LAG(metric) OVER (ORDER BY month)` for prior-period lookup

LAG(expr, n) returns the value of expr from n rows back in the window's ordered sequence. With ORDER BY month, LAG(total_units, 1) returns the previous month's total_units inline on the current row. The invariant: LAG aligns prior-period values onto the current row without a self-join.

LAG(total_units, 1) OVER (ORDER BY month) — pulls the previous month into the current row.
LAG(total_units, 12) — generalizes to YoY growth (12 months back).
First-row behavior — LAG returns NULL when there's no prior row in the partition.
PARTITION BY product_id — compute MoM independently per product.
Worked example: four months of unit sales.

month	total_units	LAG(total_units, 1)
2026-01-01	1,200	NULL
2026-02-01	1,560	1,200
2026-03-01	1,820	1,560
2026-04-01	2,210	1,820

WITH monthly AS (
  SELECT
    DATE_TRUNC('month', sale_date)::date AS month,
    SUM(units) AS total_units
  FROM coupon_sales
  GROUP BY 1
)
SELECT
  month,
  total_units,
  LAG(total_units, 1) OVER (ORDER BY month) AS prev_month_units
FROM monthly
ORDER BY month;

MoM growth formula: `(curr - prev) * 100.0 / prev`

The MoM growth formula is (curr − prev) × 100 / prev—straightforward arithmetic, with the same float-division and divide-by-zero guards as percentage calculations. The invariant: 100.0 * (curr - prev) / NULLIF(prev, 0) produces a clean percentage with NULL for the first row and any zero-prior-month.

100.0 — float literal forces float division.
NULLIF(prev, 0) — defends against zero-prior-month divide-by-zero.
ROUND(..., 2) — clamps decimal noise.
COALESCE(... , 0) — only if the consumer prefers 0% over NULL for the first month; usually NULL is more honest.
Worked example: February's MoM = (1560 − 1200) * 100.0 / 1200 = 30.00%.

WITH monthly AS (
  SELECT
    DATE_TRUNC('month', sale_date)::date AS month,
    SUM(units) AS total_units
  FROM coupon_sales
  GROUP BY 1
),
with_lag AS (
  SELECT
    month,
    total_units,
    LAG(total_units, 1) OVER (ORDER BY month) AS prev_units
  FROM monthly
)
SELECT
  month,
  total_units,
  prev_units,
  ROUND(
    100.0 * (total_units - prev_units) / NULLIF(prev_units, 0),
    2
  ) AS mom_growth_pct
FROM with_lag
ORDER BY month;

Common beginner mistakes

Using LAG without ORDER BY in the OVER clause—engine raises an error or returns non-deterministic results.
Self-joining the table to itself by month - 1—works but slow and hard to maintain compared to LAG.
Forgetting NULLIF(prev, 0) on the divide—divide-by-zero crash on the first month or any zero-units month.
Forgetting 100.0 for percentage—integer division returns 0 for fractional growth.
Using LEAD (the opposite direction) when the prompt asks for previous-period growth—LEAD looks forward.

Practice window-function problems →

SQL interview question on month-over-month growth

Table coupon_sales(sale_id, sale_date, units, revenue). Return per-month total units sold, total revenue, and month-over-month growth percentage for both metrics. Round percentages to 2 decimals; sort by month ascending.

Solution using `DATE_TRUNC` + `LAG` over an ordered window

WITH monthly AS (
  SELECT
    DATE_TRUNC('month', sale_date)::date AS month,
    SUM(units)   AS total_units,
    SUM(revenue) AS total_revenue
  FROM coupon_sales
  GROUP BY DATE_TRUNC('month', sale_date)
),
with_lag AS (
  SELECT
    month,
    total_units,
    total_revenue,
    LAG(total_units, 1)   OVER (ORDER BY month) AS prev_units,
    LAG(total_revenue, 1) OVER (ORDER BY month) AS prev_revenue
  FROM monthly
)
SELECT
  month,
  total_units,
  total_revenue,
  ROUND(
    100.0 * (total_units - prev_units) / NULLIF(prev_units, 0),
    2
  ) AS units_mom_pct,
  ROUND(
    100.0 * (total_revenue - prev_revenue) / NULLIF(prev_revenue, 0),
    2
  ) AS revenue_mom_pct
FROM with_lag
ORDER BY month;

Step-by-step trace (input: 4 sales rows aggregating to 2 months):

sale_id	sale_date	units	revenue
1	2026-01-15	500	1000
2	2026-01-28	700	1400
3	2026-02-03	900	1800
4	2026-02-19	660	1320

monthly CTE — DATE_TRUNC('month', sale_date) collapses Jan rows to 2026-01-01, Feb rows to 2026-02-01. SUM(units) GROUP BY month produces (2026-01-01, 1200, 2400) and (2026-02-01, 1560, 3120).
with_lag CTE — LAG(total_units, 1) OVER (ORDER BY month) reads each row's previous-month value: row 1 has prev_units = NULL (no prior); row 2 has prev_units = 1200. Same for prev_revenue: NULL, then 2400.
Outer SELECT row 1 (2026-01-01) — prev_units = NULL. The expression 100.0 * (1200 - NULL) / NULLIF(NULL, 0) collapses to NULL. Both growth columns emit NULL for the first month.
Outer SELECT row 2 (2026-02-01) — 100.0 * (1560 - 1200) / NULLIF(1200, 0) = 100.0 * 360 / 1200 = 30.00. Revenue: 100.0 * (3120 - 2400) / 2400 = 30.00. Rounded to 2 decimals.
Final output, sorted ascending.

Output:

month	total_units	total_revenue	units_mom_pct	revenue_mom_pct
2026-01-01	1200	2400	NULL	NULL
2026-02-01	1560	3120	30.00	30.00

Why this works — concept by concept:

DATE_TRUNC('month', sale_date) — collapses every row of a calendar month onto the first-of-month timestamp, the bucketing key.
SUM(units) / SUM(revenue) aggregate — one row per month after GROUP BY on the truncated date.
LAG(metric, 1) OVER (ORDER BY month) — pulls the previous month's value inline onto the current row; no self-join, no second pass over the data.
MoM growth formula — 100.0 * (curr - prev) / NULLIF(prev, 0); the 100.0 float literal forces float division, NULLIF defends against zero/NULL prior-month values.
First-row NULL — LAG returns NULL for the first row of the partition; the formula propagates NULL cleanly so January has no MoM number — honest, not zero.
ROUND(..., 2) — display precision; removes trailing float noise like 30.0000000001.
Cost — one scan of coupon_sales, one hash aggregate by month, one window pass → O(N + K log K) where K is months.

SQL
Topic — window functions
Window function problems

Practice →

SQL
Topic — date functions
Date function problems

Practice →

5. SQL Aggregation with String Manipulation for Volume Reports

Aggregation over parsed dimensions in SQL for data engineering

Real warehouse and operations data often arrives with composite keys packed into a single string column—SKU-WAREHOUSE-CATEGORY or 2026-04-29:WH02:HEAVY. The mental model: parse the composite string into structured columns with SPLIT_PART or SUBSTRING, then aggregate. The string parsing is a one-shot transform; the aggregation is the standard SUM / COUNT / AVG you already know.

Pro tip: Always pin the string-parsing into a CTE (or subquery) before aggregating. Doing both in the same SELECT makes the query unreadable, and a bad split-character mid-string breaks the entire downstream aggregate silently.

`SPLIT_PART` / `SUBSTRING` for parsing composite keys

SPLIT_PART(str, delim, n) returns the n-th field of str split by delim. For positional parsing of fixed-format strings, SPLIT_PART is faster and clearer than regex. The invariant: SPLIT_PART is 1-indexed—the first field is n = 1, not n = 0.

SPLIT_PART('SKU01-WH02-HEAVY', '-', 1) → 'SKU01'.
SPLIT_PART('SKU01-WH02-HEAVY', '-', 2) → 'WH02'.
SUBSTRING(s, start, len) — for fixed-width formats where the column always has the same offsets.
POSITION('-' IN s) — locate a delimiter for variable-width parsing.
Worked example: parse a composite product code.

raw_code	part 1 (sku)	part 2 (warehouse)	part 3 (category)
`SKU01-WH02-HEAVY`	SKU01	WH02	HEAVY
`SKU07-WH01-LIGHT`	SKU07	WH01	LIGHT
`SKU03-WH02-HEAVY`	SKU03	WH02	HEAVY

SELECT
  SPLIT_PART(raw_code, '-', 1) AS sku,
  SPLIT_PART(raw_code, '-', 2) AS warehouse,
  SPLIT_PART(raw_code, '-', 3) AS category
FROM warehouse_events;

`SUM` and `COUNT` over a parsed dimension

Once the string is parsed, aggregation is the standard GROUP BY you already know. The shape is CTE that parses → SELECT that aggregates. The invariant: the grain of the result is determined by the GROUP BY clause; the parsing CTE's job is only to expose the dimension columns.

GROUP BY warehouse, category — two-level aggregation.
COUNT(*) — event count per group.
SUM(units), AVG(units), MIN/MAX(units) — volume metrics.
Worked example: three events grouped by warehouse + category.

warehouse	category	event_count	total_units
WH01	LIGHT	1	50
WH02	HEAVY	2	220

WITH parsed AS (
  SELECT
    SPLIT_PART(raw_code, '-', 2) AS warehouse,
    SPLIT_PART(raw_code, '-', 3) AS category,
    units
  FROM warehouse_events
)
SELECT
  warehouse,
  category,
  COUNT(*) AS event_count,
  SUM(units) AS total_units
FROM parsed
GROUP BY warehouse, category
ORDER BY warehouse, category;

Multi-metric reports with `FILTER` / `CASE WHEN` aggregates

Volume reports usually need many metrics in one query: total units, heavy-only units, light-only units, percentage of heavy events. Combine GROUP BY with conditional aggregation. The invariant: conditional aggregation lets you compute multiple per-group metrics in a single pass—each metric is one expression in the SELECT.

SUM(CASE WHEN category = 'HEAVY' THEN units ELSE 0 END) — portable form.
SUM(units) FILTER (WHERE category = 'HEAVY') — PostgreSQL/standard SQL form (cleaner).
COUNT(*) FILTER (WHERE category = 'HEAVY') — count of matching rows.
100.0 * heavy_count / total_count — percentage with float-division and NULLIF guard.
Worked example: multi-metric per warehouse.

warehouse	total_units	heavy_units	light_units	heavy_pct
WH01	50	0	50	0.00
WH02	220	220	0	100.00

WITH parsed AS (
  SELECT
    SPLIT_PART(raw_code, '-', 2) AS warehouse,
    SPLIT_PART(raw_code, '-', 3) AS category,
    units
  FROM warehouse_events
)
SELECT
  warehouse,
  SUM(units) AS total_units,
  SUM(units) FILTER (WHERE category = 'HEAVY') AS heavy_units,
  SUM(units) FILTER (WHERE category = 'LIGHT') AS light_units,
  ROUND(
    100.0 * COUNT(*) FILTER (WHERE category = 'HEAVY') / NULLIF(COUNT(*), 0),
    2
  ) AS heavy_event_pct
FROM parsed
GROUP BY warehouse
ORDER BY warehouse;

Common beginner mistakes

Treating SPLIT_PART as zero-indexed—off-by-one bug; it's 1-indexed.
Inlining SPLIT_PART in GROUP BY directly—works but unreadable; pin it in a CTE.
Using LIKE '%HEAVY%' instead of exact match on the parsed dimension—matches LIGHT-HEAVY-COMBO accidentally.
Forgetting NULLIF(denom, 0) on percentage divides—divide-by-zero on empty groups.
Aggregating before parsing—loses the dimension you wanted to group by.

SQL interview question on warehouse volume metrics

Table warehouse_events(event_id, raw_code TEXT, units INT, ts TIMESTAMP) where raw_code has the format SKU-WAREHOUSE-CATEGORY (e.g. SKU01-WH02-HEAVY). Return per-warehouse: total events, total units, heavy-units, light-units, and percentage of heavy events. Sort by total_units descending.

Solution using `SPLIT_PART` parsing and conditional aggregation

WITH parsed AS (
  SELECT
    SPLIT_PART(raw_code, '-', 2) AS warehouse,
    SPLIT_PART(raw_code, '-', 3) AS category,
    units
  FROM warehouse_events
)
SELECT
  warehouse,
  COUNT(*) AS total_events,
  SUM(units) AS total_units,
  SUM(units) FILTER (WHERE category = 'HEAVY') AS heavy_units,
  SUM(units) FILTER (WHERE category = 'LIGHT') AS light_units,
  ROUND(
    100.0 * COUNT(*) FILTER (WHERE category = 'HEAVY') / NULLIF(COUNT(*), 0),
    2
  ) AS heavy_event_pct
FROM parsed
GROUP BY warehouse
ORDER BY total_units DESC, warehouse ASC;

Step-by-step trace (input: 3 events):

event_id	raw_code	units
1	SKU01-WH02-HEAVY	100
2	SKU07-WH01-LIGHT	50
3	SKU03-WH02-HEAVY	120

parsed CTE — for each row, SPLIT_PART(raw_code, '-', 2) produces WH02 / WH01 / WH02; SPLIT_PART(raw_code, '-', 3) produces HEAVY / LIGHT / HEAVY. Output: 3 rows of (warehouse, category, units).
GROUP BY warehouse — collapses rows by warehouse: WH01 has 1 row; WH02 has 2 rows.
WH02 aggregates — COUNT(*) = 2, SUM(units) = 220, SUM(units) FILTER (WHERE category = 'HEAVY') = 220 (both rows match), SUM(units) FILTER (WHERE category = 'LIGHT') = 0, 100.0 * 2 / NULLIF(2, 0) = 100.00.
WH01 aggregates — COUNT(*) = 1, SUM(units) = 50, heavy = 0, light = 50, heavy_pct = 100.0 * 0 / 1 = 0.00.
ORDER BY total_units DESC, warehouse ASC — WH02 (220) precedes WH01 (50).
Final output — two warehouse rows.

Output:

warehouse	total_events	total_units	heavy_units	light_units	heavy_event_pct
WH02	2	220	220	0	100.00
WH01	1	50	0	50	0.00

Why this works — concept by concept:

SPLIT_PART(raw_code, '-', N) — 1-indexed positional parsing; turns one composite string into three structured columns (sku, warehouse, category).
parsed CTE — pins the parsing in one place so the outer query is readable; GROUP BY warehouse operates on a clean structured column.
COUNT(*) — total event count per warehouse.
SUM(units) — total units per warehouse, no filter.
SUM(units) FILTER (WHERE category = 'HEAVY') — conditional sum: only HEAVY rows contribute. SQL-standard FILTER clause.
COUNT(*) FILTER (WHERE category = 'HEAVY') — conditional count for the percentage numerator.
100.0 * num / NULLIF(denom, 0) — float-division percentage with empty-group guard.
Tie-breaking ORDER BY total_units DESC, warehouse ASC — deterministic when two warehouses share the same total.

SQL
Topic — aggregation
Aggregation problems

Practice →

COMPANY
Salesforce — aggregation
Salesforce-tagged aggregation

Practice →

6. Python Hash Tables for Atomic Transaction State

Dict-as-state design for atomic transactions in Python for data engineering

Atomic transactions are the canonical state-management interview prompt for DE candidates. The mental model: a dict is a key-value state store with O(1) reads and writes; an "atomic transaction" wraps a sequence of mutations so they either all apply or none do. The implementation is plain Python—no databases, no locks—but the design discipline (snapshot, mutate, commit, rollback) is the data-engineering primitive Salesforce is grading.

Pro tip: The simplest atomic-transaction design uses dict.copy() to snapshot before mutating; on commit, the working copy replaces the canonical store; on rollback, the working copy is discarded. copy() is shallow—if your values are themselves containers, prefer copy.deepcopy or design the values to be immutable.

Dict as a key-value store with O(1) reads / writes

A Python dict is the canonical hash-table KV store. Reads (d[k]) and writes (d[k] = v) are O(1) average, O(N) worst-case (rare). The invariant: dict operations are O(1) amortized; never reach for a list-of-tuples KV store unless the keys aren't hashable.

d[k] = v — set; O(1) average.
d.get(k, default) — read with default; safer than d[k] for missing keys.
del d[k] — delete; O(1) average.
k in d — membership; O(1) average.
Worked example: simple state store.

operation	dict state
start	`{}`
`d['acct_a'] = 100`	`{'acct_a': 100}`
`d['acct_b'] = 50`	`{'acct_a': 100, 'acct_b': 50}`
`d['acct_a'] = 90`	`{'acct_a': 90, 'acct_b': 50}`

state: dict[str, int] = {}
state['acct_a'] = 100
state['acct_b'] = 50
state['acct_a'] = 90
print(state)  # {'acct_a': 90, 'acct_b': 50}

Atomic update pattern: snapshot → mutate → commit

The atomic-update pattern has three phases. Snapshot copies the current state into a working buffer. Mutate applies all changes to the buffer. Commit swaps the buffer back into place; all mutations land together. The invariant: no caller observes a partial mutation—either zero changes are visible or all of them are.

working = state.copy() — shallow snapshot.
working[k] = new_value — apply each mutation to the copy, not the canonical store.
state.clear(); state.update(working) — atomic commit (or assign self._state = working).
raise mid-mutation — the canonical state is untouched; the buffer is garbage-collected.
Worked example: transfer $20 from account A to account B atomically.

step	canonical state	working buffer
start	`{'a': 100, 'b': 50}`	—
snapshot	`{'a': 100, 'b': 50}`	`{'a': 100, 'b': 50}`
mutate (A−20)	`{'a': 100, 'b': 50}`	`{'a': 80, 'b': 50}`
mutate (B+20)	`{'a': 100, 'b': 50}`	`{'a': 80, 'b': 70}`
commit	`{'a': 80, 'b': 70}`	discarded

def transfer_atomic(state: dict[str, int], src: str, dst: str, amt: int) -> dict[str, int]:
    working = state.copy()
    working[src] -= amt
    if working[src] < 0:
        raise ValueError("insufficient funds")
    working[dst] = working.get(dst, 0) + amt
    return working

Rollback via `dict.copy()` or a backup buffer

The flip side of commit is rollback: if a mid-transaction check fails, discard the buffer and leave the canonical state alone. Two common designs: a per-transaction buffer (snapshot inside the operation; raise on failure; the buffer is GC'd) or a rollback log (record each mutation in a list; on failure, replay in reverse). The invariant: rollback never partially-mutates the canonical store.

Per-transaction buffer — simplest; buffer is local; raise to abort.
Rollback log — needed when mutations are large or the buffer doesn't fit in memory; record (key, old_value) pairs.
Context-manager pattern — with txn(state): enters / exits cleanly; the __exit__ decides commit vs rollback based on exception status.
Worked example: failing transfer on insufficient funds preserves the canonical state.

class AtomicDict:
    def __init__(self, initial: dict | None = None) -> None:
        self._state: dict = dict(initial or {})

    def transaction(self, mutations: list[tuple[str, int]]) -> None:
        working = self._state.copy()
        try:
            for key, delta in mutations:
                working[key] = working.get(key, 0) + delta
                if working[key] < 0:
                    raise ValueError(f"{key} would go negative")
        except Exception:
            return  # rollback: discard `working`, leave self._state alone
        self._state = working  # commit: swap in working buffer

    def get(self, key: str, default=None):
        return self._state.get(key, default)

    def snapshot(self) -> dict:
        return self._state.copy()

Worked-example solution. Driving the class with a successful and a failing transaction; the canonical state is preserved across the failure:

ad = AtomicDict({'a': 100, 'b': 50})

ad.transaction([('a', -30), ('b', 30)])  # commits → {'a': 70, 'b': 80}
print(ad.snapshot())                     # {'a': 70, 'b': 80}

ad.transaction([('a', -200), ('b', 200)])  # raises mid-loop → rollback
print(ad.snapshot())                       # still {'a': 70, 'b': 80} — untouched

Common beginner mistakes

Mutating the canonical state in-place and trying to "undo on failure"—rollback logic is fragile and easy to get wrong.
Using working = state (reference, not copy) and mutating working—mutates the canonical state in place, defeating the snapshot.
Using state.copy() on a dict whose values are mutable (lists, nested dicts) — shallow copy shares the inner objects.
Returning the buffer instead of swapping it into the canonical reference—commit silently fails.
Not handling the empty / missing-key case in dict.get(k, 0)—KeyError mid-transaction with no rollback.

See more hash-table design problems →

Python interview question on atomic transactions

Implement AtomicDict with set(key, value), get(key, default=None), and transaction(updates: list[tuple[str, int]]). The transaction method applies all (key, delta) pairs atomically—if any update would make a value negative, the entire transaction rolls back. After a failed transaction, the state must be unchanged.

Solution using a snapshot-and-commit dict

class AtomicDict:
    def __init__(self, initial: dict[str, int] | None = None) -> None:
        self._state: dict[str, int] = dict(initial or {})

    def set(self, key: str, value: int) -> None:
        self._state[key] = value

    def get(self, key: str, default: int | None = None) -> int | None:
        return self._state.get(key, default)

    def transaction(self, updates: list[tuple[str, int]]) -> bool:
        working = self._state.copy()
        for key, delta in updates:
            working[key] = working.get(key, 0) + delta
            if working[key] < 0:
                return False  # rollback: discard `working`, leave self._state alone
        self._state = working  # commit: atomic swap
        return True

Step-by-step trace (initial state {'a': 100, 'b': 50}; call transaction([('a', -30), ('b', 30)]) then a failing transaction([('a', -200), ('b', 200)])):

First transaction call — working = {'a': 100, 'b': 50}.copy() → independent dict.
Apply ('a', -30) — working['a'] = 100 + (-30) = 70. Not negative → continue.
Apply ('b', 30) — working['b'] = 50 + 30 = 80. Not negative → continue.
Loop ends successfully — self._state = working swaps in {'a': 70, 'b': 80}. Returns True. Canonical state is now {'a': 70, 'b': 80}.
Second transaction call — working = {'a': 70, 'b': 80}.copy().
Apply ('a', -200) — working['a'] = 70 + (-200) = -130. Negative → return False immediately.
Rollback — the function returns before reaching self._state = working. The local working dict is garbage-collected. self._state is still {'a': 70, 'b': 80} — completely untouched.
Verify — get('a') == 70, get('b') == 80. The failed second transaction left zero side effects.

Output — observable state after each call:

call	return value	`get('a')`	`get('b')`	`_state` after
`transaction([('a', -30), ('b', 30)])`	`True`	70	80	`{'a': 70, 'b': 80}` (committed)
`transaction([('a', -200), ('b', 200)])`	`False`	70	80	`{'a': 70, 'b': 80}` (rolled back, unchanged)

Why this works — concept by concept:

Snapshot — working = self._state.copy() creates an independent shallow copy; mutations to working cannot leak into self._state.
Buffer mutation — the for loop applies each (key, delta) to working, never to the canonical store.
Validation predicate — if working[key] < 0 rejects negative balances mid-loop; no further updates are applied.
Rollback by early return — return False exits before the commit line; the local working dict is garbage-collected, leaving self._state untouched.
Atomic commit — self._state = working is a single reference reassignment — all mutations land together, no partial-update window observable to other callers.
Cost — one dict.copy() (O(N) where N is current keys) plus O(M) updates; the swap is O(1) reference reassignment.

PYTHON
Topic — hash table
Hash table problems

Practice →

PYTHON
Topic — design
Design problems

Practice →

7. Python Closures and Higher-Order Functions for Pipelines

Function composition for ETL pipelines in Python for data engineering

Function composition is the canonical functional-programming primitive for ETL pipelines: chain transformations so the output of one becomes the input of the next. The mental model: a closure is a function that captures values from its enclosing scope; composition is f(g(x)) packaged as a new function. Together they let you express pipeline = compose(load, transform, validate, persist) in one readable line.

Pro tip: Watch composition order. compose(f, g)(x) is conventionally f(g(x))—the rightmost runs first. Frameworks like Toolz and Funcy follow this convention; Spark / pandas method chaining flows the other direction (leftmost runs first). State the order out loud in interviews so the reviewer knows you noticed.

Closure basics: a function that captures its enclosing scope

A closure is a function that references variables from its enclosing scope after that scope has exited. The invariant: closures capture variables by reference, not by value—a late-binding subtlety that bites in lambda factories.

Inner function referencing an outer-function variable.
Captured by reference — if the outer variable is reassigned, all captured closures see the new value.
Common factory pattern: def make_adder(n): return lambda x: x + n → reusable per-n adders.
Worked example: add_5 = make_adder(5); add_5(10) == 15.

def make_adder(n: int):
    def add(x: int) -> int:
        return x + n
    return add

add_5 = make_adder(5)
add_10 = make_adder(10)
print(add_5(3))   # 8
print(add_10(3))  # 13

A subtle gotcha: the late-binding capture means the loop-variable trap is real:

fns = [lambda x: x + i for i in range(3)]
print([f(0) for f in fns])  # [2, 2, 2] — all bound to final i
fns = [lambda x, i=i: x + i for i in range(3)]
print([f(0) for f in fns])  # [0, 1, 2] — default-arg fix

`compose(f, g)(x) == f(g(x))` — the composition contract

The composition contract is precise: compose(f, g) returns a new function that, when called with x, returns f(g(x)). The invariant: compose is right-to-left; g runs first, then f consumes its output. Two-function composition is the building block for n-way pipelines.

def compose(f, g): return lambda x: f(g(x)) — minimal definition.
Type signature: compose(B → C, A → B) → (A → C) — the inner function's output type must match the outer's input type.
Edge case: compose(f) on a single function should equal f (identity behavior).
Worked example: inc = lambda x: x + 1; double = lambda x: x * 2; inc_then_double = compose(double, inc); inc_then_double(3) == 8 ((3+1)*2).

step	expression	value
1	`g(3) = inc(3)`	4
2	`f(g(3)) = double(4)`	8

def compose(f, g):
    return lambda x: f(g(x))

inc = lambda x: x + 1
double = lambda x: x * 2

inc_then_double = compose(double, inc)
print(inc_then_double(3))  # 8 — (3+1)*2

`functools.reduce` for n-way pipelines

For an arbitrary number of functions, functools.reduce collapses a list with a binary operation. The invariant: reduce(compose, [f1, f2, ..., fN]) is the right-to-left composition of all N functions. Reverse the list if your pipeline reads left-to-right ("load → clean → validate → save").

reduce(compose, fns) — right-to-left composition of N functions.
reduce(lambda f, g: lambda x: g(f(x)), fns) — left-to-right composition.
reduce(compose, fns, identity) — start with an identity function for empty pipelines.
functools.partial(f, k=v) — bind a keyword arg, useful when one stage needs configuration.
Worked example: four-stage ETL pipeline—parse_csv → drop_nulls → normalize_email.

from functools import reduce

def compose(f, g):
    return lambda x: f(g(x))

def pipeline(*fns):
    """Returns a function that runs left-to-right: pipeline(f, g, h)(x) == h(g(f(x)))."""
    return reduce(lambda f, g: lambda x: g(f(x)), fns, lambda x: x)

# Stages
def parse_csv(raw: str) -> list[dict]:
    return [dict(zip(("name", "email"), line.split(","))) for line in raw.splitlines() if line]

def drop_nulls(rows: list[dict]) -> list[dict]:
    return [r for r in rows if r.get("email")]

def normalize_email(rows: list[dict]) -> list[dict]:
    return [{**r, "email": r["email"].strip().lower()} for r in rows]

run = pipeline(parse_csv, drop_nulls, normalize_email)
output = run("Alice,Alice@Example.com\nBob,\nCarol,Carol@example.com")
# [{'name': 'Alice', 'email': 'alice@example.com'}, {'name': 'Carol', 'email': 'carol@example.com'}]

Common beginner mistakes

Using late-binding capture in a lambda loop—all closures share the final loop value (the classic [lambda: i for i in range(3)] bug).
Confusing right-to-left and left-to-right composition direction—pipeline reads backward.
Mutating the input data inside a stage—breaks the functional contract; later replays produce different output.
Forgetting an identity seed for an empty reduce—TypeError: reduce() of empty sequence.
Reaching for class-based pipelines (PipelineStep subclasses) when a list of functions is enough—over-engineering.

Python interview question on function composition

Implement pipeline(*fns) that returns a single callable running the input functions left-to-right on a value: pipeline(f, g, h)(x) == h(g(f(x))). An empty pipeline() should return the identity function. Use functools.reduce.

Solution using closures and `functools.reduce`

from functools import reduce
from typing import Callable, Any

def pipeline(*fns: Callable[[Any], Any]) -> Callable[[Any], Any]:
    if not fns:
        return lambda x: x
    return reduce(lambda f, g: lambda x: g(f(x)), fns)

Step-by-step trace (input: pipeline(inc, double, square)(2) where inc = +1, double = *2, square = **2):

fns = (inc, double, square) — non-empty, so skip the empty-pipeline guard.
reduce iteration 1 — combiner takes f = inc, g = double, returns closure c1 = lambda x: double(inc(x)). Now the accumulator is c1.
reduce iteration 2 — combiner takes f = c1, g = square, returns closure c2 = lambda x: square(c1(x)). The accumulator is now c2. Loop ends.
pipeline(inc, double, square) returns c2 — a single callable that, when invoked, runs all three stages left-to-right.
Call c2(2) — evaluates square(c1(2)).
Inner c1(2) — evaluates double(inc(2)) = double(3) = 6.
Outer square(6) — 6 ** 2 = 36.
Final result — 36. The composition order matched left-to-right: 2 → inc → 3 → double → 6 → square → 36.
Empty case — pipeline() returns lambda x: x; calling pipeline()(42) returns 42 unchanged.

Output — value flow stage by stage:

stage	input	function	output
1	2	`inc`	3
2	3	`double`	6
3	6	`square`	36
(empty)	42	identity	42

Why this works — concept by concept:

functools.reduce — walks the function tuple from left to right, threading an accumulator through a binary combiner lambda.
Combiner closure — lambda f, g: lambda x: g(f(x)) returns a new function each iteration; the new function captures f (the so-far composed pipeline) and g (the next stage).
Left-to-right ordering — inside the inner closure, f(x) runs first and g consumes its output, so stages execute in the order they were passed in.
Fresh closure bindings — every reduce iteration produces a brand-new closure, so the classic loop-variable late-binding bug ([lambda: i for i in range(N)]) doesn't apply.
Empty-pipeline guard — if not fns: return lambda x: x short-circuits to the identity function, so pipeline()(42) == 42.
Cost — O(N) function applications per call where N is the number of stages; build cost is also O(N) (one closure per stage).

PYTHON
Topic — closures
Closure problems

Practice →

PYTHON
Topic — higher-order functions
Higher-order function problems

Practice →

Tips to crack Salesforce data engineering interviews

These are habits that move the needle in real Salesforce loops—not a re-statement of the topics above.

SQL preparation

Salesforce tilts SQL-heavy. Drill these patterns until they're muscle memory:

Subquery + aggregation — Nth-largest values via scalar subquery in WHERE.
Cohort + division — retention KPIs via MIN(login_date) cohort + conditional aggregate over the cohort size.
Triple self-join + date arithmetic — consecutive-day activity (t2.date = t1.date + 1).
LAG over DATE_TRUNC partitions — month-over-month growth without a self-join.
SPLIT_PART + conditional aggregation — volume reports parsed out of composite-key strings.
Percentage shape — 100.0 * num / NULLIF(denom, 0) shows up in every section; memorize it.

Topic-page drills: subquery, cohort analysis, self-join, window functions, date functions.

Python preparation

Salesforce's two Python problems are data-engineering primitives, not algorithm puzzles:

Hash-table design — atomic transactions with snapshot / commit / rollback. Drill state-management patterns, not LeetCode trees.
Closures + functools.reduce — ETL pipeline composition. Drill functional Python over class hierarchies.
dict.copy() discipline — shallow vs. deep copy; when shared references bite.
Late-binding capture — the lambda loop-variable trap and the default-arg fix.

Entry points: hash table, closures, Salesforce Python practice.

Window-function fluency

Salesforce's only Hard-tier curated problem is the MoM growth window-function query. Get fluent with the family:

LAG(metric, n) OVER (ORDER BY month) — previous-period lookup; n = 1 for MoM, n = 12 for YoY.
LEAD(metric, n) — same idea, the other direction; useful for "next period" comparisons.
Partition vs. global windows — PARTITION BY product_id for per-product MoM; no partition for global trend.
MoM formula — 100.0 * (curr - prev) / NULLIF(prev, 0).
ROW_NUMBER — top-K-per-group filtered via a CTE wrapper.
SUM() OVER (ORDER BY ts ROWS BETWEEN ...) — running totals and rolling windows.

Window functions return the most ROI per hour for Salesforce-style interviews.

Enterprise SaaS-analytics framing

Salesforce's prompts are SaaS-analytics flavored: salaries (HR analytics), retention (product analytics), activity cohorts (engagement analytics), coupon revenue (revenue analytics), warehouse throughput (operations analytics), transaction state (data-platform internals), function pipelines (ETL platform). State the mapping out loud: "this is Nth-largest via subquery"; "this is cohort retention with the standard retained / cohort_size"; "this is MoM growth via LAG"; "this is dict-based transaction state."

Where to practice on PipeCode

Skill lane	Practice path
Curated Salesforce practice set	/explore/practice/company/salesforce
Salesforce SQL practice	/explore/practice/company/salesforce/sql
Salesforce Python practice	/explore/practice/company/salesforce/python
Subquery	/explore/practice/topic/subquery
Cohort analysis	/explore/practice/topic/cohort-analysis
Self-join	/explore/practice/topic/self-join
Window functions	/explore/practice/topic/window-functions
Date functions	/explore/practice/topic/date-functions
Hash table	/explore/practice/topic/hash-table
Closures	/explore/practice/topic/closures
All practice topics	/explore/practice/topics
Interview courses	/explore/courses

Communication under time pressure

Three things to say out loud before, during, and after each query — in this order:

Assumptions — before typing. "I'll assume logins(user_id, login_date) may have duplicates"; "I'll assume coupon_sales revenue is in a single currency"; "I'll assume the warehouse-code format is fixed three fields."
Invariants — after key code blocks. "LAG returns NULL on the first row by design"; "SPLIT_PART is 1-indexed"; "DISTINCT prevents duplicate-streak inflation."
Complexity — at the end. "O(N) for the cohort scan + O(N + M) for the retention join"; "one window pass, O(N + K log K)."

Interviewers grade clear reasoning above silent-and-perfect.

Frequently Asked Questions

What is the Salesforce data engineering interview process like?

The Salesforce data engineering interview typically includes a phone screen (mostly SQL warm-up around aggregation and subqueries), one or two coding rounds focused on SQL window functions, cohort retention, and self-joins, plus a Python round on data-engineering primitives (state-tracking dictionaries and functional pipelines), a system-design conversation around Data Cloud / lakehouse architecture, and behavioral interviews. The curated 7-problem Salesforce practice set on PipeCode mirrors what you will see on the technical rounds.

What SQL topics does Salesforce test for data engineers?

Salesforce emphasizes subqueries for Nth-largest queries, cohort aggregation for retention KPIs, self-joins with date arithmetic for consecutive-day activity, window functions (especially LAG over DATE_TRUNC) for MoM growth, and aggregation with string manipulation for volume reports. Drill these on the subquery, cohort analysis, self-join, window functions, and aggregation topic pages.

How important is Python for a Salesforce data engineering interview?

Python is roughly 30% of the technical interview at Salesforce—lighter than SQL but not skippable. The patterns are DE primitives, not algorithm puzzles: hash-table design for atomic transaction state, closures + functools.reduce for pipeline composition. The hash table and closures topic pages plus Salesforce Python practice are the right drilling targets.

How hard are Salesforce data engineering interview questions?

Salesforce's curated set has 1 easy + 5 medium + 1 hard = a medium-heavy hub. The Hard-tier problem is window-function MoM growth—get fluent with LAG over DATE_TRUNC and you've handled the ceiling. Most candidates underestimate window functions; that's where Salesforce loops separate "good SQL" from "production SQL."

Are window functions and MoM growth common in Salesforce interviews?

Yes—LAG over DATE_TRUNC for MoM growth is the Hard-tier curated problem and the single highest-ROI prep target for Salesforce. Get comfortable computing (curr - prev) * 100.0 / NULLIF(prev, 0) percentages over both unpartitioned and partitioned windows. The same pattern applies to YoY (LAG(metric, 12)), week-over-week, and rolling averages—drill the family, not just the one query.

How many Salesforce practice problems should I solve before the interview?

Aim for 20–30 problems spanning all seven topic clusters above—not 100 of the same kind. Solve every problem in the Salesforce-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide. Window functions and cohort retention are the topics worth over-investing in for Salesforce specifically.

Start practicing Salesforce data engineering problems

Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Salesforce problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Salesforce asks—without the noise of generic algorithm prep that doesn't apply to this loop.

Pipecode.ai is Leetcode for Data Engineering.

Browse Salesforce practice →
View all practice →

Top Salesforce data engineering interview topics

1. SQL Subqueries for Top-N Salary Queries

Subqueries for Nth-largest values in SQL for data engineering

Why MAX alone fails for second-largest

Subquery pattern: MAX(x) WHERE x < (SELECT MAX(x))

LIMIT 1 OFFSET N-1 and DENSE_RANK() alternatives

SQL interview question on second-largest salary

Solution using a scalar subquery with MAX

2. SQL Cohort Aggregation for User Retention Rate

Cohort-style retention with subqueries in SQL for data engineering

Define a cohort: MIN(login_date) GROUP BY user_id

Retention formula: retained_users / cohort_size

Subquery vs. CTE for cohort sizing

SQL interview question on user retention rate

Solution using cohort CTE and conditional aggregation

3. SQL Self-Join with Date Arithmetic for Consecutive-Day Activity

Self-joins with date filters for consecutive-day patterns in SQL for data engineering

Self-join basics: alias the same table three times

Consecutive-day invariant: t2.date = t1.date + 1 AND t3.date = t1.date + 2

Gaps-and-islands alternative with ROW_NUMBER and date subtraction

SQL interview question on three-consecutive-day activity

Solution using a triple self-join with date arithmetic

4. SQL Window Functions for Month-over-Month Growth

LAG over DATE_TRUNC partitions for MoM growth in SQL for data engineering

DATE_TRUNC('month', sale_date) for monthly bucketing

LAG(metric) OVER (ORDER BY month) for prior-period lookup

MoM growth formula: (curr - prev) * 100.0 / prev

SQL interview question on month-over-month growth

Solution using DATE_TRUNC + LAG over an ordered window

5. SQL Aggregation with String Manipulation for Volume Reports

Aggregation over parsed dimensions in SQL for data engineering

SPLIT_PART / SUBSTRING for parsing composite keys

SUM and COUNT over a parsed dimension

Multi-metric reports with FILTER / CASE WHEN aggregates

SQL interview question on warehouse volume metrics

Solution using SPLIT_PART parsing and conditional aggregation

6. Python Hash Tables for Atomic Transaction State

Dict-as-state design for atomic transactions in Python for data engineering

Dict as a key-value store with O(1) reads / writes

Atomic update pattern: snapshot → mutate → commit

Rollback via dict.copy() or a backup buffer

Python interview question on atomic transactions

Solution using a snapshot-and-commit dict

7. Python Closures and Higher-Order Functions for Pipelines

Function composition for ETL pipelines in Python for data engineering

Closure basics: a function that captures its enclosing scope

compose(f, g)(x) == f(g(x)) — the composition contract

functools.reduce for n-way pipelines

Python interview question on function composition

Solution using closures and functools.reduce

Tips to crack Salesforce data engineering interviews

SQL preparation

Python preparation

Window-function fluency

Enterprise SaaS-analytics framing

Where to practice on PipeCode

Communication under time pressure

Frequently Asked Questions

What is the Salesforce data engineering interview process like?

What SQL topics does Salesforce test for data engineers?

How important is Python for a Salesforce data engineering interview?

How hard are Salesforce data engineering interview questions?

Are window functions and MoM growth common in Salesforce interviews?

How many Salesforce practice problems should I solve before the interview?

Start practicing Salesforce data engineering problems

Why `MAX` alone fails for second-largest

Subquery pattern: `MAX(x) WHERE x < (SELECT MAX(x))`

`LIMIT 1 OFFSET N-1` and `DENSE_RANK()` alternatives

Solution using a scalar subquery with `MAX`

Define a cohort: `MIN(login_date) GROUP BY user_id`

Retention formula: `retained_users / cohort_size`

Consecutive-day invariant: `t2.date = t1.date + 1 AND t3.date = t1.date + 2`

Gaps-and-islands alternative with `ROW_NUMBER` and date subtraction

`LAG` over `DATE_TRUNC` partitions for MoM growth in SQL for data engineering

`DATE_TRUNC('month', sale_date)` for monthly bucketing

`LAG(metric) OVER (ORDER BY month)` for prior-period lookup

MoM growth formula: `(curr - prev) * 100.0 / prev`

Solution using `DATE_TRUNC` + `LAG` over an ordered window

`SPLIT_PART` / `SUBSTRING` for parsing composite keys

`SUM` and `COUNT` over a parsed dimension

Multi-metric reports with `FILTER` / `CASE WHEN` aggregates

Solution using `SPLIT_PART` parsing and conditional aggregation

Rollback via `dict.copy()` or a backup buffer

`compose(f, g)(x) == f(g(x))` — the composition contract

`functools.reduce` for n-way pipelines

Solution using closures and `functools.reduce`