Square data engineering interview questions are SQL-heavy, fintech-flavored, and PostgreSQL-grounded. Square rebranded to Block Inc. in December 2021, but the SQL bar — and the question shapes that show up in the live CoderPad pair-programming round — has not moved. Four primitives carry the loop: GROUP BY sender_id + COUNT(*) + ORDER BY DESC LIMIT 10 for top-N invoice senders, DATEDIFF / INTERVAL '30 days' cohort math for 30-day-post-signup activity windows, AVG(stars) OVER (PARTITION BY product_id, month) window aggregates for monthly product analytics, and COUNT(DISTINCT col) over status-filtered payment rows for fintech-grade transaction analysis. The framings are everyday Block / Square / CashApp data engineering — invoice ranking, cohort retention, monthly review averages, payment success counts.
This guide walks four SQL topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches a curated 2-problem set (1 EASY ranking+sorting+aggregation, 1 MEDIUM aggregation+date-functions+cohort-analysis) plus the two adjacent SQL primitives — window functions and payment-flow COUNT DISTINCT — that show up on every Block / Square / CashApp SQL question list. The interview is PostgreSQL on CoderPad; candidates who prep in MySQL / Snowflake / BigQuery dialect stutter on INTERVAL, DATE_PART, and EXTRACT syntax. Drill PostgreSQL-flavored answers from the start.
Top Square data engineering interview topics
From the Square data engineering practice set, the four numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–4) | Why it shows up at Square |
|---|---|---|
| 1 | SQL ranking, sorting, and aggregation for top-N invoice senders | Top 10 Invoice Senders (EASY) — GROUP BY sender_id + COUNT(*) + ORDER BY DESC + LIMIT 10, the SQL primitive for any "top-N entities by event volume" question. |
| 2 | SQL aggregation, date functions, and cohort analysis for post-signup activity | Users with High Activity After 30 Days Signup (MEDIUM) — DATEDIFF / INTERVAL '30 days' + GROUP BY user_id + HAVING COUNT(*) >= threshold, the canonical retention-cohort pattern. |
| 3 | SQL window functions for monthly averages and duplicate detection | Monthly product rating + duplicate detection — AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)) and ROW_NUMBER() OVER (PARTITION BY ...) for dedupe (DataLemur Block staples). |
| 4 | SQL COUNT DISTINCT and status filters for payment-flow analysis |
Total unique successful transactions / senders / recipients — COUNT(DISTINCT col) WHERE status = 'Success', the fintech-grade pattern that drives Square / Block / CashApp payment analytics. |
PostgreSQL-on-CoderPad framing rule: Square interviews live SQL in CoderPad with PostgreSQL syntax. Use
EXTRACT(MONTH FROM ...),DATE_TRUNC('week', ...),INTERVAL '30 days',DATEDIFF-equivalent date math, andCOUNT(DISTINCT)natively. Snowflake / BigQuery / SQL Server idioms (DATE_PARTquirks,QUALIFY,TOP N) trip up candidates and signal weak SQL fluency. State your dialect upfront and drill PostgreSQL syntax.
1. SQL Ranking, Sorting, and Aggregation for Top-N Invoice Senders
Top-N entity ranking via GROUP BY + ORDER BY DESC + LIMIT in SQL for data engineering
"Return the top 10 senders by completed invoice count" is Square's signature EASY SQL prompt (Top 10 Invoice Senders). The mental model: GROUP BY sender_id collapses one row per sender; COUNT(*) produces the per-sender invoice count; ORDER BY count DESC ranks senders descending; LIMIT 10 returns only the top 10. Same primitive powers any "top-N entities by event volume" pipeline — top-N customers by order count, top-N products by review volume, top-N regions by daily active users.
Pro tip: Always alias the count column (
AS invoice_count) and sort by the alias inORDER BY. Mixing literalCOUNT(*)and the alias is style noise that interviewers grade. Add a deterministic tiebreaker (, sender_id ASC) when ties at the cut matter — Square's interviewers probe ties at the boundary frequently.
GROUP BY + COUNT(*) for per-sender aggregation
The aggregation invariant: GROUP BY sender_id collapses all invoice rows that share the same sender_id into one output row; COUNT(*) counts the number of rows in each group. Every non-aggregate column in the SELECT must appear in GROUP BY (or be functionally dependent on it).
-
GROUP BY sender_id— collapses to one row per sender. -
COUNT(*)— total rows per group; equivalent toCOUNT(invoice_id)wheninvoice_idis non-null. -
COUNT(DISTINCT col)— unique values per group; useful when senders can repeat invoices. -
SELECT sender_id, COUNT(*)— onlysender_idis referenced ungrouped; theGROUP BYmakes this legal.
Worked example. Five invoices, three distinct senders.
| step | output |
|---|---|
| input | [(101, s1), (102, s2), (103, s1), (104, s3), (105, s1)] |
GROUP BY sender_id |
[s1: 3 rows, s2: 1 row, s3: 1 row] |
COUNT(*) AS cnt |
[(s1, 3), (s2, 1), (s3, 1)] |
Worked-example solution.
SELECT sender_id, COUNT(*) AS invoice_count
FROM invoices
GROUP BY sender_id;
Rule of thumb: GROUP BY on the dimension you want one-row-per-value, COUNT(*) for the metric, alias the count immediately so subsequent clauses can reference it.
ORDER BY count DESC + LIMIT N for top-N rankings
The top-N invariant: ORDER BY <metric> DESC sorts groups by the metric in descending order; LIMIT N returns only the first N. For ties at the cut, dialects differ — PostgreSQL (Square's CoderPad default) returns whichever order the planner picks unless you add an explicit tiebreaker.
-
ORDER BY invoice_count DESC— sorts groups by the aliased column. -
LIMIT 10— first 10 groups in sort order; PostgreSQL syntax. -
OFFSET 10 LIMIT 10— pagination; ranks 11-20. -
Deterministic tiebreak —
, sender_id ASCmakes ties stable across runs.
Worked example. Sort 4 senders by invoice_count and take top 3.
| sender_id | invoice_count | rank |
|---|---|---|
| s1 | 4 | 1 |
| s2 | 2 | 2 |
| s3 | 1 | 3 |
| s4 | 1 | (cut) |
Worked-example solution.
SELECT sender_id, COUNT(*) AS invoice_count
FROM invoices
GROUP BY sender_id
ORDER BY invoice_count DESC, sender_id ASC
LIMIT 3;
Rule of thumb: always include a deterministic tiebreaker — fintech audits demand that ranked lists be reproducible across runs.
Tiebreaks via RANK(), DENSE_RANK(), and ROW_NUMBER()
The tiebreak invariant: ROW_NUMBER always gives a unique sequence (1, 2, 3, 4); RANK gives the same number for ties and skips the next (1, 1, 3, 4); DENSE_RANK gives the same number for ties without skipping (1, 1, 2, 3). Choose based on whether the prompt wants strict top-N row count, "all senders tied at rank N", or compact ranks without gaps.
-
ROW_NUMBER() OVER (ORDER BY count DESC)— strict 1..N sequence; ties resolved by planner. -
RANK() OVER (ORDER BY count DESC)— ties share rank, next rank skips. -
DENSE_RANK() OVER (ORDER BY count DESC)— ties share rank, next rank does not skip. -
LIMIT 10— chops to top 10 rows after sort; doesn't address ties unless paired withRANKfilter.
Worked example. Three senders tied at 5 invoices, one at 3, one at 2.
| sender_id | invoice_count | ROW_NUMBER | RANK | DENSE_RANK |
|---|---|---|---|---|
| s1 | 5 | 1 | 1 | 1 |
| s2 | 5 | 2 | 1 | 1 |
| s3 | 5 | 3 | 1 | 1 |
| s4 | 3 | 4 | 4 | 2 |
| s5 | 2 | 5 | 5 | 3 |
Worked-example solution.
SELECT sender_id, invoice_count,
ROW_NUMBER() OVER (ORDER BY invoice_count DESC) AS rn,
RANK() OVER (ORDER BY invoice_count DESC) AS rk,
DENSE_RANK() OVER (ORDER BY invoice_count DESC) AS drk
FROM (SELECT sender_id, COUNT(*) AS invoice_count
FROM invoices GROUP BY sender_id) t;
Rule of thumb: "top 10" with strict cardinality → LIMIT 10; "all senders tied at top 10" → wrap with WHERE rk <= 10; "compact ranks 1..K with no gaps" → DENSE_RANK.
Common beginner mistakes
- Forgetting
GROUP BY—SELECT sender_id, COUNT(*)without grouping is a parse error in PostgreSQL. - Forgetting
ORDER BY ... DESC— top 10 silently becomes "first 10 by planner whim." - Returning more than
Nrows by skippingLIMIT— graded as a wrong answer even when the topNare correct. - Hardcoding ties to a single row when the prompt says "all senders tied at rank 10" — use
RANKfilter, notLIMIT. - Mixing
COUNT(*)andCOUNT(DISTINCT invoice_id)— non-equivalent when invoice_id can repeat (it usually can't, but state the assumption).
SQL Interview Question on Top 10 Invoice Senders
Given an invoices(invoice_id, sender_id, recipient_id, sent_at, amount) table, write a query that returns the top 10 senders by completed invoice count, ordered by count descending and breaking ties by sender_id ascending. Output two columns: sender_id and invoice_count.
Solution Using GROUP BY + COUNT(*) + ORDER BY DESC + LIMIT 10
SELECT
sender_id,
COUNT(*) AS invoice_count
FROM invoices
GROUP BY sender_id
ORDER BY invoice_count DESC, sender_id ASC
LIMIT 10;
Why this works: GROUP BY sender_id collapses the row stream to one row per sender; COUNT(*) produces the metric for each group; ORDER BY invoice_count DESC ranks groups descending; the , sender_id ASC tiebreak ensures stable output across runs (fintech-audit-grade); LIMIT 10 returns the first 10 ranked rows.
Step-by-step trace for an 8-row sample:
| invoice_id | sender_id |
|---|---|
| 101 | s1 |
| 102 | s2 |
| 103 | s1 |
| 104 | s3 |
| 105 | s1 |
| 106 | s2 |
| 107 | s4 |
| 108 | s1 |
- Group by sender_id — four groups: s1 (4 rows), s2 (2 rows), s3 (1 row), s4 (1 row).
-
COUNT(*)— produces(s1, 4), (s2, 2), (s3, 1), (s4, 1). -
Order by
invoice_count DESC, sender_id ASC—(s1, 4), (s2, 2), (s3, 1), (s4, 1). -
LIMIT 10— only 4 senders exist; all four returned.
Output:
| sender_id | invoice_count |
|---|---|
| s1 | 4 |
| s2 | 2 |
| s3 | 1 |
| s4 | 1 |
Why this works — concept by concept:
-
GROUP BY sender_id— collapses rows that sharesender_id; produces one output row per sender. -
COUNT(*)aggregate — counts rows in each group; equivalent toCOUNT(invoice_id)wheninvoice_idis non-null (always true since it's the primary key). -
ORDER BY invoice_count DESCranking — sorts groups descending by the metric; theDESCkeyword is the entire difference between top-N and bottom-N. -
Deterministic tiebreak —
, sender_id ASCensures the output is reproducible across runs; fintech audits require this. -
LIMIT 10cut — chops to the top 10 ranked rows; PostgreSQL syntax (CoderPad default). -
O((|invoices|) + G log G)time —|invoices|rows scanned for the GROUP BY, thenO(G log G)to sort the group output whereGis the number of senders.
Inline CTA: Drill the Square SQL practice page for the curated EASY problem and the Square aggregation practice page for the only company-tagged topic surface available.
SQL
Company — Square (SQL)
Square SQL practice problems
SQL
Company — Square / aggregation
Square aggregation problems
SQL
Topic — ranking
SQL ranking problems
2. SQL Aggregation, Date Functions, and Cohort Analysis for Post-Signup Activity
30-day cohort retention via DATEDIFF + GROUP BY + HAVING in SQL for data engineering
"Return every user who has at least 5 events in the 30 days after signup" is Square's signature MEDIUM SQL prompt (Users with High Activity After 30 Days Signup). The mental model: the cohort is defined by signup_date; the activity window is signup_date + INTERVAL '30 days'; events filtered to that window are aggregated GROUP BY user_id; HAVING COUNT(*) >= 5 returns the active users. Same primitive powers any retention or cohort analysis — N-day-post-signup activity, post-purchase repeat behavior, post-event engagement windows.
Pro tip: PostgreSQL has multiple ways to express the 30-day boundary:
signup_date + INTERVAL '30 days',signup_date + INTERVAL '30 day',(event_at - signup_date) <= INTERVAL '30 days',(event_at::date - signup_date::date) <= 30. Pick one and stick with it. Mixing forms within the same query is a syntax-fluency red flag.
Date arithmetic: DATEDIFF, DATE_TRUNC, and INTERVAL '30 days'
The date-arithmetic invariant: PostgreSQL date math returns either a DATE (when adding INTERVAL) or an INTERVAL (when subtracting two dates); both are comparable to literal intervals. DATEDIFF is not native PostgreSQL — use (d1 - d2)::int or EXTRACT(DAY FROM (d1 - d2)).
-
signup_date + INTERVAL '30 days'— returns a date 30 days after signup. -
(event_at - signup_date) <= INTERVAL '30 days'— interval comparison; works for bothDATEandTIMESTAMP. -
DATE_TRUNC('week', event_at)— snaps to the week boundary; useful for weekly cohorts. -
EXTRACT(DAY FROM (event_at - signup_date))— pulls the day-count out as an integer.
Worked example. Signup on 2025-01-01; events at 2025-01-10 and 2025-02-15.
| event_at | signup_date | diff | within 30 days? |
|---|---|---|---|
| 2025-01-10 | 2025-01-01 | 9 days | ✓ |
| 2025-02-15 | 2025-01-01 | 45 days | ✗ |
Worked-example solution.
SELECT event_at,
(event_at::date - signup_date::date) AS days_since_signup,
(event_at <= signup_date + INTERVAL '30 days') AS within_window
FROM users JOIN events USING (user_id);
Rule of thumb: prefer event_at <= signup_date + INTERVAL '30 days' over (event_at - signup_date) <= INTERVAL '30 days' — the additive form reads as "the event happened within the window" without a subtraction step.
Cohort definition via signup_date + INTERVAL '30 days'
The cohort invariant: a cohort is a set of users defined by a shared signup date or signup-date-bucket; the cohort's activity window is that signup-date plus a fixed interval. The cohort filter goes in WHERE, the per-user aggregation happens in GROUP BY, and the threshold check goes in HAVING.
-
Cohort by exact signup date —
WHERE u.signup_date = '2025-01-01'. -
Cohort by week —
WHERE DATE_TRUNC('week', u.signup_date) = '2025-01-06'. - Cohort by all users — no extra filter; aggregate per user.
-
Activity window —
WHERE e.event_at <= u.signup_date + INTERVAL '30 days'.
Worked example. Two users signed up on different dates; activity-window check per user.
| user_id | signup_date | window_end |
|---|---|---|
| u1 | 2025-01-01 | 2025-01-31 |
| u2 | 2025-01-15 | 2025-02-14 |
Worked-example solution.
SELECT u.user_id, u.signup_date,
u.signup_date + INTERVAL '30 days' AS window_end
FROM users u;
Rule of thumb: compute the window boundary in SELECT (or a CTE) when you need to reference it multiple times; recomputing signup_date + INTERVAL '30 days' in three different clauses signals copy-paste.
COUNT(*) per user with HAVING threshold
The threshold invariant: GROUP BY user_id collapses to one row per user; COUNT(*) produces the per-user activity count; HAVING COUNT(*) >= 5 filters groups whose count meets the threshold. HAVING is "WHERE on aggregates" — it filters group rows, not source rows.
-
HAVING COUNT(*) >= 5— group-level filter; aggregate predicate. -
HAVING COUNT(*) >= 5 AND MAX(event_at) > signup_date + INTERVAL '14 days'— compound threshold. -
Avoid
WHERE COUNT(*) >= 5—WHEREcannot reference aggregates; PostgreSQL parser rejects it. -
Avoid
WHEREto enforce the post-signup window when you want to count zero-event users — outer join +HAVINGinstead.
Worked example. Three users; threshold 5 events.
| user_id | events_in_window | passes? |
|---|---|---|
| u1 | 6 | ✓ |
| u2 | 3 | ✗ |
| u3 | 5 | ✓ |
Worked-example solution.
SELECT u.user_id, COUNT(*) AS events_in_window
FROM users u
JOIN events e ON e.user_id = u.user_id
WHERE e.event_at <= u.signup_date + INTERVAL '30 days'
GROUP BY u.user_id
HAVING COUNT(*) >= 5;
Rule of thumb: row-level predicates → WHERE; aggregate predicates → HAVING. Crossing them is a graded conceptual error.
Common beginner mistakes
- Using
DATEDIFF(event_at, signup_date)in PostgreSQL — that's MySQL syntax; PostgreSQL uses(d1 - d2)::intorEXTRACT(DAY FROM ...). - Putting the aggregate threshold in
WHERE— parser rejects; useHAVING. - Filtering
signup_dateinHAVING— that's a row predicate; should be inWHEREfor performance. - Forgetting the
JOIN— queryingeventsalone misses the per-user signup_date anchor. - Missing the
INTERVAL '30 days'boundary — counting all-time activity instead of the 30-day window inflates results. - Timezone bugs —
event_atandsignup_datein different timezones produces off-by-one days; cast both to a single timezone when in doubt.
SQL Interview Question on 30-Day Post-Signup Activity
Given users(user_id, signup_date) and events(event_id, user_id, event_at), return every user_id whose events in the 30 days after signup are at least 5. Output user_id and events_in_window, ordered by events_in_window descending.
Solution Using JOIN + WHERE INTERVAL + GROUP BY + HAVING COUNT(*) >= 5
SELECT
u.user_id,
COUNT(*) AS events_in_window
FROM users u
JOIN events e
ON e.user_id = u.user_id
WHERE e.event_at <= u.signup_date + INTERVAL '30 days'
AND e.event_at >= u.signup_date
GROUP BY u.user_id
HAVING COUNT(*) >= 5
ORDER BY events_in_window DESC;
Why this works: the JOIN pairs each event with its user's signup_date; the WHERE clause restricts events to the 30-day window post-signup (>= signup_date AND <= signup_date + INTERVAL '30 days'); GROUP BY u.user_id collapses to one row per user; COUNT(*) produces the per-user in-window event count; HAVING COUNT(*) >= 5 filters users whose count crosses the threshold; ORDER BY events_in_window DESC ranks active users by engagement.
Step-by-step trace for three users:
| user_id | signup_date | events_in_first_30d |
|---|---|---|
| u1 | 2025-01-01 | 6 |
| u2 | 2025-01-15 | 3 |
| u3 | 2025-02-10 | 5 |
-
JOIN — pairs each event row with its
signup_datefromusers. -
WHERE window filter — keeps only events where
event_atis in[signup_date, signup_date + 30 days]. - GROUP BY user_id — collapses to one row per user with the in-window event count.
- HAVING >= 5 — filters out u2 (3 events). u1 (6) and u3 (5) pass.
- ORDER BY DESC — u1 (6) ranks above u3 (5).
Output:
| user_id | events_in_window |
|---|---|
| u1 | 6 |
| u3 | 5 |
Why this works — concept by concept:
-
JOINonuser_id— pairs each event with its user'ssignup_date; without this, the query has no anchor for the window. -
INTERVAL '30 days'boundary — PostgreSQL-native expression of the 30-day forward window; reads as "30 days from signup." -
WHERErow filter — strips out-of-window events before grouping; doing this withHAVINGwould still work but reads worse and runs slower. -
GROUP BY u.user_id— collapses the row stream to one row per user; the only non-aggregate column inSELECTmatches theGROUP BY. -
HAVING COUNT(*) >= 5— filters groups; aggregate predicates have to live here. -
O((|users| + |events|) + G log G)time — the JOIN dominates;Gusers with at least one in-window event get sorted at the end.
Inline CTA: Practice date-functions SQL problems and cohort-analysis problems on PipeCode.
SQL
Company — Square (SQL)
Square SQL practice problems
SQL
Topic — date functions
SQL date-function problems
SQL
Topic — cohort analysis
SQL cohort-analysis problems
3. SQL Window Functions for Monthly Averages and Duplicate Detection
Window AVG OVER + ROW_NUMBER OVER for analytics in SQL for data engineering
"Calculate the monthly average rating for each product" is Block / Square's signature window-function SQL prompt (DataLemur Q1). The mental model: AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)) produces the monthly product average on every row, with no row-collapse. The mirror primitive is ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for duplicate detection — flag rows whose rn > 1 per partition. Same primitive powers any "row-level value vs group-aggregate" or "first-occurrence-per-group" question.
Pro tip: Window functions and
GROUP BYanswer different questions.GROUP BYcollapses rows; window aggregates compute on the group without collapsing. UseGROUP BYwhen you want one row per group; use a window when you want every original row plus a per-group aggregate column.
Window AVG: AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))
The window-AVG invariant: AVG(expr) OVER (PARTITION BY ...) returns the average of expr across all rows in the same partition, attached to every row. Unlike GROUP BY AVG(...), the row count is preserved.
-
PARTITION BY product_id— one window per product. -
PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)— one window per (product, month). -
OVER ()— empty parens = global window across all rows. -
ORDER BY submit_date ROWS UNBOUNDED PRECEDING— running average up to current row.
Worked example. Two reviews for product 50001 in June; window AVG returns 3.5 on both rows.
| review_id | product_id | submit_date | stars | window_avg |
|---|---|---|---|---|
| 6171 | 50001 | 2022-06-08 | 4 | 3.5 |
| 5293 | 50001 | 2022-06-18 | 3 | 3.5 |
Worked-example solution.
SELECT review_id, product_id, submit_date, stars,
AVG(stars) OVER (
PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)
) AS monthly_avg
FROM reviews;
Rule of thumb: "show me each row alongside the group average" → window; "show me one row per group" → GROUP BY.
ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for duplicate detection
The duplicate-detection invariant: ROW_NUMBER() OVER (PARTITION BY <key_columns> ORDER BY <tiebreaker>) assigns 1, 2, 3, … to rows within each partition by the duplicate key; rows with row_number > 1 are duplicates.
-
PARTITION BY user_id, product_id— define what makes a duplicate. -
ORDER BY submit_date ASC— tiebreaker; deterministic. -
Filter — wrap in subquery / CTE:
WHERE rn > 1for duplicates only,WHERE rn = 1for the first occurrence. -
Alternative —
COUNT(*) OVER (PARTITION BY ...)to count duplicates per row.
Worked example. Two reviews from same user for same product → duplicate.
| review_id | user_id | product_id | submit_date | rn |
|---|---|---|---|---|
| 6171 | 123 | 50001 | 2022-06-08 | 1 |
| 9999 | 123 | 50001 | 2022-06-15 | 2 (duplicate) |
Worked-example solution.
SELECT *
FROM (
SELECT review_id, user_id, product_id, submit_date,
ROW_NUMBER() OVER (
PARTITION BY user_id, product_id ORDER BY submit_date
) AS rn
FROM reviews
) t
WHERE rn > 1;
Rule of thumb: ROW_NUMBER OVER (PARTITION BY <duplicate_key> ORDER BY <tiebreaker>) is the universal SQL idiom for "find duplicates" — no GROUP BY needed.
Window vs GROUP BY: row-level vs collapsed-row aggregations
The distinction invariant: window functions preserve the input row count and add per-group aggregate columns; GROUP BY collapses input rows to one row per group and replaces non-aggregate columns. Many candidates conflate them; interviewers grade the distinction.
-
Window —
SELECT review_id, AVG(stars) OVER (PARTITION BY product_id) FROM reviews;— row count preserved. -
GROUP BY—SELECT product_id, AVG(stars) FROM reviews GROUP BY product_id;— one row per product. - Mixed — wrap the GROUP BY in a CTE, then JOIN to the original; messy.
-
QUALIFY— Snowflake-only filter on window-result; PostgreSQL needs a wrapping subquery instead.
Worked example. Same data, two different shapes.
| approach | rows returned |
|---|---|
| Window AVG | 5 (same as input) |
| GROUP BY AVG | 2 (one per product) |
Worked-example solution.
-- Window: keep all rows
SELECT review_id, product_id, stars,
AVG(stars) OVER (PARTITION BY product_id) AS product_avg
FROM reviews;
-- GROUP BY: collapse to one row per product
SELECT product_id, AVG(stars) AS product_avg
FROM reviews
GROUP BY product_id;
Rule of thumb: read the prompt — "for each row, show…" = window; "summarize per product" = GROUP BY.
Common beginner mistakes
- Forgetting
PARTITION BY— the window covers the whole table, producing the global average on every row. - Using
GROUP BYwhen the prompt says "for each review, show the monthly product average" — collapses rows incorrectly. - Forgetting
ORDER BYinsideROW_NUMBER OVER (...)— non-deterministic row numbering; the duplicate filter becomes random. - Using
RANK()whenROW_NUMBER()is correct —RANKties become same number;RANK > 1doesn't reliably flag duplicates. - Filtering window results in
WHERE— Postgres requires a wrapping subquery / CTE because window functions evaluate after WHERE.
SQL Interview Question on Monthly Average Product Ratings
Given reviews(review_id, user_id, submit_date, product_id, stars), write a query that returns, for each row, submit_date_month, product_id, and the monthly average rating for that product (avg_stars), sorted by month and product_id.
Solution Using AVG ... OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))
SELECT
EXTRACT(MONTH FROM submit_date) AS mth,
product_id,
AVG(stars)::decimal(10, 2) OVER (
PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)
) AS avg_stars
FROM reviews
ORDER BY mth, product_id;
Why this works: EXTRACT(MONTH FROM submit_date) pulls the month component from each row; AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)) partitions by (product, month) and averages stars within each partition; the cast to decimal(10, 2) rounds to two decimals; ORDER BY mth, product_id produces the canonical sort order. Each row keeps its identity — this is window, not GROUP BY.
Step-by-step trace for the DataLemur sample:
| review_id | user_id | submit_date | product_id | stars |
|---|---|---|---|---|
| 6171 | 123 | 2022-06-08 | 50001 | 4 |
| 7802 | 265 | 2022-06-10 | 69852 | 4 |
| 5293 | 362 | 2022-06-18 | 50001 | 3 |
| 6352 | 192 | 2022-07-26 | 69852 | 3 |
| 4517 | 981 | 2022-07-05 | 69852 | 2 |
-
EXTRACT(MONTH ...) — produces
[6, 6, 6, 7, 7]. - Partition by (product_id, month) — four partitions: (50001, 6) with 2 rows, (69852, 6) with 1 row, (69852, 7) with 2 rows.
- AVG within partition — (50001, 6) → 3.5; (69852, 6) → 4.0; (69852, 7) → 2.5.
- Cast + ORDER BY — five rows emitted, sorted by (mth, product_id).
Output:
| mth | product_id | avg_stars |
|---|---|---|
| 6 | 50001 | 3.50 |
| 6 | 50001 | 3.50 |
| 6 | 69852 | 4.00 |
| 7 | 69852 | 2.50 |
| 7 | 69852 | 2.50 |
Why this works — concept by concept:
-
EXTRACT(MONTH FROM submit_date)— PostgreSQL date-component function; returns the month as integer. -
AVG(...) OVER (PARTITION BY ...)— window aggregate; preserves row count and computes per-partition mean. -
Composite partition key —
(product_id, EXTRACT(MONTH FROM submit_date))ensures monthly granularity per product. -
::decimal(10, 2)cast — rounds to 2 decimals; cleaner output than the default float representation. -
ORDER BY mth, product_id— canonical sort order for the audit-readable result. -
O(N log N)time — the planner sorts rows by partition keys once; no GROUP BY collapse.
Inline CTA: More SQL window-function problems and aggregation problems on PipeCode.
SQL
Company — Square (SQL)
Square SQL practice problems
SQL
Topic — window functions
SQL window-function problems
SQL
Topic — aggregation
SQL aggregation problems
4. SQL COUNT DISTINCT and Status Filters for Payment-Flow Analysis
Payment-flow SQL with COUNT DISTINCT and status filters in SQL for Square data engineering
"Return the total unique successful transactions, unique senders, and unique recipients" is Square / CashApp's signature payment-flow SQL prompt (Medium / DataLemur Q6 staple). The mental model: filter rows with WHERE status = 'Success' (idempotency boundary), then aggregate with COUNT(DISTINCT col) over the columns you care about — payment_id, sender_id, recipient_id. Same primitive powers any "unique-X for filtered Y" pipeline — unique active users last week, unique trading symbols traded yesterday, unique ad creatives clicked today.
Pro tip: Always state your idempotency boundary out loud before writing the query: "I'm filtering to
status = 'Success'because failed/pending transactions don't count toward unique-success metrics." Square interviewers grade this phrasing — it's the difference between "candidate writes correct SQL" and "candidate thinks like a data engineer at a regulated payments company."
COUNT vs COUNT DISTINCT: row count vs unique-value count
The distinct-count invariant: COUNT(*) counts rows; COUNT(col) counts non-null values in col; COUNT(DISTINCT col) counts distinct non-null values in col. The three are not interchangeable.
-
COUNT(*)— total rows in the result set. -
COUNT(payment_id)— non-null payment IDs (often equal to COUNT(*) since payment_id is the PK). -
COUNT(DISTINCT sender_id)— unique senders; collapses repeated sender_ids. -
COUNT(DISTINCT col1, col2)— PostgreSQL syntax for distinct combinations.
Worked example. Five payments from 3 distinct senders.
| payment_id | sender_id |
|---|---|
| p1 | s1 |
| p2 | s2 |
| p3 | s1 |
| p4 | s3 |
| p5 | s1 |
| metric | value |
|---|---|
COUNT(*) |
5 |
COUNT(payment_id) |
5 |
COUNT(DISTINCT sender_id) |
3 |
Worked-example solution.
SELECT COUNT(*) AS total_rows,
COUNT(payment_id) AS total_payment_ids,
COUNT(DISTINCT sender_id) AS unique_senders
FROM payments;
Rule of thumb: "how many rows" → COUNT(*); "how many unique entities" → COUNT(DISTINCT col).
WHERE status = 'Success' before aggregation
The idempotency-filter invariant: payment-flow analytics almost always filter to status = 'Success' (or whichever success state applies); failed and pending transactions do not contribute to success metrics. The filter goes in WHERE, before aggregation.
-
WHERE status = 'Success'— strictly successful payments. -
WHERE status IN ('Success', 'Settled')— multiple success-equivalent statuses. -
WHERE status NOT IN ('Failed', 'Cancelled')— exclusionary; semantically same when the universe is finite. - State the assumption — interviewers expect you to clarify which statuses count.
Worked example. Six payments, three successful.
| payment_id | sender_id | status |
|---|---|---|
| p1 | s1 | Success |
| p2 | s2 | Failed |
| p3 | s1 | Success |
| p4 | s3 | Pending |
| p5 | s1 | Success |
| p6 | s4 | Failed |
After WHERE status = 'Success': 3 rows (p1, p3, p5), unique senders = 1 (s1 only).
Worked-example solution.
SELECT COUNT(*) AS successful_payments,
COUNT(DISTINCT sender_id) AS unique_successful_senders
FROM payments
WHERE status = 'Success';
Rule of thumb: the success-filter is non-negotiable for payment-flow questions; ask the interviewer which status values count if the prompt is ambiguous.
Multiple COUNT DISTINCT in one SELECT: sender / recipient splits
The compound-aggregate invariant: a single SELECT can compute multiple COUNT(DISTINCT col) aggregates over the same filtered row set; this is the common shape for payment-flow "uniques per role" questions.
-
Three aggregates in one SELECT —
COUNT(DISTINCT payment_id), COUNT(DISTINCT sender_id), COUNT(DISTINCT recipient_id). - Same WHERE filter applies to all — single source row set drives every aggregate.
- Senders ≠ recipients — even though both columns reference users, the distinct counts can differ.
-
COUNT(DISTINCT user_id)— pool sender + recipient withUNION ALLfirst if you want "unique users involved."
Worked example. Three successful payments; two distinct senders, three distinct recipients.
| payment_id | sender_id | recipient_id |
|---|---|---|
| p1 | s1 | r1 |
| p3 | s1 | r2 |
| p5 | s2 | r3 |
| metric | value |
|---|---|
COUNT(DISTINCT payment_id) |
3 |
COUNT(DISTINCT sender_id) |
2 |
COUNT(DISTINCT recipient_id) |
3 |
Worked-example solution.
SELECT COUNT(DISTINCT payment_id) AS unique_transactions,
COUNT(DISTINCT sender_id) AS unique_senders,
COUNT(DISTINCT recipient_id) AS unique_recipients
FROM payments
WHERE status = 'Success';
Rule of thumb: one query, multiple aggregates is the cleanest shape; never write three separate queries for sender / recipient / transaction counts.
Common beginner mistakes
- Using
COUNT(*)when the prompt says "unique" — undercounts (or overcounts, depending on whether duplicates exist). - Forgetting
WHERE status = 'Success'— counts failed and pending transactions; the metric becomes meaningless. - Putting the status filter in
HAVINGinstead ofWHERE— works but is slow and reads worse. - Writing three separate queries for sender / recipient / transaction — composes into one SELECT with three aggregates.
- Confusing "unique senders" with "unique users" — senders and recipients can be the same individual; pool them only if the prompt asks for "unique users."
SQL Interview Question on Payment-Flow Unique Counts
Given payments(payment_id, sender_id, recipient_id, amount, status, created_at) where status is one of 'Success', 'Failed', 'Pending', write a query that returns three columns: unique_transactions, unique_senders, unique_recipients — each a COUNT(DISTINCT) over rows where status = 'Success'.
Solution Using COUNT(DISTINCT) + WHERE status = 'Success'
SELECT
COUNT(DISTINCT payment_id) AS unique_transactions,
COUNT(DISTINCT sender_id) AS unique_senders,
COUNT(DISTINCT recipient_id) AS unique_recipients
FROM payments
WHERE status = 'Success';
Why this works: WHERE status = 'Success' strips failed and pending payments before aggregation so every count reflects only successful transactions; three COUNT(DISTINCT col) aggregates over the filtered row set produce the three unique counts in a single pass; the result is one row with three columns — the canonical shape for "summary metrics for a filtered universe."
Step-by-step trace for a 6-row sample:
| payment_id | sender_id | recipient_id | status |
|---|---|---|---|
| p1 | s1 | r1 | Success |
| p2 | s2 | r2 | Failed |
| p3 | s1 | r2 | Success |
| p4 | s3 | r1 | Pending |
| p5 | s2 | r3 | Success |
| p6 | s1 | r1 | Success |
- WHERE filter — strips p2 (Failed) and p4 (Pending); 4 successful rows remain.
-
COUNT(DISTINCT payment_id) —
{p1, p3, p5, p6}= 4 (every payment_id is unique by definition). -
COUNT(DISTINCT sender_id) —
{s1, s2}= 2. -
COUNT(DISTINCT recipient_id) —
{r1, r2, r3}= 3.
Output:
| unique_transactions | unique_senders | unique_recipients |
|---|---|---|
| 4 | 2 | 3 |
Why this works — concept by concept:
-
WHERE status = 'Success'— idempotency-filter; payment-flow metrics never include failed or pending rows. -
COUNT(DISTINCT payment_id)— counts unique transaction IDs; equal toCOUNT(*)here sincepayment_idis the primary key, but spelled out for clarity. -
COUNT(DISTINCT sender_id)— counts unique senders; senders can repeat across payments, so this collapses repeats. -
COUNT(DISTINCT recipient_id)— counts unique recipients; same logic, different column. -
Single-pass three-aggregate SELECT — one scan of
paymentsfiltered toSuccess, three aggregates in parallel; no need for separate queries. -
O(|payments|)time /O(D)space — one linear scan;D= sum of distinct cardinalities held in three small hash structures.
Inline CTA: Drill more SQL filtering problems and the Square SQL practice page for the curated 2-problem set.
SQL
Company — Square (SQL)
Square SQL practice problems
SQL
Topic — filtering
SQL filtering problems
SQL
Topic — aggregation
SQL aggregation problems
Tips to crack Square data engineering interviews
Square = Block — both names refer to the same loop
Square rebranded to Block Inc. in December 2021. The company name on offer letters is Block; the consumer products (Square seller tools, Cash App, Tidal, Spiral) keep their original brands. The data-engineering interview loop, the SQL bar, and the question shapes did not change with the rebrand. Search for both "Square data engineer interview" and "Block data engineer interview" — every external article you find under one name applies under the other.
Drill the four SQL primitives
The four primitives in this guide map directly to the curated 2 PipeCode SQL problems plus the two adjacent SQL primitives every Block / Square / CashApp SQL list rotates through: GROUP BY + COUNT + ORDER BY DESC + LIMIT N for top-N rankings (#45), DATEDIFF / INTERVAL '30 days' + GROUP BY + HAVING for cohort-retention queries (#217), AVG OVER PARTITION BY and ROW_NUMBER OVER window functions for monthly aggregates and duplicate detection (DataLemur Q1 + Q3), and COUNT(DISTINCT) + WHERE status = 'Success' payment-flow aggregations (Medium + DataLemur Q6).
CoderPad PostgreSQL is the live coding environment
Square / Block / CashApp interviews live SQL in CoderPad with PostgreSQL as the only available dialect. Drill PostgreSQL-flavored answers — EXTRACT(MONTH FROM ...), DATE_TRUNC('week', ...), INTERVAL '30 days', (d1 - d2)::int, COUNT(DISTINCT col), LIMIT N OFFSET M. Avoid Snowflake's QUALIFY, MySQL's DATEDIFF(d1, d2), BigQuery's DATE_DIFF, SQL Server's TOP N — these all parse-fail in PostgreSQL. Google search is allowed during the live interview, but stuttering on dialect signals weak SQL fluency.
Medallion architecture (Bronze / Silver / Gold) is the system-design house style
Square's data platform is structured around the Medallion architecture — Bronze for raw landed data, Silver for cleaned and enriched data, Gold for aggregated analytics-ready data. When you discuss pipeline design in the system-design or onsite panel rounds, frame the data flow as Bronze → Silver → Gold, name the orchestrator (Airflow or dbt), and mention data-quality gates (Great Expectations) between layers. This single framing carries weight at every Square data-engineering onsite.
Data quality is graded heavily — Great Expectations and validation gates
Square's interviewers probe data-quality concerns on every round. Mention specific tools you've used (Great Expectations, Soda, dbt tests), describe the validation gates between Bronze / Silver / Gold layers, and articulate the difference between "schema validation" (column types and nullability) and "semantic validation" (business-logic invariants like transaction_amount > 0 or payment_status IN ('Success', 'Failed', 'Pending')).
4-stage interview process — HR, Hiring Manager, Technical, Onsite Panel (NDA)
The Square data-engineering loop runs four stages: HR screen (15-30 min recruiter), Hiring Manager interview (~30 min, past experience and data-quality), Technical Screen (~1 hour, Python + SQL on CoderPad with medium-level complexity), and Onsite Panel (multiple 30-45 min rounds, NDA required, data modeling + analytical + cultural fit). Total comp data points are limited (only 2 reported), but the average base salary is $139,850, range $101K-$185K.
Where to practice on PipeCode
Start with the Square practice page and the language-scoped Square SQL practice page for the curated 2-problem set. Hit the company_topic Square — aggregation page for the only Square-tagged topic surface available. After that, drill the matching topic pages: ranking, aggregation, date functions, cohort analysis, window functions, filtering, group by. The interview courses page bundles structured curricula. For broader coverage, browse by topic, or pivot to peer guides — the Airbnb DE interview guide and the top DE interview questions 2026 blog. The SQL data types Postgres guide is a useful refresher because PostgreSQL is the CoderPad default.
Communication and approach under time pressure
Talk through the invariant first ("this is a top-N ranking problem with a tiebreak requirement"), the brute force second ("a self-join would also work but is O(n²)"), and the optimal third ("GROUP BY + ORDER BY + LIMIT is O(n log n) and idiomatic"). Interviewers grade process as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, ties at the cut, NULL handling, timezone bugs in date arithmetic, status values you didn't expect (Pending, Cancelled). The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.
Frequently Asked Questions
What is the Square data engineering interview process?
The Square data engineering interview opens with a 15-30 minute recruiter HR screen, then a 30-minute hiring-manager interview focused on past experience and data-quality philosophy, then a 1-hour technical screen on CoderPad covering Python and SQL at medium-level complexity, then an onsite panel of multiple 30-45 minute rounds (data modeling, analytical thinking, cultural fit). The onsite panel requires an NDA. End-to-end the loop runs three to four weeks.
Is Square the same as Block? Which name should I search for?
Yes — Square rebranded to Block Inc. in December 2021. The legal entity is Block; the consumer brands (Square seller tools, Cash App, Tidal, Spiral) keep their original names. Search for both "Square data engineer interview" and "Block data engineer interview" — every external article you find under one name applies under the other. The DataLemur SQL guide is titled "Block SQL Interview Questions" and the IQ guide is titled "Square Data Engineer Interview Questions"; both refer to the same loop.
What programming languages does Square test in data engineering interviews?
Square tests SQL and Python in the technical screen and onsite. SQL is the heavier surface — the live coding round in CoderPad is PostgreSQL-only, with patterns like GROUP BY + COUNT + ORDER BY DESC + LIMIT, INTERVAL '30 days' cohort math, AVG OVER PARTITION BY window aggregates, and COUNT(DISTINCT col) over status-filtered payment data. Python is the lighter surface — typically 30 minutes of medium-difficulty data-manipulation questions (Pandas / dict / list comprehensions). Spark / PySpark show up in the onsite panel for senior roles.
How difficult are Square data engineering interview questions?
The curated Square practice set on PipeCode is 1 EASY + 1 MEDIUM, no hard. The EASY is a SQL ranking + aggregation problem (Top 10 Invoice Senders); the MEDIUM is a SQL aggregation + date-functions + cohort-analysis problem (Users with High Activity After 30 Days Signup). The DataLemur Block list adds 10 more SQL questions ranging from definition-style (joins, constraints, clustered vs non-clustered indexes) to scenario-grade (window AVG OVER, ROW_NUMBER for duplicates, click-through rate via LEFT JOIN + COUNT DISTINCT). Stuttering on the EASY is a stronger negative signal than struggling with the MEDIUM.
What is the Square data engineer salary range?
Square data engineer base salary ranges from $101K to $185K, with an average of $139,850 (median $140K) across 21 reported data points (per Interview Query). Total compensation data is sparse (only 2 reported points), so the total-comp average ($72K) is under-sampled and likely inaccurate; expect actual total comp to be substantially higher once equity refreshers and bonuses are factored in. Negotiation success rates are best-supported by competing offers and verified levels.fyi entries.
What tech stack does Square's data engineering team use?
Square's data engineering stack includes Python (heavy — Pandas, PySpark for large-scale transforms), SQL (PostgreSQL on CoderPad for interviews; production warehouses include Snowflake-equivalent platforms), Airflow + dbt for ETL/ELT orchestration, AWS or GCP for cloud infrastructure, Apache Spark for distributed data processing, the Medallion architecture (Bronze / Silver / Gold layers) for data organization, and Great Expectations for data-quality validation. The cultural emphasis is on data-quality-first pipelines, cross-functional collaboration with product / DS teams, and continuous learning.




Top comments (0)