Gowtham Potureddi

Posted on May 3

Square Data Engineering Interview Questions & Prep Guide

#python #sql #interview #dataengineering

Square data engineering interview questions are SQL-heavy, fintech-flavored, and PostgreSQL-grounded. Square rebranded to Block Inc. in December 2021, but the SQL bar — and the question shapes that show up in the live CoderPad pair-programming round — has not moved. Four primitives carry the loop: GROUP BY sender_id + COUNT(*) + ORDER BY DESC LIMIT 10 for top-N invoice senders, DATEDIFF / INTERVAL '30 days' cohort math for 30-day-post-signup activity windows, AVG(stars) OVER (PARTITION BY product_id, month) window aggregates for monthly product analytics, and COUNT(DISTINCT col) over status-filtered payment rows for fintech-grade transaction analysis. The framings are everyday Block / Square / CashApp data engineering — invoice ranking, cohort retention, monthly review averages, payment success counts.

This guide walks four SQL topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches a curated 2-problem set (1 EASY ranking+sorting+aggregation, 1 MEDIUM aggregation+date-functions+cohort-analysis) plus the two adjacent SQL primitives — window functions and payment-flow COUNT DISTINCT — that show up on every Block / Square / CashApp SQL question list. The interview is PostgreSQL on CoderPad; candidates who prep in MySQL / Snowflake / BigQuery dialect stutter on INTERVAL, DATE_PART, and EXTRACT syntax. Drill PostgreSQL-flavored answers from the start.

#	Topic (sections 1–4)	Why it shows up at Square
1	SQL ranking, sorting, and aggregation for top-N invoice senders	Top 10 Invoice Senders (EASY) — `GROUP BY sender_id + COUNT(*) + ORDER BY DESC + LIMIT 10`, the SQL primitive for any "top-N entities by event volume" question.
2	SQL aggregation, date functions, and cohort analysis for post-signup activity	Users with High Activity After 30 Days Signup (MEDIUM) — `DATEDIFF` / `INTERVAL '30 days'` + `GROUP BY user_id + HAVING COUNT(*) >= threshold`, the canonical retention-cohort pattern.
3	SQL window functions for monthly averages and duplicate detection	Monthly product rating + duplicate detection — `AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))` and `ROW_NUMBER() OVER (PARTITION BY ...)` for dedupe (DataLemur Block staples).
4	SQL `COUNT DISTINCT` and status filters for payment-flow analysis	Total unique successful transactions / senders / recipients — `COUNT(DISTINCT col) WHERE status = 'Success'`, the fintech-grade pattern that drives Square / Block / CashApp payment analytics.

1. SQL Ranking, Sorting, and Aggregation for Top-N Invoice Senders

Top-N entity ranking via GROUP BY + ORDER BY DESC + LIMIT in SQL for data engineering

"Return the top 10 senders by completed invoice count" is Square's signature EASY SQL prompt (Top 10 Invoice Senders). The mental model: GROUP BY sender_id collapses one row per sender; COUNT(*) produces the per-sender invoice count; ORDER BY count DESC ranks senders descending; LIMIT 10 returns only the top 10. Same primitive powers any "top-N entities by event volume" pipeline — top-N customers by order count, top-N products by review volume, top-N regions by daily active users.

Pro tip: Always alias the count column (AS invoice_count) and sort by the alias in ORDER BY. Mixing literal COUNT(*) and the alias is style noise that interviewers grade. Add a deterministic tiebreaker (, sender_id ASC) when ties at the cut matter — Square's interviewers probe ties at the boundary frequently.

GROUP BY + COUNT(*) for per-sender aggregation

The aggregation invariant: GROUP BY sender_id collapses all invoice rows that share the same sender_id into one output row; COUNT(*) counts the number of rows in each group. Every non-aggregate column in the SELECT must appear in GROUP BY (or be functionally dependent on it).

GROUP BY sender_id — collapses to one row per sender.
COUNT(*) — total rows per group; equivalent to COUNT(invoice_id) when invoice_id is non-null.
COUNT(DISTINCT col) — unique values per group; useful when senders can repeat invoices.
SELECT sender_id, COUNT(*) — only sender_id is referenced ungrouped; the GROUP BY makes this legal.

Worked example. Five invoices, three distinct senders.

step	output
input	`[(101, s1), (102, s2), (103, s1), (104, s3), (105, s1)]`
`GROUP BY sender_id`	`[s1: 3 rows, s2: 1 row, s3: 1 row]`
`COUNT(*) AS cnt`	`[(s1, 3), (s2, 1), (s3, 1)]`

Worked-example solution.

SELECT sender_id, COUNT(*) AS invoice_count
FROM invoices
GROUP BY sender_id;

Rule of thumb: GROUP BY on the dimension you want one-row-per-value, COUNT(*) for the metric, alias the count immediately so subsequent clauses can reference it.

ORDER BY count DESC + LIMIT N for top-N rankings

The top-N invariant: ORDER BY <metric> DESC sorts groups by the metric in descending order; LIMIT N returns only the first N. For ties at the cut, dialects differ — PostgreSQL (Square's CoderPad default) returns whichever order the planner picks unless you add an explicit tiebreaker.

ORDER BY invoice_count DESC — sorts groups by the aliased column.
LIMIT 10 — first 10 groups in sort order; PostgreSQL syntax.
OFFSET 10 LIMIT 10 — pagination; ranks 11-20.
Deterministic tiebreak — , sender_id ASC makes ties stable across runs.

Worked example. Sort 4 senders by invoice_count and take top 3.

sender_id	invoice_count	rank
s1	4	1
s2	2	2
s3	1	3
s4	1	(cut)

Worked-example solution.

SELECT sender_id, COUNT(*) AS invoice_count
FROM invoices
GROUP BY sender_id
ORDER BY invoice_count DESC, sender_id ASC
LIMIT 3;

Rule of thumb: always include a deterministic tiebreaker — fintech audits demand that ranked lists be reproducible across runs.

Tiebreaks via `RANK()`, `DENSE_RANK()`, and `ROW_NUMBER()`

The tiebreak invariant: ROW_NUMBER always gives a unique sequence (1, 2, 3, 4); RANK gives the same number for ties and skips the next (1, 1, 3, 4); DENSE_RANK gives the same number for ties without skipping (1, 1, 2, 3). Choose based on whether the prompt wants strict top-N row count, "all senders tied at rank N", or compact ranks without gaps.

ROW_NUMBER() OVER (ORDER BY count DESC) — strict 1..N sequence; ties resolved by planner.
RANK() OVER (ORDER BY count DESC) — ties share rank, next rank skips.
DENSE_RANK() OVER (ORDER BY count DESC) — ties share rank, next rank does not skip.
LIMIT 10 — chops to top 10 rows after sort; doesn't address ties unless paired with RANK filter.

Worked example. Three senders tied at 5 invoices, one at 3, one at 2.

sender_id	invoice_count	ROW_NUMBER	RANK	DENSE_RANK
s1	5	1	1	1
s2	5	2	1	1
s3	5	3	1	1
s4	3	4	4	2
s5	2	5	5	3

Worked-example solution.

SELECT sender_id, invoice_count,
       ROW_NUMBER() OVER (ORDER BY invoice_count DESC) AS rn,
       RANK()       OVER (ORDER BY invoice_count DESC) AS rk,
       DENSE_RANK() OVER (ORDER BY invoice_count DESC) AS drk
FROM (SELECT sender_id, COUNT(*) AS invoice_count
      FROM invoices GROUP BY sender_id) t;

Rule of thumb: "top 10" with strict cardinality → LIMIT 10; "all senders tied at top 10" → wrap with WHERE rk <= 10; "compact ranks 1..K with no gaps" → DENSE_RANK.

Common beginner mistakes

Forgetting GROUP BY — SELECT sender_id, COUNT(*) without grouping is a parse error in PostgreSQL.
Forgetting ORDER BY ... DESC — top 10 silently becomes "first 10 by planner whim."
Returning more than N rows by skipping LIMIT — graded as a wrong answer even when the top N are correct.
Hardcoding ties to a single row when the prompt says "all senders tied at rank 10" — use RANK filter, not LIMIT.
Mixing COUNT(*) and COUNT(DISTINCT invoice_id) — non-equivalent when invoice_id can repeat (it usually can't, but state the assumption).

SQL Interview Question on Top 10 Invoice Senders

Given an invoices(invoice_id, sender_id, recipient_id, sent_at, amount) table, write a query that returns the top 10 senders by completed invoice count, ordered by count descending and breaking ties by sender_id ascending. Output two columns: sender_id and invoice_count.

Solution Using `GROUP BY` + `COUNT(*)` + `ORDER BY DESC` + `LIMIT 10`

SELECT
    sender_id,
    COUNT(*) AS invoice_count
FROM invoices
GROUP BY sender_id
ORDER BY invoice_count DESC, sender_id ASC
LIMIT 10;

Why this works: GROUP BY sender_id collapses the row stream to one row per sender; COUNT(*) produces the metric for each group; ORDER BY invoice_count DESC ranks groups descending; the , sender_id ASC tiebreak ensures stable output across runs (fintech-audit-grade); LIMIT 10 returns the first 10 ranked rows.

Step-by-step trace for an 8-row sample:

invoice_id	sender_id
101	s1
102	s2
103	s1
104	s3
105	s1
106	s2
107	s4
108	s1

Group by sender_id — four groups: s1 (4 rows), s2 (2 rows), s3 (1 row), s4 (1 row).
COUNT(*) — produces (s1, 4), (s2, 2), (s3, 1), (s4, 1).
Order by invoice_count DESC, sender_id ASC — (s1, 4), (s2, 2), (s3, 1), (s4, 1).
LIMIT 10 — only 4 senders exist; all four returned.

Output:

sender_id	invoice_count
s1	4
s2	2
s3	1
s4	1

Why this works — concept by concept:

GROUP BY sender_id — collapses rows that share sender_id; produces one output row per sender.
COUNT(*) aggregate — counts rows in each group; equivalent to COUNT(invoice_id) when invoice_id is non-null (always true since it's the primary key).
ORDER BY invoice_count DESC ranking — sorts groups descending by the metric; the DESC keyword is the entire difference between top-N and bottom-N.
Deterministic tiebreak — , sender_id ASC ensures the output is reproducible across runs; fintech audits require this.
LIMIT 10 cut — chops to the top 10 ranked rows; PostgreSQL syntax (CoderPad default).
O((|invoices|) + G log G) time — |invoices| rows scanned for the GROUP BY, then O(G log G) to sort the group output where G is the number of senders.

Inline CTA: Drill the Square SQL practice page for the curated EASY problem and the Square aggregation practice page for the only company-tagged topic surface available.

SQL
Company — Square (SQL)
Square SQL practice problems

Practice →

SQL
Company — Square / aggregation
Square aggregation problems

Practice →

SQL
Topic — ranking
SQL ranking problems

Practice →

2. SQL Aggregation, Date Functions, and Cohort Analysis for Post-Signup Activity

30-day cohort retention via DATEDIFF + GROUP BY + HAVING in SQL for data engineering

"Return every user who has at least 5 events in the 30 days after signup" is Square's signature MEDIUM SQL prompt (Users with High Activity After 30 Days Signup). The mental model: the cohort is defined by signup_date; the activity window is signup_date + INTERVAL '30 days'; events filtered to that window are aggregated GROUP BY user_id; HAVING COUNT(*) >= 5 returns the active users. Same primitive powers any retention or cohort analysis — N-day-post-signup activity, post-purchase repeat behavior, post-event engagement windows.

Pro tip: PostgreSQL has multiple ways to express the 30-day boundary: signup_date + INTERVAL '30 days', signup_date + INTERVAL '30 day', (event_at - signup_date) <= INTERVAL '30 days', (event_at::date - signup_date::date) <= 30. Pick one and stick with it. Mixing forms within the same query is a syntax-fluency red flag.

Date arithmetic: `DATEDIFF`, `DATE_TRUNC`, and `INTERVAL '30 days'`

The date-arithmetic invariant: PostgreSQL date math returns either a DATE (when adding INTERVAL) or an INTERVAL (when subtracting two dates); both are comparable to literal intervals. DATEDIFF is not native PostgreSQL — use (d1 - d2)::int or EXTRACT(DAY FROM (d1 - d2)).

signup_date + INTERVAL '30 days' — returns a date 30 days after signup.
(event_at - signup_date) <= INTERVAL '30 days' — interval comparison; works for both DATE and TIMESTAMP.
DATE_TRUNC('week', event_at) — snaps to the week boundary; useful for weekly cohorts.
EXTRACT(DAY FROM (event_at - signup_date)) — pulls the day-count out as an integer.

Worked example. Signup on 2025-01-01; events at 2025-01-10 and 2025-02-15.

event_at	signup_date	diff	within 30 days?
2025-01-10	2025-01-01	9 days	✓
2025-02-15	2025-01-01	45 days	✗

Worked-example solution.

SELECT event_at,
       (event_at::date - signup_date::date) AS days_since_signup,
       (event_at <= signup_date + INTERVAL '30 days') AS within_window
FROM users JOIN events USING (user_id);

Rule of thumb: prefer event_at <= signup_date + INTERVAL '30 days' over (event_at - signup_date) <= INTERVAL '30 days' — the additive form reads as "the event happened within the window" without a subtraction step.

Cohort definition via `signup_date + INTERVAL '30 days'`

The cohort invariant: a cohort is a set of users defined by a shared signup date or signup-date-bucket; the cohort's activity window is that signup-date plus a fixed interval. The cohort filter goes in WHERE, the per-user aggregation happens in GROUP BY, and the threshold check goes in HAVING.

Cohort by exact signup date — WHERE u.signup_date = '2025-01-01'.
Cohort by week — WHERE DATE_TRUNC('week', u.signup_date) = '2025-01-06'.
Cohort by all users — no extra filter; aggregate per user.
Activity window — WHERE e.event_at <= u.signup_date + INTERVAL '30 days'.

Worked example. Two users signed up on different dates; activity-window check per user.

user_id	signup_date	window_end
u1	2025-01-01	2025-01-31
u2	2025-01-15	2025-02-14

Worked-example solution.

SELECT u.user_id, u.signup_date,
       u.signup_date + INTERVAL '30 days' AS window_end
FROM users u;

Rule of thumb: compute the window boundary in SELECT (or a CTE) when you need to reference it multiple times; recomputing signup_date + INTERVAL '30 days' in three different clauses signals copy-paste.

`COUNT(*) per user` with `HAVING` threshold

The threshold invariant: GROUP BY user_id collapses to one row per user; COUNT(*) produces the per-user activity count; HAVING COUNT(*) >= 5 filters groups whose count meets the threshold. HAVING is "WHERE on aggregates" — it filters group rows, not source rows.

HAVING COUNT(*) >= 5 — group-level filter; aggregate predicate.
HAVING COUNT(*) >= 5 AND MAX(event_at) > signup_date + INTERVAL '14 days' — compound threshold.
Avoid WHERE COUNT(*) >= 5 — WHERE cannot reference aggregates; PostgreSQL parser rejects it.
Avoid WHERE to enforce the post-signup window when you want to count zero-event users — outer join + HAVING instead.

Worked example. Three users; threshold 5 events.

user_id	events_in_window	passes?
u1	6	✓
u2	3	✗
u3	5	✓

Worked-example solution.

SELECT u.user_id, COUNT(*) AS events_in_window
FROM users u
JOIN events e ON e.user_id = u.user_id
WHERE e.event_at <= u.signup_date + INTERVAL '30 days'
GROUP BY u.user_id
HAVING COUNT(*) >= 5;

Rule of thumb: row-level predicates → WHERE; aggregate predicates → HAVING. Crossing them is a graded conceptual error.

Common beginner mistakes

Using DATEDIFF(event_at, signup_date) in PostgreSQL — that's MySQL syntax; PostgreSQL uses (d1 - d2)::int or EXTRACT(DAY FROM ...).
Putting the aggregate threshold in WHERE — parser rejects; use HAVING.
Filtering signup_date in HAVING — that's a row predicate; should be in WHERE for performance.
Forgetting the JOIN — querying events alone misses the per-user signup_date anchor.
Missing the INTERVAL '30 days' boundary — counting all-time activity instead of the 30-day window inflates results.
Timezone bugs — event_at and signup_date in different timezones produces off-by-one days; cast both to a single timezone when in doubt.

SQL Interview Question on 30-Day Post-Signup Activity

Given users(user_id, signup_date) and events(event_id, user_id, event_at), return every user_id whose events in the 30 days after signup are at least 5. Output user_id and events_in_window, ordered by events_in_window descending.

Solution Using `JOIN` + `WHERE INTERVAL` + `GROUP BY` + `HAVING COUNT(*) >= 5`

SELECT
    u.user_id,
    COUNT(*) AS events_in_window
FROM users u
JOIN events e
    ON e.user_id = u.user_id
WHERE e.event_at <= u.signup_date + INTERVAL '30 days'
  AND e.event_at >= u.signup_date
GROUP BY u.user_id
HAVING COUNT(*) >= 5
ORDER BY events_in_window DESC;

Why this works: the JOIN pairs each event with its user's signup_date; the WHERE clause restricts events to the 30-day window post-signup (>= signup_date AND <= signup_date + INTERVAL '30 days'); GROUP BY u.user_id collapses to one row per user; COUNT(*) produces the per-user in-window event count; HAVING COUNT(*) >= 5 filters users whose count crosses the threshold; ORDER BY events_in_window DESC ranks active users by engagement.

Step-by-step trace for three users:

user_id	signup_date	events_in_first_30d
u1	2025-01-01	6
u2	2025-01-15	3
u3	2025-02-10	5

JOIN — pairs each event row with its signup_date from users.
WHERE window filter — keeps only events where event_at is in [signup_date, signup_date + 30 days].
GROUP BY user_id — collapses to one row per user with the in-window event count.
HAVING >= 5 — filters out u2 (3 events). u1 (6) and u3 (5) pass.
ORDER BY DESC — u1 (6) ranks above u3 (5).

Output:

user_id	events_in_window
u1	6
u3	5

Why this works — concept by concept:

JOIN on user_id — pairs each event with its user's signup_date; without this, the query has no anchor for the window.
INTERVAL '30 days' boundary — PostgreSQL-native expression of the 30-day forward window; reads as "30 days from signup."
WHERE row filter — strips out-of-window events before grouping; doing this with HAVING would still work but reads worse and runs slower.
GROUP BY u.user_id — collapses the row stream to one row per user; the only non-aggregate column in SELECT matches the GROUP BY.
HAVING COUNT(*) >= 5 — filters groups; aggregate predicates have to live here.
O((|users| + |events|) + G log G) time — the JOIN dominates; G users with at least one in-window event get sorted at the end.

Inline CTA: Practice date-functions SQL problems and cohort-analysis problems on PipeCode.

SQL
Company — Square (SQL)
Square SQL practice problems

Practice →

SQL
Topic — date functions
SQL date-function problems

Practice →

SQL
Topic — cohort analysis
SQL cohort-analysis problems

Practice →

3. SQL Window Functions for Monthly Averages and Duplicate Detection

Window AVG OVER + ROW_NUMBER OVER for analytics in SQL for data engineering

"Calculate the monthly average rating for each product" is Block / Square's signature window-function SQL prompt (DataLemur Q1). The mental model: AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)) produces the monthly product average on every row, with no row-collapse. The mirror primitive is ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for duplicate detection — flag rows whose rn > 1 per partition. Same primitive powers any "row-level value vs group-aggregate" or "first-occurrence-per-group" question.

Pro tip: Window functions and GROUP BY answer different questions. GROUP BY collapses rows; window aggregates compute on the group without collapsing. Use GROUP BY when you want one row per group; use a window when you want every original row plus a per-group aggregate column.

Window AVG: `AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))`

The window-AVG invariant: AVG(expr) OVER (PARTITION BY ...) returns the average of expr across all rows in the same partition, attached to every row. Unlike GROUP BY AVG(...), the row count is preserved.

PARTITION BY product_id — one window per product.
PARTITION BY product_id, EXTRACT(MONTH FROM submit_date) — one window per (product, month).
OVER () — empty parens = global window across all rows.
ORDER BY submit_date ROWS UNBOUNDED PRECEDING — running average up to current row.

Worked example. Two reviews for product 50001 in June; window AVG returns 3.5 on both rows.

review_id	product_id	submit_date	stars	window_avg
6171	50001	2022-06-08	4	3.5
5293	50001	2022-06-18	3	3.5

Worked-example solution.

SELECT review_id, product_id, submit_date, stars,
       AVG(stars) OVER (
         PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)
       ) AS monthly_avg
FROM reviews;

Rule of thumb: "show me each row alongside the group average" → window; "show me one row per group" → GROUP BY.

`ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)` for duplicate detection

The duplicate-detection invariant: ROW_NUMBER() OVER (PARTITION BY <key_columns> ORDER BY <tiebreaker>) assigns 1, 2, 3, … to rows within each partition by the duplicate key; rows with row_number > 1 are duplicates.

PARTITION BY user_id, product_id — define what makes a duplicate.
ORDER BY submit_date ASC — tiebreaker; deterministic.
Filter — wrap in subquery / CTE: WHERE rn > 1 for duplicates only, WHERE rn = 1 for the first occurrence.
Alternative — COUNT(*) OVER (PARTITION BY ...) to count duplicates per row.

Worked example. Two reviews from same user for same product → duplicate.

review_id	user_id	product_id	submit_date	rn
6171	123	50001	2022-06-08	1
9999	123	50001	2022-06-15	2 (duplicate)

Worked-example solution.

SELECT *
FROM (
  SELECT review_id, user_id, product_id, submit_date,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, product_id ORDER BY submit_date
         ) AS rn
  FROM reviews
) t
WHERE rn > 1;

Rule of thumb: ROW_NUMBER OVER (PARTITION BY <duplicate_key> ORDER BY <tiebreaker>) is the universal SQL idiom for "find duplicates" — no GROUP BY needed.

Window vs `GROUP BY`: row-level vs collapsed-row aggregations

The distinction invariant: window functions preserve the input row count and add per-group aggregate columns; GROUP BY collapses input rows to one row per group and replaces non-aggregate columns. Many candidates conflate them; interviewers grade the distinction.

Window — SELECT review_id, AVG(stars) OVER (PARTITION BY product_id) FROM reviews; — row count preserved.
GROUP BY — SELECT product_id, AVG(stars) FROM reviews GROUP BY product_id; — one row per product.
Mixed — wrap the GROUP BY in a CTE, then JOIN to the original; messy.
QUALIFY — Snowflake-only filter on window-result; PostgreSQL needs a wrapping subquery instead.

Worked example. Same data, two different shapes.

approach	rows returned
Window AVG	5 (same as input)
GROUP BY AVG	2 (one per product)

Worked-example solution.

-- Window: keep all rows
SELECT review_id, product_id, stars,
       AVG(stars) OVER (PARTITION BY product_id) AS product_avg
FROM reviews;

-- GROUP BY: collapse to one row per product
SELECT product_id, AVG(stars) AS product_avg
FROM reviews
GROUP BY product_id;

Rule of thumb: read the prompt — "for each row, show…" = window; "summarize per product" = GROUP BY.

Common beginner mistakes

Forgetting PARTITION BY — the window covers the whole table, producing the global average on every row.
Using GROUP BY when the prompt says "for each review, show the monthly product average" — collapses rows incorrectly.
Forgetting ORDER BY inside ROW_NUMBER OVER (...) — non-deterministic row numbering; the duplicate filter becomes random.
Using RANK() when ROW_NUMBER() is correct — RANK ties become same number; RANK > 1 doesn't reliably flag duplicates.
Filtering window results in WHERE — Postgres requires a wrapping subquery / CTE because window functions evaluate after WHERE.

SQL Interview Question on Monthly Average Product Ratings

Given reviews(review_id, user_id, submit_date, product_id, stars), write a query that returns, for each row, submit_date_month, product_id, and the monthly average rating for that product (avg_stars), sorted by month and product_id.

Solution Using `AVG ... OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))`

SELECT
    EXTRACT(MONTH FROM submit_date) AS mth,
    product_id,
    AVG(stars)::decimal(10, 2) OVER (
        PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)
    ) AS avg_stars
FROM reviews
ORDER BY mth, product_id;

Why this works: EXTRACT(MONTH FROM submit_date) pulls the month component from each row; AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)) partitions by (product, month) and averages stars within each partition; the cast to decimal(10, 2) rounds to two decimals; ORDER BY mth, product_id produces the canonical sort order. Each row keeps its identity — this is window, not GROUP BY.

Step-by-step trace for the DataLemur sample:

review_id	user_id	submit_date	product_id	stars
6171	123	2022-06-08	50001	4
7802	265	2022-06-10	69852	4
5293	362	2022-06-18	50001	3
6352	192	2022-07-26	69852	3
4517	981	2022-07-05	69852	2

EXTRACT(MONTH ...) — produces [6, 6, 6, 7, 7].
Partition by (product_id, month) — four partitions: (50001, 6) with 2 rows, (69852, 6) with 1 row, (69852, 7) with 2 rows.
AVG within partition — (50001, 6) → 3.5; (69852, 6) → 4.0; (69852, 7) → 2.5.
Cast + ORDER BY — five rows emitted, sorted by (mth, product_id).

Output:

mth	product_id	avg_stars
6	50001	3.50
6	50001	3.50
6	69852	4.00
7	69852	2.50
7	69852	2.50

Why this works — concept by concept:

EXTRACT(MONTH FROM submit_date) — PostgreSQL date-component function; returns the month as integer.
AVG(...) OVER (PARTITION BY ...) — window aggregate; preserves row count and computes per-partition mean.
Composite partition key — (product_id, EXTRACT(MONTH FROM submit_date)) ensures monthly granularity per product.
::decimal(10, 2) cast — rounds to 2 decimals; cleaner output than the default float representation.
ORDER BY mth, product_id — canonical sort order for the audit-readable result.
O(N log N) time — the planner sorts rows by partition keys once; no GROUP BY collapse.

Inline CTA: More SQL window-function problems and aggregation problems on PipeCode.

SQL
Company — Square (SQL)
Square SQL practice problems

Practice →

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

4. SQL `COUNT DISTINCT` and Status Filters for Payment-Flow Analysis

Payment-flow SQL with COUNT DISTINCT and status filters in SQL for Square data engineering

"Return the total unique successful transactions, unique senders, and unique recipients" is Square / CashApp's signature payment-flow SQL prompt (Medium / DataLemur Q6 staple). The mental model: filter rows with WHERE status = 'Success' (idempotency boundary), then aggregate with COUNT(DISTINCT col) over the columns you care about — payment_id, sender_id, recipient_id. Same primitive powers any "unique-X for filtered Y" pipeline — unique active users last week, unique trading symbols traded yesterday, unique ad creatives clicked today.

Pro tip: Always state your idempotency boundary out loud before writing the query: "I'm filtering to status = 'Success' because failed/pending transactions don't count toward unique-success metrics." Square interviewers grade this phrasing — it's the difference between "candidate writes correct SQL" and "candidate thinks like a data engineer at a regulated payments company."

`COUNT` vs `COUNT DISTINCT`: row count vs unique-value count

The distinct-count invariant: COUNT(*) counts rows; COUNT(col) counts non-null values in col; COUNT(DISTINCT col) counts distinct non-null values in col. The three are not interchangeable.

COUNT(*) — total rows in the result set.
COUNT(payment_id) — non-null payment IDs (often equal to COUNT(*) since payment_id is the PK).
COUNT(DISTINCT sender_id) — unique senders; collapses repeated sender_ids.
COUNT(DISTINCT col1, col2) — PostgreSQL syntax for distinct combinations.

Worked example. Five payments from 3 distinct senders.

payment_id	sender_id
p1	s1
p2	s2
p3	s1
p4	s3
p5	s1

metric	value
`COUNT(*)`	5
`COUNT(payment_id)`	5
`COUNT(DISTINCT sender_id)`	3

Worked-example solution.

SELECT COUNT(*) AS total_rows,
       COUNT(payment_id) AS total_payment_ids,
       COUNT(DISTINCT sender_id) AS unique_senders
FROM payments;

Rule of thumb: "how many rows" → COUNT(*); "how many unique entities" → COUNT(DISTINCT col).

`WHERE status = 'Success'` before aggregation

The idempotency-filter invariant: payment-flow analytics almost always filter to status = 'Success' (or whichever success state applies); failed and pending transactions do not contribute to success metrics. The filter goes in WHERE, before aggregation.

WHERE status = 'Success' — strictly successful payments.
WHERE status IN ('Success', 'Settled') — multiple success-equivalent statuses.
WHERE status NOT IN ('Failed', 'Cancelled') — exclusionary; semantically same when the universe is finite.
State the assumption — interviewers expect you to clarify which statuses count.

Worked example. Six payments, three successful.

payment_id	sender_id	status
p1	s1	Success
p2	s2	Failed
p3	s1	Success
p4	s3	Pending
p5	s1	Success
p6	s4	Failed

After WHERE status = 'Success': 3 rows (p1, p3, p5), unique senders = 1 (s1 only).

Worked-example solution.

SELECT COUNT(*) AS successful_payments,
       COUNT(DISTINCT sender_id) AS unique_successful_senders
FROM payments
WHERE status = 'Success';

Rule of thumb: the success-filter is non-negotiable for payment-flow questions; ask the interviewer which status values count if the prompt is ambiguous.

Multiple `COUNT DISTINCT` in one `SELECT`: sender / recipient splits

The compound-aggregate invariant: a single SELECT can compute multiple COUNT(DISTINCT col) aggregates over the same filtered row set; this is the common shape for payment-flow "uniques per role" questions.

Three aggregates in one SELECT — COUNT(DISTINCT payment_id), COUNT(DISTINCT sender_id), COUNT(DISTINCT recipient_id).
Same WHERE filter applies to all — single source row set drives every aggregate.
Senders ≠ recipients — even though both columns reference users, the distinct counts can differ.
COUNT(DISTINCT user_id) — pool sender + recipient with UNION ALL first if you want "unique users involved."

Worked example. Three successful payments; two distinct senders, three distinct recipients.

payment_id	sender_id	recipient_id
p1	s1	r1
p3	s1	r2
p5	s2	r3

metric	value
`COUNT(DISTINCT payment_id)`	3
`COUNT(DISTINCT sender_id)`	2
`COUNT(DISTINCT recipient_id)`	3

Worked-example solution.

SELECT COUNT(DISTINCT payment_id)  AS unique_transactions,
       COUNT(DISTINCT sender_id)   AS unique_senders,
       COUNT(DISTINCT recipient_id) AS unique_recipients
FROM payments
WHERE status = 'Success';

Rule of thumb: one query, multiple aggregates is the cleanest shape; never write three separate queries for sender / recipient / transaction counts.

Common beginner mistakes

Using COUNT(*) when the prompt says "unique" — undercounts (or overcounts, depending on whether duplicates exist).
Forgetting WHERE status = 'Success' — counts failed and pending transactions; the metric becomes meaningless.
Putting the status filter in HAVING instead of WHERE — works but is slow and reads worse.
Writing three separate queries for sender / recipient / transaction — composes into one SELECT with three aggregates.
Confusing "unique senders" with "unique users" — senders and recipients can be the same individual; pool them only if the prompt asks for "unique users."

SQL Interview Question on Payment-Flow Unique Counts

Given payments(payment_id, sender_id, recipient_id, amount, status, created_at) where status is one of 'Success', 'Failed', 'Pending', write a query that returns three columns: unique_transactions, unique_senders, unique_recipients — each a COUNT(DISTINCT) over rows where status = 'Success'.

Solution Using `COUNT(DISTINCT)` + `WHERE status = 'Success'`

SELECT
    COUNT(DISTINCT payment_id)   AS unique_transactions,
    COUNT(DISTINCT sender_id)    AS unique_senders,
    COUNT(DISTINCT recipient_id) AS unique_recipients
FROM payments
WHERE status = 'Success';

Why this works: WHERE status = 'Success' strips failed and pending payments before aggregation so every count reflects only successful transactions; three COUNT(DISTINCT col) aggregates over the filtered row set produce the three unique counts in a single pass; the result is one row with three columns — the canonical shape for "summary metrics for a filtered universe."

Step-by-step trace for a 6-row sample:

payment_id	sender_id	recipient_id	status
p1	s1	r1	Success
p2	s2	r2	Failed
p3	s1	r2	Success
p4	s3	r1	Pending
p5	s2	r3	Success
p6	s1	r1	Success

WHERE filter — strips p2 (Failed) and p4 (Pending); 4 successful rows remain.
COUNT(DISTINCT payment_id) — {p1, p3, p5, p6} = 4 (every payment_id is unique by definition).
COUNT(DISTINCT sender_id) — {s1, s2} = 2.
COUNT(DISTINCT recipient_id) — {r1, r2, r3} = 3.

Output:

unique_transactions	unique_senders	unique_recipients
4	2	3

Why this works — concept by concept:

WHERE status = 'Success' — idempotency-filter; payment-flow metrics never include failed or pending rows.
COUNT(DISTINCT payment_id) — counts unique transaction IDs; equal to COUNT(*) here since payment_id is the primary key, but spelled out for clarity.
COUNT(DISTINCT sender_id) — counts unique senders; senders can repeat across payments, so this collapses repeats.
COUNT(DISTINCT recipient_id) — counts unique recipients; same logic, different column.
Single-pass three-aggregate SELECT — one scan of payments filtered to Success, three aggregates in parallel; no need for separate queries.
O(|payments|) time / O(D) space — one linear scan; D = sum of distinct cardinalities held in three small hash structures.

Inline CTA: Drill more SQL filtering problems and the Square SQL practice page for the curated 2-problem set.

SQL
Company — Square (SQL)
Square SQL practice problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

Tips to crack Square data engineering interviews

Square = Block — both names refer to the same loop

Square rebranded to Block Inc. in December 2021. The company name on offer letters is Block; the consumer products (Square seller tools, Cash App, Tidal, Spiral) keep their original brands. The data-engineering interview loop, the SQL bar, and the question shapes did not change with the rebrand. Search for both "Square data engineer interview" and "Block data engineer interview" — every external article you find under one name applies under the other.

Drill the four SQL primitives

The four primitives in this guide map directly to the curated 2 PipeCode SQL problems plus the two adjacent SQL primitives every Block / Square / CashApp SQL list rotates through: GROUP BY + COUNT + ORDER BY DESC + LIMIT N for top-N rankings (#45), DATEDIFF / INTERVAL '30 days' + GROUP BY + HAVING for cohort-retention queries (#217), AVG OVER PARTITION BY and ROW_NUMBER OVER window functions for monthly aggregates and duplicate detection (DataLemur Q1 + Q3), and COUNT(DISTINCT) + WHERE status = 'Success' payment-flow aggregations (Medium + DataLemur Q6).

CoderPad PostgreSQL is the live coding environment

Square / Block / CashApp interviews live SQL in CoderPad with PostgreSQL as the only available dialect. Drill PostgreSQL-flavored answers — EXTRACT(MONTH FROM ...), DATE_TRUNC('week', ...), INTERVAL '30 days', (d1 - d2)::int, COUNT(DISTINCT col), LIMIT N OFFSET M. Avoid Snowflake's QUALIFY, MySQL's DATEDIFF(d1, d2), BigQuery's DATE_DIFF, SQL Server's TOP N — these all parse-fail in PostgreSQL. Google search is allowed during the live interview, but stuttering on dialect signals weak SQL fluency.

Medallion architecture (Bronze / Silver / Gold) is the system-design house style

Square's data platform is structured around the Medallion architecture — Bronze for raw landed data, Silver for cleaned and enriched data, Gold for aggregated analytics-ready data. When you discuss pipeline design in the system-design or onsite panel rounds, frame the data flow as Bronze → Silver → Gold, name the orchestrator (Airflow or dbt), and mention data-quality gates (Great Expectations) between layers. This single framing carries weight at every Square data-engineering onsite.

Data quality is graded heavily — Great Expectations and validation gates

Square's interviewers probe data-quality concerns on every round. Mention specific tools you've used (Great Expectations, Soda, dbt tests), describe the validation gates between Bronze / Silver / Gold layers, and articulate the difference between "schema validation" (column types and nullability) and "semantic validation" (business-logic invariants like transaction_amount > 0 or payment_status IN ('Success', 'Failed', 'Pending')).

4-stage interview process — HR, Hiring Manager, Technical, Onsite Panel (NDA)

The Square data-engineering loop runs four stages: HR screen (15-30 min recruiter), Hiring Manager interview (~30 min, past experience and data-quality), Technical Screen (~1 hour, Python + SQL on CoderPad with medium-level complexity), and Onsite Panel (multiple 30-45 min rounds, NDA required, data modeling + analytical + cultural fit). Total comp data points are limited (only 2 reported), but the average base salary is $139,850, range $101K-$185K.

Where to practice on PipeCode

Start with the Square practice page and the language-scoped Square SQL practice page for the curated 2-problem set. Hit the company_topic Square — aggregation page for the only Square-tagged topic surface available. After that, drill the matching topic pages: ranking, aggregation, date functions, cohort analysis, window functions, filtering, group by. The interview courses page bundles structured curricula. For broader coverage, browse by topic, or pivot to peer guides — the Airbnb DE interview guide and the top DE interview questions 2026 blog. The SQL data types Postgres guide is a useful refresher because PostgreSQL is the CoderPad default.

Communication and approach under time pressure

Talk through the invariant first ("this is a top-N ranking problem with a tiebreak requirement"), the brute force second ("a self-join would also work but is O(n²)"), and the optimal third ("GROUP BY + ORDER BY + LIMIT is O(n log n) and idiomatic"). Interviewers grade process as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, ties at the cut, NULL handling, timezone bugs in date arithmetic, status values you didn't expect (Pending, Cancelled). The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.

Frequently Asked Questions

What is the Square data engineering interview process?

The Square data engineering interview opens with a 15-30 minute recruiter HR screen, then a 30-minute hiring-manager interview focused on past experience and data-quality philosophy, then a 1-hour technical screen on CoderPad covering Python and SQL at medium-level complexity, then an onsite panel of multiple 30-45 minute rounds (data modeling, analytical thinking, cultural fit). The onsite panel requires an NDA. End-to-end the loop runs three to four weeks.

Is Square the same as Block? Which name should I search for?

Yes — Square rebranded to Block Inc. in December 2021. The legal entity is Block; the consumer brands (Square seller tools, Cash App, Tidal, Spiral) keep their original names. Search for both "Square data engineer interview" and "Block data engineer interview" — every external article you find under one name applies under the other. The DataLemur SQL guide is titled "Block SQL Interview Questions" and the IQ guide is titled "Square Data Engineer Interview Questions"; both refer to the same loop.

What programming languages does Square test in data engineering interviews?

Square tests SQL and Python in the technical screen and onsite. SQL is the heavier surface — the live coding round in CoderPad is PostgreSQL-only, with patterns like GROUP BY + COUNT + ORDER BY DESC + LIMIT, INTERVAL '30 days' cohort math, AVG OVER PARTITION BY window aggregates, and COUNT(DISTINCT col) over status-filtered payment data. Python is the lighter surface — typically 30 minutes of medium-difficulty data-manipulation questions (Pandas / dict / list comprehensions). Spark / PySpark show up in the onsite panel for senior roles.

How difficult are Square data engineering interview questions?

The curated Square practice set on PipeCode is 1 EASY + 1 MEDIUM, no hard. The EASY is a SQL ranking + aggregation problem (Top 10 Invoice Senders); the MEDIUM is a SQL aggregation + date-functions + cohort-analysis problem (Users with High Activity After 30 Days Signup). The DataLemur Block list adds 10 more SQL questions ranging from definition-style (joins, constraints, clustered vs non-clustered indexes) to scenario-grade (window AVG OVER, ROW_NUMBER for duplicates, click-through rate via LEFT JOIN + COUNT DISTINCT). Stuttering on the EASY is a stronger negative signal than struggling with the MEDIUM.

What is the Square data engineer salary range?

Square data engineer base salary ranges from $101K to $185K, with an average of $139,850 (median $140K) across 21 reported data points (per Interview Query). Total compensation data is sparse (only 2 reported points), so the total-comp average ($72K) is under-sampled and likely inaccurate; expect actual total comp to be substantially higher once equity refreshers and bonuses are factored in. Negotiation success rates are best-supported by competing offers and verified levels.fyi entries.

What tech stack does Square's data engineering team use?

Square's data engineering stack includes Python (heavy — Pandas, PySpark for large-scale transforms), SQL (PostgreSQL on CoderPad for interviews; production warehouses include Snowflake-equivalent platforms), Airflow + dbt for ETL/ELT orchestration, AWS or GCP for cloud infrastructure, Apache Spark for distributed data processing, the Medallion architecture (Bronze / Silver / Gold layers) for data organization, and Great Expectations for data-quality validation. The cultural emphasis is on data-quality-first pipelines, cross-functional collaboration with product / DS teams, and continuous learning.

Top Square data engineering interview topics

1. SQL Ranking, Sorting, and Aggregation for Top-N Invoice Senders

Top-N entity ranking via GROUP BY + ORDER BY DESC + LIMIT in SQL for data engineering

GROUP BY + COUNT(*) for per-sender aggregation

ORDER BY count DESC + LIMIT N for top-N rankings

Tiebreaks via RANK(), DENSE_RANK(), and ROW_NUMBER()

Common beginner mistakes

SQL Interview Question on Top 10 Invoice Senders

Solution Using GROUP BY + COUNT(*) + ORDER BY DESC + LIMIT 10

2. SQL Aggregation, Date Functions, and Cohort Analysis for Post-Signup Activity

30-day cohort retention via DATEDIFF + GROUP BY + HAVING in SQL for data engineering

Date arithmetic: DATEDIFF, DATE_TRUNC, and INTERVAL '30 days'

Cohort definition via signup_date + INTERVAL '30 days'

COUNT(*) per user with HAVING threshold

Common beginner mistakes

SQL Interview Question on 30-Day Post-Signup Activity

Solution Using JOIN + WHERE INTERVAL + GROUP BY + HAVING COUNT(*) >= 5

3. SQL Window Functions for Monthly Averages and Duplicate Detection

Window AVG OVER + ROW_NUMBER OVER for analytics in SQL for data engineering

Window AVG: AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))

ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for duplicate detection

Window vs GROUP BY: row-level vs collapsed-row aggregations

Common beginner mistakes

SQL Interview Question on Monthly Average Product Ratings

Solution Using AVG ... OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))

4. SQL COUNT DISTINCT and Status Filters for Payment-Flow Analysis

Payment-flow SQL with COUNT DISTINCT and status filters in SQL for Square data engineering

COUNT vs COUNT DISTINCT: row count vs unique-value count

WHERE status = 'Success' before aggregation

Multiple COUNT DISTINCT in one SELECT: sender / recipient splits

Common beginner mistakes

SQL Interview Question on Payment-Flow Unique Counts

Solution Using COUNT(DISTINCT) + WHERE status = 'Success'

Tips to crack Square data engineering interviews

Square = Block — both names refer to the same loop

Drill the four SQL primitives

CoderPad PostgreSQL is the live coding environment

Medallion architecture (Bronze / Silver / Gold) is the system-design house style

Data quality is graded heavily — Great Expectations and validation gates

4-stage interview process — HR, Hiring Manager, Technical, Onsite Panel (NDA)

Where to practice on PipeCode

Communication and approach under time pressure

Frequently Asked Questions

What is the Square data engineering interview process?

Is Square the same as Block? Which name should I search for?

What programming languages does Square test in data engineering interviews?

How difficult are Square data engineering interview questions?

What is the Square data engineer salary range?

What tech stack does Square's data engineering team use?

Start practicing Square data engineering problems

Tiebreaks via `RANK()`, `DENSE_RANK()`, and `ROW_NUMBER()`

Solution Using `GROUP BY` + `COUNT(*)` + `ORDER BY DESC` + `LIMIT 10`

Date arithmetic: `DATEDIFF`, `DATE_TRUNC`, and `INTERVAL '30 days'`

Cohort definition via `signup_date + INTERVAL '30 days'`

`COUNT(*) per user` with `HAVING` threshold

Solution Using `JOIN` + `WHERE INTERVAL` + `GROUP BY` + `HAVING COUNT(*) >= 5`

Window AVG: `AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))`

`ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)` for duplicate detection

Window vs `GROUP BY`: row-level vs collapsed-row aggregations

Solution Using `AVG ... OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))`

4. SQL `COUNT DISTINCT` and Status Filters for Payment-Flow Analysis

`COUNT` vs `COUNT DISTINCT`: row count vs unique-value count

`WHERE status = 'Success'` before aggregation

Multiple `COUNT DISTINCT` in one `SELECT`: sender / recipient splits

Solution Using `COUNT(DISTINCT)` + `WHERE status = 'Success'`