DEV Community

Cover image for Square Data Engineering Interview Questions & Prep Guide
Gowtham Potureddi
Gowtham Potureddi

Posted on

Square Data Engineering Interview Questions & Prep Guide

Square data engineering interview questions are SQL-heavy, fintech-flavored, and PostgreSQL-grounded. Square rebranded to Block Inc. in December 2021, but the SQL bar — and the question shapes that show up in the live CoderPad pair-programming round — has not moved. Four primitives carry the loop: GROUP BY sender_id + COUNT(*) + ORDER BY DESC LIMIT 10 for top-N invoice senders, DATEDIFF / INTERVAL '30 days' cohort math for 30-day-post-signup activity windows, AVG(stars) OVER (PARTITION BY product_id, month) window aggregates for monthly product analytics, and COUNT(DISTINCT col) over status-filtered payment rows for fintech-grade transaction analysis. The framings are everyday Block / Square / CashApp data engineering — invoice ranking, cohort retention, monthly review averages, payment success counts.

This guide walks four SQL topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches a curated 2-problem set (1 EASY ranking+sorting+aggregation, 1 MEDIUM aggregation+date-functions+cohort-analysis) plus the two adjacent SQL primitives — window functions and payment-flow COUNT DISTINCT — that show up on every Block / Square / CashApp SQL question list. The interview is PostgreSQL on CoderPad; candidates who prep in MySQL / Snowflake / BigQuery dialect stutter on INTERVAL, DATE_PART, and EXTRACT syntax. Drill PostgreSQL-flavored answers from the start.

Square data engineering interview questions cover image with bold headline, SQL and Block-Square rebrand chips, faint code ghost, and pipecode.ai attribution.


Top Square data engineering interview topics

From the Square data engineering practice set, the four numbered sections below follow this topic map (one row per H2):

# Topic (sections 1–4) Why it shows up at Square
1 SQL ranking, sorting, and aggregation for top-N invoice senders Top 10 Invoice Senders (EASY) — GROUP BY sender_id + COUNT(*) + ORDER BY DESC + LIMIT 10, the SQL primitive for any "top-N entities by event volume" question.
2 SQL aggregation, date functions, and cohort analysis for post-signup activity Users with High Activity After 30 Days Signup (MEDIUM) — DATEDIFF / INTERVAL '30 days' + GROUP BY user_id + HAVING COUNT(*) >= threshold, the canonical retention-cohort pattern.
3 SQL window functions for monthly averages and duplicate detection Monthly product rating + duplicate detection — AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)) and ROW_NUMBER() OVER (PARTITION BY ...) for dedupe (DataLemur Block staples).
4 SQL COUNT DISTINCT and status filters for payment-flow analysis Total unique successful transactions / senders / recipients — COUNT(DISTINCT col) WHERE status = 'Success', the fintech-grade pattern that drives Square / Block / CashApp payment analytics.

PostgreSQL-on-CoderPad framing rule: Square interviews live SQL in CoderPad with PostgreSQL syntax. Use EXTRACT(MONTH FROM ...), DATE_TRUNC('week', ...), INTERVAL '30 days', DATEDIFF-equivalent date math, and COUNT(DISTINCT) natively. Snowflake / BigQuery / SQL Server idioms (DATE_PART quirks, QUALIFY, TOP N) trip up candidates and signal weak SQL fluency. State your dialect upfront and drill PostgreSQL syntax.


1. SQL Ranking, Sorting, and Aggregation for Top-N Invoice Senders

Top-N entity ranking via GROUP BY + ORDER BY DESC + LIMIT in SQL for data engineering

"Return the top 10 senders by completed invoice count" is Square's signature EASY SQL prompt (Top 10 Invoice Senders). The mental model: GROUP BY sender_id collapses one row per sender; COUNT(*) produces the per-sender invoice count; ORDER BY count DESC ranks senders descending; LIMIT 10 returns only the top 10. Same primitive powers any "top-N entities by event volume" pipeline — top-N customers by order count, top-N products by review volume, top-N regions by daily active users.

Diagram showing an invoices mini-table on the left, a horizontal bar chart of per-sender invoice counts sorted descending in the center with the top 10 cut highlighted in green, and a green output card on the right listing the top-10 sender_ids with their counts.

Pro tip: Always alias the count column (AS invoice_count) and sort by the alias in ORDER BY. Mixing literal COUNT(*) and the alias is style noise that interviewers grade. Add a deterministic tiebreaker (, sender_id ASC) when ties at the cut matter — Square's interviewers probe ties at the boundary frequently.

GROUP BY + COUNT(*) for per-sender aggregation

The aggregation invariant: GROUP BY sender_id collapses all invoice rows that share the same sender_id into one output row; COUNT(*) counts the number of rows in each group. Every non-aggregate column in the SELECT must appear in GROUP BY (or be functionally dependent on it).

  • GROUP BY sender_id — collapses to one row per sender.
  • COUNT(*) — total rows per group; equivalent to COUNT(invoice_id) when invoice_id is non-null.
  • COUNT(DISTINCT col) — unique values per group; useful when senders can repeat invoices.
  • SELECT sender_id, COUNT(*) — only sender_id is referenced ungrouped; the GROUP BY makes this legal.

Worked example. Five invoices, three distinct senders.

step output
input [(101, s1), (102, s2), (103, s1), (104, s3), (105, s1)]
GROUP BY sender_id [s1: 3 rows, s2: 1 row, s3: 1 row]
COUNT(*) AS cnt [(s1, 3), (s2, 1), (s3, 1)]

Worked-example solution.

SELECT sender_id, COUNT(*) AS invoice_count
FROM invoices
GROUP BY sender_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: GROUP BY on the dimension you want one-row-per-value, COUNT(*) for the metric, alias the count immediately so subsequent clauses can reference it.

ORDER BY count DESC + LIMIT N for top-N rankings

The top-N invariant: ORDER BY <metric> DESC sorts groups by the metric in descending order; LIMIT N returns only the first N. For ties at the cut, dialects differ — PostgreSQL (Square's CoderPad default) returns whichever order the planner picks unless you add an explicit tiebreaker.

  • ORDER BY invoice_count DESC — sorts groups by the aliased column.
  • LIMIT 10 — first 10 groups in sort order; PostgreSQL syntax.
  • OFFSET 10 LIMIT 10 — pagination; ranks 11-20.
  • Deterministic tiebreak, sender_id ASC makes ties stable across runs.

Worked example. Sort 4 senders by invoice_count and take top 3.

sender_id invoice_count rank
s1 4 1
s2 2 2
s3 1 3
s4 1 (cut)

Worked-example solution.

SELECT sender_id, COUNT(*) AS invoice_count
FROM invoices
GROUP BY sender_id
ORDER BY invoice_count DESC, sender_id ASC
LIMIT 3;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always include a deterministic tiebreaker — fintech audits demand that ranked lists be reproducible across runs.

Tiebreaks via RANK(), DENSE_RANK(), and ROW_NUMBER()

The tiebreak invariant: ROW_NUMBER always gives a unique sequence (1, 2, 3, 4); RANK gives the same number for ties and skips the next (1, 1, 3, 4); DENSE_RANK gives the same number for ties without skipping (1, 1, 2, 3). Choose based on whether the prompt wants strict top-N row count, "all senders tied at rank N", or compact ranks without gaps.

  • ROW_NUMBER() OVER (ORDER BY count DESC) — strict 1..N sequence; ties resolved by planner.
  • RANK() OVER (ORDER BY count DESC) — ties share rank, next rank skips.
  • DENSE_RANK() OVER (ORDER BY count DESC) — ties share rank, next rank does not skip.
  • LIMIT 10 — chops to top 10 rows after sort; doesn't address ties unless paired with RANK filter.

Worked example. Three senders tied at 5 invoices, one at 3, one at 2.

sender_id invoice_count ROW_NUMBER RANK DENSE_RANK
s1 5 1 1 1
s2 5 2 1 1
s3 5 3 1 1
s4 3 4 4 2
s5 2 5 5 3

Worked-example solution.

SELECT sender_id, invoice_count,
       ROW_NUMBER() OVER (ORDER BY invoice_count DESC) AS rn,
       RANK()       OVER (ORDER BY invoice_count DESC) AS rk,
       DENSE_RANK() OVER (ORDER BY invoice_count DESC) AS drk
FROM (SELECT sender_id, COUNT(*) AS invoice_count
      FROM invoices GROUP BY sender_id) t;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "top 10" with strict cardinality → LIMIT 10; "all senders tied at top 10" → wrap with WHERE rk <= 10; "compact ranks 1..K with no gaps" → DENSE_RANK.

Common beginner mistakes

  • Forgetting GROUP BYSELECT sender_id, COUNT(*) without grouping is a parse error in PostgreSQL.
  • Forgetting ORDER BY ... DESC — top 10 silently becomes "first 10 by planner whim."
  • Returning more than N rows by skipping LIMIT — graded as a wrong answer even when the top N are correct.
  • Hardcoding ties to a single row when the prompt says "all senders tied at rank 10" — use RANK filter, not LIMIT.
  • Mixing COUNT(*) and COUNT(DISTINCT invoice_id) — non-equivalent when invoice_id can repeat (it usually can't, but state the assumption).

SQL Interview Question on Top 10 Invoice Senders

Given an invoices(invoice_id, sender_id, recipient_id, sent_at, amount) table, write a query that returns the top 10 senders by completed invoice count, ordered by count descending and breaking ties by sender_id ascending. Output two columns: sender_id and invoice_count.

Solution Using GROUP BY + COUNT(*) + ORDER BY DESC + LIMIT 10

SELECT
    sender_id,
    COUNT(*) AS invoice_count
FROM invoices
GROUP BY sender_id
ORDER BY invoice_count DESC, sender_id ASC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Why this works: GROUP BY sender_id collapses the row stream to one row per sender; COUNT(*) produces the metric for each group; ORDER BY invoice_count DESC ranks groups descending; the , sender_id ASC tiebreak ensures stable output across runs (fintech-audit-grade); LIMIT 10 returns the first 10 ranked rows.

Step-by-step trace for an 8-row sample:

invoice_id sender_id
101 s1
102 s2
103 s1
104 s3
105 s1
106 s2
107 s4
108 s1
  1. Group by sender_id — four groups: s1 (4 rows), s2 (2 rows), s3 (1 row), s4 (1 row).
  2. COUNT(*) — produces (s1, 4), (s2, 2), (s3, 1), (s4, 1).
  3. Order by invoice_count DESC, sender_id ASC(s1, 4), (s2, 2), (s3, 1), (s4, 1).
  4. LIMIT 10 — only 4 senders exist; all four returned.

Output:

sender_id invoice_count
s1 4
s2 2
s3 1
s4 1

Why this works — concept by concept:

  • GROUP BY sender_id — collapses rows that share sender_id; produces one output row per sender.
  • COUNT(*) aggregate — counts rows in each group; equivalent to COUNT(invoice_id) when invoice_id is non-null (always true since it's the primary key).
  • ORDER BY invoice_count DESC ranking — sorts groups descending by the metric; the DESC keyword is the entire difference between top-N and bottom-N.
  • Deterministic tiebreak, sender_id ASC ensures the output is reproducible across runs; fintech audits require this.
  • LIMIT 10 cut — chops to the top 10 ranked rows; PostgreSQL syntax (CoderPad default).
  • O((|invoices|) + G log G) time|invoices| rows scanned for the GROUP BY, then O(G log G) to sort the group output where G is the number of senders.

Inline CTA: Drill the Square SQL practice page for the curated EASY problem and the Square aggregation practice page for the only company-tagged topic surface available.

SQL
Company — Square (SQL)
Square SQL practice problems

Practice →

SQL
Company — Square / aggregation
Square aggregation problems

Practice →

SQL
Topic — ranking
SQL ranking problems

Practice →


2. SQL Aggregation, Date Functions, and Cohort Analysis for Post-Signup Activity

30-day cohort retention via DATEDIFF + GROUP BY + HAVING in SQL for data engineering

"Return every user who has at least 5 events in the 30 days after signup" is Square's signature MEDIUM SQL prompt (Users with High Activity After 30 Days Signup). The mental model: the cohort is defined by signup_date; the activity window is signup_date + INTERVAL '30 days'; events filtered to that window are aggregated GROUP BY user_id; HAVING COUNT(*) >= 5 returns the active users. Same primitive powers any retention or cohort analysis — N-day-post-signup activity, post-purchase repeat behavior, post-event engagement windows.

Diagram showing a users table with signup_date markers, a horizontal timeline with 30-day forward windows tinted purple where in-window events are tinted green and out-of-window events stay slate, and a green output card listing active users meeting the threshold.

Pro tip: PostgreSQL has multiple ways to express the 30-day boundary: signup_date + INTERVAL '30 days', signup_date + INTERVAL '30 day', (event_at - signup_date) <= INTERVAL '30 days', (event_at::date - signup_date::date) <= 30. Pick one and stick with it. Mixing forms within the same query is a syntax-fluency red flag.

Date arithmetic: DATEDIFF, DATE_TRUNC, and INTERVAL '30 days'

The date-arithmetic invariant: PostgreSQL date math returns either a DATE (when adding INTERVAL) or an INTERVAL (when subtracting two dates); both are comparable to literal intervals. DATEDIFF is not native PostgreSQL — use (d1 - d2)::int or EXTRACT(DAY FROM (d1 - d2)).

  • signup_date + INTERVAL '30 days' — returns a date 30 days after signup.
  • (event_at - signup_date) <= INTERVAL '30 days' — interval comparison; works for both DATE and TIMESTAMP.
  • DATE_TRUNC('week', event_at) — snaps to the week boundary; useful for weekly cohorts.
  • EXTRACT(DAY FROM (event_at - signup_date)) — pulls the day-count out as an integer.

Worked example. Signup on 2025-01-01; events at 2025-01-10 and 2025-02-15.

event_at signup_date diff within 30 days?
2025-01-10 2025-01-01 9 days
2025-02-15 2025-01-01 45 days

Worked-example solution.

SELECT event_at,
       (event_at::date - signup_date::date) AS days_since_signup,
       (event_at <= signup_date + INTERVAL '30 days') AS within_window
FROM users JOIN events USING (user_id);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: prefer event_at <= signup_date + INTERVAL '30 days' over (event_at - signup_date) <= INTERVAL '30 days' — the additive form reads as "the event happened within the window" without a subtraction step.

Cohort definition via signup_date + INTERVAL '30 days'

The cohort invariant: a cohort is a set of users defined by a shared signup date or signup-date-bucket; the cohort's activity window is that signup-date plus a fixed interval. The cohort filter goes in WHERE, the per-user aggregation happens in GROUP BY, and the threshold check goes in HAVING.

  • Cohort by exact signup dateWHERE u.signup_date = '2025-01-01'.
  • Cohort by weekWHERE DATE_TRUNC('week', u.signup_date) = '2025-01-06'.
  • Cohort by all users — no extra filter; aggregate per user.
  • Activity windowWHERE e.event_at <= u.signup_date + INTERVAL '30 days'.

Worked example. Two users signed up on different dates; activity-window check per user.

user_id signup_date window_end
u1 2025-01-01 2025-01-31
u2 2025-01-15 2025-02-14

Worked-example solution.

SELECT u.user_id, u.signup_date,
       u.signup_date + INTERVAL '30 days' AS window_end
FROM users u;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: compute the window boundary in SELECT (or a CTE) when you need to reference it multiple times; recomputing signup_date + INTERVAL '30 days' in three different clauses signals copy-paste.

COUNT(*) per user with HAVING threshold

The threshold invariant: GROUP BY user_id collapses to one row per user; COUNT(*) produces the per-user activity count; HAVING COUNT(*) >= 5 filters groups whose count meets the threshold. HAVING is "WHERE on aggregates" — it filters group rows, not source rows.

  • HAVING COUNT(*) >= 5 — group-level filter; aggregate predicate.
  • HAVING COUNT(*) >= 5 AND MAX(event_at) > signup_date + INTERVAL '14 days' — compound threshold.
  • Avoid WHERE COUNT(*) >= 5WHERE cannot reference aggregates; PostgreSQL parser rejects it.
  • Avoid WHERE to enforce the post-signup window when you want to count zero-event users — outer join + HAVING instead.

Worked example. Three users; threshold 5 events.

user_id events_in_window passes?
u1 6
u2 3
u3 5

Worked-example solution.

SELECT u.user_id, COUNT(*) AS events_in_window
FROM users u
JOIN events e ON e.user_id = u.user_id
WHERE e.event_at <= u.signup_date + INTERVAL '30 days'
GROUP BY u.user_id
HAVING COUNT(*) >= 5;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: row-level predicates → WHERE; aggregate predicates → HAVING. Crossing them is a graded conceptual error.

Common beginner mistakes

  • Using DATEDIFF(event_at, signup_date) in PostgreSQL — that's MySQL syntax; PostgreSQL uses (d1 - d2)::int or EXTRACT(DAY FROM ...).
  • Putting the aggregate threshold in WHERE — parser rejects; use HAVING.
  • Filtering signup_date in HAVING — that's a row predicate; should be in WHERE for performance.
  • Forgetting the JOIN — querying events alone misses the per-user signup_date anchor.
  • Missing the INTERVAL '30 days' boundary — counting all-time activity instead of the 30-day window inflates results.
  • Timezone bugs — event_at and signup_date in different timezones produces off-by-one days; cast both to a single timezone when in doubt.

SQL Interview Question on 30-Day Post-Signup Activity

Given users(user_id, signup_date) and events(event_id, user_id, event_at), return every user_id whose events in the 30 days after signup are at least 5. Output user_id and events_in_window, ordered by events_in_window descending.

Solution Using JOIN + WHERE INTERVAL + GROUP BY + HAVING COUNT(*) >= 5

SELECT
    u.user_id,
    COUNT(*) AS events_in_window
FROM users u
JOIN events e
    ON e.user_id = u.user_id
WHERE e.event_at <= u.signup_date + INTERVAL '30 days'
  AND e.event_at >= u.signup_date
GROUP BY u.user_id
HAVING COUNT(*) >= 5
ORDER BY events_in_window DESC;
Enter fullscreen mode Exit fullscreen mode

Why this works: the JOIN pairs each event with its user's signup_date; the WHERE clause restricts events to the 30-day window post-signup (>= signup_date AND <= signup_date + INTERVAL '30 days'); GROUP BY u.user_id collapses to one row per user; COUNT(*) produces the per-user in-window event count; HAVING COUNT(*) >= 5 filters users whose count crosses the threshold; ORDER BY events_in_window DESC ranks active users by engagement.

Step-by-step trace for three users:

user_id signup_date events_in_first_30d
u1 2025-01-01 6
u2 2025-01-15 3
u3 2025-02-10 5
  1. JOIN — pairs each event row with its signup_date from users.
  2. WHERE window filter — keeps only events where event_at is in [signup_date, signup_date + 30 days].
  3. GROUP BY user_id — collapses to one row per user with the in-window event count.
  4. HAVING >= 5 — filters out u2 (3 events). u1 (6) and u3 (5) pass.
  5. ORDER BY DESC — u1 (6) ranks above u3 (5).

Output:

user_id events_in_window
u1 6
u3 5

Why this works — concept by concept:

  • JOIN on user_id — pairs each event with its user's signup_date; without this, the query has no anchor for the window.
  • INTERVAL '30 days' boundary — PostgreSQL-native expression of the 30-day forward window; reads as "30 days from signup."
  • WHERE row filter — strips out-of-window events before grouping; doing this with HAVING would still work but reads worse and runs slower.
  • GROUP BY u.user_id — collapses the row stream to one row per user; the only non-aggregate column in SELECT matches the GROUP BY.
  • HAVING COUNT(*) >= 5 — filters groups; aggregate predicates have to live here.
  • O((|users| + |events|) + G log G) time — the JOIN dominates; G users with at least one in-window event get sorted at the end.

Inline CTA: Practice date-functions SQL problems and cohort-analysis problems on PipeCode.

SQL
Company — Square (SQL)
Square SQL practice problems

Practice →

SQL
Topic — date functions
SQL date-function problems

Practice →

SQL
Topic — cohort analysis
SQL cohort-analysis problems

Practice →


3. SQL Window Functions for Monthly Averages and Duplicate Detection

Window AVG OVER + ROW_NUMBER OVER for analytics in SQL for data engineering

"Calculate the monthly average rating for each product" is Block / Square's signature window-function SQL prompt (DataLemur Q1). The mental model: AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)) produces the monthly product average on every row, with no row-collapse. The mirror primitive is ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for duplicate detection — flag rows whose rn > 1 per partition. Same primitive powers any "row-level value vs group-aggregate" or "first-occurrence-per-group" question.

Diagram showing a reviews table with submit_date and stars columns, a partition box wrapping rows by product_id and month with AVG(stars) OVER per partition annotated, and a green output card listing month, product_id, avg_stars tuples.

Pro tip: Window functions and GROUP BY answer different questions. GROUP BY collapses rows; window aggregates compute on the group without collapsing. Use GROUP BY when you want one row per group; use a window when you want every original row plus a per-group aggregate column.

Window AVG: AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))

The window-AVG invariant: AVG(expr) OVER (PARTITION BY ...) returns the average of expr across all rows in the same partition, attached to every row. Unlike GROUP BY AVG(...), the row count is preserved.

  • PARTITION BY product_id — one window per product.
  • PARTITION BY product_id, EXTRACT(MONTH FROM submit_date) — one window per (product, month).
  • OVER () — empty parens = global window across all rows.
  • ORDER BY submit_date ROWS UNBOUNDED PRECEDING — running average up to current row.

Worked example. Two reviews for product 50001 in June; window AVG returns 3.5 on both rows.

review_id product_id submit_date stars window_avg
6171 50001 2022-06-08 4 3.5
5293 50001 2022-06-18 3 3.5

Worked-example solution.

SELECT review_id, product_id, submit_date, stars,
       AVG(stars) OVER (
         PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)
       ) AS monthly_avg
FROM reviews;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "show me each row alongside the group average" → window; "show me one row per group" → GROUP BY.

ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for duplicate detection

The duplicate-detection invariant: ROW_NUMBER() OVER (PARTITION BY <key_columns> ORDER BY <tiebreaker>) assigns 1, 2, 3, … to rows within each partition by the duplicate key; rows with row_number > 1 are duplicates.

  • PARTITION BY user_id, product_id — define what makes a duplicate.
  • ORDER BY submit_date ASC — tiebreaker; deterministic.
  • Filter — wrap in subquery / CTE: WHERE rn > 1 for duplicates only, WHERE rn = 1 for the first occurrence.
  • AlternativeCOUNT(*) OVER (PARTITION BY ...) to count duplicates per row.

Worked example. Two reviews from same user for same product → duplicate.

review_id user_id product_id submit_date rn
6171 123 50001 2022-06-08 1
9999 123 50001 2022-06-15 2 (duplicate)

Worked-example solution.

SELECT *
FROM (
  SELECT review_id, user_id, product_id, submit_date,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, product_id ORDER BY submit_date
         ) AS rn
  FROM reviews
) t
WHERE rn > 1;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: ROW_NUMBER OVER (PARTITION BY <duplicate_key> ORDER BY <tiebreaker>) is the universal SQL idiom for "find duplicates" — no GROUP BY needed.

Window vs GROUP BY: row-level vs collapsed-row aggregations

The distinction invariant: window functions preserve the input row count and add per-group aggregate columns; GROUP BY collapses input rows to one row per group and replaces non-aggregate columns. Many candidates conflate them; interviewers grade the distinction.

  • WindowSELECT review_id, AVG(stars) OVER (PARTITION BY product_id) FROM reviews; — row count preserved.
  • GROUP BYSELECT product_id, AVG(stars) FROM reviews GROUP BY product_id; — one row per product.
  • Mixed — wrap the GROUP BY in a CTE, then JOIN to the original; messy.
  • QUALIFY — Snowflake-only filter on window-result; PostgreSQL needs a wrapping subquery instead.

Worked example. Same data, two different shapes.

approach rows returned
Window AVG 5 (same as input)
GROUP BY AVG 2 (one per product)

Worked-example solution.

-- Window: keep all rows
SELECT review_id, product_id, stars,
       AVG(stars) OVER (PARTITION BY product_id) AS product_avg
FROM reviews;

-- GROUP BY: collapse to one row per product
SELECT product_id, AVG(stars) AS product_avg
FROM reviews
GROUP BY product_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: read the prompt — "for each row, show…" = window; "summarize per product" = GROUP BY.

Common beginner mistakes

  • Forgetting PARTITION BY — the window covers the whole table, producing the global average on every row.
  • Using GROUP BY when the prompt says "for each review, show the monthly product average" — collapses rows incorrectly.
  • Forgetting ORDER BY inside ROW_NUMBER OVER (...) — non-deterministic row numbering; the duplicate filter becomes random.
  • Using RANK() when ROW_NUMBER() is correct — RANK ties become same number; RANK > 1 doesn't reliably flag duplicates.
  • Filtering window results in WHERE — Postgres requires a wrapping subquery / CTE because window functions evaluate after WHERE.

SQL Interview Question on Monthly Average Product Ratings

Given reviews(review_id, user_id, submit_date, product_id, stars), write a query that returns, for each row, submit_date_month, product_id, and the monthly average rating for that product (avg_stars), sorted by month and product_id.

Solution Using AVG ... OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))

SELECT
    EXTRACT(MONTH FROM submit_date) AS mth,
    product_id,
    AVG(stars)::decimal(10, 2) OVER (
        PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)
    ) AS avg_stars
FROM reviews
ORDER BY mth, product_id;
Enter fullscreen mode Exit fullscreen mode

Why this works: EXTRACT(MONTH FROM submit_date) pulls the month component from each row; AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date)) partitions by (product, month) and averages stars within each partition; the cast to decimal(10, 2) rounds to two decimals; ORDER BY mth, product_id produces the canonical sort order. Each row keeps its identity — this is window, not GROUP BY.

Step-by-step trace for the DataLemur sample:

review_id user_id submit_date product_id stars
6171 123 2022-06-08 50001 4
7802 265 2022-06-10 69852 4
5293 362 2022-06-18 50001 3
6352 192 2022-07-26 69852 3
4517 981 2022-07-05 69852 2
  1. EXTRACT(MONTH ...) — produces [6, 6, 6, 7, 7].
  2. Partition by (product_id, month) — four partitions: (50001, 6) with 2 rows, (69852, 6) with 1 row, (69852, 7) with 2 rows.
  3. AVG within partition — (50001, 6) → 3.5; (69852, 6) → 4.0; (69852, 7) → 2.5.
  4. Cast + ORDER BY — five rows emitted, sorted by (mth, product_id).

Output:

mth product_id avg_stars
6 50001 3.50
6 50001 3.50
6 69852 4.00
7 69852 2.50
7 69852 2.50

Why this works — concept by concept:

  • EXTRACT(MONTH FROM submit_date) — PostgreSQL date-component function; returns the month as integer.
  • AVG(...) OVER (PARTITION BY ...) — window aggregate; preserves row count and computes per-partition mean.
  • Composite partition key(product_id, EXTRACT(MONTH FROM submit_date)) ensures monthly granularity per product.
  • ::decimal(10, 2) cast — rounds to 2 decimals; cleaner output than the default float representation.
  • ORDER BY mth, product_id — canonical sort order for the audit-readable result.
  • O(N log N) time — the planner sorts rows by partition keys once; no GROUP BY collapse.

Inline CTA: More SQL window-function problems and aggregation problems on PipeCode.

SQL
Company — Square (SQL)
Square SQL practice problems

Practice →

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →


4. SQL COUNT DISTINCT and Status Filters for Payment-Flow Analysis

Payment-flow SQL with COUNT DISTINCT and status filters in SQL for Square data engineering

"Return the total unique successful transactions, unique senders, and unique recipients" is Square / CashApp's signature payment-flow SQL prompt (Medium / DataLemur Q6 staple). The mental model: filter rows with WHERE status = 'Success' (idempotency boundary), then aggregate with COUNT(DISTINCT col) over the columns you care about — payment_id, sender_id, recipient_id. Same primitive powers any "unique-X for filtered Y" pipeline — unique active users last week, unique trading symbols traded yesterday, unique ad creatives clicked today.

Pro tip: Always state your idempotency boundary out loud before writing the query: "I'm filtering to status = 'Success' because failed/pending transactions don't count toward unique-success metrics." Square interviewers grade this phrasing — it's the difference between "candidate writes correct SQL" and "candidate thinks like a data engineer at a regulated payments company."

COUNT vs COUNT DISTINCT: row count vs unique-value count

The distinct-count invariant: COUNT(*) counts rows; COUNT(col) counts non-null values in col; COUNT(DISTINCT col) counts distinct non-null values in col. The three are not interchangeable.

  • COUNT(*) — total rows in the result set.
  • COUNT(payment_id) — non-null payment IDs (often equal to COUNT(*) since payment_id is the PK).
  • COUNT(DISTINCT sender_id) — unique senders; collapses repeated sender_ids.
  • COUNT(DISTINCT col1, col2) — PostgreSQL syntax for distinct combinations.

Worked example. Five payments from 3 distinct senders.

payment_id sender_id
p1 s1
p2 s2
p3 s1
p4 s3
p5 s1
metric value
COUNT(*) 5
COUNT(payment_id) 5
COUNT(DISTINCT sender_id) 3

Worked-example solution.

SELECT COUNT(*) AS total_rows,
       COUNT(payment_id) AS total_payment_ids,
       COUNT(DISTINCT sender_id) AS unique_senders
FROM payments;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "how many rows" → COUNT(*); "how many unique entities" → COUNT(DISTINCT col).

WHERE status = 'Success' before aggregation

The idempotency-filter invariant: payment-flow analytics almost always filter to status = 'Success' (or whichever success state applies); failed and pending transactions do not contribute to success metrics. The filter goes in WHERE, before aggregation.

  • WHERE status = 'Success' — strictly successful payments.
  • WHERE status IN ('Success', 'Settled') — multiple success-equivalent statuses.
  • WHERE status NOT IN ('Failed', 'Cancelled') — exclusionary; semantically same when the universe is finite.
  • State the assumption — interviewers expect you to clarify which statuses count.

Worked example. Six payments, three successful.

payment_id sender_id status
p1 s1 Success
p2 s2 Failed
p3 s1 Success
p4 s3 Pending
p5 s1 Success
p6 s4 Failed

After WHERE status = 'Success': 3 rows (p1, p3, p5), unique senders = 1 (s1 only).

Worked-example solution.

SELECT COUNT(*) AS successful_payments,
       COUNT(DISTINCT sender_id) AS unique_successful_senders
FROM payments
WHERE status = 'Success';
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: the success-filter is non-negotiable for payment-flow questions; ask the interviewer which status values count if the prompt is ambiguous.

Multiple COUNT DISTINCT in one SELECT: sender / recipient splits

The compound-aggregate invariant: a single SELECT can compute multiple COUNT(DISTINCT col) aggregates over the same filtered row set; this is the common shape for payment-flow "uniques per role" questions.

  • Three aggregates in one SELECTCOUNT(DISTINCT payment_id), COUNT(DISTINCT sender_id), COUNT(DISTINCT recipient_id).
  • Same WHERE filter applies to all — single source row set drives every aggregate.
  • Senders ≠ recipients — even though both columns reference users, the distinct counts can differ.
  • COUNT(DISTINCT user_id) — pool sender + recipient with UNION ALL first if you want "unique users involved."

Worked example. Three successful payments; two distinct senders, three distinct recipients.

payment_id sender_id recipient_id
p1 s1 r1
p3 s1 r2
p5 s2 r3
metric value
COUNT(DISTINCT payment_id) 3
COUNT(DISTINCT sender_id) 2
COUNT(DISTINCT recipient_id) 3

Worked-example solution.

SELECT COUNT(DISTINCT payment_id)  AS unique_transactions,
       COUNT(DISTINCT sender_id)   AS unique_senders,
       COUNT(DISTINCT recipient_id) AS unique_recipients
FROM payments
WHERE status = 'Success';
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: one query, multiple aggregates is the cleanest shape; never write three separate queries for sender / recipient / transaction counts.

Common beginner mistakes

  • Using COUNT(*) when the prompt says "unique" — undercounts (or overcounts, depending on whether duplicates exist).
  • Forgetting WHERE status = 'Success' — counts failed and pending transactions; the metric becomes meaningless.
  • Putting the status filter in HAVING instead of WHERE — works but is slow and reads worse.
  • Writing three separate queries for sender / recipient / transaction — composes into one SELECT with three aggregates.
  • Confusing "unique senders" with "unique users" — senders and recipients can be the same individual; pool them only if the prompt asks for "unique users."

SQL Interview Question on Payment-Flow Unique Counts

Given payments(payment_id, sender_id, recipient_id, amount, status, created_at) where status is one of 'Success', 'Failed', 'Pending', write a query that returns three columns: unique_transactions, unique_senders, unique_recipients — each a COUNT(DISTINCT) over rows where status = 'Success'.

Solution Using COUNT(DISTINCT) + WHERE status = 'Success'

SELECT
    COUNT(DISTINCT payment_id)   AS unique_transactions,
    COUNT(DISTINCT sender_id)    AS unique_senders,
    COUNT(DISTINCT recipient_id) AS unique_recipients
FROM payments
WHERE status = 'Success';
Enter fullscreen mode Exit fullscreen mode

Why this works: WHERE status = 'Success' strips failed and pending payments before aggregation so every count reflects only successful transactions; three COUNT(DISTINCT col) aggregates over the filtered row set produce the three unique counts in a single pass; the result is one row with three columns — the canonical shape for "summary metrics for a filtered universe."

Step-by-step trace for a 6-row sample:

payment_id sender_id recipient_id status
p1 s1 r1 Success
p2 s2 r2 Failed
p3 s1 r2 Success
p4 s3 r1 Pending
p5 s2 r3 Success
p6 s1 r1 Success
  1. WHERE filter — strips p2 (Failed) and p4 (Pending); 4 successful rows remain.
  2. COUNT(DISTINCT payment_id){p1, p3, p5, p6} = 4 (every payment_id is unique by definition).
  3. COUNT(DISTINCT sender_id){s1, s2} = 2.
  4. COUNT(DISTINCT recipient_id){r1, r2, r3} = 3.

Output:

unique_transactions unique_senders unique_recipients
4 2 3

Why this works — concept by concept:

  • WHERE status = 'Success' — idempotency-filter; payment-flow metrics never include failed or pending rows.
  • COUNT(DISTINCT payment_id) — counts unique transaction IDs; equal to COUNT(*) here since payment_id is the primary key, but spelled out for clarity.
  • COUNT(DISTINCT sender_id) — counts unique senders; senders can repeat across payments, so this collapses repeats.
  • COUNT(DISTINCT recipient_id) — counts unique recipients; same logic, different column.
  • Single-pass three-aggregate SELECT — one scan of payments filtered to Success, three aggregates in parallel; no need for separate queries.
  • O(|payments|) time / O(D) space — one linear scan; D = sum of distinct cardinalities held in three small hash structures.

Inline CTA: Drill more SQL filtering problems and the Square SQL practice page for the curated 2-problem set.

SQL
Company — Square (SQL)
Square SQL practice problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →


Tips to crack Square data engineering interviews

Square = Block — both names refer to the same loop

Square rebranded to Block Inc. in December 2021. The company name on offer letters is Block; the consumer products (Square seller tools, Cash App, Tidal, Spiral) keep their original brands. The data-engineering interview loop, the SQL bar, and the question shapes did not change with the rebrand. Search for both "Square data engineer interview" and "Block data engineer interview" — every external article you find under one name applies under the other.

Drill the four SQL primitives

The four primitives in this guide map directly to the curated 2 PipeCode SQL problems plus the two adjacent SQL primitives every Block / Square / CashApp SQL list rotates through: GROUP BY + COUNT + ORDER BY DESC + LIMIT N for top-N rankings (#45), DATEDIFF / INTERVAL '30 days' + GROUP BY + HAVING for cohort-retention queries (#217), AVG OVER PARTITION BY and ROW_NUMBER OVER window functions for monthly aggregates and duplicate detection (DataLemur Q1 + Q3), and COUNT(DISTINCT) + WHERE status = 'Success' payment-flow aggregations (Medium + DataLemur Q6).

CoderPad PostgreSQL is the live coding environment

Square / Block / CashApp interviews live SQL in CoderPad with PostgreSQL as the only available dialect. Drill PostgreSQL-flavored answers — EXTRACT(MONTH FROM ...), DATE_TRUNC('week', ...), INTERVAL '30 days', (d1 - d2)::int, COUNT(DISTINCT col), LIMIT N OFFSET M. Avoid Snowflake's QUALIFY, MySQL's DATEDIFF(d1, d2), BigQuery's DATE_DIFF, SQL Server's TOP N — these all parse-fail in PostgreSQL. Google search is allowed during the live interview, but stuttering on dialect signals weak SQL fluency.

Medallion architecture (Bronze / Silver / Gold) is the system-design house style

Square's data platform is structured around the Medallion architectureBronze for raw landed data, Silver for cleaned and enriched data, Gold for aggregated analytics-ready data. When you discuss pipeline design in the system-design or onsite panel rounds, frame the data flow as Bronze → Silver → Gold, name the orchestrator (Airflow or dbt), and mention data-quality gates (Great Expectations) between layers. This single framing carries weight at every Square data-engineering onsite.

Data quality is graded heavily — Great Expectations and validation gates

Square's interviewers probe data-quality concerns on every round. Mention specific tools you've used (Great Expectations, Soda, dbt tests), describe the validation gates between Bronze / Silver / Gold layers, and articulate the difference between "schema validation" (column types and nullability) and "semantic validation" (business-logic invariants like transaction_amount > 0 or payment_status IN ('Success', 'Failed', 'Pending')).

4-stage interview process — HR, Hiring Manager, Technical, Onsite Panel (NDA)

The Square data-engineering loop runs four stages: HR screen (15-30 min recruiter), Hiring Manager interview (~30 min, past experience and data-quality), Technical Screen (~1 hour, Python + SQL on CoderPad with medium-level complexity), and Onsite Panel (multiple 30-45 min rounds, NDA required, data modeling + analytical + cultural fit). Total comp data points are limited (only 2 reported), but the average base salary is $139,850, range $101K-$185K.

Where to practice on PipeCode

Start with the Square practice page and the language-scoped Square SQL practice page for the curated 2-problem set. Hit the company_topic Square — aggregation page for the only Square-tagged topic surface available. After that, drill the matching topic pages: ranking, aggregation, date functions, cohort analysis, window functions, filtering, group by. The interview courses page bundles structured curricula. For broader coverage, browse by topic, or pivot to peer guides — the Airbnb DE interview guide and the top DE interview questions 2026 blog. The SQL data types Postgres guide is a useful refresher because PostgreSQL is the CoderPad default.

Communication and approach under time pressure

Talk through the invariant first ("this is a top-N ranking problem with a tiebreak requirement"), the brute force second ("a self-join would also work but is O(n²)"), and the optimal third ("GROUP BY + ORDER BY + LIMIT is O(n log n) and idiomatic"). Interviewers grade process as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, ties at the cut, NULL handling, timezone bugs in date arithmetic, status values you didn't expect (Pending, Cancelled). The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.


Frequently Asked Questions

What is the Square data engineering interview process?

The Square data engineering interview opens with a 15-30 minute recruiter HR screen, then a 30-minute hiring-manager interview focused on past experience and data-quality philosophy, then a 1-hour technical screen on CoderPad covering Python and SQL at medium-level complexity, then an onsite panel of multiple 30-45 minute rounds (data modeling, analytical thinking, cultural fit). The onsite panel requires an NDA. End-to-end the loop runs three to four weeks.

Is Square the same as Block? Which name should I search for?

Yes — Square rebranded to Block Inc. in December 2021. The legal entity is Block; the consumer brands (Square seller tools, Cash App, Tidal, Spiral) keep their original names. Search for both "Square data engineer interview" and "Block data engineer interview" — every external article you find under one name applies under the other. The DataLemur SQL guide is titled "Block SQL Interview Questions" and the IQ guide is titled "Square Data Engineer Interview Questions"; both refer to the same loop.

What programming languages does Square test in data engineering interviews?

Square tests SQL and Python in the technical screen and onsite. SQL is the heavier surface — the live coding round in CoderPad is PostgreSQL-only, with patterns like GROUP BY + COUNT + ORDER BY DESC + LIMIT, INTERVAL '30 days' cohort math, AVG OVER PARTITION BY window aggregates, and COUNT(DISTINCT col) over status-filtered payment data. Python is the lighter surface — typically 30 minutes of medium-difficulty data-manipulation questions (Pandas / dict / list comprehensions). Spark / PySpark show up in the onsite panel for senior roles.

How difficult are Square data engineering interview questions?

The curated Square practice set on PipeCode is 1 EASY + 1 MEDIUM, no hard. The EASY is a SQL ranking + aggregation problem (Top 10 Invoice Senders); the MEDIUM is a SQL aggregation + date-functions + cohort-analysis problem (Users with High Activity After 30 Days Signup). The DataLemur Block list adds 10 more SQL questions ranging from definition-style (joins, constraints, clustered vs non-clustered indexes) to scenario-grade (window AVG OVER, ROW_NUMBER for duplicates, click-through rate via LEFT JOIN + COUNT DISTINCT). Stuttering on the EASY is a stronger negative signal than struggling with the MEDIUM.

What is the Square data engineer salary range?

Square data engineer base salary ranges from $101K to $185K, with an average of $139,850 (median $140K) across 21 reported data points (per Interview Query). Total compensation data is sparse (only 2 reported points), so the total-comp average ($72K) is under-sampled and likely inaccurate; expect actual total comp to be substantially higher once equity refreshers and bonuses are factored in. Negotiation success rates are best-supported by competing offers and verified levels.fyi entries.

What tech stack does Square's data engineering team use?

Square's data engineering stack includes Python (heavy — Pandas, PySpark for large-scale transforms), SQL (PostgreSQL on CoderPad for interviews; production warehouses include Snowflake-equivalent platforms), Airflow + dbt for ETL/ELT orchestration, AWS or GCP for cloud infrastructure, Apache Spark for distributed data processing, the Medallion architecture (Bronze / Silver / Gold layers) for data organization, and Great Expectations for data-quality validation. The cultural emphasis is on data-quality-first pipelines, cross-functional collaboration with product / DS teams, and continuous learning.


Start practicing Square data engineering problems

Top comments (0)