sql for data analytics is its own dialect. The keywords feel familiar — SELECT, GROUP BY, JOIN — but the shape of the queries is different. Analyst SQL lives in the world of cohorts (DATE_TRUNC, relative periods), funnels (FILTER, conditional counts), retention curves (day-N, weekly buckets, calendar joins) and growth metrics (LTV, ARPU, DAU/MAU stickiness). If you can write a fast aggregate and a clean window function, the job becomes shaping data into the exact frame a product manager or growth team can read off a dashboard.
This guide walks the five SQL muscles that show up in every analytics-engineer and senior analyst loop. We open with the why, then go deep on cohorts, funnels, retention curves, and the LTV / ARPU / MAU / DAU bundle. Every section pairs a teaching block with a worked interview answer — code, step-by-step trace, output, then a concept-by-concept why-this-works. By the end you will be able to defend a 12-line cohort matrix, trace a 5-step funnel, draw a 30-day retention curve, and compute DAU/MAU stickiness from a single user_activity table — the exact moves sql for data analyst rounds reward at FAANG, growth-stage SaaS, and analytics-heavy shops.
When you want hands-on reps immediately after reading, browse aggregation problems →, drill window-function problems →, and rehearse on the SQL practice library →.
On this page
- Why analytics SQL is its own discipline
- Cohort analysis — DATE_TRUNC + first-event signup buckets
- Funnel analysis — multi-step conversion + drop-off
- Retention curves — day-N + weekly cohorts + decay
- Business-metric SQL — LTV, ARPU, MAU/DAU, stickiness
- Cheat sheet — analyst SQL idioms
- Frequently asked questions
- Practice on PipeCode
1. Why analytics SQL is its own discipline
sql data analytics is built from four shape primitives — cohort, funnel, retention, LTV
The one-sentence mental model: sql data analysis is the discipline of taking a raw event table and reshaping it into one of four canonical frames — a cohort matrix, a funnel waterfall, a retention curve, or a growth-metric time series. Every dashboard a PM looks at is a derivative of one of those four shapes, which is why interviewers probe them by name.
Why analyst SQL deserves its own playbook.
- It is aggregation-first, not row-first. Application SQL fetches single rows by primary key. Analytics SQL almost never does — every query touches a window of time and produces a roll-up.
-
It lives on event tables.
events,user_activity,orders,pageviews,sessions. Wide, append-only, time-ordered. -
It uses a tight family of functions.
DATE_TRUNC,DATE_DIFF,EXTRACT,SUM,COUNT DISTINCT,FILTER (WHERE …),ROW_NUMBER,LAG,LEAD,SUM() OVER. -
It is read in shapes, not rows. A "good" query produces a tidy frame — one row per
(cohort, period), or one row per(funnel_step, user_count)— that a BI tool can pivot.
The four canonical analyst SQL frames.
- Cohort matrix. Rows = signup month; columns = months-since-signup; cells = retention %.
- Funnel waterfall. Rows = ordered steps; columns = count, drop-off, cumulative %.
- Retention curve. X = days-since-signup; Y = % of cohort still active; one line per cohort.
- Time-series + ratio. MAU, DAU, stickiness, ARPU, LTV — one row per period.
Why this is the bridge into senior analytics-engineer roles.
- A junior analyst writes a
GROUP BY signup_dateand stops. - A senior analyst writes the same query, then asks "is this cohort-fair across months?", catches the diagonal-greying problem, joins a calendar table, and ships a frame the PM can paste straight into a deck.
- An analytics engineer wraps that frame in a dbt model with tests on
cohort_size > 0andretention_pct <= 1.0, and the lineage is auditable forever.
sql for data analysis — the four-axis trade-off you make on every query.
- Granularity. Day vs week vs month — too fine and the chart is noisy, too coarse and the trend lags by a sprint.
- Anchoring. Calendar time vs cohort-relative time — a calendar chart is a status report, a cohort chart is a learning instrument.
- Counting basis. Active users vs revenue vs events — the same shape question has very different answers depending on which you pick.
- Comparability. Same window vs same-N-days — a 30-day window picked from January is not the same as 30 days from any cohort start; analyst SQL has to enforce one or the other.
What interviewers listen for in a sql for data analyst round.
- Do you reach for
DATE_TRUNCthe moment they say "cohort"? — required signal. - Do you separate calendar time from cohort-relative time in the query? — senior signal.
- Do you use
COUNT(DISTINCT user_id) FILTER (WHERE …)for step counts instead of pivoting in Python? — senior signal. - Do you mention a calendar / dates table for retention curves with sparse activity? — senior signal.
- Do you compute DAU/MAU stickiness as a ratio of two rolling windows, not as two unrelated metrics? — required signal.
Worked example — reshape one event table into a cohort frame in 8 lines
Detailed explanation. Realistic analyst SQL starts with a single event table — user_activity(user_id, event_date, event_type) — and a question — "how many users from the January cohort were still active in February?". The answer is one CTE that anchors each user to their first event, joined back to all of their later events.
Question. Given user_activity(user_id, event_date), write a query that returns the number of users from each monthly signup cohort who were active in each subsequent month.
Input.
| user_id | event_date |
|---|---|
| 1 | 2026-01-03 |
| 1 | 2026-02-11 |
| 1 | 2026-03-04 |
| 2 | 2026-01-22 |
| 3 | 2026-02-02 |
| 3 | 2026-03-30 |
Code.
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', MIN(event_date)) AS cohort_month
FROM user_activity
GROUP BY user_id
)
SELECT c.cohort_month,
DATE_TRUNC('month', a.event_date) AS activity_month,
COUNT(DISTINCT a.user_id) AS active_users
FROM cohorts c
JOIN user_activity a USING (user_id)
GROUP BY 1, 2
ORDER BY 1, 2;
Step-by-step explanation.
-
cohortsCTE. For eachuser_id, take the minimumevent_dateandDATE_TRUNCit to a month. That label — "the user's signup month" — never changes once assigned. - Join back to activity. Every event row now carries the user's cohort and its own activity month. One row per event, dressed with both keys.
-
GROUP BY cohort_month, activity_month. Collapse to one row per(cohort, period)pair, count distinct users — that is your cohort × period grid. -
ORDER BY cohort, activity_month. The output is the long-form (tidy) version of the cohort matrix; any BI tool can pivot it into the heatmap on the right. - Why this shape? Long-form is easier to test, store, and join with calendar tables. Pivot only at the presentation layer.
Output.
| cohort_month | activity_month | active_users |
|---|---|---|
| 2026-01 | 2026-01 | 2 |
| 2026-01 | 2026-02 | 1 |
| 2026-01 | 2026-03 | 1 |
| 2026-02 | 2026-02 | 1 |
| 2026-02 | 2026-03 | 1 |
Rule of thumb. Anchor users to their first event in a CTE; join back to all of their later events; group by (cohort, period). That single recipe powers cohort, funnel, and retention queries — only the second key and the count basis change.
Analytics SQL interview question on reshaping events into a long cohort frame
The probe usually sounds like: "Given an event table, return one row per signup-month × activity-month with the active-user count. Don't pivot in SQL — keep it tidy." It tests whether the candidate separates cohort time from calendar time and whether they reach for DATE_TRUNC reflexively.
Solution Using DATE_TRUNC + first-event CTE + grouped count
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', MIN(event_date)) AS cohort_month
FROM user_activity
GROUP BY user_id
),
activity AS (
SELECT a.user_id,
c.cohort_month,
DATE_TRUNC('month', a.event_date) AS activity_month
FROM user_activity a
JOIN cohorts c USING (user_id)
)
SELECT cohort_month,
activity_month,
COUNT(DISTINCT user_id) AS active_users
FROM activity
GROUP BY 1, 2
ORDER BY 1, 2;
Step-by-step trace.
| Stage | Operation | Output rows |
|---|---|---|
| 1 |
cohorts CTE — first event per user |
3 rows: (u1,2026-01), (u2,2026-01), (u3,2026-02) |
| 2 |
activity CTE — join cohort onto every event |
6 rows, each carrying both cohort_month and activity_month
|
| 3 | GROUP BY cohort_month, activity_month |
collapses to 5 distinct (cohort, period) pairs |
| 4 | COUNT(DISTINCT user_id) |
de-duplicates within each cell — one user active twice in March still counts as one |
| 5 | ORDER BY 1, 2 |
rows ordered cohort-then-period — easy to pivot |
Output:
| cohort_month | activity_month | active_users |
|---|---|---|
| 2026-01-01 | 2026-01-01 | 2 |
| 2026-01-01 | 2026-02-01 | 1 |
| 2026-01-01 | 2026-03-01 | 1 |
| 2026-02-01 | 2026-02-01 | 1 |
| 2026-02-01 | 2026-03-01 | 1 |
Why this works — concept by concept:
- DATE_TRUNC as the cohort anchor — truncating the user's first event to a month produces a label that is stable for the user's lifetime; every later query joins on it without recomputation.
- First-event CTE pattern — separating "who is the user" from "what did the user do" keeps the cohort label out of the activity table; one place to change the bucket granularity (month → week) without touching downstream queries.
- COUNT(DISTINCT user_id) as the basis — collapses many events per user per period into one active flag; that is the correct counting basis for retention, MAU, and cohort-active metrics.
-
Long-form output — one row per
(cohort, period); trivially pivotable in BI, easy to test (assert no row where active_users > cohort_size), and joinable with calendar tables. -
Cost — query time = O(events) for the first-event scan, O(events) for the activity scan, plus the group-by hash; safe up to billions of rows on a warehouse with cluster-key on
event_date.
SQL
Topic — aggregation
Aggregation problems (SQL)
2. Cohort analysis — DATE_TRUNC + first-event signup buckets
sql data analysis lives or dies on cohort SQL — anchor users to a signup bucket, then reshape
The one-line invariant of cohort SQL: assign every user a stable cohort label (the truncated signup date), then aggregate their later activity by relative period from that label. Once the cohort label is fixed, every downstream question — retention, revenue per cohort, conversion per cohort — collapses to a GROUP BY (cohort, period).
The cohort decisions every analyst SQL interview probes.
- Granularity. Daily, weekly, or monthly cohorts. Weekly is the dashboard sweet spot for SaaS; monthly for ecommerce; daily only when you have millions of signups per day.
- Anchor event. First app open, first purchase, first paid plan — the definition of "cohort 0" controls every downstream chart. State it explicitly in the query.
- Relative period vs absolute period. Cohort SQL groups by relative period (months-since-signup), not calendar period. A January user reaching M3 means March; a February user reaching M3 means May.
- Triangular shape. Newer cohorts haven't existed long enough to reach later periods — those cells should be NULL, not zero. Greying the impossible cells is a hallmark of a fair cohort chart.
The cohort recipe in three SQL primitives.
-
DATE_TRUNC('month', signup_at)→ cohort_month. The stable bucket label. -
DATE_DIFF('month', cohort_month, activity_month)→ period_n. The relative period. -
COUNT(DISTINCT user_id) / cohort_size→ retention_pct. The retention metric.
Three failure modes to call out in the interview.
-
Wrong basis.
COUNT(event_id)instead ofCOUNT(DISTINCT user_id)— heavy users inflate retention. -
Mixing calendar and cohort time. Joining on
activity_monthdirectly rather thanperiod_nproduces a non-comparable chart — January's M3 cell is February, but February's M3 cell is May. - Forgetting the triangle. Filling impossible cells with zero collapses retention as cohorts age; readers will misinterpret it as a falling product.
Worked example — 12-line cohort matrix for a SaaS signup feed
Detailed explanation. A SaaS product wants the monthly retention matrix — for each signup-month cohort, what fraction were active in M1, M2, M3, etc. The query needs three CTEs: cohort assignment, cohort size, and active counts per relative period.
Question. Given users(user_id, signup_at) and events(user_id, event_at), return one row per (cohort_month, period_n) with the cohort size, the active-user count, and the retention percentage.
Input (small sample).
| user_id | signup_at | events |
|---|---|---|
| 101 | 2026-01-04 | 2026-01-04, 2026-02-09, 2026-03-15 |
| 102 | 2026-01-12 | 2026-01-12, 2026-02-20 |
| 103 | 2026-02-02 | 2026-02-02, 2026-03-10 |
| 104 | 2026-02-25 | 2026-02-25 |
Code.
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', signup_at) AS cohort_month
FROM users
),
cohort_size AS (
SELECT cohort_month, COUNT(*) AS n_users
FROM cohorts
GROUP BY 1
),
activity AS (
SELECT c.cohort_month,
DATE_DIFF('month', c.cohort_month,
DATE_TRUNC('month', e.event_at)) AS period_n,
COUNT(DISTINCT e.user_id) AS active_users
FROM events e
JOIN cohorts c USING (user_id)
GROUP BY 1, 2
)
SELECT a.cohort_month,
a.period_n,
a.active_users,
s.n_users,
1.0 * a.active_users / s.n_users AS retention_pct
FROM activity a
JOIN cohort_size s USING (cohort_month)
ORDER BY 1, 2;
Step-by-step explanation.
-
cohortsCTE. Each user gets one stable label — the month-truncated signup date. This is the row-anchor for everything else. -
cohort_sizeCTE. Count the rows in each cohort. This becomes the denominator for retention percentages. -
activityCTE. Join cohort labels onto every event, thenDATE_DIFFto compute the relative period.GROUP BY (cohort, period)collapses to one row per cell. -
Final SELECT. Join the cell counts to the cohort sizes; divide for the retention percentage. The
1.0 *cast forces float division — without it some engines floor to integer zero. - Triangular shape emerges naturally. Cohorts that haven't reached period N yet simply produce no row — the cell is missing, not zero. The dashboard layer should render those cells as greyed, never as 0%.
Output.
| cohort_month | period_n | active_users | n_users | retention_pct |
|---|---|---|---|---|
| 2026-01 | 0 | 2 | 2 | 1.00 |
| 2026-01 | 1 | 2 | 2 | 1.00 |
| 2026-01 | 2 | 1 | 2 | 0.50 |
| 2026-02 | 0 | 2 | 2 | 1.00 |
| 2026-02 | 1 | 1 | 2 | 0.50 |
Rule of thumb. Three CTEs — cohorts, cohort_size, activity — and one final join. That is the entire cohort-matrix shape, no matter how big the data gets.
Analytics SQL interview question on building a cohort matrix end-to-end
A common probe in sql for data analytics interviews: "Walk me through the SQL to build a monthly cohort retention matrix. I want to see how you handle cohort assignment, the relative-period join, and the denominator." It is the bread-and-butter analyst question — fluency here distinguishes a senior analyst from a junior.
Solution Using DATE_TRUNC + DATE_DIFF + three CTEs
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', signup_at) AS cohort_month
FROM users
),
cohort_size AS (
SELECT cohort_month, COUNT(*) AS n_users
FROM cohorts
GROUP BY 1
),
activity AS (
SELECT c.cohort_month,
DATE_DIFF('month', c.cohort_month,
DATE_TRUNC('month', e.event_at)) AS period_n,
COUNT(DISTINCT e.user_id) AS active_users
FROM events e
JOIN cohorts c USING (user_id)
GROUP BY 1, 2
)
SELECT a.cohort_month,
a.period_n,
a.active_users,
s.n_users,
ROUND(100.0 * a.active_users / s.n_users, 1) AS retention_pct
FROM activity a
JOIN cohort_size s USING (cohort_month)
ORDER BY 1, 2;
Step-by-step trace.
| Stage | Operation | Output |
|---|---|---|
| 1 |
cohorts CTE |
one row per user with stable cohort_month
|
| 2 |
cohort_size CTE |
one row per cohort with n_users denominator |
| 3 |
activity CTE — join cohort onto every event |
one row per (cohort_month, period_n) with active_users
|
| 4 | final join + division | one row per cell with retention_pct
|
| 5 |
ROUND(...) to 1 dp |
presentation-ready output |
Output:
| cohort_month | period_n | active_users | n_users | retention_pct |
|---|---|---|---|---|
| 2026-01-01 | 0 | 12,000 | 12,000 | 100.0 |
| 2026-01-01 | 1 | 5,040 | 12,000 | 42.0 |
| 2026-01-01 | 2 | 3,360 | 12,000 | 28.0 |
| 2026-01-01 | 3 | 2,160 | 12,000 | 18.0 |
| 2026-02-01 | 0 | 14,500 | 14,500 | 100.0 |
| 2026-02-01 | 1 | 6,525 | 14,500 | 45.0 |
Why this works — concept by concept:
- DATE_TRUNC as the stable cohort anchor — every user has exactly one cohort label for life, computed once in the first CTE; all subsequent joins are O(1) per row instead of recomputing the truncation.
-
DATE_DIFF for relative period — turning a calendar pair
(cohort_month, activity_month)into a single integerperiod_nis what makes cohorts comparable. M3 always means "three months after signup", regardless of calendar. - COUNT(DISTINCT user_id) as the basis — collapses heavy-user inflation; the cell counts the people, not the events.
-
Triangular shape — missing rows (instead of zero rows) for
(cohort, period)pairs that don't exist yet means the chart honestly grays out impossible cells; the dashboard renders the diagonal correctly. - 1.0 * x / y — forces float division in engines that otherwise floor integers; the most common silent-zero bug in analyst SQL.
-
Cost — O(users) for cohorts, O(events) for activity, O(cohorts) for the denominator join; safe up to hundreds of millions of events with cluster-keys on
signup_atandevent_at.
SQL
Topic — cohort analysis
Cohort analysis problems (SQL)
Worked example — weekly cohorts and the diagonal-greying trap
Detailed explanation. Monthly cohorts can hide weekly seasonality (Black Friday signups behave very differently from end-of-quarter signups). Weekly cohorts surface that variance, but they multiply the cell count 4×, and the diagonal-greying trap becomes more obvious — June cohorts cannot have a 30-week period yet because June was only 4 weeks ago.
Question. Convert the monthly cohort recipe into a weekly cohort recipe, and explain why the bottom-right corner of the matrix has to be NULL, not zero.
Input (timeline assumption).
| Cohort week | Earliest possible period_n |
|---|---|
| 2026-W01 (Jan 5) | 21 weeks by today (2026-W22) |
| 2026-W10 (Mar 9) | 12 weeks by today |
| 2026-W22 (Jun 1) | 0 weeks by today |
Code.
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('week', signup_at) AS cohort_week
FROM users
),
activity AS (
SELECT c.cohort_week,
DATE_DIFF('week', c.cohort_week,
DATE_TRUNC('week', e.event_at)) AS period_n,
COUNT(DISTINCT e.user_id) AS active_users
FROM events e
JOIN cohorts c USING (user_id)
WHERE e.event_at >= c.cohort_week -- guard: no events before signup
GROUP BY 1, 2
)
SELECT *
FROM activity
WHERE cohort_week + (period_n * INTERVAL '7 days') <= CURRENT_DATE
ORDER BY 1, 2;
Step-by-step explanation.
- Truncate
signup_atto the start of the ISO week (Monday in most engines). That is the cohort label. - Truncate
event_atto its own week;DATE_DIFFgives the relative period in weeks. -
The guard clause
e.event_at >= c.cohort_weekdrops the impossible pre-signup events that creep in from clock skew or test data. -
The diagonal filter
cohort_week + (period_n * INTERVAL '7 days') <= CURRENT_DATEforbids rows that haven't happened yet — the cell for(2026-W22, period_n=10)is not emitted, so it renders as missing (which the dashboard greys out). - Without the filter, the engine would emit zero for those cells, which the dashboard would draw as 0% — a misleading retention drop.
Output.
| cohort_week | period_n | active_users |
|---|---|---|
| 2026-W01 | 0 | 1,200 |
| 2026-W01 | 1 | 720 |
| 2026-W01 | 4 | 420 |
| 2026-W10 | 0 | 1,500 |
| 2026-W10 | 1 | 930 |
| 2026-W22 | 0 | 800 |
Rule of thumb. Always enforce the "period is possible" filter in the final SELECT. NULL is the honest answer for the future; zero is a chart-killing lie.
SQL
Topic — date functions
Date function and bucketing problems (SQL)
Worked example — cohort-anchored revenue per period
Detailed explanation. A retention chart shows whether users come back; a cohort-anchored revenue chart shows what they spend when they come back. Same recipe shape — anchor the user to a cohort, project activity onto relative period — but the inner aggregate is SUM(amount) instead of COUNT(DISTINCT user_id).
Question. Given users(user_id, signup_at) and orders(user_id, amount, order_at), return one row per (cohort_month, period_n) with revenue and revenue_per_user.
Code.
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', signup_at) AS cohort_month
FROM users
),
cohort_size AS (
SELECT cohort_month, COUNT(*) AS n_users
FROM cohorts
GROUP BY 1
),
revenue AS (
SELECT c.cohort_month,
DATE_DIFF('month', c.cohort_month,
DATE_TRUNC('month', o.order_at)) AS period_n,
SUM(o.amount) AS revenue
FROM orders o
JOIN cohorts c USING (user_id)
GROUP BY 1, 2
)
SELECT r.cohort_month,
r.period_n,
r.revenue,
s.n_users,
ROUND(r.revenue / s.n_users, 2) AS revenue_per_user
FROM revenue r
JOIN cohort_size s USING (cohort_month)
ORDER BY 1, 2;
Step-by-step explanation.
-
cohortsCTE — the same first-event anchor; never recompute. -
cohort_sizeCTE — the denominator for the per-user metric. -
revenueCTE — sum orders into(cohort, period)cells. The DATE_DIFF gives relative period so months are comparable across cohorts. - Final join + division — produces the per-cohort, per-period revenue and the revenue-per-original-user (the LTV building block).
-
The shape generalises — replace
SUM(amount)withCOUNT(DISTINCT order_id)for purchase counts, or withAVG(amount)for cohort AOV.
Output.
| cohort_month | period_n | revenue | n_users | revenue_per_user |
|---|---|---|---|---|
| 2026-01 | 0 | 240,000 | 12,000 | 20.00 |
| 2026-01 | 1 | 132,000 | 12,000 | 11.00 |
| 2026-01 | 2 | 96,000 | 12,000 | 8.00 |
| 2026-02 | 0 | 290,000 | 14,500 | 20.00 |
| 2026-02 | 1 | 130,500 | 14,500 | 9.00 |
Rule of thumb. The cohort recipe is the shape, not the metric. Swap the inner aggregate and you can build retention, revenue, conversion, feature adoption, or any other cohort-anchored chart with the same five-CTE skeleton.
3. Funnel analysis — multi-step conversion + drop-off
sql for data analysis of funnels is one row per user-step pair, pivoted with FILTER
The funnel recipe in one sentence: for every user, mark which funnel step(s) they reached; then for every step, count the distinct users who reached it; the row-to-column pivot is done with COUNT(*) FILTER (WHERE step = 'X'). Once the conditional-count idiom clicks, multi-step funnels become five-line queries.
The funnel decisions every analyst SQL interview probes.
-
Ordering. Does the funnel require strict step order (signup → verify → activate, in sequence), or is it order-free (touched these N features at all)?
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_at)enforces order. -
Time bounding. "Activated within 7 days of signup" means joining the activation event with
event_at <= signup_at + INTERVAL '7 days'. Without a bound, conversion percentages drift up over time for old cohorts. -
De-duplication. A user who completes a step twice still counts once.
COUNT(DISTINCT user_id) FILTER (WHERE step = 'X')handles it. -
Drop-off math.
prev_step_count - current_step_countfor the absolute drop;1 - current / prevfor the percentage.
The funnel SQL recipe in three primitives.
-
FILTER (WHERE step = 'X')— the conditional-count pivot. -
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_at)— for strict-order funnels. -
INTERVAL '7 days'— the time-bound clause for "within N days" funnels.
Three failure modes interviewers love to catch.
-
Skip-step counting. Counting users who reached step 4 without verifying they also reached step 3. Use
EXISTSorINTERSECTto enforce. - Forgetting the time bound. Cohort A signed up 2 years ago, cohort B signed up yesterday — comparing their lifetime conversion is meaningless. Bound to "within N days of step 1".
-
Counting events instead of users.
COUNT(*)counts events;COUNT(DISTINCT user_id)counts people. The funnel chart shows people.
Worked example — 5-step product activation funnel in 6 lines
Detailed explanation. A product team wants the activation funnel: signup → email verified → profile completed → first action → activated (defined as 3+ actions in first 7 days). Five steps, four drop-off arrows.
Question. Given events(user_id, event_type, event_at), write a query that returns one row per funnel step with the user count and the step-to-step conversion percentage.
Input.
| user_id | event_type | event_at |
|---|---|---|
| 1 | signup | 2026-04-01 |
| 1 | email_verified | 2026-04-01 |
| 1 | profile_completed | 2026-04-02 |
| 1 | first_action | 2026-04-03 |
| 2 | signup | 2026-04-01 |
| 2 | email_verified | 2026-04-02 |
| 3 | signup | 2026-04-01 |
Code.
SELECT
COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'signup') AS step_1_signup,
COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'email_verified') AS step_2_verified,
COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'profile_completed') AS step_3_profile,
COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'first_action') AS step_4_action,
COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'activated') AS step_5_activated
FROM events
WHERE event_at >= DATE '2026-04-01';
Step-by-step explanation.
-
One scan, five buckets. The query reads the events table once; each
FILTERclause carves out the subset that matches the step name;COUNT(DISTINCT user_id)de-duplicates within that subset. - Wide-shape output. One row, five columns — perfect for a horizontal funnel chart but awkward to compute drop-offs from. We pivot it to long-form in the next CTE.
-
The
WHERE event_at >= ...clause anchors the funnel to a measurement window — without it the funnel includes every event from the table's history, which makes month-over-month comparison impossible. -
COUNT(DISTINCT user_id) FILTER (...)is the workhorse — in older engines withoutFILTERyou writeCOUNT(DISTINCT CASE WHEN ... THEN user_id END)for the same effect. -
The implicit ordering assumption. This wide query counts every user who reached a step, not every user who progressed through all previous steps in order. The strict-order version uses
ROW_NUMBERorEXISTS— covered in the next worked example.
Output (wide form).
| step_1_signup | step_2_verified | step_3_profile | step_4_action | step_5_activated |
|---|---|---|---|---|
| 10,000 | 7,500 | 5,200 | 3,400 | 1,890 |
Now pivot it to long-form for the drop-off chart.
WITH wide AS (
SELECT
COUNT(DISTINCT user_id) FILTER (WHERE event_type='signup') AS s1,
COUNT(DISTINCT user_id) FILTER (WHERE event_type='email_verified') AS s2,
COUNT(DISTINCT user_id) FILTER (WHERE event_type='profile_completed') AS s3,
COUNT(DISTINCT user_id) FILTER (WHERE event_type='first_action') AS s4,
COUNT(DISTINCT user_id) FILTER (WHERE event_type='activated') AS s5
FROM events
WHERE event_at >= DATE '2026-04-01'
)
SELECT 1 AS step, 'signup' AS name, s1 AS users, NULL::float AS dropoff FROM wide
UNION ALL SELECT 2, 'email_verified', s2, 1 - 1.0*s2/s1 FROM wide
UNION ALL SELECT 3, 'profile_completed', s3, 1 - 1.0*s3/s2 FROM wide
UNION ALL SELECT 4, 'first_action', s4, 1 - 1.0*s4/s3 FROM wide
UNION ALL SELECT 5, 'activated', s5, 1 - 1.0*s5/s4 FROM wide
ORDER BY step;
Output (long form).
| step | name | users | dropoff |
|---|---|---|---|
| 1 | signup | 10,000 | — |
| 2 | email_verified | 7,500 | 0.25 |
| 3 | profile_completed | 5,200 | 0.31 |
| 4 | first_action | 3,400 | 0.35 |
| 5 | activated | 1,890 | 0.44 |
Rule of thumb. Wide form for the dashboard widget, long form for the drop-off chart. Compute both in the same query so the two views stay in sync.
Analytics SQL interview question on a strict-order funnel with a 7-day time bound
The follow-up probe is almost always: "Make the funnel strict — a user only counts at step N if they also reached every earlier step in order, within 7 days of signup." It tests whether the candidate reaches for ROW_NUMBER or EXISTS or INTERSECT, and whether they correctly time-bound the conversion window.
Solution Using a ROW_NUMBER ordered-events CTE + INNER JOIN per step
WITH ordered AS (
SELECT user_id,
event_type,
event_at,
MIN(event_at) OVER (PARTITION BY user_id) AS signup_at
FROM events
WHERE event_type IN ('signup','email_verified','profile_completed',
'first_action','activated')
),
within_7d AS (
SELECT *
FROM ordered
WHERE event_at <= signup_at + INTERVAL '7 days'
),
per_step AS (
SELECT user_id,
BOOL_OR(event_type='signup') AS hit_signup,
BOOL_OR(event_type='email_verified') AS hit_verified,
BOOL_OR(event_type='profile_completed') AS hit_profile,
BOOL_OR(event_type='first_action') AS hit_action,
BOOL_OR(event_type='activated') AS hit_activated
FROM within_7d
GROUP BY user_id
)
SELECT
COUNT(*) AS s1_signup,
COUNT(*) FILTER (WHERE hit_verified) AS s2_verified,
COUNT(*) FILTER (WHERE hit_verified
AND hit_profile) AS s3_profile,
COUNT(*) FILTER (WHERE hit_verified
AND hit_profile
AND hit_action) AS s4_action,
COUNT(*) FILTER (WHERE hit_verified
AND hit_profile
AND hit_action
AND hit_activated) AS s5_activated
FROM per_step
WHERE hit_signup;
Step-by-step trace.
| Stage | Operation | Output rows |
|---|---|---|
| 1 |
ordered CTE |
every relevant event with signup_at derived from the user's earliest event |
| 2 |
within_7d CTE |
filters out events more than 7 days after the user's signup |
| 3 |
per_step CTE |
one row per user with 5 boolean hit-flags |
| 4 | final SELECT |
FILTER (WHERE hit_X AND hit_Y AND …) enforces strict order — a user only counts at step 4 if all earlier steps were also hit |
| 5 | WHERE hit_signup |
a user without the signup event is not in the funnel at all |
Output:
| s1_signup | s2_verified | s3_profile | s4_action | s5_activated |
|---|---|---|---|---|
| 10,000 | 7,200 | 4,900 | 3,100 | 1,650 |
Strict-order numbers are lower than the wide funnel because users who reached step 4 without verifying email earlier are excluded.
Why this works — concept by concept:
-
Conditional counts (
FILTER) — the column-pivoting workhorse; one scan computes every step's count without subqueries orUNION ALLper step. -
Boolean roll-up per user (
BOOL_OR) — collapses many events per user to one row with five flags; the strict-order test becomes a chained AND. -
Time-bounding with
INTERVAL '7 days'— anchors every step to the user's signup, not to wall-clock time; cohort fairness is preserved. -
ROW_NUMBERalternative — when steps must be in strict chronological order (not just "reached in any sequence"),ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_at)with a step-name lookup is the alternative pattern. Use it when sequence violation matters (e.g. fraud funnels). -
Drop-off math in one place — once the long-form table is built,
1 - current/prevper row gives the drop-off without re-querying the events table. - Cost — O(events) for the scans plus O(users) for the per-user roll-up; safe up to billions of rows on a clustered warehouse.
SQL
Topic — user funnel analysis
User-funnel analysis problems (SQL)
Worked example — conditional aggregation for a feature-adoption funnel
Detailed explanation. Conditional aggregation — COUNT FILTER, SUM CASE WHEN, AVG FILTER — is the most flexible analyst-SQL idiom. It lets you pivot, segment, and bucket in one pass without dragging data into Python.
Question. From the same events table, compute the share of users who adopted each of three optional features (dashboard, api_keys, team_invite) in their first 30 days.
Code.
WITH first_30 AS (
SELECT e.user_id,
e.event_type,
e.event_at,
u.signup_at
FROM events e
JOIN users u USING (user_id)
WHERE e.event_at BETWEEN u.signup_at
AND u.signup_at + INTERVAL '30 days'
)
SELECT
COUNT(DISTINCT user_id) AS total_users,
COUNT(DISTINCT user_id) FILTER (WHERE event_type='dashboard_open') AS dashboard_users,
COUNT(DISTINCT user_id) FILTER (WHERE event_type='api_key_created') AS api_users,
COUNT(DISTINCT user_id) FILTER (WHERE event_type='team_invite_sent') AS invite_users,
ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE event_type='dashboard_open')
/ COUNT(DISTINCT user_id), 1) AS pct_dashboard,
ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE event_type='api_key_created')
/ COUNT(DISTINCT user_id), 1) AS pct_api,
ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE event_type='team_invite_sent')
/ COUNT(DISTINCT user_id), 1) AS pct_invite
FROM first_30;
Step-by-step explanation.
-
first_30CTE restricts to events in the user's first 30 days — the cohort-fair adoption window. -
COUNT(DISTINCT user_id)(no FILTER) is the denominator — every user who has at least one event in their first 30 days. -
Each
FILTER (WHERE …)count is the numerator for one feature; users who touched it in the window. -
The percentage columns divide the two;
100.0 *forces float division;ROUND(... , 1)shapes the output for the BI tool. -
One scan, six metrics. Compare to writing three separate queries —
FILTERis dramatically faster and easier to keep in sync.
Output.
| total_users | dashboard_users | api_users | invite_users | pct_dashboard | pct_api | pct_invite |
|---|---|---|---|---|---|---|
| 14,500 | 9,860 | 3,335 | 1,885 | 68.0 | 23.0 | 13.0 |
Rule of thumb. Whenever the question is "what fraction of X did Y", reach for COUNT(...) FILTER (WHERE …) / COUNT(...). It collapses a half-page Python notebook into 8 lines of SQL.
SQL
Topic — conditional aggregation
Conditional aggregation problems (SQL)
Worked example — time-to-convert distribution with percentiles
Detailed explanation. Funnel charts answer "how many converted"; time-to-convert answers "how long did it take". The interview shape: compute the time between each user's signup and their activation event, then summarise with median and p90.
Question. Given events(user_id, event_type, event_at), compute the time-to-activation distribution (in hours) for users who eventually activated.
Input.
| user_id | event_type | event_at |
|---|---|---|
| 1 | signup | 2026-04-01 10:00:00 |
| 1 | activated | 2026-04-01 14:30:00 |
| 2 | signup | 2026-04-01 11:00:00 |
| 2 | activated | 2026-04-02 09:00:00 |
| 3 | signup | 2026-04-01 12:00:00 |
Code.
WITH first_signup AS (
SELECT user_id, MIN(event_at) AS signup_at
FROM events
WHERE event_type = 'signup'
GROUP BY user_id
),
first_activation AS (
SELECT user_id, MIN(event_at) AS activated_at
FROM events
WHERE event_type = 'activated'
GROUP BY user_id
),
joined AS (
SELECT s.user_id,
EXTRACT(EPOCH FROM (a.activated_at - s.signup_at)) / 3600 AS hours_to_activate
FROM first_signup s
JOIN first_activation a USING (user_id)
)
SELECT
COUNT(*) AS activated_users,
ROUND(AVG(hours_to_activate), 2) AS mean_hours,
ROUND(PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY hours_to_activate), 2) AS p50_hours,
ROUND(PERCENTILE_CONT(0.9)
WITHIN GROUP (ORDER BY hours_to_activate), 2) AS p90_hours
FROM joined;
Step-by-step explanation.
-
first_signupCTE anchors each user to their earliest signup timestamp. -
first_activationCTE anchors each user to their earliest activation. Users who never activated are dropped by the inner join. -
joinedCTE computes the time delta in hours.EXTRACT(EPOCH FROM ...) / 3600is the cross-dialect way to turn an interval into a number. -
PERCENTILE_CONT(0.5)is the median;PERCENTILE_CONT(0.9)is p90. These ordered-set aggregates are exactly what the interviewer is testing — they replace the old "mean is misleading" hand-wave. - What this surfaces. A mean of 6 hours with a p90 of 36 hours is a clear two-mode distribution — most users activate quickly, a long tail drag the chart.
Output.
| activated_users | mean_hours | p50_hours | p90_hours |
|---|---|---|---|
| 1,890 | 6.20 | 4.50 | 22.30 |
Rule of thumb. Whenever the question is "how long did it take?", report median and p90 alongside the mean — the distribution shape is the answer, not a single number.
4. Retention curves — day-N + weekly cohorts + decay
sql for data analyst rounds love retention curves — the shape tells the product story
The retention curve in one sentence: for each cohort and each day-N, what fraction of the cohort was active that day. Plot day-N on the x-axis, retention % on the y, one line per cohort, and you have the canonical product-market-fit chart.
The retention decisions every analyst SQL interview probes.
-
Day-N or rolling. Day-N retention = "active on exactly day N after signup". Rolling retention = "active any time in days
[N-window, N]". Day-N is simpler; rolling is what most dashboards actually show. - Calendar vs activity-anchored. Retention curves anchor to signup (cohort time), not calendar time — that is what makes the curve readable across cohorts.
- Sparse days. If a user is inactive on day 14 but active on day 13 and 15, the day-14 row is missing from the activity table. A LEFT JOIN against a calendar table fills the zero rows so the chart is honest.
- Benchmarks. D1/D7/D30 are the canonical reporting horizons. Industry benchmarks live there. Anything else is a story curve.
The retention SQL recipe in four primitives.
-
Anchor.
signup_dateper user (from the cohorts CTE). -
Day-N.
DATE_DIFF('day', signup_date, activity_date) AS day_n. -
Calendar fill.
LEFT JOIN calendarso D14 = 0 when there is no event. -
Retention.
COUNT(DISTINCT user_id) / cohort_size.
Three failure modes interviewers love to catch.
- Missing zero rows. A flat-looking curve that quietly skips inactive days is wrong; the user is not active on those days, so the count should be zero, not "not in the table".
- Calendar drift. Comparing day-N across cohorts without a calendar table can drift if the engine groups on natural-key rows only; LEFT JOIN ensures every (cohort, day-N) pair exists.
- D0 confusion. D0 = signup day. By definition, retention on D0 is 100% — if your query produces 92%, you have a clock or a UTC bug.
Worked example — 30-day retention curve across 3 cohorts
Detailed explanation. A growth team wants the 30-day retention curve for the last three monthly cohorts. The curve must show every day from D0 to D30 — including days where the cohort was completely inactive (rare but possible).
Question. Given users(user_id, signup_at) and events(user_id, event_at), return one row per (cohort_month, day_n) for day_n 0..30 with the retention percentage.
Input.
| user_id | signup_at | events |
|---|---|---|
| 101 | 2026-01-05 | 2026-01-05, 2026-01-06, 2026-01-12 |
| 102 | 2026-01-15 | 2026-01-15, 2026-01-22, 2026-02-04 |
| 201 | 2026-02-03 | 2026-02-03, 2026-02-05 |
Code.
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', signup_at) AS cohort_month,
DATE(signup_at) AS signup_date
FROM users
),
cohort_size AS (
SELECT cohort_month, COUNT(*) AS n_users
FROM cohorts
GROUP BY 1
),
day_grid AS ( -- 0..30 — the x-axis
SELECT generate_series(0, 30) AS day_n
),
matrix AS (
SELECT c.cohort_month, d.day_n
FROM (SELECT DISTINCT cohort_month FROM cohorts) c
CROSS JOIN day_grid d
),
active AS (
SELECT c.cohort_month,
DATE_DIFF('day', c.signup_date, DATE(e.event_at)) AS day_n,
COUNT(DISTINCT e.user_id) AS active_users
FROM events e
JOIN cohorts c USING (user_id)
WHERE e.event_at BETWEEN c.signup_date
AND c.signup_date + INTERVAL '30 days'
GROUP BY 1, 2
)
SELECT m.cohort_month,
m.day_n,
COALESCE(a.active_users, 0) AS active_users,
s.n_users,
ROUND(100.0 * COALESCE(a.active_users, 0) / s.n_users, 1) AS retention_pct
FROM matrix m
JOIN cohort_size s USING (cohort_month)
LEFT JOIN active a USING (cohort_month, day_n)
ORDER BY 1, 2;
Step-by-step explanation.
-
cohortsCTE assigns each user a month label and keeps the exact signup date — the day-N math needs the day, the cohort needs the month. -
cohort_sizeCTE is the denominator per cohort. -
day_grid+matrixgenerate every(cohort, day_n)pair from 0 to 30; this is the canonical x-axis. Without it, days where nobody returned would be silently absent. -
activeCTE counts distinct users active on each(cohort, day_n). TheWHEREclause time-bounds activity to the first 30 days per user. -
LEFT JOIN activefills the matrix — when there is no activity row for a(cohort, day_n),COALESCEsubstitutes 0; the chart shows an honest zero, not a missing point. - Final division yields the retention percentage in presentation form.
Output (sampled).
| cohort_month | day_n | active_users | n_users | retention_pct |
|---|---|---|---|---|
| 2026-01 | 0 | 12,000 | 12,000 | 100.0 |
| 2026-01 | 1 | 5,400 | 12,000 | 45.0 |
| 2026-01 | 7 | 3,360 | 12,000 | 28.0 |
| 2026-01 | 14 | 2,640 | 12,000 | 22.0 |
| 2026-01 | 30 | 2,040 | 12,000 | 17.0 |
| 2026-02 | 0 | 14,500 | 14,500 | 100.0 |
| 2026-02 | 1 | 5,510 | 14,500 | 38.0 |
| 2026-02 | 7 | 3,190 | 14,500 | 22.0 |
Rule of thumb. Generate the day-N grid explicitly; LEFT JOIN the activity counts onto it; COALESCE(active_users, 0). Without the grid the chart silently lies on inactive days.
Analytics SQL interview question on D1 / D7 / D30 with calendar gap-fill
A common probe in sql for data analyst rounds: "Compute D1, D7, D30 retention per cohort and explain why the calendar table is necessary." It is a clean test of cohort fairness, calendar handling, and the difference between missing data and zero data.
Solution Using LEFT JOIN calendar + COALESCE(active, 0) + targeted day-N picks
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', signup_at) AS cohort_month,
DATE(signup_at) AS signup_date
FROM users
),
cohort_size AS (
SELECT cohort_month, COUNT(*) AS n_users
FROM cohorts
GROUP BY 1
),
active AS (
SELECT c.cohort_month,
DATE_DIFF('day', c.signup_date, DATE(e.event_at)) AS day_n,
COUNT(DISTINCT e.user_id) AS active_users
FROM events e
JOIN cohorts c USING (user_id)
WHERE e.event_at BETWEEN c.signup_date
AND c.signup_date + INTERVAL '30 days'
GROUP BY 1, 2
),
calendar AS (
SELECT cohort_month, day_n
FROM (SELECT DISTINCT cohort_month FROM cohorts) c
CROSS JOIN (VALUES (1), (7), (30)) AS d(day_n)
)
SELECT cal.cohort_month,
cal.day_n,
s.n_users,
COALESCE(a.active_users, 0) AS active_users,
ROUND(100.0 * COALESCE(a.active_users,0) / s.n_users, 1) AS retention_pct
FROM calendar cal
JOIN cohort_size s USING (cohort_month)
LEFT JOIN active a USING (cohort_month, day_n)
ORDER BY cal.cohort_month, cal.day_n;
Step-by-step trace.
| Stage | Operation | Output |
|---|---|---|
| 1 |
cohorts CTE |
one row per user with cohort_month + signup_date
|
| 2 |
cohort_size CTE |
per-cohort denominator |
| 3 |
active CTE |
actual day-N counts (sparse — missing days have no row) |
| 4 |
calendar CTE |
dense (cohort, day_n) pairs for D1/D7/D30 only |
| 5 |
LEFT JOIN active + COALESCE
|
dense output — missing days resolve to 0 |
| 6 | divide for percentage | three rows per cohort: D1, D7, D30 |
Output:
| cohort_month | day_n | n_users | active_users | retention_pct |
|---|---|---|---|---|
| 2026-01-01 | 1 | 12,000 | 5,400 | 45.0 |
| 2026-01-01 | 7 | 12,000 | 3,360 | 28.0 |
| 2026-01-01 | 30 | 12,000 | 2,040 | 17.0 |
| 2026-02-01 | 1 | 14,500 | 5,510 | 38.0 |
| 2026-02-01 | 7 | 14,500 | 3,190 | 22.0 |
| 2026-02-01 | 30 | 14,500 | 1,595 | 11.0 |
Why this works — concept by concept:
-
Calendar table fills the gaps — without
LEFT JOIN, inactive days simply do not appear; the chart would skip them and the curve would look smoother than reality. The calendar is the spine that forces every day to exist. - COALESCE turns "missing" into "zero" — the analyst-honest substitution; absence of evidence becomes evidence of absence, which is correct for retention.
- DATE_DIFF for day-N — turns calendar dates into cohort time; the curve becomes comparable across cohorts.
-
Time-bound the activity —
event_at BETWEEN signup AND signup + 30 dayskeeps the scan small and cohort-fair. - Targeted picks (D1/D7/D30) — instead of a 31-row table per cohort, the calendar restricts to three industry-standard horizons; faster query, easier dashboard.
- Cost — O(users) for cohort assignment, O(events_in_30d) for the active scan, O(cohorts × 3) for the calendar; trivially small.
SQL
Topic — window functions
Window-function and retention problems (SQL)
Worked example — weekly cohort retention with DATE_DIFF
Detailed explanation. Weekly cohorts are the daily-driver granularity for product analytics — they capture seasonality without the noise of daily curves. The recipe is identical to monthly cohorts; only the truncation unit changes.
Question. Given a 12-week window, return weekly cohort retention for weeks W0..W6 (six weeks after signup).
Code.
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('week', signup_at) AS cohort_week
FROM users
),
cohort_size AS (
SELECT cohort_week, COUNT(*) AS n_users
FROM cohorts
GROUP BY 1
),
activity AS (
SELECT c.cohort_week,
DATE_DIFF('week', c.cohort_week,
DATE_TRUNC('week', e.event_at)) AS week_n,
COUNT(DISTINCT e.user_id) AS active_users
FROM events e
JOIN cohorts c USING (user_id)
WHERE e.event_at <= c.cohort_week + INTERVAL '6 weeks'
GROUP BY 1, 2
)
SELECT a.cohort_week,
a.week_n,
a.active_users,
s.n_users,
ROUND(100.0 * a.active_users / s.n_users, 1) AS retention_pct
FROM activity a
JOIN cohort_size s USING (cohort_week)
WHERE a.cohort_week >= CURRENT_DATE - INTERVAL '12 weeks'
ORDER BY 1, 2;
Step-by-step explanation.
- Truncate signup to week start; this is the cohort label.
- Truncate every activity event to its own week;
DATE_DIFFproduces the relativeweek_n. -
WHERE event_at <= cohort_week + INTERVAL '6 weeks'bounds the activity scan — every cohort gets exactly six weeks of observation, max. - Outer filter
cohort_week >= CURRENT_DATE - INTERVAL '12 weeks'restricts to the last twelve weekly cohorts — the chart's x-axis. - The final division and ROUND shape the output for the BI tool.
Output (sampled).
| cohort_week | week_n | active_users | n_users | retention_pct |
|---|---|---|---|---|
| 2026-03-09 | 0 | 1,200 | 1,200 | 100.0 |
| 2026-03-09 | 1 | 540 | 1,200 | 45.0 |
| 2026-03-09 | 4 | 264 | 1,200 | 22.0 |
| 2026-03-16 | 0 | 1,150 | 1,150 | 100.0 |
| 2026-03-16 | 1 | 540 | 1,150 | 47.0 |
Rule of thumb. Weekly retention curves are the granularity you want for product dashboards. Monthly is for board decks; daily is for incident analysis.
SQL
Topic — date arithmetic
Date arithmetic problems (SQL)
Worked example — N-day rolling retention with window-rank dedup
Detailed explanation. Rolling retention (active any time in days [N-window, N]) is more forgiving than day-N retention — it captures users who came back yesterday or today, not just exactly on day N. Most product dashboards prefer rolling 28-day retention over day-28 retention for that reason.
Question. Given events(user_id, event_at) and a known per-user signup_date, return one row per (user_id, day_n) for day_n in [7, 14, 28] indicating whether each user was active in the trailing 7-day window ending at that day.
Code.
WITH cohorts AS (
SELECT user_id, DATE(signup_at) AS signup_date
FROM users
),
windows AS (
SELECT user_id, signup_date, day_n
FROM cohorts CROSS JOIN (VALUES (7), (14), (28)) AS d(day_n)
),
active_in_window AS (
SELECT w.user_id,
w.day_n,
BOOL_OR(
DATE(e.event_at) BETWEEN
w.signup_date + (w.day_n - 7) AND
w.signup_date + w.day_n
) AS active_rolling_7d
FROM windows w
JOIN events e USING (user_id)
GROUP BY 1, 2
)
SELECT day_n,
COUNT(*) FILTER (WHERE active_rolling_7d) AS retained,
COUNT(*) AS cohort,
ROUND(100.0 * COUNT(*) FILTER (WHERE active_rolling_7d)
/ COUNT(*), 1) AS rolling_7d_pct
FROM active_in_window
GROUP BY day_n
ORDER BY day_n;
Step-by-step explanation.
-
cohortsCTE holds each user's anchor date. -
windowsCTE is the cross product — three windows per user (W7, W14, W28). Each window is a 7-day trailing interval. -
active_in_windowjoins events;BOOL_OR(... BETWEEN ...)returns TRUE if the user had any event in the window. This is the rolling-retention flag. - Final SELECT counts retained users per window and computes the percentage.
- Why rolling beats day-N for dashboards. A user active on days 6 and 8 but not exactly on day 7 appears as "not retained" in day-N retention but "retained" in rolling-7-day. The rolling chart is smoother and more accurate to engagement.
Output.
| day_n | retained | cohort | rolling_7d_pct |
|---|---|---|---|
| 7 | 5,640 | 12,000 | 47.0 |
| 14 | 4,200 | 12,000 | 35.0 |
| 28 | 2,880 | 12,000 | 24.0 |
Rule of thumb. Use day-N for the interview (cleaner math); use rolling-7-day for the dashboard (more honest signal). Both belong in the analyst's toolkit.
5. Business-metric SQL — LTV, ARPU, MAU/DAU, stickiness
sql for data analytics ends in growth metrics — LTV, ARPU, MAU/DAU, stickiness
The growth-metric bundle in one sentence: every business metric a PM dashboards is one of two shapes — a rolling time series (DAU, MAU, weekly active) or a per-cohort sum (LTV, ARPU, gross revenue). Window functions handle the first; cohort-anchored sums handle the second.
The growth metric ladder.
- DAU — Daily Active Users. Distinct users with any event today.
- WAU — Weekly Active Users. Distinct users with any event in the last 7 days.
- MAU — Monthly Active Users. Distinct users with any event in the last 30 days.
-
Stickiness — DAU / MAU. Ratio in
[0,1]; ~0.2 is normal SaaS, 0.5+ is messaging-app territory. -
ARPU — Average Revenue Per User.
SUM(revenue) / COUNT(DISTINCT user)for the period. -
LTV — Lifetime Value. Sum of revenue per user over their entire lifetime, or
ARPU × avg_lifetime_months. -
CAC payback.
CAC / ARPU(months); LTV/CAC > 3 is the SaaS rule of thumb.
The SQL primitives.
-
SUM() OVER (PARTITION BY user_id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)— running totals. -
COUNT(DISTINCT user_id) OVER (... ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)— 30-day rolling DAU. (Note: many engines disallowDISTINCTinside window — fall back to a CTE-and-join approach.) -
LAG(metric) OVER (ORDER BY date)— period-over-period growth. -
DATE_TRUNC+ GROUP BY — cohort-month sums.
Three failure modes interviewers test for.
- "MAU" is not "sum of DAU". A user active 30 days in a row counts once in MAU and 30 times in summed DAU. Use distinct on the window.
- Stickiness with mismatched windows. DAU divided by yesterday's MAU is wrong; align both to the same end date.
- LTV without cohort. LTV is a cohort metric — a single average across all users mixes 5-year customers with 5-day signups. Compute per cohort, then average if needed.
Worked example — 30-day rolling MAU + LAG growth
Detailed explanation. Build a daily time series of 30-day rolling MAU and the day-over-day growth percentage. The MAU rollup uses a self-join (since COUNT(DISTINCT) is not allowed inside most engines' window frames); the growth uses LAG.
Question. Given events(user_id, event_at), return one row per day with mau_30d (distinct users active in the last 30 days) and mau_growth_pct (vs the previous day).
Code.
WITH daily_active AS (
SELECT DATE(event_at) AS d,
user_id
FROM events
GROUP BY 1, 2
),
days AS (
SELECT DISTINCT d FROM daily_active
),
mau AS (
SELECT x.d,
COUNT(DISTINCT da.user_id) AS mau_30d
FROM days x
JOIN daily_active da
ON da.d BETWEEN x.d - INTERVAL '29 days' AND x.d
GROUP BY x.d
),
with_growth AS (
SELECT d,
mau_30d,
LAG(mau_30d) OVER (ORDER BY d) AS prev_mau,
ROUND(100.0 * (mau_30d - LAG(mau_30d) OVER (ORDER BY d))
/ NULLIF(LAG(mau_30d) OVER (ORDER BY d), 0), 2)
AS mau_growth_pct
FROM mau
)
SELECT *
FROM with_growth
ORDER BY d;
Step-by-step explanation.
-
daily_activededupes events to one row per(user, day)— collapses many events to one active-flag per day. -
daysis the time spine — every day where any user was active. -
mauself-joins the daily-active table to the time spine; for each spine dayx.d, count distinct users active in the trailing 30 days. This is the rolling MAU. -
with_growthusesLAGto fetch the previous day's MAU;NULLIF(... , 0)guards the very first day so the division doesn't crash. - The output is a one-row-per-day time series — perfect for the rolling-MAU line chart.
Output (sampled).
| d | mau_30d | prev_mau | mau_growth_pct |
|---|---|---|---|
| 2026-05-01 | 80,000 | NULL | NULL |
| 2026-05-02 | 80,420 | 80,000 | 0.53 |
| 2026-05-07 | 82,150 | 81,930 | 0.27 |
| 2026-05-15 | 90,300 | 89,610 | 0.77 |
| 2026-05-30 | 120,000 | 118,800 | 1.01 |
Rule of thumb. Build the time spine once, self-join the activity onto it with a 30-day window, then layer LAG on top. The same shape handles WAU, DAU, and quarterly metrics — only the window width changes.
Analytics SQL interview question on DAU/MAU stickiness with windowed counts
A senior probe: "Compute the 30-day rolling MAU, the same-day DAU, and the DAU/MAU stickiness ratio in a single query." Tests whether the candidate keeps the windows aligned to the same end date and whether they reach for a self-join rather than COUNT DISTINCT OVER.
Solution Using daily_active spine + dual self-joins for DAU and MAU
WITH daily_active AS (
SELECT DATE(event_at) AS d, user_id
FROM events
GROUP BY 1, 2
),
days AS (
SELECT DISTINCT d FROM daily_active
),
dau AS (
SELECT d, COUNT(DISTINCT user_id) AS dau
FROM daily_active
GROUP BY 1
),
mau AS (
SELECT x.d,
COUNT(DISTINCT da.user_id) AS mau_30d
FROM days x
JOIN daily_active da
ON da.d BETWEEN x.d - INTERVAL '29 days' AND x.d
GROUP BY x.d
)
SELECT d.d,
d.dau,
m.mau_30d,
ROUND(1.0 * d.dau / NULLIF(m.mau_30d, 0), 3) AS stickiness
FROM dau d
JOIN mau m USING (d)
ORDER BY d.d;
Step-by-step trace.
| Stage | Operation | Output |
|---|---|---|
| 1 | daily_active |
one row per (user, day)
|
| 2 | days |
the time spine |
| 3 | dau |
one row per day with same-day distinct user count |
| 4 | mau |
one row per day with trailing-30-day distinct user count |
| 5 | final join | one row per day with dau, mau_30d, stickiness
|
| 6 | NULLIF |
guards against division-by-zero on empty days |
Output:
| d | dau | mau_30d | stickiness |
|---|---|---|---|
| 2026-05-01 | 15,200 | 80,000 | 0.190 |
| 2026-05-02 | 16,050 | 80,420 | 0.200 |
| 2026-05-07 | 16,800 | 82,150 | 0.205 |
| 2026-05-15 | 19,500 | 90,300 | 0.216 |
| 2026-05-30 | 30,000 | 120,000 | 0.250 |
Why this works — concept by concept:
-
Daily-active spine — collapsing events to one row per
(user, day)is the cheapest representation for both DAU and MAU; every downstream query touches a much smaller table. -
Self-join with
BETWEEN x.d - INTERVAL '29 days' AND x.d— produces a 30-day trailing window per spine day without usingCOUNT DISTINCTinside a window frame (which most engines disallow). -
DAU/MAU stickiness as a ratio — both numerator and denominator share the same end date
d; that alignment is what makes the ratio meaningful. Misaligned windows are the most common stickiness bug in production. -
NULLIF(mau, 0)— guards the ratio against zero MAU (rare, but possible on first-day cohorts); without it the query throws. -
One spine, many metrics — once the spine is built, ARPU, gross revenue, and retention can be layered as additional joins; one scan of
events, many shapes of output. - Cost — O(events) for the spine, O(days × users_in_30d) for the MAU self-join — for typical product scales (≤100k DAU, ≤1M MAU) this finishes in seconds on a clustered warehouse.
SQL
Topic — real-time analytics
Real-time analytics problems (SQL)
Worked example — ARPU per cohort-month and LTV in 10 lines
Detailed explanation. ARPU and LTV are cohort metrics, not all-user averages. Compute them per cohort, then average if needed. The recipe: anchor users to a cohort, sum lifetime revenue per user, average per cohort.
Question. Given users(user_id, signup_at) and orders(user_id, amount, order_at), return one row per cohort-month with cohort_size, total_revenue, arpu (ARPU as of today), and ltv_12mo_est (estimated 12-month LTV).
Code.
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', signup_at) AS cohort_month
FROM users
),
per_user AS (
SELECT c.cohort_month,
c.user_id,
COALESCE(SUM(o.amount), 0) AS lifetime_revenue,
MAX(o.order_at) AS last_order_at
FROM cohorts c
LEFT JOIN orders o USING (user_id)
GROUP BY 1, 2
)
SELECT cohort_month,
COUNT(*) AS cohort_size,
SUM(lifetime_revenue) AS total_revenue,
ROUND(AVG(lifetime_revenue), 2) AS arpu,
ROUND(AVG(lifetime_revenue) *
12.0 / GREATEST(1, DATE_DIFF('month',
MIN(cohort_month), CURRENT_DATE)),
2) AS ltv_12mo_est
FROM per_user
GROUP BY 1
ORDER BY 1;
Step-by-step explanation.
-
cohortsCTE assigns each user a stable cohort label. -
per_userCTE computes lifetime revenue per user, anchored to the cohort.LEFT JOINkeeps users with zero orders (they belong in the cohort denominator). -
The final SELECT groups by cohort and computes the four metrics —
cohort_sizeis the headcount,total_revenueis the gross,arpuis the average per user,ltv_12mo_estextrapolates that ARPU out to 12 months. -
The extrapolation math — divide observed ARPU by months observed, multiply by 12.
GREATEST(1, ...)guards new cohorts that have only existed for fractional months. - Why per cohort? A six-month-old cohort and a one-day-old cohort have wildly different ARPU horizons; mixing them gives a meaningless average.
Output.
| cohort_month | cohort_size | total_revenue | arpu | ltv_12mo_est |
|---|---|---|---|---|
| 2026-01 | 12,000 | 540,000 | 45.00 | 540.00 |
| 2026-02 | 14,500 | 542,500 | 37.41 | 449.00 |
| 2026-03 | 13,200 | 396,000 | 30.00 | 360.00 |
| 2026-04 | 15,800 | 316,000 | 20.00 | 240.00 |
Rule of thumb. Always express ARPU and LTV per cohort. The 12-month extrapolation is a useful fiction — flag it as an estimate in the column name (ltv_12mo_est, not ltv) so consumers know the math.
SQL
Topic — data aggregation
Data aggregation problems (SQL)
Worked example — trailing 7-day revenue with SUM OVER
Detailed explanation. Window functions shine on rolling metrics — trailing-N revenue, moving averages, period growth. SUM(amount) OVER (ORDER BY d ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) is the canonical trailing-7-day sum.
Question. Given a daily revenue series, return the daily revenue, the trailing-7-day total, and the day-over-day growth percentage.
Code.
WITH daily_rev AS (
SELECT DATE(order_at) AS d, SUM(amount) AS rev
FROM orders
GROUP BY 1
)
SELECT d,
rev,
SUM(rev) OVER (ORDER BY d
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS trailing_7d_rev,
ROUND(100.0 * (rev - LAG(rev) OVER (ORDER BY d))
/ NULLIF(LAG(rev) OVER (ORDER BY d), 0), 2) AS rev_growth_pct
FROM daily_rev
ORDER BY d;
Step-by-step explanation.
-
daily_revCTE collapses orders to one row per day with summed revenue. -
SUM(...) OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)— the window frame is "last 7 rows including current"; for daily data that is the trailing-7-day total. -
LAG(rev)fetches the previous day's revenue; subtract, divide, round for the growth percentage. -
NULLIF(... , 0)guards against zero-revenue days (rare but possible). - One scan, three metrics — daily, trailing-7-day, and growth percentage in a single query.
Output (sampled).
| d | rev | trailing_7d_rev | rev_growth_pct |
|---|---|---|---|
| 2026-05-01 | 12,300 | 12,300 | NULL |
| 2026-05-02 | 13,800 | 26,100 | 12.20 |
| 2026-05-07 | 15,400 | 95,200 | 4.05 |
| 2026-05-08 | 14,900 | 97,800 | -3.25 |
| 2026-05-15 | 17,200 | 110,500 | 2.99 |
Rule of thumb. SUM OVER ROWS BETWEEN N PRECEDING AND CURRENT ROW is the trailing-window primitive; pair it with LAG for growth and LEAD for forward-looking spreads. These three operators cover 80% of growth-team dashboard SQL.
SQL
Topic — time series
Time-series problems (SQL)
Worked example — LTV/CAC ratio and payback in pure SQL
Detailed explanation. LTV and CAC are the two halves of the unit-economics question. The classic ratio — LTV / CAC — tells you whether you make more money from a user than you spent acquiring them; the payback period — CAC / monthly_ARPU — tells you how long it takes to break even. Both are analyst SQL queries on top of the cohort frame.
Question. Given cohort_acquisition_cost(cohort_month, total_cac) and the per-cohort revenue stream from earlier, return one row per cohort with cac_per_user, arpu_monthly, payback_months, and ltv_cac_ratio_12mo.
Code.
WITH cohorts AS (
SELECT user_id, DATE_TRUNC('month', signup_at) AS cohort_month
FROM users
),
cohort_size AS (
SELECT cohort_month, COUNT(*) AS n_users
FROM cohorts
GROUP BY 1
),
revenue AS (
SELECT c.cohort_month, SUM(o.amount) AS total_revenue
FROM orders o
JOIN cohorts c USING (user_id)
GROUP BY 1
),
arpu AS (
SELECT cohort_month,
total_revenue / GREATEST(1, DATE_DIFF('month', cohort_month, CURRENT_DATE))
/ NULLIF((SELECT n_users FROM cohort_size cs
WHERE cs.cohort_month = revenue.cohort_month), 0) AS arpu_monthly
FROM revenue
)
SELECT s.cohort_month,
s.n_users,
c.total_cac,
ROUND(c.total_cac / s.n_users, 2) AS cac_per_user,
ROUND(a.arpu_monthly, 2) AS arpu_monthly,
ROUND((c.total_cac / s.n_users) / NULLIF(a.arpu_monthly,0), 2)
AS payback_months,
ROUND(a.arpu_monthly * 12 / NULLIF(c.total_cac / s.n_users, 0), 2)
AS ltv_cac_ratio_12mo
FROM cohort_size s
JOIN cohort_acquisition_cost c USING (cohort_month)
JOIN arpu a USING (cohort_month)
ORDER BY s.cohort_month;
Step-by-step explanation.
-
cohorts,cohort_size,revenue— the standard cohort skeleton. -
arpuCTE — total revenue divided by(months observed × cohort size). TheGREATEST(1, ...)guards new cohorts. -
cac_per_user— total acquisition spend divided by cohort headcount. -
payback_months—cac_per_user / arpu_monthly. How many months of average revenue per user to break even on the CAC. -
ltv_cac_ratio_12mo— projects ARPU out twelve months and divides by CAC. Industry rule of thumb: > 3x is healthy, < 1x is unsustainable.
Output.
| cohort_month | n_users | total_cac | cac_per_user | arpu_monthly | payback_months | ltv_cac_ratio_12mo |
|---|---|---|---|---|---|---|
| 2026-01 | 12,000 | 1,500,000 | 125.00 | 45.00 | 2.78 | 4.32 |
| 2026-02 | 14,500 | 1,950,000 | 134.48 | 37.41 | 3.59 | 3.34 |
| 2026-03 | 13,200 | 1,650,000 | 125.00 | 30.00 | 4.17 | 2.88 |
Rule of thumb. Always present LTV alongside CAC alongside the ratio. A 4x LTV/CAC is great on paper but irrelevant if payback is 18 months and the business is cash-constrained.
SQL
Topic — aggregation
Cohort revenue + LTV aggregation drills (SQL)
Worked example — top-N power users per cohort with ROW_NUMBER
Detailed explanation. Senior analysts get pulled into "who are the whales" questions constantly. The pattern: rank users by revenue within each cohort, then pick the top N. ROW_NUMBER() OVER (PARTITION BY cohort ORDER BY revenue DESC) is the canonical idiom — works in every modern dialect.
Question. Given users(user_id, signup_at) and orders(user_id, amount), return the top-5 highest-spending users for each monthly cohort.
Code.
WITH cohorts AS (
SELECT user_id, DATE_TRUNC('month', signup_at) AS cohort_month
FROM users
),
per_user AS (
SELECT c.cohort_month,
c.user_id,
COALESCE(SUM(o.amount), 0) AS lifetime_revenue
FROM cohorts c
LEFT JOIN orders o USING (user_id)
GROUP BY 1, 2
),
ranked AS (
SELECT cohort_month,
user_id,
lifetime_revenue,
ROW_NUMBER() OVER (PARTITION BY cohort_month
ORDER BY lifetime_revenue DESC) AS rn
FROM per_user
)
SELECT cohort_month, user_id, lifetime_revenue
FROM ranked
WHERE rn <= 5
ORDER BY cohort_month, rn;
Step-by-step explanation.
-
cohortsCTE assigns each user a cohort label. -
per_userCTE sums lifetime revenue per user; LEFT JOIN keeps zero-revenue users (they'll get rn > 5 and drop out). -
rankedCTE assigns a row number per(cohort_month, lifetime_revenue DESC)— the highest-revenue user in each cohort gets rn=1. -
Final SELECT picks only
rn <= 5. The outer ORDER BY produces a tidy report ordered by cohort and rank. -
ROW_NUMBERvsRANKvsDENSE_RANK. UseROW_NUMBERwhen you want exactly N rows per group regardless of ties;RANKif ties should share a position (and the top-5 might have 7 rows);DENSE_RANKif ties share without gaps. For "top-N reporting" the safe default isROW_NUMBER.
Output.
| cohort_month | user_id | lifetime_revenue |
|---|---|---|
| 2026-01 | 7,402 | 4,580.00 |
| 2026-01 | 3,011 | 3,940.00 |
| 2026-01 | 8,219 | 3,420.00 |
| 2026-01 | 1,506 | 3,200.00 |
| 2026-01 | 9,033 | 2,990.00 |
| 2026-02 | 12,008 | 5,210.00 |
| 2026-02 | 14,221 | 4,800.00 |
Rule of thumb. Top-N-per-group = ROW_NUMBER() OVER (PARTITION BY g ORDER BY m DESC) <= N in an outer filter. Memorise it — it is the single most-asked analyst SQL pattern in interviews.
SQL
Topic — ranking
Top-N ranking problems (SQL)
Cheat sheet — analyst SQL idioms
-
Top-N per group.
ROW_NUMBER() OVER (PARTITION BY group_id ORDER BY metric DESC) <= Nin an outer WHERE. -
Cohort matrix. Three CTEs —
cohorts,cohort_size,activity— then a final join;DATE_TRUNCfor the bucket,DATE_DIFFfor the relative period. -
Funnel drop-off.
COUNT(*) FILTER (WHERE step = 'X')per step in one wide row; UNION ALL pivot to long form;1 - current/prevfor drop-off percentage. -
Day-N retention. Per-user
signup_date;DATE_DIFFfor day-N; LEFT JOIN a(cohort, day_n)calendar withCOALESCE(active, 0)to fill the gaps. -
Stickiness (DAU/MAU). Daily-active spine; one self-join for trailing-30-day MAU; one straight GROUP BY for DAU; divide, with
NULLIFfor safety. -
LTV per cohort. Anchor users to cohort;
LEFT JOIN orders;SUM(amount)per user; AVG per cohort; flag the time-bounded estimate in the column name (ltv_12mo_est). -
Rolling N-day metrics.
SUM(metric) OVER (ORDER BY d ROWS BETWEEN (N-1) PRECEDING AND CURRENT ROW)— the trailing-window primitive. -
Period-over-period growth.
LAG(metric) OVER (ORDER BY d)in the numerator;NULLIF(LAG(...), 0)in the denominator. -
Conditional aggregation.
COUNT/SUM/AVG(...) FILTER (WHERE …)is faster, cleaner, and easier to maintain than chainedCASE WHENaggregates. -
Triangular cohort cells. Always check
cohort_start + period_n <= CURRENT_DATEin the final SELECT; absence is honest, zero is misleading.
Frequently asked questions
Cohort analysis vs funnel analysis — which one should I reach for?
Use a cohort analysis when you want to understand long-term behaviour over time — retention curves, lifetime value, "how do January signups compare to February signups across the next six months". Use a funnel analysis when you want to understand a sequence of actions inside a fixed window — signup → verify → activate, within seven days. Cohorts ask "how is this group doing over time"; funnels ask "where do users drop off in a flow". A well-designed dashboard usually carries both — cohorts on the strategic tab, funnels on the activation tab.
DAU vs MAU vs WAU — which is the right metric for my product?
Choose the cadence that matches your expected user behaviour. Messaging and games are daily-use products — DAU and DAU/MAU stickiness are the right metrics. Most SaaS is weekly — WAU plus WAU/MAU stickiness. Tax software, insurance portals, payroll — monthly is the natural cadence; MAU and DAU/MAU ratios at 5% are normal and not alarming. The rule is: if the metric drops to zero on a Saturday, you picked the wrong cadence.
Retention vs churn — are they the same thing?
They are mirror images, not the same metric. Retention is the fraction of a cohort still active in period N (active_users / cohort_size); it ranges from 100% at D0 and drops over time. Churn is 1 - retention for a given window — the fraction lost in that window. SaaS teams talk about "monthly churn" — usually "of users active last month, what fraction were not active this month". Analytics SQL handles both shapes from the same cohort frame; pick whichever your audience reads more naturally.
How do I do cohort analysis in pure SQL without pulling into Python?
Three CTEs and a final join. CTE 1 — cohorts — anchors every user to a DATE_TRUNC bucket of their first event. CTE 2 — cohort_size — counts users per cohort (the denominator). CTE 3 — activity — joins cohort labels onto all events and groups by (cohort, period_n) where period_n = DATE_DIFF(activity_period, cohort_period). Final SELECT joins CTE 3 to CTE 2 and divides for the retention percentage. That single recipe handles monthly, weekly, daily, revenue, and feature-adoption cohorts — only the bucket unit changes.
Can SQL replace a BI tool for analytics work?
For the computation, yes — almost every metric a BI tool draws is a SQL query underneath. For the interaction (slicing dimensions, drilling down, scheduling, version-controlling, sharing dashboards), no. The healthy pattern is: SQL produces the tidy long-form tables (one row per (cohort, period), one row per (date, metric)), and the BI tool — Looker, Mode, Metabase, Tableau, Hex — handles pivot, drill, and presentation. Treat SQL as the engine, BI as the dashboard chrome.
LTV in SQL — do I need a calendar table?
Not strictly for LTV itself, because LTV is summed over a user's entire history — it does not depend on the calendar shape. But if you want time-bounded LTV ("LTV after 90 days", "LTV in first 12 months"), then yes, you anchor each user to their cohort date and time-bound the orders join with order_at <= cohort_date + INTERVAL '90 days'. The calendar table starts mattering when you want a per-period series (LTV at month-1, month-2, month-3) — there you LEFT JOIN a generated month-grid so cohorts with no orders in a given month still appear with 0.
Practice on PipeCode
- Drill the aggregation practice library → for the SUM / COUNT / GROUP BY muscles that power every cohort query.
- Rehearse window-function problems → for ROW_NUMBER, LAG, LEAD, and trailing-N-day aggregates.
- Sharpen cohort-analysis problems → when the interviewer wants a full DATE_TRUNC + DATE_DIFF cohort matrix.
- Tackle user-funnel analysis problems → for FILTER-based step pivots and strict-order conversion.
- Lock in date-function problems → so DATE_TRUNC, DATE_DIFF, and EXTRACT are reflexive.
- Layer on conditional-aggregation problems → for the COUNT(...) FILTER pivots.
- For the broader interview surface, read top data engineering interview questions →.
- Build the core SQL fluency with SQL for data engineering interviews — from zero to FAANG →.
- Add product-metric depth with product sense and metrics for data engineering interviews →.
Pipecode.ai is Leetcode for Data Engineering — every SQL analytics pattern above ships with hands-on practice rooms where you shape real cohort matrices, debug real funnels, and ship real LTV queries against real datasets. Start with aggregation and window functions; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)