Gowtham Potureddi

Posted on Jun 4

SQL for Data Analytics & Data Analysts: Cohorts, Funnels, Retention

#python #sql #dataengineering #interview

sql for data analytics is its own dialect. The keywords feel familiar — SELECT, GROUP BY, JOIN — but the shape of the queries is different. Analyst SQL lives in the world of cohorts (DATE_TRUNC, relative periods), funnels (FILTER, conditional counts), retention curves (day-N, weekly buckets, calendar joins) and growth metrics (LTV, ARPU, DAU/MAU stickiness). If you can write a fast aggregate and a clean window function, the job becomes shaping data into the exact frame a product manager or growth team can read off a dashboard.

This guide walks the five SQL muscles that show up in every analytics-engineer and senior analyst loop. We open with the why, then go deep on cohorts, funnels, retention curves, and the LTV / ARPU / MAU / DAU bundle. Every section pairs a teaching block with a worked interview answer — code, step-by-step trace, output, then a concept-by-concept why-this-works. By the end you will be able to defend a 12-line cohort matrix, trace a 5-step funnel, draw a 30-day retention curve, and compute DAU/MAU stickiness from a single user_activity table — the exact moves sql for data analyst rounds reward at FAANG, growth-stage SaaS, and analytics-heavy shops.

When you want hands-on reps immediately after reading, browse aggregation problems →, drill window-function problems →, and rehearse on the SQL practice library →.

On this page

Why analytics SQL is its own discipline
Cohort analysis — DATE_TRUNC + first-event signup buckets
Funnel analysis — multi-step conversion + drop-off
Retention curves — day-N + weekly cohorts + decay
Business-metric SQL — LTV, ARPU, MAU/DAU, stickiness
Cheat sheet — analyst SQL idioms
Frequently asked questions
Practice on PipeCode

1. Why analytics SQL is its own discipline

`sql data analytics` is built from four shape primitives — cohort, funnel, retention, LTV

The one-sentence mental model: sql data analysis is the discipline of taking a raw event table and reshaping it into one of four canonical frames — a cohort matrix, a funnel waterfall, a retention curve, or a growth-metric time series. Every dashboard a PM looks at is a derivative of one of those four shapes, which is why interviewers probe them by name.

Why analyst SQL deserves its own playbook.

It is aggregation-first, not row-first. Application SQL fetches single rows by primary key. Analytics SQL almost never does — every query touches a window of time and produces a roll-up.
It lives on event tables. events, user_activity, orders, pageviews, sessions. Wide, append-only, time-ordered.
It uses a tight family of functions. DATE_TRUNC, DATE_DIFF, EXTRACT, SUM, COUNT DISTINCT, FILTER (WHERE …), ROW_NUMBER, LAG, LEAD, SUM() OVER.
It is read in shapes, not rows. A "good" query produces a tidy frame — one row per (cohort, period), or one row per (funnel_step, user_count) — that a BI tool can pivot.

The four canonical analyst SQL frames.

Cohort matrix. Rows = signup month; columns = months-since-signup; cells = retention %.
Funnel waterfall. Rows = ordered steps; columns = count, drop-off, cumulative %.
Retention curve. X = days-since-signup; Y = % of cohort still active; one line per cohort.
Time-series + ratio. MAU, DAU, stickiness, ARPU, LTV — one row per period.

Why this is the bridge into senior analytics-engineer roles.

A junior analyst writes a GROUP BY signup_date and stops.
A senior analyst writes the same query, then asks "is this cohort-fair across months?", catches the diagonal-greying problem, joins a calendar table, and ships a frame the PM can paste straight into a deck.
An analytics engineer wraps that frame in a dbt model with tests on cohort_size > 0 and retention_pct <= 1.0, and the lineage is auditable forever.

sql for data analysis — the four-axis trade-off you make on every query.

Granularity. Day vs week vs month — too fine and the chart is noisy, too coarse and the trend lags by a sprint.
Anchoring. Calendar time vs cohort-relative time — a calendar chart is a status report, a cohort chart is a learning instrument.
Counting basis. Active users vs revenue vs events — the same shape question has very different answers depending on which you pick.
Comparability. Same window vs same-N-days — a 30-day window picked from January is not the same as 30 days from any cohort start; analyst SQL has to enforce one or the other.

What interviewers listen for in a sql for data analyst round.

Do you reach for DATE_TRUNC the moment they say "cohort"? — required signal.
Do you separate calendar time from cohort-relative time in the query? — senior signal.
Do you use COUNT(DISTINCT user_id) FILTER (WHERE …) for step counts instead of pivoting in Python? — senior signal.
Do you mention a calendar / dates table for retention curves with sparse activity? — senior signal.
Do you compute DAU/MAU stickiness as a ratio of two rolling windows, not as two unrelated metrics? — required signal.

Worked example — reshape one event table into a cohort frame in 8 lines

Detailed explanation. Realistic analyst SQL starts with a single event table — user_activity(user_id, event_date, event_type) — and a question — "how many users from the January cohort were still active in February?". The answer is one CTE that anchors each user to their first event, joined back to all of their later events.

Question. Given user_activity(user_id, event_date), write a query that returns the number of users from each monthly signup cohort who were active in each subsequent month.

Input.

user_id	event_date
1	2026-01-03
1	2026-02-11
1	2026-03-04
2	2026-01-22
3	2026-02-02
3	2026-03-30

Code.

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('month', MIN(event_date)) AS cohort_month
  FROM user_activity
  GROUP BY user_id
)
SELECT  c.cohort_month,
        DATE_TRUNC('month', a.event_date) AS activity_month,
        COUNT(DISTINCT a.user_id)         AS active_users
FROM    cohorts c
JOIN    user_activity a USING (user_id)
GROUP BY 1, 2
ORDER BY 1, 2;

Step-by-step explanation.

cohorts CTE. For each user_id, take the minimum event_date and DATE_TRUNC it to a month. That label — "the user's signup month" — never changes once assigned.
Join back to activity. Every event row now carries the user's cohort and its own activity month. One row per event, dressed with both keys.
GROUP BY cohort_month, activity_month. Collapse to one row per (cohort, period) pair, count distinct users — that is your cohort × period grid.
ORDER BY cohort, activity_month. The output is the long-form (tidy) version of the cohort matrix; any BI tool can pivot it into the heatmap on the right.
Why this shape? Long-form is easier to test, store, and join with calendar tables. Pivot only at the presentation layer.

Output.

cohort_month	activity_month	active_users
2026-01	2026-01	2
2026-01	2026-02	1
2026-01	2026-03	1
2026-02	2026-02	1
2026-02	2026-03	1

Rule of thumb. Anchor users to their first event in a CTE; join back to all of their later events; group by (cohort, period). That single recipe powers cohort, funnel, and retention queries — only the second key and the count basis change.

Analytics SQL interview question on reshaping events into a long cohort frame

The probe usually sounds like: "Given an event table, return one row per signup-month × activity-month with the active-user count. Don't pivot in SQL — keep it tidy." It tests whether the candidate separates cohort time from calendar time and whether they reach for DATE_TRUNC reflexively.

Solution Using `DATE_TRUNC` + first-event CTE + grouped count

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('month', MIN(event_date)) AS cohort_month
  FROM   user_activity
  GROUP  BY user_id
),
activity AS (
  SELECT a.user_id,
         c.cohort_month,
         DATE_TRUNC('month', a.event_date) AS activity_month
  FROM   user_activity a
  JOIN   cohorts c USING (user_id)
)
SELECT  cohort_month,
        activity_month,
        COUNT(DISTINCT user_id) AS active_users
FROM    activity
GROUP   BY 1, 2
ORDER   BY 1, 2;

Step-by-step trace.

Stage	Operation	Output rows
1	`cohorts` CTE — first event per user	3 rows: (u1,2026-01), (u2,2026-01), (u3,2026-02)
2	`activity` CTE — join cohort onto every event	6 rows, each carrying both `cohort_month` and `activity_month`
3	`GROUP BY cohort_month, activity_month`	collapses to 5 distinct `(cohort, period)` pairs
4	`COUNT(DISTINCT user_id)`	de-duplicates within each cell — one user active twice in March still counts as one
5	`ORDER BY 1, 2`	rows ordered cohort-then-period — easy to pivot

Output:

cohort_month	activity_month	active_users
2026-01-01	2026-01-01	2
2026-01-01	2026-02-01	1
2026-01-01	2026-03-01	1
2026-02-01	2026-02-01	1
2026-02-01	2026-03-01	1

Why this works — concept by concept:

DATE_TRUNC as the cohort anchor — truncating the user's first event to a month produces a label that is stable for the user's lifetime; every later query joins on it without recomputation.
First-event CTE pattern — separating "who is the user" from "what did the user do" keeps the cohort label out of the activity table; one place to change the bucket granularity (month → week) without touching downstream queries.
COUNT(DISTINCT user_id) as the basis — collapses many events per user per period into one active flag; that is the correct counting basis for retention, MAU, and cohort-active metrics.
Long-form output — one row per (cohort, period); trivially pivotable in BI, easy to test (assert no row where active_users > cohort_size), and joinable with calendar tables.
Cost — query time = O(events) for the first-event scan, O(events) for the activity scan, plus the group-by hash; safe up to billions of rows on a warehouse with cluster-key on event_date.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →

2. Cohort analysis — DATE_TRUNC + first-event signup buckets

`sql data analysis` lives or dies on cohort SQL — anchor users to a signup bucket, then reshape

The one-line invariant of cohort SQL: assign every user a stable cohort label (the truncated signup date), then aggregate their later activity by relative period from that label. Once the cohort label is fixed, every downstream question — retention, revenue per cohort, conversion per cohort — collapses to a GROUP BY (cohort, period).

The cohort decisions every analyst SQL interview probes.

Granularity. Daily, weekly, or monthly cohorts. Weekly is the dashboard sweet spot for SaaS; monthly for ecommerce; daily only when you have millions of signups per day.
Anchor event. First app open, first purchase, first paid plan — the definition of "cohort 0" controls every downstream chart. State it explicitly in the query.
Relative period vs absolute period. Cohort SQL groups by relative period (months-since-signup), not calendar period. A January user reaching M3 means March; a February user reaching M3 means May.
Triangular shape. Newer cohorts haven't existed long enough to reach later periods — those cells should be NULL, not zero. Greying the impossible cells is a hallmark of a fair cohort chart.

The cohort recipe in three SQL primitives.

DATE_TRUNC('month', signup_at) → cohort_month. The stable bucket label.
DATE_DIFF('month', cohort_month, activity_month) → period_n. The relative period.
COUNT(DISTINCT user_id) / cohort_size → retention_pct. The retention metric.

Three failure modes to call out in the interview.

Wrong basis. COUNT(event_id) instead of COUNT(DISTINCT user_id) — heavy users inflate retention.
Mixing calendar and cohort time. Joining on activity_month directly rather than period_n produces a non-comparable chart — January's M3 cell is February, but February's M3 cell is May.
Forgetting the triangle. Filling impossible cells with zero collapses retention as cohorts age; readers will misinterpret it as a falling product.

Worked example — 12-line cohort matrix for a SaaS signup feed

Detailed explanation. A SaaS product wants the monthly retention matrix — for each signup-month cohort, what fraction were active in M1, M2, M3, etc. The query needs three CTEs: cohort assignment, cohort size, and active counts per relative period.

Question. Given users(user_id, signup_at) and events(user_id, event_at), return one row per (cohort_month, period_n) with the cohort size, the active-user count, and the retention percentage.

Input (small sample).

user_id	signup_at	events
101	2026-01-04	2026-01-04, 2026-02-09, 2026-03-15
102	2026-01-12	2026-01-12, 2026-02-20
103	2026-02-02	2026-02-02, 2026-03-10
104	2026-02-25	2026-02-25

Code.

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('month', signup_at) AS cohort_month
  FROM   users
),
cohort_size AS (
  SELECT cohort_month, COUNT(*) AS n_users
  FROM   cohorts
  GROUP  BY 1
),
activity AS (
  SELECT c.cohort_month,
         DATE_DIFF('month', c.cohort_month,
                   DATE_TRUNC('month', e.event_at)) AS period_n,
         COUNT(DISTINCT e.user_id)                  AS active_users
  FROM   events  e
  JOIN   cohorts c USING (user_id)
  GROUP  BY 1, 2
)
SELECT a.cohort_month,
       a.period_n,
       a.active_users,
       s.n_users,
       1.0 * a.active_users / s.n_users AS retention_pct
FROM   activity    a
JOIN   cohort_size s USING (cohort_month)
ORDER  BY 1, 2;

Step-by-step explanation.

cohorts CTE. Each user gets one stable label — the month-truncated signup date. This is the row-anchor for everything else.
cohort_size CTE. Count the rows in each cohort. This becomes the denominator for retention percentages.
activity CTE. Join cohort labels onto every event, then DATE_DIFF to compute the relative period. GROUP BY (cohort, period) collapses to one row per cell.
Final SELECT. Join the cell counts to the cohort sizes; divide for the retention percentage. The 1.0 * cast forces float division — without it some engines floor to integer zero.
Triangular shape emerges naturally. Cohorts that haven't reached period N yet simply produce no row — the cell is missing, not zero. The dashboard layer should render those cells as greyed, never as 0%.

Output.

cohort_month	period_n	active_users	n_users	retention_pct
2026-01	0	2	2	1.00
2026-01	1	2	2	1.00
2026-01	2	1	2	0.50
2026-02	0	2	2	1.00
2026-02	1	1	2	0.50

Rule of thumb. Three CTEs — cohorts, cohort_size, activity — and one final join. That is the entire cohort-matrix shape, no matter how big the data gets.

Analytics SQL interview question on building a cohort matrix end-to-end

A common probe in sql for data analytics interviews: "Walk me through the SQL to build a monthly cohort retention matrix. I want to see how you handle cohort assignment, the relative-period join, and the denominator." It is the bread-and-butter analyst question — fluency here distinguishes a senior analyst from a junior.

Solution Using `DATE_TRUNC` + `DATE_DIFF` + three CTEs

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('month', signup_at) AS cohort_month
  FROM   users
),
cohort_size AS (
  SELECT cohort_month, COUNT(*) AS n_users
  FROM   cohorts
  GROUP  BY 1
),
activity AS (
  SELECT c.cohort_month,
         DATE_DIFF('month', c.cohort_month,
                   DATE_TRUNC('month', e.event_at)) AS period_n,
         COUNT(DISTINCT e.user_id)                  AS active_users
  FROM   events  e
  JOIN   cohorts c USING (user_id)
  GROUP  BY 1, 2
)
SELECT a.cohort_month,
       a.period_n,
       a.active_users,
       s.n_users,
       ROUND(100.0 * a.active_users / s.n_users, 1) AS retention_pct
FROM   activity    a
JOIN   cohort_size s USING (cohort_month)
ORDER  BY 1, 2;

Step-by-step trace.

Stage	Operation	Output
1	`cohorts` CTE	one row per user with stable `cohort_month`
2	`cohort_size` CTE	one row per cohort with `n_users` denominator
3	`activity` CTE — join cohort onto every event	one row per `(cohort_month, period_n)` with `active_users`
4	final join + division	one row per cell with `retention_pct`
5	`ROUND(...)` to 1 dp	presentation-ready output

Output:

cohort_month	period_n	active_users	n_users	retention_pct
2026-01-01	0	12,000	12,000	100.0
2026-01-01	1	5,040	12,000	42.0
2026-01-01	2	3,360	12,000	28.0
2026-01-01	3	2,160	12,000	18.0
2026-02-01	0	14,500	14,500	100.0
2026-02-01	1	6,525	14,500	45.0

Why this works — concept by concept:

DATE_TRUNC as the stable cohort anchor — every user has exactly one cohort label for life, computed once in the first CTE; all subsequent joins are O(1) per row instead of recomputing the truncation.
DATE_DIFF for relative period — turning a calendar pair (cohort_month, activity_month) into a single integer period_n is what makes cohorts comparable. M3 always means "three months after signup", regardless of calendar.
COUNT(DISTINCT user_id) as the basis — collapses heavy-user inflation; the cell counts the people, not the events.
Triangular shape — missing rows (instead of zero rows) for (cohort, period) pairs that don't exist yet means the chart honestly grays out impossible cells; the dashboard renders the diagonal correctly.
1.0 * x / y — forces float division in engines that otherwise floor integers; the most common silent-zero bug in analyst SQL.
Cost — O(users) for cohorts, O(events) for activity, O(cohorts) for the denominator join; safe up to hundreds of millions of events with cluster-keys on signup_at and event_at.

SQL
Topic — cohort analysis
Cohort analysis problems (SQL)

Practice →

Worked example — weekly cohorts and the diagonal-greying trap

Detailed explanation. Monthly cohorts can hide weekly seasonality (Black Friday signups behave very differently from end-of-quarter signups). Weekly cohorts surface that variance, but they multiply the cell count 4×, and the diagonal-greying trap becomes more obvious — June cohorts cannot have a 30-week period yet because June was only 4 weeks ago.

Question. Convert the monthly cohort recipe into a weekly cohort recipe, and explain why the bottom-right corner of the matrix has to be NULL, not zero.

Input (timeline assumption).

Cohort week	Earliest possible period_n
2026-W01 (Jan 5)	21 weeks by today (2026-W22)
2026-W10 (Mar 9)	12 weeks by today
2026-W22 (Jun 1)	0 weeks by today

Code.

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('week', signup_at) AS cohort_week
  FROM   users
),
activity AS (
  SELECT c.cohort_week,
         DATE_DIFF('week', c.cohort_week,
                   DATE_TRUNC('week', e.event_at)) AS period_n,
         COUNT(DISTINCT e.user_id)                 AS active_users
  FROM   events  e
  JOIN   cohorts c USING (user_id)
  WHERE  e.event_at >= c.cohort_week        -- guard: no events before signup
  GROUP  BY 1, 2
)
SELECT *
FROM   activity
WHERE  cohort_week + (period_n * INTERVAL '7 days') <= CURRENT_DATE
ORDER  BY 1, 2;

Step-by-step explanation.

Truncate signup_at to the start of the ISO week (Monday in most engines). That is the cohort label.
Truncate event_at to its own week; DATE_DIFF gives the relative period in weeks.
The guard clause e.event_at >= c.cohort_week drops the impossible pre-signup events that creep in from clock skew or test data.
The diagonal filter cohort_week + (period_n * INTERVAL '7 days') <= CURRENT_DATE forbids rows that haven't happened yet — the cell for (2026-W22, period_n=10) is not emitted, so it renders as missing (which the dashboard greys out).
Without the filter, the engine would emit zero for those cells, which the dashboard would draw as 0% — a misleading retention drop.

Output.

cohort_week	period_n	active_users
2026-W01	0	1,200
2026-W01	1	720
2026-W01	4	420
2026-W10	0	1,500
2026-W10	1	930
2026-W22	0	800

Rule of thumb. Always enforce the "period is possible" filter in the final SELECT. NULL is the honest answer for the future; zero is a chart-killing lie.

SQL
Topic — date functions
Date function and bucketing problems (SQL)

Practice →

Worked example — cohort-anchored revenue per period

Detailed explanation. A retention chart shows whether users come back; a cohort-anchored revenue chart shows what they spend when they come back. Same recipe shape — anchor the user to a cohort, project activity onto relative period — but the inner aggregate is SUM(amount) instead of COUNT(DISTINCT user_id).

Question. Given users(user_id, signup_at) and orders(user_id, amount, order_at), return one row per (cohort_month, period_n) with revenue and revenue_per_user.

Code.

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('month', signup_at) AS cohort_month
  FROM   users
),
cohort_size AS (
  SELECT cohort_month, COUNT(*) AS n_users
  FROM   cohorts
  GROUP  BY 1
),
revenue AS (
  SELECT c.cohort_month,
         DATE_DIFF('month', c.cohort_month,
                   DATE_TRUNC('month', o.order_at)) AS period_n,
         SUM(o.amount)                              AS revenue
  FROM   orders  o
  JOIN   cohorts c USING (user_id)
  GROUP  BY 1, 2
)
SELECT r.cohort_month,
       r.period_n,
       r.revenue,
       s.n_users,
       ROUND(r.revenue / s.n_users, 2) AS revenue_per_user
FROM   revenue     r
JOIN   cohort_size s USING (cohort_month)
ORDER  BY 1, 2;

Step-by-step explanation.

cohorts CTE — the same first-event anchor; never recompute.
cohort_size CTE — the denominator for the per-user metric.
revenue CTE — sum orders into (cohort, period) cells. The DATE_DIFF gives relative period so months are comparable across cohorts.
Final join + division — produces the per-cohort, per-period revenue and the revenue-per-original-user (the LTV building block).
The shape generalises — replace SUM(amount) with COUNT(DISTINCT order_id) for purchase counts, or with AVG(amount) for cohort AOV.

Output.

cohort_month	period_n	revenue	n_users	revenue_per_user
2026-01	0	240,000	12,000	20.00
2026-01	1	132,000	12,000	11.00
2026-01	2	96,000	12,000	8.00
2026-02	0	290,000	14,500	20.00
2026-02	1	130,500	14,500	9.00

Rule of thumb. The cohort recipe is the shape, not the metric. Swap the inner aggregate and you can build retention, revenue, conversion, feature adoption, or any other cohort-anchored chart with the same five-CTE skeleton.

3. Funnel analysis — multi-step conversion + drop-off

`sql for data analysis` of funnels is one row per user-step pair, pivoted with FILTER

The funnel recipe in one sentence: for every user, mark which funnel step(s) they reached; then for every step, count the distinct users who reached it; the row-to-column pivot is done with COUNT(*) FILTER (WHERE step = 'X'). Once the conditional-count idiom clicks, multi-step funnels become five-line queries.

The funnel decisions every analyst SQL interview probes.

Ordering. Does the funnel require strict step order (signup → verify → activate, in sequence), or is it order-free (touched these N features at all)? ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_at) enforces order.
Time bounding. "Activated within 7 days of signup" means joining the activation event with event_at <= signup_at + INTERVAL '7 days'. Without a bound, conversion percentages drift up over time for old cohorts.
De-duplication. A user who completes a step twice still counts once. COUNT(DISTINCT user_id) FILTER (WHERE step = 'X') handles it.
Drop-off math. prev_step_count - current_step_count for the absolute drop; 1 - current / prev for the percentage.

The funnel SQL recipe in three primitives.

FILTER (WHERE step = 'X') — the conditional-count pivot.
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_at) — for strict-order funnels.
INTERVAL '7 days' — the time-bound clause for "within N days" funnels.

Three failure modes interviewers love to catch.

Skip-step counting. Counting users who reached step 4 without verifying they also reached step 3. Use EXISTS or INTERSECT to enforce.
Forgetting the time bound. Cohort A signed up 2 years ago, cohort B signed up yesterday — comparing their lifetime conversion is meaningless. Bound to "within N days of step 1".
Counting events instead of users. COUNT(*) counts events; COUNT(DISTINCT user_id) counts people. The funnel chart shows people.

Worked example — 5-step product activation funnel in 6 lines

Detailed explanation. A product team wants the activation funnel: signup → email verified → profile completed → first action → activated (defined as 3+ actions in first 7 days). Five steps, four drop-off arrows.

Question. Given events(user_id, event_type, event_at), write a query that returns one row per funnel step with the user count and the step-to-step conversion percentage.

Input.

user_id	event_type	event_at
1	signup	2026-04-01
1	email_verified	2026-04-01
1	profile_completed	2026-04-02
1	first_action	2026-04-03
2	signup	2026-04-01
2	email_verified	2026-04-02
3	signup	2026-04-01

Code.

SELECT
  COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'signup')             AS step_1_signup,
  COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'email_verified')     AS step_2_verified,
  COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'profile_completed')  AS step_3_profile,
  COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'first_action')       AS step_4_action,
  COUNT(DISTINCT user_id) FILTER (WHERE event_type = 'activated')          AS step_5_activated
FROM   events
WHERE  event_at >= DATE '2026-04-01';

Step-by-step explanation.

One scan, five buckets. The query reads the events table once; each FILTER clause carves out the subset that matches the step name; COUNT(DISTINCT user_id) de-duplicates within that subset.
Wide-shape output. One row, five columns — perfect for a horizontal funnel chart but awkward to compute drop-offs from. We pivot it to long-form in the next CTE.
The WHERE event_at >= ... clause anchors the funnel to a measurement window — without it the funnel includes every event from the table's history, which makes month-over-month comparison impossible.
COUNT(DISTINCT user_id) FILTER (...) is the workhorse — in older engines without FILTER you write COUNT(DISTINCT CASE WHEN ... THEN user_id END) for the same effect.
The implicit ordering assumption. This wide query counts every user who reached a step, not every user who progressed through all previous steps in order. The strict-order version uses ROW_NUMBER or EXISTS — covered in the next worked example.

Output (wide form).

step_1_signup	step_2_verified	step_3_profile	step_4_action	step_5_activated
10,000	7,500	5,200	3,400	1,890

Now pivot it to long-form for the drop-off chart.

WITH wide AS (
  SELECT
    COUNT(DISTINCT user_id) FILTER (WHERE event_type='signup')            AS s1,
    COUNT(DISTINCT user_id) FILTER (WHERE event_type='email_verified')    AS s2,
    COUNT(DISTINCT user_id) FILTER (WHERE event_type='profile_completed') AS s3,
    COUNT(DISTINCT user_id) FILTER (WHERE event_type='first_action')      AS s4,
    COUNT(DISTINCT user_id) FILTER (WHERE event_type='activated')         AS s5
  FROM events
  WHERE event_at >= DATE '2026-04-01'
)
SELECT 1 AS step, 'signup'             AS name, s1 AS users, NULL::float AS dropoff FROM wide
UNION ALL SELECT 2, 'email_verified',    s2, 1 - 1.0*s2/s1 FROM wide
UNION ALL SELECT 3, 'profile_completed', s3, 1 - 1.0*s3/s2 FROM wide
UNION ALL SELECT 4, 'first_action',      s4, 1 - 1.0*s4/s3 FROM wide
UNION ALL SELECT 5, 'activated',         s5, 1 - 1.0*s5/s4 FROM wide
ORDER BY step;

Output (long form).

step	name	users	dropoff
1	signup	10,000	—
2	email_verified	7,500	0.25
3	profile_completed	5,200	0.31
4	first_action	3,400	0.35
5	activated	1,890	0.44

Rule of thumb. Wide form for the dashboard widget, long form for the drop-off chart. Compute both in the same query so the two views stay in sync.

Analytics SQL interview question on a strict-order funnel with a 7-day time bound

The follow-up probe is almost always: "Make the funnel strict — a user only counts at step N if they also reached every earlier step in order, within 7 days of signup." It tests whether the candidate reaches for ROW_NUMBER or EXISTS or INTERSECT, and whether they correctly time-bound the conversion window.

Solution Using a `ROW_NUMBER` ordered-events CTE + INNER JOIN per step

WITH ordered AS (
  SELECT user_id,
         event_type,
         event_at,
         MIN(event_at) OVER (PARTITION BY user_id) AS signup_at
  FROM   events
  WHERE  event_type IN ('signup','email_verified','profile_completed',
                        'first_action','activated')
),
within_7d AS (
  SELECT *
  FROM   ordered
  WHERE  event_at <= signup_at + INTERVAL '7 days'
),
per_step AS (
  SELECT user_id,
         BOOL_OR(event_type='signup')             AS hit_signup,
         BOOL_OR(event_type='email_verified')     AS hit_verified,
         BOOL_OR(event_type='profile_completed')  AS hit_profile,
         BOOL_OR(event_type='first_action')       AS hit_action,
         BOOL_OR(event_type='activated')          AS hit_activated
  FROM   within_7d
  GROUP  BY user_id
)
SELECT
  COUNT(*)                                        AS s1_signup,
  COUNT(*) FILTER (WHERE hit_verified)            AS s2_verified,
  COUNT(*) FILTER (WHERE hit_verified
                     AND hit_profile)             AS s3_profile,
  COUNT(*) FILTER (WHERE hit_verified
                     AND hit_profile
                     AND hit_action)              AS s4_action,
  COUNT(*) FILTER (WHERE hit_verified
                     AND hit_profile
                     AND hit_action
                     AND hit_activated)           AS s5_activated
FROM   per_step
WHERE  hit_signup;

Step-by-step trace.

Stage	Operation	Output rows
1	`ordered` CTE	every relevant event with `signup_at` derived from the user's earliest event
2	`within_7d` CTE	filters out events more than 7 days after the user's signup
3	`per_step` CTE	one row per user with 5 boolean hit-flags
4	final SELECT	`FILTER (WHERE hit_X AND hit_Y AND …)` enforces strict order — a user only counts at step 4 if all earlier steps were also hit
5	`WHERE hit_signup`	a user without the signup event is not in the funnel at all

Output:

s1_signup	s2_verified	s3_profile	s4_action	s5_activated
10,000	7,200	4,900	3,100	1,650

Strict-order numbers are lower than the wide funnel because users who reached step 4 without verifying email earlier are excluded.

Why this works — concept by concept:

Conditional counts (FILTER) — the column-pivoting workhorse; one scan computes every step's count without subqueries or UNION ALL per step.
Boolean roll-up per user (BOOL_OR) — collapses many events per user to one row with five flags; the strict-order test becomes a chained AND.
Time-bounding with INTERVAL '7 days' — anchors every step to the user's signup, not to wall-clock time; cohort fairness is preserved.
ROW_NUMBER alternative — when steps must be in strict chronological order (not just "reached in any sequence"), ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_at) with a step-name lookup is the alternative pattern. Use it when sequence violation matters (e.g. fraud funnels).
Drop-off math in one place — once the long-form table is built, 1 - current/prev per row gives the drop-off without re-querying the events table.
Cost — O(events) for the scans plus O(users) for the per-user roll-up; safe up to billions of rows on a clustered warehouse.

SQL
Topic — user funnel analysis
User-funnel analysis problems (SQL)

Practice →

Worked example — conditional aggregation for a feature-adoption funnel

Detailed explanation. Conditional aggregation — COUNT FILTER, SUM CASE WHEN, AVG FILTER — is the most flexible analyst-SQL idiom. It lets you pivot, segment, and bucket in one pass without dragging data into Python.

Question. From the same events table, compute the share of users who adopted each of three optional features (dashboard, api_keys, team_invite) in their first 30 days.

Code.

WITH first_30 AS (
  SELECT e.user_id,
         e.event_type,
         e.event_at,
         u.signup_at
  FROM   events e
  JOIN   users  u USING (user_id)
  WHERE  e.event_at BETWEEN u.signup_at
                       AND u.signup_at + INTERVAL '30 days'
)
SELECT
  COUNT(DISTINCT user_id)                                                AS total_users,
  COUNT(DISTINCT user_id) FILTER (WHERE event_type='dashboard_open')     AS dashboard_users,
  COUNT(DISTINCT user_id) FILTER (WHERE event_type='api_key_created')    AS api_users,
  COUNT(DISTINCT user_id) FILTER (WHERE event_type='team_invite_sent')   AS invite_users,
  ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE event_type='dashboard_open')
              / COUNT(DISTINCT user_id), 1)                              AS pct_dashboard,
  ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE event_type='api_key_created')
              / COUNT(DISTINCT user_id), 1)                              AS pct_api,
  ROUND(100.0 * COUNT(DISTINCT user_id) FILTER (WHERE event_type='team_invite_sent')
              / COUNT(DISTINCT user_id), 1)                              AS pct_invite
FROM   first_30;

Step-by-step explanation.

first_30 CTE restricts to events in the user's first 30 days — the cohort-fair adoption window.
COUNT(DISTINCT user_id) (no FILTER) is the denominator — every user who has at least one event in their first 30 days.
Each FILTER (WHERE …) count is the numerator for one feature; users who touched it in the window.
The percentage columns divide the two; 100.0 * forces float division; ROUND(... , 1) shapes the output for the BI tool.
One scan, six metrics. Compare to writing three separate queries — FILTER is dramatically faster and easier to keep in sync.

Output.

total_users	dashboard_users	api_users	invite_users	pct_dashboard	pct_api	pct_invite
14,500	9,860	3,335	1,885	68.0	23.0	13.0

Rule of thumb. Whenever the question is "what fraction of X did Y", reach for COUNT(...) FILTER (WHERE …) / COUNT(...). It collapses a half-page Python notebook into 8 lines of SQL.

SQL
Topic — conditional aggregation
Conditional aggregation problems (SQL)

Practice →

Worked example — time-to-convert distribution with percentiles

Detailed explanation. Funnel charts answer "how many converted"; time-to-convert answers "how long did it take". The interview shape: compute the time between each user's signup and their activation event, then summarise with median and p90.

Question. Given events(user_id, event_type, event_at), compute the time-to-activation distribution (in hours) for users who eventually activated.

Input.

user_id	event_type	event_at
1	signup	2026-04-01 10:00:00
1	activated	2026-04-01 14:30:00
2	signup	2026-04-01 11:00:00
2	activated	2026-04-02 09:00:00
3	signup	2026-04-01 12:00:00

Code.

WITH first_signup AS (
  SELECT user_id, MIN(event_at) AS signup_at
  FROM   events
  WHERE  event_type = 'signup'
  GROUP  BY user_id
),
first_activation AS (
  SELECT user_id, MIN(event_at) AS activated_at
  FROM   events
  WHERE  event_type = 'activated'
  GROUP  BY user_id
),
joined AS (
  SELECT s.user_id,
         EXTRACT(EPOCH FROM (a.activated_at - s.signup_at)) / 3600 AS hours_to_activate
  FROM   first_signup     s
  JOIN   first_activation a USING (user_id)
)
SELECT
  COUNT(*)                                              AS activated_users,
  ROUND(AVG(hours_to_activate), 2)                      AS mean_hours,
  ROUND(PERCENTILE_CONT(0.5)
        WITHIN GROUP (ORDER BY hours_to_activate), 2)  AS p50_hours,
  ROUND(PERCENTILE_CONT(0.9)
        WITHIN GROUP (ORDER BY hours_to_activate), 2)  AS p90_hours
FROM joined;

Step-by-step explanation.

first_signup CTE anchors each user to their earliest signup timestamp.
first_activation CTE anchors each user to their earliest activation. Users who never activated are dropped by the inner join.
joined CTE computes the time delta in hours. EXTRACT(EPOCH FROM ...) / 3600 is the cross-dialect way to turn an interval into a number.
PERCENTILE_CONT(0.5) is the median; PERCENTILE_CONT(0.9) is p90. These ordered-set aggregates are exactly what the interviewer is testing — they replace the old "mean is misleading" hand-wave.
What this surfaces. A mean of 6 hours with a p90 of 36 hours is a clear two-mode distribution — most users activate quickly, a long tail drag the chart.

Output.

activated_users	mean_hours	p50_hours	p90_hours
1,890	6.20	4.50	22.30

Rule of thumb. Whenever the question is "how long did it take?", report median and p90 alongside the mean — the distribution shape is the answer, not a single number.

4. Retention curves — day-N + weekly cohorts + decay

`sql for data analyst` rounds love retention curves — the shape tells the product story

The retention curve in one sentence: for each cohort and each day-N, what fraction of the cohort was active that day. Plot day-N on the x-axis, retention % on the y, one line per cohort, and you have the canonical product-market-fit chart.

The retention decisions every analyst SQL interview probes.

Day-N or rolling. Day-N retention = "active on exactly day N after signup". Rolling retention = "active any time in days [N-window, N]". Day-N is simpler; rolling is what most dashboards actually show.
Calendar vs activity-anchored. Retention curves anchor to signup (cohort time), not calendar time — that is what makes the curve readable across cohorts.
Sparse days. If a user is inactive on day 14 but active on day 13 and 15, the day-14 row is missing from the activity table. A LEFT JOIN against a calendar table fills the zero rows so the chart is honest.
Benchmarks. D1/D7/D30 are the canonical reporting horizons. Industry benchmarks live there. Anything else is a story curve.

The retention SQL recipe in four primitives.

Anchor. signup_date per user (from the cohorts CTE).
Day-N. DATE_DIFF('day', signup_date, activity_date) AS day_n.
Calendar fill. LEFT JOIN calendar so D14 = 0 when there is no event.
Retention. COUNT(DISTINCT user_id) / cohort_size.

Three failure modes interviewers love to catch.

Missing zero rows. A flat-looking curve that quietly skips inactive days is wrong; the user is not active on those days, so the count should be zero, not "not in the table".
Calendar drift. Comparing day-N across cohorts without a calendar table can drift if the engine groups on natural-key rows only; LEFT JOIN ensures every (cohort, day-N) pair exists.
D0 confusion. D0 = signup day. By definition, retention on D0 is 100% — if your query produces 92%, you have a clock or a UTC bug.

Worked example — 30-day retention curve across 3 cohorts

Detailed explanation. A growth team wants the 30-day retention curve for the last three monthly cohorts. The curve must show every day from D0 to D30 — including days where the cohort was completely inactive (rare but possible).

Question. Given users(user_id, signup_at) and events(user_id, event_at), return one row per (cohort_month, day_n) for day_n 0..30 with the retention percentage.

Input.

user_id	signup_at	events
101	2026-01-05	2026-01-05, 2026-01-06, 2026-01-12
102	2026-01-15	2026-01-15, 2026-01-22, 2026-02-04
201	2026-02-03	2026-02-03, 2026-02-05

Code.

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('month', signup_at) AS cohort_month,
         DATE(signup_at)                AS signup_date
  FROM   users
),
cohort_size AS (
  SELECT cohort_month, COUNT(*) AS n_users
  FROM   cohorts
  GROUP  BY 1
),
day_grid AS (              -- 0..30 — the x-axis
  SELECT generate_series(0, 30) AS day_n
),
matrix AS (
  SELECT c.cohort_month, d.day_n
  FROM   (SELECT DISTINCT cohort_month FROM cohorts) c
  CROSS  JOIN day_grid d
),
active AS (
  SELECT c.cohort_month,
         DATE_DIFF('day', c.signup_date, DATE(e.event_at)) AS day_n,
         COUNT(DISTINCT e.user_id)                         AS active_users
  FROM   events  e
  JOIN   cohorts c USING (user_id)
  WHERE  e.event_at BETWEEN c.signup_date
                       AND c.signup_date + INTERVAL '30 days'
  GROUP  BY 1, 2
)
SELECT m.cohort_month,
       m.day_n,
       COALESCE(a.active_users, 0)                          AS active_users,
       s.n_users,
       ROUND(100.0 * COALESCE(a.active_users, 0) / s.n_users, 1) AS retention_pct
FROM   matrix      m
JOIN   cohort_size s USING (cohort_month)
LEFT JOIN active    a USING (cohort_month, day_n)
ORDER BY 1, 2;

Step-by-step explanation.

cohorts CTE assigns each user a month label and keeps the exact signup date — the day-N math needs the day, the cohort needs the month.
cohort_size CTE is the denominator per cohort.
day_grid + matrix generate every (cohort, day_n) pair from 0 to 30; this is the canonical x-axis. Without it, days where nobody returned would be silently absent.
active CTE counts distinct users active on each (cohort, day_n). The WHERE clause time-bounds activity to the first 30 days per user.
LEFT JOIN active fills the matrix — when there is no activity row for a (cohort, day_n), COALESCE substitutes 0; the chart shows an honest zero, not a missing point.
Final division yields the retention percentage in presentation form.

Output (sampled).

cohort_month	day_n	active_users	n_users	retention_pct
2026-01	0	12,000	12,000	100.0
2026-01	1	5,400	12,000	45.0
2026-01	7	3,360	12,000	28.0
2026-01	14	2,640	12,000	22.0
2026-01	30	2,040	12,000	17.0
2026-02	0	14,500	14,500	100.0
2026-02	1	5,510	14,500	38.0
2026-02	7	3,190	14,500	22.0

Rule of thumb. Generate the day-N grid explicitly; LEFT JOIN the activity counts onto it; COALESCE(active_users, 0). Without the grid the chart silently lies on inactive days.

Analytics SQL interview question on D1 / D7 / D30 with calendar gap-fill

A common probe in sql for data analyst rounds: "Compute D1, D7, D30 retention per cohort and explain why the calendar table is necessary." It is a clean test of cohort fairness, calendar handling, and the difference between missing data and zero data.

Solution Using `LEFT JOIN calendar` + `COALESCE(active, 0)` + targeted day-N picks

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('month', signup_at) AS cohort_month,
         DATE(signup_at)                AS signup_date
  FROM   users
),
cohort_size AS (
  SELECT cohort_month, COUNT(*) AS n_users
  FROM   cohorts
  GROUP  BY 1
),
active AS (
  SELECT c.cohort_month,
         DATE_DIFF('day', c.signup_date, DATE(e.event_at)) AS day_n,
         COUNT(DISTINCT e.user_id)                         AS active_users
  FROM   events  e
  JOIN   cohorts c USING (user_id)
  WHERE  e.event_at BETWEEN c.signup_date
                       AND c.signup_date + INTERVAL '30 days'
  GROUP  BY 1, 2
),
calendar AS (
  SELECT cohort_month, day_n
  FROM   (SELECT DISTINCT cohort_month FROM cohorts) c
  CROSS  JOIN (VALUES (1), (7), (30)) AS d(day_n)
)
SELECT cal.cohort_month,
       cal.day_n,
       s.n_users,
       COALESCE(a.active_users, 0)                            AS active_users,
       ROUND(100.0 * COALESCE(a.active_users,0) / s.n_users, 1) AS retention_pct
FROM   calendar    cal
JOIN   cohort_size s USING (cohort_month)
LEFT JOIN active    a USING (cohort_month, day_n)
ORDER  BY cal.cohort_month, cal.day_n;

Step-by-step trace.

Stage	Operation	Output
1	`cohorts` CTE	one row per user with `cohort_month` + `signup_date`
2	`cohort_size` CTE	per-cohort denominator
3	`active` CTE	actual day-N counts (sparse — missing days have no row)
4	`calendar` CTE	dense `(cohort, day_n)` pairs for D1/D7/D30 only
5	`LEFT JOIN active` + `COALESCE`	dense output — missing days resolve to 0
6	divide for percentage	three rows per cohort: D1, D7, D30

Output:

cohort_month	day_n	n_users	active_users	retention_pct
2026-01-01	1	12,000	5,400	45.0
2026-01-01	7	12,000	3,360	28.0
2026-01-01	30	12,000	2,040	17.0
2026-02-01	1	14,500	5,510	38.0
2026-02-01	7	14,500	3,190	22.0
2026-02-01	30	14,500	1,595	11.0

Why this works — concept by concept:

Calendar table fills the gaps — without LEFT JOIN, inactive days simply do not appear; the chart would skip them and the curve would look smoother than reality. The calendar is the spine that forces every day to exist.
COALESCE turns "missing" into "zero" — the analyst-honest substitution; absence of evidence becomes evidence of absence, which is correct for retention.
DATE_DIFF for day-N — turns calendar dates into cohort time; the curve becomes comparable across cohorts.
Time-bound the activity — event_at BETWEEN signup AND signup + 30 days keeps the scan small and cohort-fair.
Targeted picks (D1/D7/D30) — instead of a 31-row table per cohort, the calendar restricts to three industry-standard horizons; faster query, easier dashboard.
Cost — O(users) for cohort assignment, O(events_in_30d) for the active scan, O(cohorts × 3) for the calendar; trivially small.

SQL
Topic — window functions
Window-function and retention problems (SQL)

Practice →

Worked example — weekly cohort retention with DATE_DIFF

Detailed explanation. Weekly cohorts are the daily-driver granularity for product analytics — they capture seasonality without the noise of daily curves. The recipe is identical to monthly cohorts; only the truncation unit changes.

Question. Given a 12-week window, return weekly cohort retention for weeks W0..W6 (six weeks after signup).

Code.

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('week', signup_at) AS cohort_week
  FROM   users
),
cohort_size AS (
  SELECT cohort_week, COUNT(*) AS n_users
  FROM   cohorts
  GROUP  BY 1
),
activity AS (
  SELECT c.cohort_week,
         DATE_DIFF('week', c.cohort_week,
                   DATE_TRUNC('week', e.event_at)) AS week_n,
         COUNT(DISTINCT e.user_id)                 AS active_users
  FROM   events  e
  JOIN   cohorts c USING (user_id)
  WHERE  e.event_at <= c.cohort_week + INTERVAL '6 weeks'
  GROUP  BY 1, 2
)
SELECT a.cohort_week,
       a.week_n,
       a.active_users,
       s.n_users,
       ROUND(100.0 * a.active_users / s.n_users, 1) AS retention_pct
FROM   activity    a
JOIN   cohort_size s USING (cohort_week)
WHERE  a.cohort_week >= CURRENT_DATE - INTERVAL '12 weeks'
ORDER  BY 1, 2;

Step-by-step explanation.

Truncate signup to week start; this is the cohort label.
Truncate every activity event to its own week; DATE_DIFF produces the relative week_n.
WHERE event_at <= cohort_week + INTERVAL '6 weeks' bounds the activity scan — every cohort gets exactly six weeks of observation, max.
Outer filter cohort_week >= CURRENT_DATE - INTERVAL '12 weeks' restricts to the last twelve weekly cohorts — the chart's x-axis.
The final division and ROUND shape the output for the BI tool.

Output (sampled).

cohort_week	week_n	active_users	n_users	retention_pct
2026-03-09	0	1,200	1,200	100.0
2026-03-09	1	540	1,200	45.0
2026-03-09	4	264	1,200	22.0
2026-03-16	0	1,150	1,150	100.0
2026-03-16	1	540	1,150	47.0

Rule of thumb. Weekly retention curves are the granularity you want for product dashboards. Monthly is for board decks; daily is for incident analysis.

SQL
Topic — date arithmetic
Date arithmetic problems (SQL)

Practice →

Worked example — N-day rolling retention with window-rank dedup

Detailed explanation. Rolling retention (active any time in days [N-window, N]) is more forgiving than day-N retention — it captures users who came back yesterday or today, not just exactly on day N. Most product dashboards prefer rolling 28-day retention over day-28 retention for that reason.

Question. Given events(user_id, event_at) and a known per-user signup_date, return one row per (user_id, day_n) for day_n in [7, 14, 28] indicating whether each user was active in the trailing 7-day window ending at that day.

Code.

WITH cohorts AS (
  SELECT user_id, DATE(signup_at) AS signup_date
  FROM   users
),
windows AS (
  SELECT user_id, signup_date, day_n
  FROM   cohorts CROSS JOIN (VALUES (7), (14), (28)) AS d(day_n)
),
active_in_window AS (
  SELECT w.user_id,
         w.day_n,
         BOOL_OR(
           DATE(e.event_at) BETWEEN
             w.signup_date + (w.day_n - 7) AND
             w.signup_date + w.day_n
         ) AS active_rolling_7d
  FROM   windows w
  JOIN   events  e USING (user_id)
  GROUP  BY 1, 2
)
SELECT day_n,
       COUNT(*) FILTER (WHERE active_rolling_7d) AS retained,
       COUNT(*)                                    AS cohort,
       ROUND(100.0 * COUNT(*) FILTER (WHERE active_rolling_7d)
             / COUNT(*), 1)                        AS rolling_7d_pct
FROM   active_in_window
GROUP  BY day_n
ORDER  BY day_n;

Step-by-step explanation.

cohorts CTE holds each user's anchor date.
windows CTE is the cross product — three windows per user (W7, W14, W28). Each window is a 7-day trailing interval.
active_in_window joins events; BOOL_OR(... BETWEEN ...) returns TRUE if the user had any event in the window. This is the rolling-retention flag.
Final SELECT counts retained users per window and computes the percentage.
Why rolling beats day-N for dashboards. A user active on days 6 and 8 but not exactly on day 7 appears as "not retained" in day-N retention but "retained" in rolling-7-day. The rolling chart is smoother and more accurate to engagement.

Output.

day_n	retained	cohort	rolling_7d_pct
7	5,640	12,000	47.0
14	4,200	12,000	35.0
28	2,880	12,000	24.0

Rule of thumb. Use day-N for the interview (cleaner math); use rolling-7-day for the dashboard (more honest signal). Both belong in the analyst's toolkit.

5. Business-metric SQL — LTV, ARPU, MAU/DAU, stickiness

`sql for data analytics` ends in growth metrics — LTV, ARPU, MAU/DAU, stickiness

The growth-metric bundle in one sentence: every business metric a PM dashboards is one of two shapes — a rolling time series (DAU, MAU, weekly active) or a per-cohort sum (LTV, ARPU, gross revenue). Window functions handle the first; cohort-anchored sums handle the second.

The growth metric ladder.

DAU — Daily Active Users. Distinct users with any event today.
WAU — Weekly Active Users. Distinct users with any event in the last 7 days.
MAU — Monthly Active Users. Distinct users with any event in the last 30 days.
Stickiness — DAU / MAU. Ratio in [0,1]; ~0.2 is normal SaaS, 0.5+ is messaging-app territory.
ARPU — Average Revenue Per User. SUM(revenue) / COUNT(DISTINCT user) for the period.
LTV — Lifetime Value. Sum of revenue per user over their entire lifetime, or ARPU × avg_lifetime_months.
CAC payback. CAC / ARPU (months); LTV/CAC > 3 is the SaaS rule of thumb.

The SQL primitives.

SUM() OVER (PARTITION BY user_id ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) — running totals.
COUNT(DISTINCT user_id) OVER (... ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) — 30-day rolling DAU. (Note: many engines disallow DISTINCT inside window — fall back to a CTE-and-join approach.)
LAG(metric) OVER (ORDER BY date) — period-over-period growth.
DATE_TRUNC + GROUP BY — cohort-month sums.

Three failure modes interviewers test for.

"MAU" is not "sum of DAU". A user active 30 days in a row counts once in MAU and 30 times in summed DAU. Use distinct on the window.
Stickiness with mismatched windows. DAU divided by yesterday's MAU is wrong; align both to the same end date.
LTV without cohort. LTV is a cohort metric — a single average across all users mixes 5-year customers with 5-day signups. Compute per cohort, then average if needed.

Worked example — 30-day rolling MAU + LAG growth

Detailed explanation. Build a daily time series of 30-day rolling MAU and the day-over-day growth percentage. The MAU rollup uses a self-join (since COUNT(DISTINCT) is not allowed inside most engines' window frames); the growth uses LAG.

Question. Given events(user_id, event_at), return one row per day with mau_30d (distinct users active in the last 30 days) and mau_growth_pct (vs the previous day).

Code.

WITH daily_active AS (
  SELECT DATE(event_at)         AS d,
         user_id
  FROM   events
  GROUP  BY 1, 2
),
days AS (
  SELECT DISTINCT d FROM daily_active
),
mau AS (
  SELECT  x.d,
          COUNT(DISTINCT da.user_id) AS mau_30d
  FROM    days x
  JOIN    daily_active da
          ON  da.d BETWEEN x.d - INTERVAL '29 days' AND x.d
  GROUP   BY x.d
),
with_growth AS (
  SELECT  d,
          mau_30d,
          LAG(mau_30d) OVER (ORDER BY d) AS prev_mau,
          ROUND(100.0 * (mau_30d - LAG(mau_30d) OVER (ORDER BY d))
                      / NULLIF(LAG(mau_30d) OVER (ORDER BY d), 0), 2)
                                          AS mau_growth_pct
  FROM    mau
)
SELECT *
FROM   with_growth
ORDER  BY d;

Step-by-step explanation.

daily_active dedupes events to one row per (user, day) — collapses many events to one active-flag per day.
days is the time spine — every day where any user was active.
mau self-joins the daily-active table to the time spine; for each spine day x.d, count distinct users active in the trailing 30 days. This is the rolling MAU.
with_growth uses LAG to fetch the previous day's MAU; NULLIF(... , 0) guards the very first day so the division doesn't crash.
The output is a one-row-per-day time series — perfect for the rolling-MAU line chart.

Output (sampled).

d	mau_30d	prev_mau	mau_growth_pct
2026-05-01	80,000	NULL	NULL
2026-05-02	80,420	80,000	0.53
2026-05-07	82,150	81,930	0.27
2026-05-15	90,300	89,610	0.77
2026-05-30	120,000	118,800	1.01

Rule of thumb. Build the time spine once, self-join the activity onto it with a 30-day window, then layer LAG on top. The same shape handles WAU, DAU, and quarterly metrics — only the window width changes.

Analytics SQL interview question on DAU/MAU stickiness with windowed counts

A senior probe: "Compute the 30-day rolling MAU, the same-day DAU, and the DAU/MAU stickiness ratio in a single query." Tests whether the candidate keeps the windows aligned to the same end date and whether they reach for a self-join rather than COUNT DISTINCT OVER.

Solution Using daily_active spine + dual self-joins for DAU and MAU

WITH daily_active AS (
  SELECT DATE(event_at) AS d, user_id
  FROM   events
  GROUP  BY 1, 2
),
days AS (
  SELECT DISTINCT d FROM daily_active
),
dau AS (
  SELECT d, COUNT(DISTINCT user_id) AS dau
  FROM   daily_active
  GROUP  BY 1
),
mau AS (
  SELECT  x.d,
          COUNT(DISTINCT da.user_id) AS mau_30d
  FROM    days x
  JOIN    daily_active da
          ON  da.d BETWEEN x.d - INTERVAL '29 days' AND x.d
  GROUP   BY x.d
)
SELECT  d.d,
        d.dau,
        m.mau_30d,
        ROUND(1.0 * d.dau / NULLIF(m.mau_30d, 0), 3) AS stickiness
FROM    dau d
JOIN    mau m USING (d)
ORDER   BY d.d;

Step-by-step trace.

Stage	Operation	Output
1	`daily_active`	one row per `(user, day)`
2	`days`	the time spine
3	`dau`	one row per day with same-day distinct user count
4	`mau`	one row per day with trailing-30-day distinct user count
5	final join	one row per day with `dau`, `mau_30d`, `stickiness`
6	`NULLIF`	guards against division-by-zero on empty days

Output:

d	dau	mau_30d	stickiness
2026-05-01	15,200	80,000	0.190
2026-05-02	16,050	80,420	0.200
2026-05-07	16,800	82,150	0.205
2026-05-15	19,500	90,300	0.216
2026-05-30	30,000	120,000	0.250

Why this works — concept by concept:

Daily-active spine — collapsing events to one row per (user, day) is the cheapest representation for both DAU and MAU; every downstream query touches a much smaller table.
Self-join with BETWEEN x.d - INTERVAL '29 days' AND x.d — produces a 30-day trailing window per spine day without using COUNT DISTINCT inside a window frame (which most engines disallow).
DAU/MAU stickiness as a ratio — both numerator and denominator share the same end date d; that alignment is what makes the ratio meaningful. Misaligned windows are the most common stickiness bug in production.
NULLIF(mau, 0) — guards the ratio against zero MAU (rare, but possible on first-day cohorts); without it the query throws.
One spine, many metrics — once the spine is built, ARPU, gross revenue, and retention can be layered as additional joins; one scan of events, many shapes of output.
Cost — O(events) for the spine, O(days × users_in_30d) for the MAU self-join — for typical product scales (≤100k DAU, ≤1M MAU) this finishes in seconds on a clustered warehouse.

SQL
Topic — real-time analytics
Real-time analytics problems (SQL)

Practice →

Worked example — ARPU per cohort-month and LTV in 10 lines

Detailed explanation. ARPU and LTV are cohort metrics, not all-user averages. Compute them per cohort, then average if needed. The recipe: anchor users to a cohort, sum lifetime revenue per user, average per cohort.

Question. Given users(user_id, signup_at) and orders(user_id, amount, order_at), return one row per cohort-month with cohort_size, total_revenue, arpu (ARPU as of today), and ltv_12mo_est (estimated 12-month LTV).

Code.

WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('month', signup_at) AS cohort_month
  FROM   users
),
per_user AS (
  SELECT c.cohort_month,
         c.user_id,
         COALESCE(SUM(o.amount), 0)                 AS lifetime_revenue,
         MAX(o.order_at)                            AS last_order_at
  FROM   cohorts c
  LEFT JOIN orders o USING (user_id)
  GROUP  BY 1, 2
)
SELECT cohort_month,
       COUNT(*)                                    AS cohort_size,
       SUM(lifetime_revenue)                       AS total_revenue,
       ROUND(AVG(lifetime_revenue), 2)             AS arpu,
       ROUND(AVG(lifetime_revenue) *
             12.0 / GREATEST(1, DATE_DIFF('month',
                          MIN(cohort_month), CURRENT_DATE)),
             2)                                    AS ltv_12mo_est
FROM   per_user
GROUP  BY 1
ORDER  BY 1;

Step-by-step explanation.

cohorts CTE assigns each user a stable cohort label.
per_user CTE computes lifetime revenue per user, anchored to the cohort. LEFT JOIN keeps users with zero orders (they belong in the cohort denominator).
The final SELECT groups by cohort and computes the four metrics — cohort_size is the headcount, total_revenue is the gross, arpu is the average per user, ltv_12mo_est extrapolates that ARPU out to 12 months.
The extrapolation math — divide observed ARPU by months observed, multiply by 12. GREATEST(1, ...) guards new cohorts that have only existed for fractional months.
Why per cohort? A six-month-old cohort and a one-day-old cohort have wildly different ARPU horizons; mixing them gives a meaningless average.

Output.

cohort_month	cohort_size	total_revenue	arpu	ltv_12mo_est
2026-01	12,000	540,000	45.00	540.00
2026-02	14,500	542,500	37.41	449.00
2026-03	13,200	396,000	30.00	360.00
2026-04	15,800	316,000	20.00	240.00

Rule of thumb. Always express ARPU and LTV per cohort. The 12-month extrapolation is a useful fiction — flag it as an estimate in the column name (ltv_12mo_est, not ltv) so consumers know the math.

SQL
Topic — data aggregation
Data aggregation problems (SQL)

Practice →

Worked example — trailing 7-day revenue with SUM OVER

Detailed explanation. Window functions shine on rolling metrics — trailing-N revenue, moving averages, period growth. SUM(amount) OVER (ORDER BY d ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) is the canonical trailing-7-day sum.

Question. Given a daily revenue series, return the daily revenue, the trailing-7-day total, and the day-over-day growth percentage.

Code.

WITH daily_rev AS (
  SELECT DATE(order_at) AS d, SUM(amount) AS rev
  FROM   orders
  GROUP  BY 1
)
SELECT  d,
        rev,
        SUM(rev) OVER (ORDER BY d
                       ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS trailing_7d_rev,
        ROUND(100.0 * (rev - LAG(rev) OVER (ORDER BY d))
                    / NULLIF(LAG(rev) OVER (ORDER BY d), 0), 2)  AS rev_growth_pct
FROM    daily_rev
ORDER   BY d;

Step-by-step explanation.

daily_rev CTE collapses orders to one row per day with summed revenue.
SUM(...) OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) — the window frame is "last 7 rows including current"; for daily data that is the trailing-7-day total.
LAG(rev) fetches the previous day's revenue; subtract, divide, round for the growth percentage.
NULLIF(... , 0) guards against zero-revenue days (rare but possible).
One scan, three metrics — daily, trailing-7-day, and growth percentage in a single query.

Output (sampled).

d	rev	trailing_7d_rev	rev_growth_pct
2026-05-01	12,300	12,300	NULL
2026-05-02	13,800	26,100	12.20
2026-05-07	15,400	95,200	4.05
2026-05-08	14,900	97,800	-3.25
2026-05-15	17,200	110,500	2.99

Rule of thumb. SUM OVER ROWS BETWEEN N PRECEDING AND CURRENT ROW is the trailing-window primitive; pair it with LAG for growth and LEAD for forward-looking spreads. These three operators cover 80% of growth-team dashboard SQL.

SQL
Topic — time series
Time-series problems (SQL)

Practice →

Worked example — LTV/CAC ratio and payback in pure SQL

Detailed explanation. LTV and CAC are the two halves of the unit-economics question. The classic ratio — LTV / CAC — tells you whether you make more money from a user than you spent acquiring them; the payback period — CAC / monthly_ARPU — tells you how long it takes to break even. Both are analyst SQL queries on top of the cohort frame.

Question. Given cohort_acquisition_cost(cohort_month, total_cac) and the per-cohort revenue stream from earlier, return one row per cohort with cac_per_user, arpu_monthly, payback_months, and ltv_cac_ratio_12mo.

Code.

WITH cohorts AS (
  SELECT user_id, DATE_TRUNC('month', signup_at) AS cohort_month
  FROM   users
),
cohort_size AS (
  SELECT cohort_month, COUNT(*) AS n_users
  FROM   cohorts
  GROUP  BY 1
),
revenue AS (
  SELECT c.cohort_month, SUM(o.amount) AS total_revenue
  FROM   orders  o
  JOIN   cohorts c USING (user_id)
  GROUP  BY 1
),
arpu AS (
  SELECT cohort_month,
         total_revenue / GREATEST(1, DATE_DIFF('month', cohort_month, CURRENT_DATE))
         / NULLIF((SELECT n_users FROM cohort_size cs
                   WHERE cs.cohort_month = revenue.cohort_month), 0) AS arpu_monthly
  FROM   revenue
)
SELECT s.cohort_month,
       s.n_users,
       c.total_cac,
       ROUND(c.total_cac / s.n_users, 2)                          AS cac_per_user,
       ROUND(a.arpu_monthly, 2)                                   AS arpu_monthly,
       ROUND((c.total_cac / s.n_users) / NULLIF(a.arpu_monthly,0), 2)
                                                                  AS payback_months,
       ROUND(a.arpu_monthly * 12 / NULLIF(c.total_cac / s.n_users, 0), 2)
                                                                  AS ltv_cac_ratio_12mo
FROM   cohort_size               s
JOIN   cohort_acquisition_cost   c USING (cohort_month)
JOIN   arpu                      a USING (cohort_month)
ORDER  BY s.cohort_month;

Step-by-step explanation.

cohorts, cohort_size, revenue — the standard cohort skeleton.
arpu CTE — total revenue divided by (months observed × cohort size). The GREATEST(1, ...) guards new cohorts.
cac_per_user — total acquisition spend divided by cohort headcount.
payback_months — cac_per_user / arpu_monthly. How many months of average revenue per user to break even on the CAC.
ltv_cac_ratio_12mo — projects ARPU out twelve months and divides by CAC. Industry rule of thumb: > 3x is healthy, < 1x is unsustainable.

Output.

cohort_month	n_users	total_cac	cac_per_user	arpu_monthly	payback_months	ltv_cac_ratio_12mo
2026-01	12,000	1,500,000	125.00	45.00	2.78	4.32
2026-02	14,500	1,950,000	134.48	37.41	3.59	3.34
2026-03	13,200	1,650,000	125.00	30.00	4.17	2.88

Rule of thumb. Always present LTV alongside CAC alongside the ratio. A 4x LTV/CAC is great on paper but irrelevant if payback is 18 months and the business is cash-constrained.

SQL
Topic — aggregation
Cohort revenue + LTV aggregation drills (SQL)

Practice →

Worked example — top-N power users per cohort with ROW_NUMBER

Detailed explanation. Senior analysts get pulled into "who are the whales" questions constantly. The pattern: rank users by revenue within each cohort, then pick the top N. ROW_NUMBER() OVER (PARTITION BY cohort ORDER BY revenue DESC) is the canonical idiom — works in every modern dialect.

Question. Given users(user_id, signup_at) and orders(user_id, amount), return the top-5 highest-spending users for each monthly cohort.

Code.

WITH cohorts AS (
  SELECT user_id, DATE_TRUNC('month', signup_at) AS cohort_month
  FROM   users
),
per_user AS (
  SELECT c.cohort_month,
         c.user_id,
         COALESCE(SUM(o.amount), 0) AS lifetime_revenue
  FROM   cohorts c
  LEFT JOIN orders o USING (user_id)
  GROUP  BY 1, 2
),
ranked AS (
  SELECT cohort_month,
         user_id,
         lifetime_revenue,
         ROW_NUMBER() OVER (PARTITION BY cohort_month
                            ORDER BY lifetime_revenue DESC) AS rn
  FROM   per_user
)
SELECT cohort_month, user_id, lifetime_revenue
FROM   ranked
WHERE  rn <= 5
ORDER  BY cohort_month, rn;

Step-by-step explanation.

cohorts CTE assigns each user a cohort label.
per_user CTE sums lifetime revenue per user; LEFT JOIN keeps zero-revenue users (they'll get rn > 5 and drop out).
ranked CTE assigns a row number per (cohort_month, lifetime_revenue DESC) — the highest-revenue user in each cohort gets rn=1.
Final SELECT picks only rn <= 5. The outer ORDER BY produces a tidy report ordered by cohort and rank.
ROW_NUMBER vs RANK vs DENSE_RANK. Use ROW_NUMBER when you want exactly N rows per group regardless of ties; RANK if ties should share a position (and the top-5 might have 7 rows); DENSE_RANK if ties share without gaps. For "top-N reporting" the safe default is ROW_NUMBER.

Output.

cohort_month	user_id	lifetime_revenue
2026-01	7,402	4,580.00
2026-01	3,011	3,940.00
2026-01	8,219	3,420.00
2026-01	1,506	3,200.00
2026-01	9,033	2,990.00
2026-02	12,008	5,210.00
2026-02	14,221	4,800.00

Rule of thumb. Top-N-per-group = ROW_NUMBER() OVER (PARTITION BY g ORDER BY m DESC) <= N in an outer filter. Memorise it — it is the single most-asked analyst SQL pattern in interviews.

SQL
Topic — ranking
Top-N ranking problems (SQL)

Practice →

Cheat sheet — analyst SQL idioms

Top-N per group. ROW_NUMBER() OVER (PARTITION BY group_id ORDER BY metric DESC) <= N in an outer WHERE.
Cohort matrix. Three CTEs — cohorts, cohort_size, activity — then a final join; DATE_TRUNC for the bucket, DATE_DIFF for the relative period.
Funnel drop-off. COUNT(*) FILTER (WHERE step = 'X') per step in one wide row; UNION ALL pivot to long form; 1 - current/prev for drop-off percentage.
Day-N retention. Per-user signup_date; DATE_DIFF for day-N; LEFT JOIN a (cohort, day_n) calendar with COALESCE(active, 0) to fill the gaps.
Stickiness (DAU/MAU). Daily-active spine; one self-join for trailing-30-day MAU; one straight GROUP BY for DAU; divide, with NULLIF for safety.
LTV per cohort. Anchor users to cohort; LEFT JOIN orders; SUM(amount) per user; AVG per cohort; flag the time-bounded estimate in the column name (ltv_12mo_est).
Rolling N-day metrics. SUM(metric) OVER (ORDER BY d ROWS BETWEEN (N-1) PRECEDING AND CURRENT ROW) — the trailing-window primitive.
Period-over-period growth. LAG(metric) OVER (ORDER BY d) in the numerator; NULLIF(LAG(...), 0) in the denominator.
Conditional aggregation. COUNT/SUM/AVG(...) FILTER (WHERE …) is faster, cleaner, and easier to maintain than chained CASE WHEN aggregates.
Triangular cohort cells. Always check cohort_start + period_n <= CURRENT_DATE in the final SELECT; absence is honest, zero is misleading.

Frequently asked questions

Cohort analysis vs funnel analysis — which one should I reach for?

Use a cohort analysis when you want to understand long-term behaviour over time — retention curves, lifetime value, "how do January signups compare to February signups across the next six months". Use a funnel analysis when you want to understand a sequence of actions inside a fixed window — signup → verify → activate, within seven days. Cohorts ask "how is this group doing over time"; funnels ask "where do users drop off in a flow". A well-designed dashboard usually carries both — cohorts on the strategic tab, funnels on the activation tab.

DAU vs MAU vs WAU — which is the right metric for my product?

Choose the cadence that matches your expected user behaviour. Messaging and games are daily-use products — DAU and DAU/MAU stickiness are the right metrics. Most SaaS is weekly — WAU plus WAU/MAU stickiness. Tax software, insurance portals, payroll — monthly is the natural cadence; MAU and DAU/MAU ratios at 5% are normal and not alarming. The rule is: if the metric drops to zero on a Saturday, you picked the wrong cadence.

Retention vs churn — are they the same thing?

They are mirror images, not the same metric. Retention is the fraction of a cohort still active in period N (active_users / cohort_size); it ranges from 100% at D0 and drops over time. Churn is 1 - retention for a given window — the fraction lost in that window. SaaS teams talk about "monthly churn" — usually "of users active last month, what fraction were not active this month". Analytics SQL handles both shapes from the same cohort frame; pick whichever your audience reads more naturally.

How do I do cohort analysis in pure SQL without pulling into Python?

Three CTEs and a final join. CTE 1 — cohorts — anchors every user to a DATE_TRUNC bucket of their first event. CTE 2 — cohort_size — counts users per cohort (the denominator). CTE 3 — activity — joins cohort labels onto all events and groups by (cohort, period_n) where period_n = DATE_DIFF(activity_period, cohort_period). Final SELECT joins CTE 3 to CTE 2 and divides for the retention percentage. That single recipe handles monthly, weekly, daily, revenue, and feature-adoption cohorts — only the bucket unit changes.

Can SQL replace a BI tool for analytics work?

For the computation, yes — almost every metric a BI tool draws is a SQL query underneath. For the interaction (slicing dimensions, drilling down, scheduling, version-controlling, sharing dashboards), no. The healthy pattern is: SQL produces the tidy long-form tables (one row per (cohort, period), one row per (date, metric)), and the BI tool — Looker, Mode, Metabase, Tableau, Hex — handles pivot, drill, and presentation. Treat SQL as the engine, BI as the dashboard chrome.

LTV in SQL — do I need a calendar table?

Not strictly for LTV itself, because LTV is summed over a user's entire history — it does not depend on the calendar shape. But if you want time-bounded LTV ("LTV after 90 days", "LTV in first 12 months"), then yes, you anchor each user to their cohort date and time-bound the orders join with order_at <= cohort_date + INTERVAL '90 days'. The calendar table starts mattering when you want a per-period series (LTV at month-1, month-2, month-3) — there you LEFT JOIN a generated month-grid so cohorts with no orders in a given month still appear with 0.

Practice on PipeCode

Drill the aggregation practice library → for the SUM / COUNT / GROUP BY muscles that power every cohort query.
Rehearse window-function problems → for ROW_NUMBER, LAG, LEAD, and trailing-N-day aggregates.
Sharpen cohort-analysis problems → when the interviewer wants a full DATE_TRUNC + DATE_DIFF cohort matrix.
Tackle user-funnel analysis problems → for FILTER-based step pivots and strict-order conversion.
Lock in date-function problems → so DATE_TRUNC, DATE_DIFF, and EXTRACT are reflexive.
Layer on conditional-aggregation problems → for the COUNT(...) FILTER pivots.
For the broader interview surface, read top data engineering interview questions →.
Build the core SQL fluency with SQL for data engineering interviews — from zero to FAANG →.
Add product-metric depth with product sense and metrics for data engineering interviews →.

Pipecode.ai is Leetcode for Data Engineering — every SQL analytics pattern above ships with hands-on practice rooms where you shape real cohort matrices, debug real funnels, and ship real LTV queries against real datasets. Start with aggregation and window functions; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice SQL aggregation now →
Window-function drills →

1. Why analytics SQL is its own discipline

sql data analytics is built from four shape primitives — cohort, funnel, retention, LTV

Worked example — reshape one event table into a cohort frame in 8 lines

Analytics SQL interview question on reshaping events into a long cohort frame

Solution Using DATE_TRUNC + first-event CTE + grouped count

2. Cohort analysis — DATE_TRUNC + first-event signup buckets

sql data analysis lives or dies on cohort SQL — anchor users to a signup bucket, then reshape

Worked example — 12-line cohort matrix for a SaaS signup feed

Analytics SQL interview question on building a cohort matrix end-to-end

Solution Using DATE_TRUNC + DATE_DIFF + three CTEs

Worked example — weekly cohorts and the diagonal-greying trap

Worked example — cohort-anchored revenue per period

3. Funnel analysis — multi-step conversion + drop-off

sql for data analysis of funnels is one row per user-step pair, pivoted with FILTER

Worked example — 5-step product activation funnel in 6 lines

Analytics SQL interview question on a strict-order funnel with a 7-day time bound

Solution Using a ROW_NUMBER ordered-events CTE + INNER JOIN per step

Worked example — conditional aggregation for a feature-adoption funnel

Worked example — time-to-convert distribution with percentiles

4. Retention curves — day-N + weekly cohorts + decay

sql for data analyst rounds love retention curves — the shape tells the product story

Worked example — 30-day retention curve across 3 cohorts

Analytics SQL interview question on D1 / D7 / D30 with calendar gap-fill

Solution Using LEFT JOIN calendar + COALESCE(active, 0) + targeted day-N picks

Worked example — weekly cohort retention with DATE_DIFF

Worked example — N-day rolling retention with window-rank dedup

5. Business-metric SQL — LTV, ARPU, MAU/DAU, stickiness

sql for data analytics ends in growth metrics — LTV, ARPU, MAU/DAU, stickiness

Worked example — 30-day rolling MAU + LAG growth

Analytics SQL interview question on DAU/MAU stickiness with windowed counts

Solution Using daily_active spine + dual self-joins for DAU and MAU

Worked example — ARPU per cohort-month and LTV in 10 lines

Worked example — trailing 7-day revenue with SUM OVER

Worked example — LTV/CAC ratio and payback in pure SQL

Worked example — top-N power users per cohort with ROW_NUMBER

Cheat sheet — analyst SQL idioms

Frequently asked questions

Cohort analysis vs funnel analysis — which one should I reach for?

DAU vs MAU vs WAU — which is the right metric for my product?

Retention vs churn — are they the same thing?

How do I do cohort analysis in pure SQL without pulling into Python?

Can SQL replace a BI tool for analytics work?

LTV in SQL — do I need a calendar table?

Practice on PipeCode

`sql data analytics` is built from four shape primitives — cohort, funnel, retention, LTV

Solution Using `DATE_TRUNC` + first-event CTE + grouped count

`sql data analysis` lives or dies on cohort SQL — anchor users to a signup bucket, then reshape

Solution Using `DATE_TRUNC` + `DATE_DIFF` + three CTEs

`sql for data analysis` of funnels is one row per user-step pair, pivoted with FILTER

Solution Using a `ROW_NUMBER` ordered-events CTE + INNER JOIN per step

`sql for data analyst` rounds love retention curves — the shape tells the product story

Solution Using `LEFT JOIN calendar` + `COALESCE(active, 0)` + targeted day-N picks

`sql for data analytics` ends in growth metrics — LTV, ARPU, MAU/DAU, stickiness