Retention cohort analysis with plain SQL

#sql #database #webdev #analytics

Everyone wants a retention chart and most reach for a tool. You can get the core cohort triangle from raw events with one SQL query. Here is the pattern we use, and the mistake that quietly makes it wrong.

I build this for Zenovay (web analytics) where we show retention by signup cohort. The shape below is database agnostic, written for Postgres.

The idea

A cohort is a group of users bucketed by when they first appeared (usually signup week). For each cohort you measure what fraction are still active N weeks later. Plot it as a triangle and you can see whether newer cohorts retain better than older ones.

The two inputs

You need two things per user: their cohort (first activity week) and every week they were active.

-- 1) each user's cohort week
with first_seen as (
  select
    user_id,
    date_trunc('week', min(event_at)) as cohort_week
  from events
  group by user_id
),
-- 2) the distinct weeks each user was active
activity as (
  select distinct
    user_id,
    date_trunc('week', event_at) as active_week
  from events
)
select
  f.cohort_week,
  -- how many weeks after signup this activity is
  (extract(epoch from (a.active_week - f.cohort_week)) / 604800)::int as week_number,
  count(distinct a.user_id) as active_users
from first_seen f
join activity a using (user_id)
group by f.cohort_week, week_number
order by f.cohort_week, week_number;

That gives you a long table of (cohort_week, week_number, active_users). To turn it into a percentage you divide by the cohort size (week_number 0).

Turning counts into retention percent

with counts as (
  -- the query above, as a CTE
  ...
),
cohort_size as (
  select cohort_week, active_users as size
  from counts
  where week_number = 0
)
select
  c.cohort_week,
  c.week_number,
  c.active_users,
  round(100.0 * c.active_users / s.size, 1) as retention_pct
from counts c
join cohort_size s using (cohort_week)
order by c.cohort_week, c.week_number;

The mistake that breaks it

The subtle bug: defining "active" inconsistently between the cohort and the activity step. If your cohort is based on signup but your activity is based on, say, only paid events, week 0 retention will not be 100 percent and every number downstream is wrong. The fix is to make sure the week_number 0 cohort count equals the number of users who signed up that week. If it does not, your two definitions disagree.

The other one: time zones. date_trunc on a raw UTC timestamp buckets by UTC weeks. If your business thinks in a local week, convert before truncating, or your Monday is someone's Sunday and cohorts blur at the edges.

Making it fast

Over raw events this gets slow quickly. We run it over a daily rollup table (one row per user per active day) instead of millions of raw events, so the cohort query touches thousands of rows. Same result, far cheaper. The rollup is a separate scheduled job.

Disclosure: I build Zenovay, a web analytics tool that does this so you do not have to write the SQL. But the query above is the whole idea, and you should understand it before trusting any tool's version.

How do you define "active" for retention, any event or a meaningful one? That choice changes the whole chart.