DEV Community

DanLee
DanLee

Posted on • Edited on • Originally published at dataform.co

Three tables every analyst needs

Introduction

If you're one of the first analysts joining a company, you will probably arrive to find an overwhelming amount of data. You'll almost certainly have some relational data (probably powering your application's backend), and likely have event data too (possibly through a third party like Segment).

This can make it difficult to know where to start with each analytical request. You might be tempted to create a new dataset for each analysis, but this will add to the amount of data and quickly becomes difficult to manage.

Fortunately, you can gather all of your data into three tables that can then be used to answer the majority of user related questions.

The SQL snippets in this post apply to Google BigQuery but can easily be adapted for other data warehouses.

Table 1: User Stats

To start, you need a table that tells you all about your users. This table will contain dimensions (e.g. user account creation timestamp) and metrics (e.g. user lifetime transaction count). It might be simple to begin with, but as your company grows and gets more complex, you’ll keep adding to this table.

Key point: user_stats has one row per user

SELECT * FROM user_stats
user_id created_at last_seen_at transactions_last_7d transactions_lifetime
32356543 2019-01-01 14:36:51 2019-03-14 19:55:10 2 13
64763654 2019-02-01 10:30:18 2019-02-08 10:02:32 0 4
35795445 2019-03-03 23:00:04 2019-03-03 23:03:07 0 0
98765446 2019-03-06 14:36:31 2019-03-15 08:20:22 1 2
... ... ... ... ...

What kind of questions can this table be used to answer?

How many users signed up this week?

SELECT 
  COUNT(*) as user_count
FROM
  user_stats
WHERE
  created_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 7 DAY)

How many users have made at least 5 transactions?

SELECT
  COUNT(*) as user_count
FROM
  user_stats
WHERE
  transactions_lifetime >= 5

How many users have logged in within the past 24 hours?

SELECT
  COUNT(*) as user_count
FROM
  user_stats
WHERE
  last_seen_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 24 HOUR)

Table 2: Daily user stats

The need for this table might not be so obvious, but it turns out that having a table describing each user’s activity every single day since they signed up (even if they were not active at all on that day) is really useful. There are four dimensions you’ll definitely want in this table:

  • user_id: a unique identifier for your users, that you can link back to user_stats
  • date: what date does this row correspond to
  • day_n: what day (in the lifetime of the user) does this date represent. On the day a user signs up day_n = 1, the day after sign up day_n = 2 etc
  • was_active: was the user active on this day? (Exactly how you choose to define active should depend on the specifics of your business. If you are creating a messaging app, you may define “active” as having sent at least one message, for example)

As with user_stats, this table will probably end up with many more than these 4 fields as your business becomes more complex.

Key point: daily_user_stats has one row per user per day (for all days since they signed up)

SELECT * FROM daily_user_stats ORDER BY user_id, day_n
user_id date day_n was_active
32356543 2019-03-01 1 1
32356543 2019-03-02 2 0
32356543 2019-03-03 3 1
35795445 2019-03-02 1 1
35795445 2019-03-03 2 0
98765446 2019-03-03 1 1
... ... ... ...

What kind of questions can this table be used to answer?

What percentage of users are active one week after they signed up?

SELECT 
  AVG(was_active) as active_user_ratio
FROM
  daily_users_stats
WHERE
  day_n = 8

How has the percentage of users active the day after they sign up trended over time?

SELECT
  DATE(user_stats.created_at) AS created_date,
  AVERAGE(was_active) as active_user_ratio
FROM
  daily_user_stats
  INNER JOIN user_stats
    ON daily_user_stats.user_id = user_stats.user_id
WHERE
  day_n = 2
ORDER BY created_date ASC

What does our user retention curve look like?

SELECT
  day_n,
  AVERAGE(was_active) as active_user_ratio
FROM
  daily_user_stats
  INNER JOIN user_stats
    ON daily_user_stats.user_id = user_stats.user_id
WHERE
  user_stats.created_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 7 DAY)
  AND day_n <= 7

Table 3: Sessions

The sessions table is a place to store your event data. Events are user interactions on your app or site (e.g. button clicks, form submissions, message sends etc). You probably track many different events, and finding a consistent approach to analysing them can be tricky. Sessions is a table to organise the chaos!

A session is a collection of events that represent one end to end interaction with your product: user opens app > user opens product abc page > user clicks add to cart > user closes app. How you combine a user’s events into a session depends on your product, but a standard approach is to end a session after a period of 30 minutes of inactivity. Any subsequent events from that user will be added to a new session.

Using some relatively complex SQL, you can split your raw events into sessions. Your sessions table should at least have these dimensions:

  • user_id: unique user identifier
  • session_id: Starting at 1 (the user's first ever session), incrementing for each further session
  • start_time: timestamp of first event in session
  • end_time: timestamp of last event in session
  • events: ordered list/array of all events in the session (this is well supported in BigQuery, other query languages may need to approach this slightly differently, for example using JSON)

Key point: sessions has one row for each session a user has had

SELECT * FROM sessions
user_id session_id start_time end_time events
32356543 1 2019-03-14 19:55:10 2019-03-14 19:58:15 [opened_app, clicked_product, closed_app]
32356543 2 2019-03-16 13:45:10 2019-03-16 13:51:15 [opened_app, scrolled_list, closed_app]
23501294 1 2019-03-18 16:13:10 2019-03-18 16:14:15 [opened_app, ..., closed_app]
23501294 1 2019-03-19 19:55:10 2019-03-19 19:58:15 [opened_app, ..., closed_app]
... ... ... ... ...

What kind of questions can this table be used to answer?

How has average session length changed over time?

SELECT
  DATE(end_time) session_date,
  AVG(TIMESTAMP_DIFF(end_time, start_time)) AS session_length
FROM
  sessions
GROUP BY 
  1

What share of sessions contain a transaction?

SELECT
  DATE(end_time) session_date,
  AVG(IF(ARRAY_TO_STRING(events, '-') LIKE '%transaction_complete%', 
         1,
         0)) AS transaction_ratio
FROM
  sessions
GROUP BY 
  1

What chain of events lead to a user leaving the app and never returning?

SELECT
  user_id,
  ARRAY_REVERSE(events)[SAFE_OFFSET(2)] AS last_event_offset_2,
  ARRAY_REVERSE(events)[SAFE_OFFSET(1)] AS last_event_offset_1,
  ARRAY_REVERSE(events)[SAFE_OFFSET(0)] AS last_event_offset_0
FROM
  sessions
  LEFT JOIN user_stats
    ON sessions.user_id = user_stats.user_id
WHERE
  sessions.session_id = 1
  AND user_stats.sessions_all_time = 1

Summary

These three well structured tables can help answer most questions:

  • user_stats: who is in my user base
  • daily_user_stats: how many active users do I have, and how well do they retain
  • sessions: what journey are users taking through my product

Once you've written the queries to combine your other datasets into these three simple tables you should set up a schedule to keep them up to date with any new changes. You'll soon start relying on these tables for a large share of your analytics.

Dataform provides a fully managed, collaborative workspace to set up schedules and manage your data warehouse. For more information on Dataform check out our documentation, or create an account for free and start using Dataform's fully managed Web platform.

Top comments (0)