How many sales did we make this month?

One way of answering this question is to find the total sum of sales that we have made in the current calendar month:

```
SELECT COUNT(*) FROM sales
WHERE DATE_TRUNC('month', sale_date) = DATE_TRUNC('month', NOW())
```

This query returns the number of rows in the `sales`

table that their `sale_date`

value is in the current calendar month.

Similarly we can count up the sales that we made in the previous calendar month:

```
SELECT COUNT(*) FROM sales
WHERE DATE_TRUNC('month', sale_date) = DATE_TRUNC('month', NOW() - '1 month' :: INTERVAL)
```

This query returns the number of rows in the `sales`

table that their `sale_date`

value is within the previous calendar month.

Depending on which day of the month today is, you may notice that last month we had considerably more sales than this month. We can of course explain it because we are always somewhere the middle of the current month and this month has not yet ended.

We can avoid this problem all together if interpret the meaning of the current month as the past 30 days as opposed to a calendar month:

```
SELECT COUNT(*) FROM sales
WHERE sale_date >= NOW() - '1 month' :: INTERVAL
```

In this post, I argue that this interpretation is more suitable for data visualization and trend analysis.

We also explore some techniques for generating sample data sets and working with SQL window functions.

## Generating Data

Let's start by generating a (albeit boring) sample data set.

We use PostgreSQL's `generate_series(lower, upper, step)`

handy function to create our samples. For example this snippet generates a time series of hourly interval between a year ago and now:

```
SELECT * FROM generate_series(NOW(), NOW() - '1 year' :: INTERVAL , '-1 hour') AS sale_date;
```

Here we assume that we have had one sale every hour at regular intervals in the past year (this assumption is what makes this data set dull, nevertheless we will check a more advanced data generation technique later one) but it is enough for us to make a point about varying month lengths.:

```
WITH sales AS (
SELECT generate_series(NOW(), NOW() - '1 year' :: INTERVAL , '-1 hour') AS sale_date
)
SELECT
DATE_TRUNC('month', sale_date) AS sale_month
, EXTRACT(DAY FROM DATE_TRUNC('month', sale_date + '1 month' :: INTERVAL) - DATE_TRUNC('month', sale_date)) AS month_length
, COUNT(*)
FROM sales
GROUP BY sale_month, month_length
ORDER BY sale_month
```

`WITH`

expressions are called Common Table Expressions or CTEs. We use them to encapsulate our logic.

I avoid creating temporary tables in this post, that's where CTEs come handy.

Here `WITH sales AS ...`

makes a virtual, `sales`

table with one column: `sale_date`

, this table lives in the memory during the execution of the query.

This is the result of the query (try it in SQL Fiddle)

```
+------------+--------------+---------+
| sale_month | month_length | count |
|------------+--------------+---------|
| 2018-08-01 | 31 | 659 |
| 2018-09-01 | 30 | 720 |
| 2018-10-01 | 31 | 745 |
| 2018-11-01 | 30 | 720 |
| 2018-12-01 | 31 | 744 |
| 2019-01-01 | 31 | 744 |
| 2019-02-01 | 28 | 672 |
| 2019-03-01 | 30 | 743 |
| 2019-04-01 | 30 | 720 |
| 2019-05-01 | 31 | 744 |
| 2019-06-01 | 30 | 720 |
| 2019-07-01 | 31 | 744 |
| 2019-08-01 | 31 | 86 |
+------------+--------------+---------+
```

Ignoring first and last months, the number of sales in every month is directly related to the number of days in that month. February is usually problematic.

To improve our analysis, let's take advantage of our other interpretation of 'month' as the past 30 days.

The following query finds the average number of sales that we had in the past 30 days, for every day in the data set:

```
WITH sales AS (
SELECT generate_series(NOW(), NOW() - '1 year' :: INTERVAL , '-1 hour') AS sale_date
)
,
daily_sales AS (
SELECT
DATE_TRUNC('day', sale_date) AS sale_day
, COUNT(*) AS sales
FROM sales
GROUP BY sale_day
ORDER BY sale_day
)
,
daily_avgs as (
SELECT
sale_day
, SUM(sales) OVER W
, AVG(sales) OVER W
FROM daily_sales
WINDOW W as (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)
)
SELECT * FROM daily_avgs
ORDER BY sale_day DESC
```

There's a lot to unpack here, but let's first check the result:

```
+------------+-------+------+
| sale_day | sum | avg |
|------------+-------+------|
| 2019-08-05 | 710 | 23.7 |
| 2019-08-04 | 720 | 24.0 |
| 2019-08-03 | 720 | 24.0 |
| 2019-08-02 | 720 | 24.0 |
| 2019-08-01 | 720 | 24.0 |
| 2019-07-31 | 720 | 24.0 |
| 2019-07-30 | 720 | 24.0 |
| 2019-07-29 | 720 | 24.0 |
| 2019-07-28 | 720 | 24.0 |
...
```

The sum in every row is `24 * 30 = 720`

, but the latest row. The problem is of course because today is not finished yet. We never have full 24 hours of today in the data set.

Note that we first created a `daily_sales`

CTE:

```
daily_sales AS (
SELECT
DATE_TRUNC('day', sale_date) AS sale_day
, COUNT(*) AS sales
FROM sales
GROUP BY sale_day
ORDER BY sale_day
)
```

Which is basically a time series of number of sales that we had every day.

We are running statistics on `daily_sales`

CTE in the next CTE (`daily_avgs`

):

```
daily_avgs as (
SELECT
sale_day
, SUM(sales) OVER W
, AVG(sales) OVER W
FROM daily_sales
WINDOW W as (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)
```

`WINDOW W as (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW)`

creates a window (named `W`

) that is 30-row wide. We could have named our window anything like (`W30`

or `my_window`

or whatever) I choose a simple one letter `W`

because we only have one window in this query.

Here we say for each row in our data set, select `sale_day`

of the row, and `sum`

all the `sales`

that occurred between the `sale_day`

s that are 29 rows earlier from the current row and including the current row (that is 30 rows in total).

We can roughly translate this window expression into a C-like language as below:

```
const indexed_daily_sales = daily_sales
.sortBy(r => r.sale_day) // ORDER BY sale_day
.map((r, index) => ({...r, index}))
foreach(row in indexed_daily_sales) {
yield indexed_daily_sales
.filter(r => r.index >= row.index - 29 and r.index <= row.index) // ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
.sum(x => x.sales) // SUM(sales)
// similarly for AVG
}
```

If you want to exclude today's data from the result, filter out today at the end.

And here's the result visualized in a time series:

## SQL Window Function

Let's explore the concept of windows in SQL a bit more.

If I want to sum up all the integers between 1 and 10, I use this query:

```
SELECT SUM(id) FROM generate_series(1, 10) id
```

The result is simply a row with one number: 55.

But if I want to find the sum of all the integers from 1 to N for every N <= 10, I use:

```
SELECT id, SUM(id) OVER (ORDER BY id) FROM generate_series(1, 10) id
```

The result has 10 rows:

```
+------+-------+
| id | sum |
|------+-------|
| 1 | 1 | = 1
| 2 | 3 | = 1 + 2
| 3 | 6 | = 1 + 2 + 3
| 4 | 10 | = 1 + 2 + 3 + 4
| 5 | 15 | = 1 + 2 + 3 + 4 + 5
| 6 | 21 | = 1 + 2 + 3 + 4 + 5 + 6
| 7 | 28 | = 1 + 2 + 3 + 4 + 5 + 6 + 7
| 8 | 36 | = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8
| 9 | 45 | = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9
| 10 | 55 | = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10
+------+-------+
```

The sum field in every row corresponds to the sum of the integers from 1 to `id`

of the row.

Here `OVER (ORDER BY id)`

creates a window of all the rows up and including the current row. It is equivalent to: `OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)`

We don't have to name our window if we use it only once.

```
SELECT
SUM(id) OVER (ORDER BY id)
FROM ...
```

is equivalent to:

```
SELECT
SUM(step_size) OVER W
FROM ...
WINDOW W AS (ORDER BY id)
```

Let's modify this query to return the sum of three integers around each number, for example the row for 5 should return: `4 + 5 + 6 = 15`

```
SELECT
id
, SUM(id) OVER (ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM generate_series(1,10) id
```

```
+------+-------+
| id | sum |
|------+-------|
| 1 | 3 | = + 1 + 2 = 3
| 2 | 6 | = 1 + 2 + 3 = 6
| 3 | 9 | = 2 + 3 + 4 = 9
| 4 | 12 | = 3 + 4 + 5 = 12
| 5 | 15 | = 4 + 5 + 6 = 15
| 6 | 18 | = 5 + 6 + 7 = 18
| 7 | 21 | = 6 + 7 + 8 = 21
| 8 | 24 | = 7 + 8 + 9 = 24
| 9 | 27 | = 8 + 9 + 10 = 27
| 10 | 19 | = 9 + 10 + = 19
+------+-------+
```

The size of the window is 3 with the exception of the first and last items in the result.

This is usually the case that the size of the window in the beginning and/or the end of the result is smaller than in the middle.

Aggregate functions that operate on windows are aware of this. For example check the average of the three neighboring numbers (that must be the middle one):

```
SELECT
id
, SUM(id) OVER W
, AVG(id) OVER W
FROM generate_series(1, 10) id
WINDOW W AS (ORDER BY id ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
```

```
+------+-------+-----+
| id | sum | avg |
|------+-------+-----|
| 1 | 3 | 1.5 | = ( 1 + 2) / 2 = 1.5 (note division by 2)
| 2 | 6 | 2.0 | = (1 + 2 + 3) / 3 = 2.0 (note division by 3)
| 3 | 9 | 3.0 | = (2 + 3 + 4) / 3 = 3.0
| 4 | 12 | 4.0 | = (3 + 4 + 5) / 3 = 4.0
| 5 | 15 | 5.0 | = (4 + 5 + 6) / 3 = 5.0
| 6 | 18 | 6.0 | = (5 + 6 + 7) / 3 = 6.0
| 7 | 21 | 7.0 | = (6 + 7 + 8) / 3 = 7.0
| 8 | 24 | 8.0 | = (7 + 8 + 9) / 3 = 8.0
| 9 | 27 | 9.0 | = (8 + 9 + 10) / 3 = 9.0
| 10 | 19 | 9.5 | = (9 + 10 ) / 2 = 9.5 (note division by 2)
+------+-------+-----+
```

Depending on your analysis you might need to take this fact into consideration.

For us it means that the average of sales in our time series in the beginning has less data points, hence `AVG(sales) OVER W`

would be more noisy at the left of our charts.

One easy workaround is to ignore the data points in the beginning by offsetting our result. Use `OFFSET 29`

at the end the query (SQL Fiddle).

## Random Walk

We need a more realistic data set in order to put everything that we discussed above into use. Our sales data set has been very boring so far because we used a uniform distribution (one sale every hour) to create our sample data set.

Here we explore a method for generating a more realistic sales data set.

Random Walk is exactly what you think it is. In two dimension, you can think of a small turtle on a surface that chooses the direction of her next step completely by random.

In one dimension we can only move up or down.

```
SELECT step_id, (FLOOR((RANDOM() * 3) - 1)) AS step_size
FROM generate_series(1,6000) step_id
```

```
+-----------+-------------+
| step_id | step_size |
|-----------+-------------|
| 1 | 1.0 |
| 2 | 0.0 |
| 3 | 1.0 |
| 4 | 1.0 |
| 5 | 0.0 |
| 6 | -1.0 |
...
```

This snippet generates a uniform distribution of -1s, 0s and 1s `step_size`

s.

These numbers model the movement of our cursor down, up or not at all at each step.

The sum of all the previous `step_size`

s at each step determines the total distance that we have travelled form the origin.

```
SELECT
SUM(step_size) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pos
FROM (
SELECT step_id, floor((random() * 3) - 1) AS step_size
FROM generate_series(1,6000) step_id
) _a
```

```
+-----------+-------------+-------+
| step_id | step_size | pos |
|-----------+-------------|-------|
| 1 | 1.0 | 1.0 |
| 2 | 0.0 | 1.0 | = 1 + 0
| 3 | 1.0 | 2.0 | = 1 + 0 + 1
| 4 | 1.0 | 3.0 | = 1 + 0 + 1 + 1
| 5 | 0.0 | 3.0 | = 1 + 0 + 1 + 1 + 0
| 6 | -1.0 | 2.0 | = 1 + 0 + 1 + 1 + 0 - 1
...
```

Use `UNION`

if you want to specify a starting point:

```
SELECT 0 as step_id, 600 as step_size
UNION
SELECT step_id, floor((random() * 3) - 1) AS step_size
FROM generate_series(1,6000) step_id
ORDER BY step_id
```

We use random walk to generate more realistic-looking data set. The idea here is that the number of sales every day is not completely random, but it is actually close to the sales that we had on the previous day plus or minus some random value which we call noise.

```
WITH noise AS (
SELECT
step_id
, date_trunc('day', NOW()) - (step_id || ' day') :: INTERVAL AS sale_day
, SUM(step_size) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pos
FROM (
SELECT step_id, (FLOOR((RANDOM() * 3) - 1)) * FLOOR(RANDOM() * 100) AS step_size
FROM generate_series(1,1000) step_id
) _a
)
,
daily_sales AS (
SELECT
sale_day
, (CASE WHEN EXTRACT(DAY FROM sale_day) < 8
THEN FLOOR(RANDOM() * 200)
ELSE 0 END
) + (SELECT ABS(MIN(pos)) FROM noise) + pos AS sales
FROM noise
ORDER BY step_id DESC
)
SELECT
sale_day
, sales
, AVG(sales) OVER (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_daily_sales
FROM daily_sales
```

`noise`

is a series of random numbers between -99 and 99:

```
SELECT * FROM noise
+-----------+------------+--------+
| step_id | sale_day | pos |
|-----------+------------+--------|
| 1 | 2019-08-04 | 48.0 |
| 2 | 2019-08-03 | 48.0 |
| 3 | 2019-08-02 | 48.0 |
| 4 | 2019-08-01 | 72.0 |
| 5 | 2019-07-31 | 72.0 |
| 6 | 2019-07-30 | 159.0 |
| 7 | 2019-07-29 | 252.0 |
...
```

In this model, we expect our sales to vary ±99 from the previous day by random.

`daily_sales`

adjusts the noise series by first making sure that all data points are above zero: `+ (SELECT ABS(MIN(pos)) FROM noise)`

. We also adds some seasonality to the series (we assume our sales increase in the first week of the month by maximum of 200 sales per day):

```
(CASE WHEN EXTRACT(DAY FROM sale_day) < 8
THEN FLOOR(RANDOM() * 200)
ELSE 0 END
)
```

Let's check the final result:

```
+------------+---------+-------------------+
| sale_day | sales | avg_daily_sales |
|------------+---------+-------------------|
| 2016-11-08 | 1074.0 | 1074.0 |
| 2016-11-09 | 1068.0 | 1071.0 |
| 2016-11-10 | 1118.0 | 1086.66666666667 |
| 2016-11-11 | 1118.0 | 1094.5 |
| 2016-11-12 | 1112.0 | 1098.0 |
| 2016-11-13 | 1177.0 | 1111.16666666667 |
| 2016-11-14 | 1145.0 | 1116.0 |
| 2016-11-15 | 1117.0 | 1116.125 |
...
```

A sanity check:

`avg_daily_sales`

at `2016-11-11`

= `(1074 + 1068 + 1118 + 1118) / 4 = 1,094.5`

The maximum number of rows that we have in any window is 30. In the beginning we of course we have less than 30 rows in the window, because there is no row before `2016-11-08`

.

Compare our latest chart with our original attempt with naïve `DATE_TRUNC`

.

And if you prefer monthly statistics, filter out the rows at the end of the query:

```
...
daily_avgs AS (
SELECT
sale_day
, sales
, SUM(sales) OVER (ORDER BY sale_day ROWS BETWEEN 29 PRECEDING AND
CURRENT ROW) AS avg_monthly_sales
FROM daily_sales
)
SELECT sale_day, avg_monthly_sales
FROM daily_avgs
WHERE EXTRACT(DAY FROM sale_day) = EXTRACT(DAY FROM NOW() - '1 day' :: INTERVAL)
ORDER BY sale_day DESC
```

Each bar represents the sum of sales during 30-days before and including the date that is associated with the bar.

## Missing Values

Often we need to deal with gaps in our time series. In our example, rows in `daily_sales`

would be missing if there is no sales happened on that day.

So far in our generated data set we avoided zero and negative sales by padding the data with the absolute value of the minimum distance that was generated by our random walk.

In order to produce some gaps, let's first decrease this padding to half `+ (SELECT ABS(ROUND(MIN(pos)/2)) FROM noise)`

and then filter out all the rows with zero or negative sales: `SELECT * FROM daily_sales_1 WHERE sales > 0`

:

```
WITH noise AS (
SELECT
step_id
, DATE_TRUNC('day', NOW()) - (step_id || ' day') :: INTERVAL as sale_day
, SUM(step_size) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pos
FROM (
SELECT 0 as step_id, 0 as step_size
UNION
SELECT step_id, (floor((random() * 3) - 1)) * floor(random() * 100) AS step_size
FROM generate_series(1,1000) step_id
ORDER BY step_id
) _a
)
,
daily_sales_1 AS (
SELECT
sale_day
, (CASE WHEN EXTRACT(DAY FROM sale_day) < 8
THEN floor(random() * 200)
ELSE 0 END
) + (SELECT ABS(ROUND(MIN(pos)/2)) FROM noise) + pos AS sales
FROM noise
ORDER BY step_id DESC
)
,
daily_sales AS (
SELECT * FROM daily_sales_1 where sales > 0
)
,
calendar AS (
SELECT generate_series(
(SELECT min(sale_day) from daily_sales_1)
, (SELECT max(sale_day) from daily_sales_1)
, '1 day' :: INTERVAL
) as sale_day
)
SELECT
calendar.sale_day
, COALESCE(sales, 0) AS sales
, AVG(COALESCE(sales, 0)) OVER (ORDER BY calendar.sale_day ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS avg_daily_sales
FROM calendar
LEFT JOIN daily_sales ON calendar.sale_day = daily_sales.sale_day
```

This is the result:

`calendar`

CTE generates a series of dates

```
calendar AS (
SELECT generate_series(
(SELECT min(sale_day) from daily_sales_1)
, (SELECT max(sale_day) from daily_sales_1)
, '1 day' :: INTERVAL
) as sale_day
)
```

We select `sale_day`

s from `calendar`

and left join it with `daily_sales`

table in the final step.

`sales`

are `null`

for the rows that are missing in the `daily_sales`

table because of left join. That's why we use `COALESCE(sales, 0)`

to cast nulls to 0 for the missing data points.

In general null does not mean 0. So be careful. But we can cast nulls or missing data points to 0 when we are dealing with gaps in a time series.

## Discussion