Chidiebere Ogujeiofor

Posted on Sep 20, 2020

Relative calculations 1: LAG and LEAD

#database #sql #datascience

In this post, we are going to continue our discussion by looking at calculations that involve getting the value from the previous/next row. We would see how using ORDER BY clause in an SQL window function can make these calculations pretty easy and possible.

Let's dive right in.

Our Big Question

One particularly important insight is the increase is the way the number of cases has been increasing. Suppose we had the following question:

What is the increase in cases recorded in the UK for each day in the month of June?

Assuming we had access to only the total number of confirmed cases one way we could answer this to generate a table where each row gives us the difference between the case recorded today and the case recorded previous day. Soemthing like this:

date	total_confirmed_cases	daily_case_recorded
2020-06-01	v₁	0
2020-06-02	v₂	v₂ - v₁
2020-06-03	v₃	v₃ - v₂
2020-06-04	v₄	v₄ - v₃
.	.	.
.	.	.
.	.	.
n	v_n	v_n - v_n-1

Notice how the value in each row in the total_confirmed_cases column uses the value from the previous column to calculate its value.

Note that we assumed that we had access to only the total_confirmed_cases. Google COVID Dataset already gives us the daily_confirmed_cases which is the same as what we want to calculate.

In order to solve this, we would use the LAG aggregate function and ORDER BY in our window function.

The ORDER BY in a window function

Using the ORDER BY as an argument in the window function specifies how you want the window function to sort the rows in the ResultSet while performing calculations.

For example, in order to answer the question we have above, it is important that we sort the result by date so that when getting the value from the previous row, we would be getting the confirmed case from the previous day.

Using the ORDER BY clause makes it possible for us to specify that.

We can use the ORDER BY like so:


OVER(ORDER BY column_1, column_2, ..., column_n [ASC|DESC])

LAG aggregate function

The LAG function is a special aggregate function the works only when used with window functions. It takes a column as an argument and returns the previous value of that column in the ResultSet.

We use the LAG function with the ORDER BY clause like so:

LAG(column_name) OVER(ORDER BY column_1, column_2, ..., column_n {ASC|DESC|)

The SQL QUERY

Combining these together we can answer our Big question using the following query.

SELECT
  date,
  countries_and_territories as country,
  confirmed_cases,
  confirmed_cases - LAG(confirmed_cases) OVER(
    ORDER BY
      date
  ) as daily_increase
FROM
  `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
WHERE
  date >= "2020-06-01"
  AND date <= "2020-06-30"
  AND geo_id = "UK";

This filters the resultset and generates the increase in a number of cases using:

  confirmed_cases - LAG(confirmed_cases) OVER(
    ORDER BY
      date
  ) as daily_increase

The window function orders by date while the LAG(confirmed_cases) returns cases confirmed for the previous row.

This generates the following table.

date	country	confirmed_cases	daily_increase
2020-06-01	United_Kingdom	274762	NULL
2020-06-02	United_Kingdom	276332	1570
2020-06-03	United_Kingdom	277985	1653
2020-06-04	United_Kingdom	279856	1871
2020-06-05	United_Kingdom	281661	1805
2020-06-06	United_Kingdom	283311	1650
2020-06-07	United_Kingdom	284868	1557
2020-06-08	United_Kingdom	286194	1326
2020-06-09	United_Kingdom	287399	1205
2020-06-10	United_Kingdom	289140	1741
2020-06-11	United_Kingdom	290143	1003
2020-06-12	United_Kingdom	291409	1266
2020-06-13	United_Kingdom	292950	1541
2020-06-14	United_Kingdom	294375	1425
2020-06-15	United_Kingdom	295889	1514
2020-06-16	United_Kingdom	296857	968
2020-06-17	United_Kingdom	298136	1279
2020-06-18	United_Kingdom	299251	1115
2020-06-19	United_Kingdom	300469	1218
2020-06-20	United_Kingdom	301815	1346
2020-06-21	United_Kingdom	303110	1295
2020-06-22	United_Kingdom	304331	1221
2020-06-23	United_Kingdom	305289	958
2020-06-24	United_Kingdom	306210	921
2020-06-25	United_Kingdom	306862	652
2020-06-26	United_Kingdom	307980	1118
2020-06-27	United_Kingdom	309360	1380
2020-06-28	United_Kingdom	310250	890
2020-06-29	United_Kingdom	311151	901
2020-06-30	United_Kingdom	311965	814

One thing that you may have noticed is that the first value in the table is NULL.

Remember what we said in previous articles that window functions use the ResultSet generated after filtering in the WHERE clause, well, this is one of the consequences.

At the first instance, the LAG function returns the value of the specified column in the previous row but the first row does not have a previous row. Therefore, we get NULL( which means non-existent).

But we know that the confirmed case for the first day of July exists, so how do we get that?

One quick way is to modify the filter in the WHERE clause so that it returns data for a day before July 1st like so:

WHERE
  date >= "2020-05-31"
  AND date <= "2020-06-30"
  AND geo_id = "UK";

Then we omit the first row in our resultset by adding OFFSET 1. The new query would now become

SELECT
  date,
  countries_and_territories as country,
  confirmed_cases,
  confirmed_cases - LAG(confirmed_cases) OVER(
    ORDER BY
      date
  ) as daily_increase
FROM
  `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
where
  date >= "2020-05-31"
  AND date <= "2020-06-30"
  AND geo_id = "UK"
LIMIT
  31 OFFSET 1;

This would return the correct data for the first row

date	country	confirmed_cases	daily_increase
2020-06-01	United_Kingdom	274762	1936
2020-06-02	United_Kingdom	276332	1570
2020-06-03	United_Kingdom	277985	1653
2020-06-04	United_Kingdom	279856	1871
2020-06-05	United_Kingdom	281661	1805
2020-06-06	United_Kingdom	283311	1650
2020-06-07	United_Kingdom	284868	1557
2020-06-08	United_Kingdom	286194	1326
2020-06-09	United_Kingdom	287399	1205
2020-06-10	United_Kingdom	289140	1741
2020-06-11	United_Kingdom	290143	1003
2020-06-12	United_Kingdom	291409	1266
2020-06-13	United_Kingdom	292950	1541
2020-06-14	United_Kingdom	294375	1425
2020-06-15	United_Kingdom	295889	1514
2020-06-16	United_Kingdom	296857	968
2020-06-17	United_Kingdom	298136	1279
2020-06-18	United_Kingdom	299251	1115
2020-06-19	United_Kingdom	300469	1218
2020-06-20	United_Kingdom	301815	1346
2020-06-21	United_Kingdom	303110	1295
2020-06-22	United_Kingdom	304331	1221
2020-06-23	United_Kingdom	305289	958
2020-06-24	United_Kingdom	306210	921
2020-06-25	United_Kingdom	306862	652
2020-06-26	United_Kingdom	307980	1118
2020-06-27	United_Kingdom	309360	1380
2020-06-28	United_Kingdom	310250	890
2020-06-29	United_Kingdom	311151	901
2020-06-30	United_Kingdom	311965	814

Now we can generate our chart:

You can interact with the chart by following this link.

More on LAG, intro to LEAD

The LAG function takes an optional integer as value as its second argument this integer indicates how many rows back we want to get. This number is 1 by default, that's why we get the previous row( 1 row before the current row).

Similar to the LAG function is the LEAD function which takes the value of the next row. It has the same constraints and syntax of the LAG function.

So LEAD(col_name, 2) indicate we want to get a value 2 rows after the current row from col_name while LEAD(col_name) indicate we want the next row value.

Conclusion

In this article, we've learnt how to perform a calculation that involves getting a single value from a row before or after our current row.

That's great but what if we want to perform an aggregation based on more than one row before or after our current row? We would see how to do that using ROWS PRECEDING and ROWS FOLLOWING in the next article of our series.

See you soon.

DEV Community