loading...
Cover image for Comparing aggregated data to unaggregated data in SQL

Comparing aggregated data to unaggregated data in SQL

chidioguejiofor profile image Chidiebere Ogujeiofor ・5 min read

Depending on the type of questions we want to answer there might be times when we need to compare aggregated data to unaggregated data in SQL.

This is where the fact that the GROUP BY clause forces us to 'squeeze' all unaggregated columns into one row can be a headache.

In this article, you would see how a window function can make this quite easy.

The Big question

Suppose we wanted to answer the following question using our COVID Dataset:

How does each case recorded in Nigeria from March 2020 to June 2020 compared with the maximum case recorded within the same time frame.

We can answer this question by returning a table that has the recorded case for each day and the maximum case recorded in a day within the time frame. Something like this:

date country confirmed_today max_value_in_timeframe
2020-03-01 Nigeria value1 max_value_in_result_set
2020-03-02 Nigeria value2 max_value_in_result_set
2020-03-03 Nigeria value3 max_value_in_result_set
2020-03-04 Nigeria value4 max_value_in_result_set

This would allow us to easily plot graphs that help us with the comparison. Let's see how we can solve this


The Inefficient solution

We can generate this result using a subquery in the SELECT clause. This subquery would calculate the maximum value and return that value for each row in our table. See the code below:


SELECT
  date,
  countries_and_territories as country,
  daily_confirmed_cases AS confirmed_today,
  (
    SELECT
      MAX(daily_confirmed_cases)
    FROM
      `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
    WHERE
      geo_id = 'NG'
      AND date >= '2020-03-01'
      AND date < '2020-06-01'
  ) as highest_case_recorded_in_a_day
FROM
  `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
WHERE
  geo_id = 'NG'
  AND date >= '2020-03-01'
  AND date < '2020-06-01';

This returns the following results:

date country confirmed_today highest_case_recorded_in_a_day
2020-03-01 Nigeria 0 533
2020-03-02 Nigeria 0 553
2020-03-10 Nigeria 1 553
2020-03-15 Nigeria 0 553
2020-03-16 Nigeria 0 553
2020-03-17 Nigeria 0 553
2020-03-18 Nigeria 1 553
2020-03-19 Nigeria 5 553
2020-03-20 Nigeria 0 553
2020-03-21 Nigeria 4 553
2020-03-22 Nigeria 10 553
2020-03-23 Nigeria 8 553
2020-03-24 Nigeria 10 553
2020-03-25 Nigeria 4 553
2020-03-26 Nigeria 7 553
2020-03-27 Nigeria 14 553
2020-03-28 Nigeria 16 553
2020-03-29 Nigeria 16 553
2020-03-30 Nigeria 0 553
2020-03-31 Nigeria 34 553
2020-04-01 Nigeria 0 553
2020-04-02 Nigeria 20 553
2020-04-03 Nigeria 23 553
2020-04-04 Nigeria 16 553
2020-04-05 Nigeria 20 553
2020-04-06 Nigeria 22 553
2020-04-07 Nigeria 6 553
2020-04-08 Nigeria 16 553
2020-04-09 Nigeria 22 553
2020-04-10 Nigeria 12 553
2020-04-11 Nigeria 17 553
2020-04-12 Nigeria 13 553
2020-04-13 Nigeria 5 553
2020-04-14 Nigeria 20 553
2020-04-15 Nigeria 30 553
2020-04-16 Nigeria 34 553
2020-04-17 Nigeria 35 553
2020-04-18 Nigeria 51 553
2020-04-19 Nigeria 49 553
2020-04-20 Nigeria 85 553
2020-04-21 Nigeria 38 553
2020-04-22 Nigeria 117 553
2020-04-23 Nigeria 91 553
2020-04-24 Nigeria 108 553
2020-04-25 Nigeria 114 553
2020-04-26 Nigeria 87 553
2020-04-27 Nigeria 91 553
2020-04-28 Nigeria 64 553
2020-04-29 Nigeria 195 553
2020-04-30 Nigeria 196 553
2020-05-01 Nigeria 204 553
2020-05-02 Nigeria 238 553
2020-05-03 Nigeria 218 553
2020-05-04 Nigeria 170 553
2020-05-05 Nigeria 244 553
2020-05-06 Nigeria 148 553
2020-05-07 Nigeria 195 553
2020-05-08 Nigeria 381 553
2020-05-09 Nigeria 386 553
2020-05-10 Nigeria 239 553
2020-05-11 Nigeria 248 553
2020-05-12 Nigeria 242 553
2020-05-13 Nigeria 146 553
2020-05-14 Nigeria 184 553
2020-05-15 Nigeria 191 553
2020-05-16 Nigeria 283 553
2020-05-17 Nigeria 176 553
2020-05-18 Nigeria 338 553
2020-05-19 Nigeria 216 553
2020-05-20 Nigeria 226 553
2020-05-21 Nigeria 276 553
2020-05-22 Nigeria 339 553
2020-05-23 Nigeria 245 553
2020-05-24 Nigeria 265 553
2020-05-25 Nigeria 313 553
2020-05-26 Nigeria 229 553
2020-05-27 Nigeria 276 553
2020-05-28 Nigeria 389 553
2020-05-29 Nigeria 182 553
2020-05-30 Nigeria 387 553
2020-05-31 Nigeria 553 553

As you can see it returns exactly the shape of the data that we want in our table. Now we can go ahead and generate a chart, like the one shown below, which are easier to comprehend:

Alt Text

You can interact with the chart here


So what are the problems

The first issue is that subqueries perform calculations by running through the entire table. This simply means that the query above runs through the table twice to get the results that we got. This can affect performance

Another problem is that subqueries are quite difficult to read and maintain. It is quite difficult to grasp what the query above is doing at a glance

The window function solution

Now let's try to solve this problem using a window function.
The way we would do this is to replace the subquery part of the first query with our window function, like so:

SELECT
  date,
  countries_and_territories as country,
  daily_confirmed_cases AS confirmed_today,
  MAX(daily_confirmed_cases) OVER() as highest_case_recorded_in_a_day
FROM
  `bigquery-public-data.covid19_ecdc.covid_19_geographic_distribution_worldwide`
WHERE
  geo_id = 'NG'
  AND date >= '2020-03-01'
  AND date < '2020-06-01'


The OVER() function is what starts our window function(as stated in the previous article in the series) while the MAX() gets the maximum case recorded on a particular day in the specified time frame.

The window function uses the ResultSet gotten from the WHERE part of the query thus there it does not run through the entire table like the subquery alternative.

In the end, you get a much shorter, more readable, maintainable and performant code.

Conclusion

What I have described in this article is a very simple use case of a window function. Let me know what insightful questions related to COVID-19 we could answer by comparing aggregated and unaggregated data in the comments section.

Next in this series, we would be talking about performing calculations that are dependent on the value of other rows eg. running totals, daily_increase in cases etc.

Posted on by:

chidioguejiofor profile

Chidiebere Ogujeiofor

@chidioguejiofor

I love challenging projects that make impact in society.

Discussion

pic
Editor guide