zchtodd

Posted on Oct 12, 2019

#SQL30 Day 10: Random Sampling

#sql #database #postgres #challenge

Welcome to the SQL showdown series!

What is this and how does it work?

I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.

Write your own solution in the comments! Let's see who can come up with the most creative solutions.

I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!

Challenge #10: Random Sampling

For today's challenge I'll delve into a topic that is simple on the surface, but also has a surprising amount of subtlety.

Here's the challenge itself:

Can you write a query to efficiently return a random sample of records from a table?

We'll re-visit the measurement table from the day 3 challenge. Obviously, your results will vary from mine, but the goal is to return a random sampling of records from the table.

Here's what my output looks like when grabbing 10 random records:

Sandbox Connection Details

I have a PostgreSQL database ready for you to play with.

The password is the same as the username! To query the employee table:

SELECT * FROM day10.measurement;

Solution to Challenge #7: Grouping Sets

This is the question we were trying to answer with yesterday's SQL challenge:

Given supermarket sales history, can you display sales not only by location and category, but also at the month and yearly level?

Remember, in this challenge we wanted the output to appear something like this:

That is, we wanted sales summed up to the month, but with a yearly subtotal for each location and category.

Not long ago, I probably would have reached for Python to do this kind of multi-level aggregation, but it turns out there's a perfectly good way to do this in SQL.

Group by, apparently, has a hidden super power called grouping sets.

Grouping sets allow you to group at several different levels. Let's take a look at the query that produced the output above.

SELECT   s.city, 
         s.product_line, 
         Date_part('month', sale_date) AS monthly, 
         Date_part('year', sale_date)  AS yearly, 
         Sum(total) 
FROM     day7.supermarket s 
GROUP BY grouping sets (((city, product_line), yearly, monthly), ((city, product_line), yearly)) 
ORDER BY city, 
         product_line, 
         monthly, 
         yearly;

As you can see in the output, any row produced by a set missing columns (in this case month), will have a null value for that column.

SQL also has a few shorthand notations, such as ROLLUP and CUBE, that produce commonly needed grouping sets.

I could change the above query to use ROLLUP like so:

SELECT   s.city, 
         s.product_line, 
         Date_part('month', sale_date) AS monthly, 
         Date_part('year', sale_date)  AS yearly, 
         Sum(total) 
FROM     day7.supermarket s 
GROUP BY ROLLUP ((city, product_line), yearly, monthly)
ORDER BY city, 
         product_line, 
         monthly, 
         yearly;

This would actually be equivalent to the following sets:

((city, product_line), yearly, monthly): Sales broken down to the month.
((city, product_line), yearly)): Sales broken down to the year.
((city, product_line)): All sales for each city/product line.
(): All sales, for all time, for all city/product combinations.

Good luck!

Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.

Oldest comments (1)

zchtodd • Oct 12 '19

Somehow I ended up overwriting the solution for day 9 with the solution to day 7! I'm not sure how that happened.

My apologies!

I had an entire explanation written up, but for now at least here is the query that solves day 9:

WITH streak_groups AS (
  SELECT
    s2.*,
    sum(price_fell) OVER (
      ORDER BY
        day
    ) AS streak_group
  FROM
    (
      SELECT
        s1.*,
        CASE
          WHEN s1.close > s1.prior_close THEN 0
          ELSE 1
        END AS price_fell
      FROM
        (
          SELECT
            day9.stockprice.*,
            lag(close, 1) OVER (
              ORDER BY
                day
            ) AS prior_close
          FROM
            day9.stockprice
        ) AS s1
    ) AS s2
)
SELECT
  count(*) AS consecutive_days,
  min(close) AS min_close,
  max(close) AS max_close,
  min(day) AS start_date,
  max(day) AS end_date
FROM
  streak_groups
GROUP BY
  streak_group
ORDER BY
  consecutive_days DESC;