Welcome to the SQL showdown series!
What is this and how does it work?
I'm committing to publishing a SQL challenge every day for 30 days. In each post I'll describe my solution to the last day's challenge. I'll follow that up with a full description of the challenge for the next day.
Write your own solution in the comments! Let's see who can come up with the most creative solutions.
I'll add connection details for a PostgreSQL database containing test data for the challenge. Solutions are by no means limited to PostgreSQL, but it's there if you want an easy way to test your query!
Challenge #2: Stock Price Variance
In this challenge we'll use a small part of the huge stock market data-set from Kaggle. I've loaded the daily trading history for a single company to keep things reasonable.
The question for today is:
Can you output the day's closing price, the closing price from yesterday, the variance between the two, as well as the average closing price for the current month?
The stockprice table has the following columns:
- day
- open
- high
- low
- close
- volume
Here's an example to give you a better idea of the output you're after:
Sandbox Connection Details
I have a PostgreSQL database ready for you to play with.
The password is the same as the username! To query the stockprice table:
SELECT * FROM day2.stockprice;
Solution for Challenge #1
This is the question we were trying to answer with yesterday's SQL challenge:
What are the details of the largest recorded wildfires (by acreage) for each potential cause of wildfires?
At first you might think, let's do a GROUP BY and a sum, right? Job's done boss. Well, almost. We also wanted to return the details for each individual fire.
I'm sure you've seen this issue before if you're an old hand at SQL. The problem is that we can't do this:
SELECT w.*, sum(w.acres) FROM day1.wildfire w GROUP BY cause;
At least in PostgreSQL, we're greeted with this lovely message.
[Code: 0, SQL State: 42803] ERROR: column "w.fire_name"
must appear in the GROUP BY clause or be used in
an aggregate function
This issue is brought up so much that StackOverflow has created the greatest-n-per-group tag just for it.
One possible solution involves joining to a subquery that finds the max value.
SELECT * FROM day1.wildfire w1 JOIN (
SELECT cause, max(acres) AS max_acres FROM day1.wildfire GROUP BY cause
) w2 ON w1.cause = w2.cause AND w1.acres = w2.max_acres
ORDER BY w1.acres desc;
There are some kind of interesting exceptions to the rules around GROUP BY. Prior to MySQL 8.0, you could arbitrarily select columns that aren't part of an aggregate or the group by clause. Which row does it pull the rest of the columns from? I'm not sure anyone knows.
SQLite will also allow you to do the same, but only if the aggregate is a min or max, in which case the values come from the corresponding row.
Good luck!
Have fun, and I can't wait to see what you come up with! I'll be back tomorrow with a solution to this problem and a new challenge.
Top comments (5)
I've been travelling this weekend so late to the game.
Here's my answer before I go check out day three and to see if we get the same
Almost identical to what I ended up doing, with the exception that I did the month extraction inside of a CTE. Using a CTE to avoid repetition of an expression might sometimes be a bad idea, especially with PG, where I believe all CTEs are materialized. So I probably paid dearly for that little convenience!
I hear though that PG 12 has made some improvements in regards to CTE performance.
Yes! I am very excited to be able to use CTEs without fear. The release notes for PG12 are out now - postgresql.org/about/news/1976/
I ended up doing the same as Helen and the author. for variety, here's a variant that uses CTEs but
EXPLAIN ANALYSE
says it's slightly (10%) slower:being from a stats background I found your use of the term "variance" somewhat confusing, I thought you were after something more than just subtraction.