So I'm trying to gather some data from dev.to's database. Unfortunately, SQL is not my strong suit and I've been mostly working on the Ruby + Rails...
For further actions, you may consider blocking this person and/or reporting abuse
Hi @andy , I think it can be simplified.
First thing first, there are a few odd things at first glance.
There's a
FULL JOIN
betweenarticles
andcollections
but I don't think the query is interested in collections that have no correspondent articles (also because I don't think they can exist in the DB right now). A full join (also called a full outer join) returns all rows of the first table, all rows of the second table and useNULL
to fill the gaps.So the first thing we can do is change
to a simple join
The second odd thing is that the query is joining articles with reactions and page_views but then it disregards the join. How do I know you're not interested in the result of this additional joins? In the outermost
SELECT
:the query only retrieves fields from articles and collections.
At this point one might say: "but rhymes, the query is also counting, summing and averaging" data from reactions and page_views. Yes, but this data is loaded with subqueries that are already filtering with a crucial
WHERE
statement, without having to join to anything. Now that we know this we can remove the outermost joins as well:The last odd thing is about the aliasing of the last two aggregated queries, they currently have the same name, I would change the last one to:
change the name to something more meaningful, I'm just guessing what it does
So, at this point our query is:
Much more readable isn't it?
Is there more room for optimization? Certainly. Should we do it? Well, it depends on the performance. Subqueries are not super efficient but they are certainly more readable.
I'm going to leave here the query plan for reference:
There's a lot of "sub plans" (each sub query is indicated as a "sub plan") and at least two sequential scans we might want to investigate later on: the one on collections and the one on page views.
But let's move on from the sequential scans for now, we can maybe come back to them later.
There's a possible improvement for the query in its current state, using...
GROUP BY
:D.GROUP BY
What the query is essentially asking is to return a list of all published articles of the given user with some aggregates. The word aggregation is a giant clue in SQL, whenever you have to aggregate data from one or multiple tables you might want to investigate using grouping.
So, what do we want to aggregate? As we've seen we want the count of all likes, of all unicorns, of all bookmarks, the total of page views and the average time spent on an article.
How to do it?
If likes, unicorns, bookmarks had been in different tables we could have done something like and called it a day:
(the
LEFT JOIN
is because we want to return articles without likes or unicorns I guessed, it will add 0 to the column)You can see it working if we simplify the real query to:
This is a quick verification that it's working from my local copy:
Since they are in the same table we need to resort to either subqueries as you've done or we need to unearth filtered aggregates. What are they?
As we've seen in this query above aggregates expression enable the database to compute an aggregate on a list of selected rows (ie. counting all likes that have the requested article id). A filtered aggregate allows you to ask for an aggregate but filter on some condition. Let's try:
Example result:
With this in mind we can rewrite the original query like this:
This is the result:
As we you can see all of the user's article are there, this is what happens if instead of using a
LEFT JOIN
I use a normal join with the collections:Since I have no articles present in collections. I have no results.
Let's see the query plan:
I would stop here, I don't have a huge dataset to test so you might have to run this second query and see for yourself (let me know if everything works).
(I just noticed there's a bug in the
GROUP BY
function so I suggest using the subselects at the top of my comment in the meantime. I'll fix it later :D)Found it, I missed the
DISTINCT
. So, the final queries:With sub selects
With GROUP by
As you can see, same result.
Hope this helps Andy!
Wow Rhymes, thanks for such an excellent answer!!! The query plans really help to see how they are performing. Interesting that
FILTER
allows you to use aggregates much faster than a subquery. Definitely have to read a book or two about SQL now!In case you're wondering, getting 119 rows took about 4 minutes to run the
GROUP BY
query. We're not looking to put in this app by any means, so the time is totally fine. Definitely helps me think about how to structureAnalyticsService.rb
queries though.Glad it helped!
I think it's because the DB is able to do it with less work by using multiple conditions on the same set of data (the joined table on the right of the join,
reactions
in this example), whereas with a subquery, it has to execute a different subquery for each row in thearticles
table. Not much to optimize there.Mmm not great though :D how long was it taking with the subqueries? Anyhow, it's something. I have a feeling it can be further optimized by checking if and which indexes are being used and the other idea I had (untested and unexplored) was to use
window functions
but I haven't tried, nor I know if it can be done or if it's faster :DGood :D
Let me know if you find a different approach for that. After all you have the great advantage of having a working version that you can use as a comparison. You extract a set of data and then play with grouping until you get the same result. :D
Obviously, I'm not 100% sure since I don't access to the table but I will try something like this:
This query should work if the
reactions
is a snapshot and not just a log.I always love playing with weird SQL so happy to help. Feel free to reach out at
francois@cloudforecast.io
Do the aggregation functions work in that query? Looks like you could add a group by at the end, like:
"
group by
articles.id,
articles.title,
articles.published_at,
articles.positive_reactions_count,
articles.cached_tag_list,
collections.slug
"
Of course not ... I forgot the group by. Thanks for pointing that.
uhh apparently this is valid SQL and the query is running, so I'll report back in a bit. 🙈
Have y'all had any discussions about moving to NoSQL? Or does DEV not gather enough user metadata / have enough users to make that a sensible option?
Changing DB to fix a query seems a little bit overkill :D
Lol I should have mentioned that this was sort of unrelated to the original post.
I'm not sure what the use cases are for NoSQL, and because I don't know that I wouldn't know why we would use a NoSQL database. Rails 6 will support multiple databases, so that might be something we could explore.
Just curious, why full join and not inner?