DEV Community


Discussion on: One WEIRD Trick for Speeding Up ORDER BY That You Probably Shouldn't Use

micheloe profile image
micheloe • Edited

It sounds like the data in the activities table is skewed, basically meaning that for every unique account, there is a very different amount of records in the activities table. This makes the outcome for the LIMIT clause, which is applied only after the filter criteria, very unpredictable. I also suspect that the index on account_id is not very selective, meaning there are many more activities than there are accounts. This makes it that a query optimizer isn't likely to choose this index, as a lookup result will always contain many rows, unless there is no other option.

Which brings me to the + 0. :-) Adding + 0 in the ORDER BY clause effectively disables the use of the index on the primary key, this goes for anything other than just the column name in the WHERE or ORDER BY clause. Why? because now the actual value needs to be fetched first, before the calculated value is known. A query optimizer cannot know this information beforehand: it has to calculate the most optimal query resolution path before actually executing it. The same goes for using functions in the WHERE or ORDER BY clause. + 0 of course just means to use the original value, but the query optimizer cannot know this.

There are several approaches to mitigate the issue at hand. The ones I can think of but don't know if they are applicable to PostgreSQL are:

  1. Make sure the query optimizer has information about the data distribution (skew), by creating statistics with histograms. The challenge with this mitigation is that the INSERT rate is quite high and statistics might get stale (too) fast. Also, the query shows you're interested in the latest data, which makes this approach more problematic.

  2. You could try changing the index on the account_id column to a compound index with the primary key column as secondary and see if it helps. This would add more storage though, increase INSERT/UPDATE load (as now a more complex index needs to get updated) and might have a negative side-effect on other queries that are run, so this should be examined with care before taking it into production.

  3. You could opt for creating a table partitioned by account_id, which allows the optimizer to better know the data skew, as statistics can be created for every partition individually. The down side obviously is more administrative overhead and the not-so-straight-forward setup.

There are probably other options, but I'm currently sleep-deprived. :-)

cassidycodes profile image
Cassidy Scheffer (she/her) Author • Edited

These are great tips! Partitioning is the end goal for this table, but we need to do some prep work first.

Interestingly, after running the experiment for a couple days or so the + 0 is much faster both on average and 95p. Seems to be a good stopgap while we work on partitioning, but I want to try out the compound index you mentioned first.