I was digging into a performance issue at work recently and found THE WEIRDEST way of speeding up a query that uses ORDER BY
. When testing against our replica database, I was getting results up to 100 times faster than before.
"Too good to be true!" I thought. Well, it turns out that it is sort of too good to be true.
The WEIRD Trick
Here's an example query that was having issues. We're selecting some records and ordering by ID
.
SELECT *
FROM activities
WHERE account_id IN (999102,989987,809354,100964)
ORDER BY id DESC
LIMIT 10;
Here's an EXPLAIN plan from pganalyze. Look at that! We're spending almost 9 seconds scanning the index for activities.id
π―
This was baffling to me! We're using an index, why is this slow?!
As any good developer does, I went digging in Stack Overflow for answers. Unfortunately, I can't find the original answer I was looking at, but someone had suggested adding + 0
to the ORDER BY
clause.
WHAT?! So I tried it.
EXPLAIN ANALYSELECT *
FROM activities
WHERE account_id IN (999102,989987,809354,100964)
ORDER BY id + 0 DESC
LIMIT 10;
Limit (cost=32383.82..32383.84 rows=8 width=149) (actual time=71.250..71.252 rows=8 loops=1)
-> Sort (cost=32383.82..32410.22 rows=10557 width=149 (actual time=71.249..71.250 rows=8 loops=1)
Sort Key: ((id + 0)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Index Scan using index_activities_on_account_user_id on activities (cost=0.57..32172.68 rows=10557 width=149) (actual time=2.292..71.154 rows=132 loops=1)
Index Cond: (account_user_id = ANY ('{999102,989987,809354,100964}'::integer[]))
Planning Time: 0.122 ms
Execution Time: 71.279 ms
π€― 71.279 ms!!!!!
What in the ACTUAL is happening here?
Well, first, you should know that this table is pretty big. There are just under 100,000,000 records in it. AND, you should know that in some cases, this query can return LOTS of records without the LIMIT
in there. In the EXPLAIN plan screen shot above, you can see that the query found just under 9,000 records, but couldn't apply the LIMIT
until they were sorted.
Another thing to know here is that this table is busy. We add almost 1,000,000 records in a day. Which means that those 9,000 records have IDs that are pretty far apart. There are big gaps between the numbers returned in the ID
column.
Imagine this query returned 100 records, but the IDs were more than 100 apart. E.g., IDS 100, 200, 300...etc. When postgres is trying to sort these records it does a reverse scan through the index and has to read the order of EVERY number including the ones not returned by the query.
So what is this magic + 0
doing? Adding + 0
to the ORDER BY
clause forces Postgres to load the records into memory and sort them without the index, which is way faster. You can see in the explain plan above, that the sort in memory was super fast.
Why You Probably Shouldn't Do This in Production
This really depends on your data. I knew that only a small percentage of users were experiencing really slow requests due to this query, but I didn't know what the impact of the + 0
magic would be on users that have fewer activities
records. So I ran an experiment using GitHub's Scientist gem.
The results of the experiment are still coming in, but so far, the original ORDER BY
is almost tied with ORDER BY id + 0
for average execution time. However, the 95 percentile execution time (meaning 5% of queries are slower than this number), are way out of sync! ORDER BY id + 0
is almost twice as slow in the 95 percentile category.
The results of the experiment tell me that overall, ORDER BY
with an index scan might cause timeouts for some users, but ORDER BY id + 0
with no index scan seems to cause slower queries overall for most users.
Looks like I'm going to have to find a different solution to this problem.
Conclusion
Adding + 0
to your ORDER BY id
is a wacky trick that will speed up queries when the index is HUGE and the IDs returned have big intervals.
However, it won't always work. Indexes are still fast! Test your queries safely in production to make sure that you're not making a worse experience overall to save a few slow queries.
Top comments (4)
It sounds like the data in the activities table is skewed, basically meaning that for every unique account, there is a very different amount of records in the activities table. This makes the outcome for the LIMIT clause, which is applied only after the filter criteria, very unpredictable. I also suspect that the index on account_id is not very selective, meaning there are many more activities than there are accounts. This makes it that a query optimizer isn't likely to choose this index, as a lookup result will always contain many rows, unless there is no other option.
Which brings me to the + 0. :-) Adding + 0 in the ORDER BY clause effectively disables the use of the index on the primary key, this goes for anything other than just the column name in the WHERE or ORDER BY clause. Why? because now the actual value needs to be fetched first, before the calculated value is known. A query optimizer cannot know this information beforehand: it has to calculate the most optimal query resolution path before actually executing it. The same goes for using functions in the WHERE or ORDER BY clause. + 0 of course just means to use the original value, but the query optimizer cannot know this.
There are several approaches to mitigate the issue at hand. The ones I can think of but don't know if they are applicable to PostgreSQL are:
Make sure the query optimizer has information about the data distribution (skew), by creating statistics with histograms. The challenge with this mitigation is that the INSERT rate is quite high and statistics might get stale (too) fast. Also, the query shows you're interested in the latest data, which makes this approach more problematic.
You could try changing the index on the account_id column to a compound index with the primary key column as secondary and see if it helps. This would add more storage though, increase INSERT/UPDATE load (as now a more complex index needs to get updated) and might have a negative side-effect on other queries that are run, so this should be examined with care before taking it into production.
You could opt for creating a table partitioned by account_id, which allows the optimizer to better know the data skew, as statistics can be created for every partition individually. The down side obviously is more administrative overhead and the not-so-straight-forward setup.
There are probably other options, but I'm currently sleep-deprived. :-)
These are great tips! Partitioning is the end goal for this table, but we need to do some prep work first.
Interestingly, after running the experiment for a couple days or so the + 0 is much faster both on average and 95p. Seems to be a good stopgap while we work on partitioning, but I want to try out the compound index you mentioned first.
This is very common query planner behavior. The best solution in my opinion is to create an index using filter column + order column:
CREATE INDEX ON activities(account_id, id)
, In my case, +30 second queries start executing in 1 ms or less. I did the same test on MS SQL Server and SQLite... only Postgres has this behavior.A generic solution is to change
ORDER BY id
toORDER BY coalesce(id)
.It works regardless of the
id
type, for instance if itβs UUID which cannot do+ 0
.