Josh Branchaud

Posted on May 5, 2022 • Edited on Nov 29, 2025

Speeding Up An Expensive PostgreSQL Query: B-Tree vs. BRIN

#postgres #performance #webdev #devops

I had an expensive ~20 second query on a massive ~500 million row PostgreSQL table. I brought that query down to about 2 milliseconds. The quick answer is that there was a missing index. In this post, we'll explore how we came to that conclusion and how we determined what kind of index to add.

tl;dr: the query we will look at in this post was slow because of a missing index on a created_at timestamp column. We tried separately adding a BRIN and a B-Tree index. When a BRIN index was added, there was no performance improvement. When a B-Tree index was added, however, we saw a huge performance improvement. The most expensive query dropped from ~20 seconds to ~2 milliseconds. A 10,000x speed up. The average timing of this query is now even a little lower than that.

Read on if you want more of the details.

Note: all the queries I'm about to cover are run against a locally-restored snapshot of production data. Take care when adding indexes like these in production. Do so concurrently and be sure the monitor your app in the process.

The Data Model

First, let's look at a sufficient snapshot of the data model.

We have a massive events table, it contains hundreds of millions of rows. Each of those events is associated with a particular record in the users table.

Users do things and those things are recorded as events in the events table. Users do lots of things, so there are lots of events.

The Slow Query

The query that our monitoring software was flagging as slow was one where we were trying to find the two most recent events for a specific user.

select * from events
where user_id = 123
order by created_at desc
limit 2;

Some users have way more events than others, so the slowness of this query can vary pretty drastically. This particular user has over ~200,000 events. It's instances like this that revealed the bottleneck in this table's design.

Let's explain analyze this query to better understand where the issue is.

explain analyze
select * from events
where user_id = 123
order by created_at desc
limit 2;

Running explain analyze with this query gives us a really good idea of what is going on because it actually executes the query. Here is what the output of that looks like:

                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1066813.50..1066813.50 rows=2 width=114) (actual time=23207.023..23207.024 rows=2 loops=1)
   ->  Sort  (cost=1066813.50..1067596.73 rows=313292 width=114) (actual time=23207.022..23207.022 rows=2 loops=1)
         Sort Key: created_at DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Bitmap Heap Scan on events  (cost=5868.59..1063680.58 rows=313292 width=114) (actual time=74.682..23098.624 rows=222272 loops=1)
               Recheck Cond: (user_id = 123)
               Rows Removed by Index Recheck: 6448758
               Heap Blocks: exact=67292 lossy=128418
               ->  Bitmap Index Scan on index_events_on_user_id  (cost=0.00..5790.26 rows=313292 width=0) (actual time=57.349..57.349 rows=222272 loops=1)
                     Index Cond: (user_id = 123)
 Planning time: 0.095 ms
 Execution time: 23207.502 ms
(12 rows)

The headline number is the Execution time at the bottom of the output -- 23207.502 ms. That's 23 seconds.

The thing I then want to note is where the bulk of those 23 seconds is going. Fortunately in this explain analyze output, it pretty clearly stands out in a single area.

Notice this bit of the output where it is sorting the table based on created_at DESC.

         Sort Key: created_at DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Bitmap Heap Scan on events  (cost=5868.59..1063680.58 rows=313292 width=114) (actual time=74.682..23098.624 rows=222272 loops=1)

We can see the actual time ranges from 74.682..23098.624. That's where this query is spending the vast majority of its time.

Looking at the table indexes and the query we are running, I probably could have guessed the issue was a missing index on created_at. I don't just want to add indexes haphazardly. I want numbers to back it up. This confirms that. We are on track to fixing this.

BRIN vs. B-Tree

So, we know we need an index on created_at, but what kind of index? B-Tree is the default when the index type isn't specified. But I know I've heard of Postgres supporting other kinds of indexes. What about those?

Here is what the Crunchy Data blog has to say about the BRIN index.

PostgreSQL 9.5 introduced a feature called block range indexes (aka BRIN)
that is incredibly helpful in efficiently searching over large time series
data and has the benefit of taking up significantly less space on disk than a
standard B-tree index.

"Large time series data" is exactly what we are dealing with here, and an index that takes up less space on disk sounds great. So that seems worth a try. In fact, let's start there.

With BRIN Index

I'll start by adding the index.

create index index_events_on_created_at
  on events using brin (created_at);

Note: Adding the BRIN index to a ~500 million row table took ~12 minutes on my dev machine.

Now I'll re-run the query from earlier still with explain analyze:

> explain analyze select * from events where user_id = 123 order by created_at desc limit 2;

                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1066813.50..1066813.50 rows=2 width=114) (actual time=23074.657..23074.658 rows=2 loops=1)
   ->  Sort  (cost=1066813.50..1067596.73 rows=313292 width=114) (actual time=23074.655..23074.655 rows=2 loops=1)
         Sort Key: created_at DESC
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Bitmap Heap Scan on events  (cost=5868.59..1063680.58 rows=313292 width=114) (actual time=68.786..22975.087 rows=222272 loops=1)
               Recheck Cond: (user_id = 123)
               Rows Removed by Index Recheck: 6448758
               Heap Blocks: exact=67292 lossy=128418
               ->  Bitmap Index Scan on index_events_on_user_id  (cost=0.00..5790.26 rows=313292 width=0) (actual time=48.569..48.569 rows=222272 loops=1)
                     Index Cond: (user_id = 123)
 Planning time: 2.097 ms
 Execution time: 23075.473 ms
(12 rows)

Unfortunately, that output looks very similar to the one before we added an index. It's hard to know exactly why a BRIN index isn't conferring any benefit here. The important thing is that we measured. We know that at least for this data set, a BRIN index isn't the way forward.

Since we don't want the BRIN index, let's drop it and then give B-Tree a try.

drop index index_events_on_created_at;

With B-Tree Index

create index index_events_on_created_at
  on events(created_at);

Note: Adding the B-Tree index to a ~500 million row table took ~8 minutes on my dev machine.

And again, we'll run our slow query and see how it performs.

> explain analyze select * from events where user_id = 123 order by created_at desc limit 2;

                                                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..535.62 rows=2 width=114) (actual time=0.056..1.951 rows=2 loops=1)
   ->  Index Scan Backward using index_events_on_created_at on events  (cost=0.57..83812631.72 rows=313292 width=114) (actual time=0.055..1.950 rows=2 loops=1)
         Filter: (user_id = 123)
         Rows Removed by Filter: 4819
 Planning time: 0.116 ms
 Execution time: 2.000 ms
(6 rows)

Besides the fact that the query returns immediately, I can tell by the change in shape and height of the query plan that we're on to something. The execution time has dropped to 2.000 ms as well. The second line of the query plan shows that the new index is being used.

   ->  Index Scan Backward using index_events_on_created_at on events  (cost=0.57..83812631.72 rows=313292 width=114) (actual time=0.055..1.950 rows=2 loops=1)

That's more like it. This backs up the decision to add a B-Tree index for the created_at column in production. As I mentioned at the beginning of the post, adding an index like this in any production setting, but especially for a table
of this size, it is important to do so concurrently and to monitor your app's health metrics during the process.

If you enjoyed this post, consider joining my newsletter or following me on twitter. If this helped you or you have a question, feel free to drop me note wherever. I'd love to hear from you!

Cover image credit: Kiyoshi on Unsplash

Latest comments (2)

mk • Jan 9

BRIN can't help here, because ORDER BY is used

BRIN works in terms of block ranges (or “page ranges”). A block range is a group of pages that are physically adjacent in the table; for each block range, some summary info is stored by the index.

(copied from postgres docs)

So, for a bunch of records, there is a record in a BRIN index telling: "hey, this block contains records with created_at from November till December", so when you need records from May, PG can safely skip the whole block

But it is filtering. BRIN doesn't contain info about concrete records; thus, it can't help with ordering, thus - the optimizer didn't use it

If your query contained ... where created_at > 2010-01-12T11:00:00Z - then BRIN could be used for skipping some blocks of pages

nenea-nelu • Aug 10 '23 • Edited

As the query is "order by created_at desc" wouldn't make sense to have the index created in the same DESC order? Many times data is accessed in natural way meaning we need the latest records not the oldest yet indexes are created in default ASC and not DESC order.