DEV Community

Franck Pachot for YugabyteDB

Posted on

Index Filtering in PostgreSQL and YugabyteDB (Index Scan instead of Index Only Scan)

This is inspired by Henrietta Dombrovskaya's recent post (Can we use this index, please? – Why not?), which reveals an SQL optimization method that requires the user to rewrite the query in PostgreSQL. This is also a tiny counter-example to my general assertion that seeking the fully covering index, the Index Only Scan, isn't always necessary, as I explained in Covering Index nuances: which columns to cover (WHERE, ORDER BY, LIMIT, SELECT)?
In this scenario, with PostgreSQL, an Index-Only Scan was necessary to optimize the filtering, even with an index that includes all columns utilized by the filter condition.
Additionally, I will demonstrate that in some cases, YugabyteDB outperforms by filtering directly on the index entries, even during an Index Scan, and a smaller index can be good enough.

Example with an immutable function (EXTRACT on TIMESTAMP)

I'll start with an example that diverges from Hettie's case. For the demonstration with immutable function predicates, I use TIMESTAMP instead of TIMESTAMPTZ, although TIMESTAMPTZ is often recommended for storing points in time as they relate to a time zone.

PostgreSQL with immutable function

I created the following table where columns "category" and "created_at" are indexed:

create table demo (
 id bigserial
 ,category   int
 ,created_at timestamp
 ,filler text default lpad('x',9999,'x')
)
;

insert into demo(category,created_at)
 select 0, now() + generate_series(0,99999) * interval '1 minute'
;

create index on demo ( category, created_at )
;

vacuum analyze demo
;

Enter fullscreen mode Exit fullscreen mode

This table is stored in 2223 pages of heap table and 388 pages of index, as visible from the query planner statistics and also when scanning them though explain (analyze buffers):

postgres=> select relname, relpages, reltuples from pg_class where oid in('demo'::regclass,'demo_category_created_at_idx'::regclass);
           relname            | relpages | reltuples
------------------------------+----------+-----------
 demo                         |     2223 |    100000
 demo_category_created_at_idx |      388 |    100000
(2 rows)

postgres=> explain (analyze, costs off, summary off, buffers)
select * from demo;
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on demo (actual time=0.008..5.409 rows=100000 loops=1)
   Buffers: shared hit=2223

postgres=> explain (analyze, costs off, summary off, buffers)
select category, created_at from demo;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_category_created_at_idx on demo (actual time=0.046..7.013 rows=100000 loops=1)
   Heap Fetches: 0
   Buffers: shared hit=389
(3 rows)

Enter fullscreen mode Exit fullscreen mode

The following query reads rows for one "category" within a range of "created_at", not very selective, and an additional filter on the hour and minute, which is very selective:

postgres=> explain (analyze, costs off, summary off, buffers)
select demo.* from demo
 where category=0
 and created_at between '01-01-2025' and '12-31-2025'
 and extract (hour from created_at)=16 and extract (minute from created_at)=15
;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_category_created_at_idx on demo (actual time=0.172..25.991 rows=70 loops=1)
   Index Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2025-12-31 00:00:00'::timestamp without time zone))
   Filter: ((EXTRACT(hour FROM created_at) = '16'::numeric) AND (EXTRACT(minute FROM created_at) = '15'::numeric))
   Rows Removed by Filter: 99930
   Buffers: shared hit=2609
Enter fullscreen mode Exit fullscreen mode

The result is 70 rows, and all the WHERE clause predicates can be applied to the index entries, which include "category" and "created_at" in their key. Then, I could expect a maximum of 70 buffers from the table to get the non-indexed columns for the "SELECT *" projection. Seventy rows accessed by CTID should be fast and not require an Index Only Scan. However, this query read most of the table, 2609 pages from the shared buffers.

The reason is visible on the execution plan: the predicate on the hour and minute extracted from "created_at" is in Filter, which applies after the Index Scan, after reading the table. Its impact is visible in Rows Removed by Filter: 99930 rows have been fetched from the table to be discarded afterward when applying this filter.

Note that a bitmap scan execution plan differentiates the access to the index and the heap and it is more visible that the Filter applies on the Heap Scan and not the Index Scan:

postgres=> explain (analyze, costs off, summary off, buffers)
select demo.* from demo
 where category=0
 and created_at between '01-01-2025' and '12-31-2025'
 and extract (hour from created_at)=16 and extract (minute from created_at)=15
;
                                                                                                                                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on demo (actual time=4.271..23.077 rows=70 loops=1)
   Recheck Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2025-12-31 00:00:00'::timestamp without time zone))
   Filter: ((EXTRACT(hour FROM created_at) = '16'::numeric) AND (EXTRACT(minute FROM created_at) = '15'::numeric))
   Rows Removed by Filter: 99930
   Heap Blocks: exact=2223
   Buffers: shared hit=2609
   ->  Bitmap Index Scan on demo_category_created_at_idx (actual time=3.863..3.864 rows=100000 loops=1)
         Index Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2025-12-31 00:00:00'::timestamp without time zone))
         Buffers: shared hit=386

Enter fullscreen mode Exit fullscreen mode

To filter after the index access but before the heap access, we can explicitly separate the index and table access, as Hettie exposed in her blog post:

postgres=> explain (analyze, costs off, summary off, buffers)
 select demo.* from demo where (category,created_at) in (
  select category,created_at from demo
   where category=0
   and created_at between '01-01-2025' and '12-31-2025'
   and extract (hour from created_at)=16 and extract (minute from created_at)=15
);
                                                                                           QUERY PLAN                                                                                         
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop (actual time=18.120..18.449 rows=70 loops=1)
   Buffers: shared hit=669
   ->  Unique (actual time=18.090..18.103 rows=70 loops=1)
         Buffers: shared hit=389
         ->  Sort (actual time=18.087..18.090 rows=70 loops=1)
               Sort Key: demo_1.category, demo_1.created_at
               Sort Method: quicksort  Memory: 28kB
               Buffers: shared hit=389
               ->  Index Only Scan using demo_category_created_at_idx on demo demo_1 (actual time=0.182..18.063 rows=70 loops=1)
                     Index Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2025-12-31 00:00:00'::timestamp without time zone))
                     Filter: ((EXTRACT(hour FROM created_at) = '16'::numeric) AND (EXTRACT(minute FROM created_at) = '15'::numeric))
                     Rows Removed by Filter: 99930
                     Heap Fetches: 0
                     Buffers: shared hit=389
   ->  Index Scan using demo_category_created_at_idx on demo (actual time=0.004..0.005 rows=1 loops=70)
         Index Cond: ((category = 0) AND (created_at = demo_1.created_at))
         Buffers: shared hit=280

Enter fullscreen mode Exit fullscreen mode

The subquery selects only the columns included in the index from an Index Only Scan, and all filtering happens there. Access to the table is more expensive, with an additional Index Scan on the same index, but it occurs on a limited number of rows in this case (70 rows instead of 70+99930=100000 in the previous case).

YugabyteDB and immutable function (EXTRACT on TIMESTAMP)

I have run the same on YugabyteDB. There's no need for any trick, and the dist option displays more information about the number of rows scanned from the index and the table:


yugabyte=# explain (analyze, costs off, summary off, dist)
select demo.* from demo
 where category=0
 and created_at between '01-01-2025' and '12-31-2025'
 and extract (hour from created_at)=16 and extract (minute from created_at)=15
;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_category_created_at_idx on demo (actual time=87.445..87.968 rows=70 loops=1)
   Index Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2025-12-31 00:00:00'::timestamp without time zone))
   Storage Index Filter: ((EXTRACT(hour FROM created_at) = '16'::numeric) AND (EXTRACT(minute FROM created_at) = '15'::numeric))
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 1.630 ms
   Storage Table Rows Scanned: 70
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 85.479 ms
   Storage Index Rows Scanned: 100000

Enter fullscreen mode Exit fullscreen mode

Like in PostgreSQL, it is an Index Scan because the index does not cover all columns selected, but there's a Storage Index Filter that is applied before the access to the table which would be seen as Storage Table Filter if pushed down, or Filter if not.

Like with PostgreSQL, 100000 index entries were read, but the filter applied there meant that only 70 table rows were read, similar to the query with Hettie's rewrite, without changing the query.

This is an optimization by Yugabyte to reduce the network transfer on distributed tables, which benefits this case. However, not all functions can be pushed down, mainly when they depend on server configuration. Then, the behavior reverts to PostgreSQL so that the configuration of the query layer applies. We will look at this later when using TIMESTAMPTZ instead of TIMESTAMP.

First, I would like to show that it is not a limitation of Index Scan vs. Index Only Scan, and partially covering indexes can help.

Example with an index covering the immutable function (EXTRACT on TIMESTAMP)

I used only immutable functions (extract from a TIMESTAMP), and it is possible to cover the predicates without a fully covering Index Only Scan:

create index on demo ( 
 category, created_at
 , (extract (hour from created_at))
 , (extract (minute from created_at))
)
;
Enter fullscreen mode Exit fullscreen mode

PostgreSQL can apply the filter on the index entries, visible in Index Cond:

postgres=> explain (analyze, costs off, summary off, buffers)
select demo.* from demo
 where category=0
 and created_at between '01-01-2025' and '12-31-2025'
 and extract (hour from created_at)=16 and extract (minute from created_at)=15
;
                                                                                                                                       QUERY PLAN                                             
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_category_created_at_extract_extract1_idx1 on demo (actual time=0.058..4.436 rows=70 loops=1)
   Index Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2025-12-31 00:00:00'::timestamp without time zone) AND (EXTRACT(hour FROM created_at) = '16'::numeric) AND (EXTRACT(minute FROM created_at) = '15'::numeric))
   Buffers: shared hit=668

Enter fullscreen mode Exit fullscreen mode

YugabyteDB does the same with additional information with the dist option of explain:

yugabyte=# explain (analyze, costs off, summary off, dist)
select demo.* from demo
 where category=0
 and created_at between '01-01-2025' and '12-31-2025'
 and extract (hour from created_at)=16 and extract (minute from created_at)=15
;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_category_created_at_extract_extract1_idx on demo (actual time=101.617..102.129 rows=70 loops=1)
   Index Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2025-12-31 00:00:00'::timestamp without time zone) AND (EXTRACT(hour FROM created_at) = '16'::numeric) AND (EXTRACT(minute FROM created_at) = '15'::numeric))
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 1.834 ms
   Storage Table Rows Scanned: 70
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 99.426 ms
   Storage Index Rows Scanned: 70

Enter fullscreen mode Exit fullscreen mode

In both cases, adding the function's result to the index allowed the query planner to push down the condition to the index scan.

However, this is possible only with immutable functions. Extracting the hour and minute from a TIMESTAMP is deterministic. This is not the case when the timestamp includes the timezone because it depends on the database configuration. Changing the default timezone would corrupt the index.

Example with non-immutable function (EXTRACT on TIMESTAMPTZ)

Hettie's case defined "created_at" as TIMESTAMPTZ (which is a good idea to store timestamps that can be compared with each other, but maybe not when extracting the hour and minute):

create table demo (
 id bigserial
 ,category   int
 ,created_at timestamptz
 ,filler text default lpad('x',9999,'x')
)
;

insert into demo(category,created_at)
 select 0, now() + generate_series(0,99999) * interval '1 minute'
;
create index on demo ( category, created_at )
;

vacuum analyze demo
;
Enter fullscreen mode Exit fullscreen mode

Covering the extract is not possible anymore because it not deterministic:

postgres=> create index on demo (
 category, created_at
 , (extract (hour from created_at))
 , (extract (minute from created_at))
)
;
ERROR:  functions in index expression must be marked IMMUTABLE

Enter fullscreen mode Exit fullscreen mode

extract(text, timestamp with time zone) is stable (deterministic within the scope of one SQL call) but not immutable (deterministic forever):

postgres=> \df+ extract
                                                                                                            List of functions
   Schema   |  Name   | Result data type |        Argument data types        | Type | Volatility | Parallel |    Owner     | Security | Access privileges | Language |    Internal name    |                 Description
------------+---------+------------------+-----------------------------------+------+------------+----------+--------------+----------+-------------------+----------+---------------------+---------------------------------------------
 pg_catalog | extract | numeric          | text, date                        | func | immutable  | safe     | zenith_admin | invoker  |                   | internal | extract_date        | extract field from date
 pg_catalog | extract | numeric          | text, interval                    | func | immutable  | safe     | zenith_admin | invoker  |                   | internal | extract_interval    | extract field from interval
 pg_catalog | extract | numeric          | text, time with time zone         | func | immutable  | safe     | zenith_admin | invoker  |                   | internal | extract_timetz      | extract field from time with time zone
 pg_catalog | extract | numeric          | text, time without time zone      | func | immutable  | safe     | zenith_admin | invoker  |                   | internal | extract_time        | extract field from time
 pg_catalog | extract | numeric          | text, timestamp with time zone    | func | stable     | safe     | zenith_admin | invoker  |                   | internal | extract_timestamptz | extract field from timestamp with time zone
 pg_catalog | extract | numeric          | text, timestamp without time zone | func | immutable  | safe     | zenith_admin | invoker  |                   | internal | extract_timestamp   | extract field from timestamp

Enter fullscreen mode Exit fullscreen mode

It's not possible to add this function to the index.

YugabyteDB applied the filter on the index entries because the predicate was pushed down to the storage layer. However, this doesn't happen with this function:

yugabyte=# \d demo
                                       Table "public.demo"
   Column   |           Type           | Collation | Nullable |             Default
------------+--------------------------+-----------+----------+----------------------------------
 id         | bigint                   |           | not null | nextval('demo_id_seq'::regclass)
 category   | integer                  |           |          |
 created_at | timestamp with time zone |           |          |
 filler     | text                     |           |          | lpad('x'::text, 9999, 'x'::text)
Indexes:
    "demo_category_created_at_idx" lsm (category ASC, created_at ASC)

yugabyte=# explain (analyze, costs off, summary off, dist)
select demo.* from demo
 where category=0
 and created_at between '01-01-2025' and '12-31-2025'
 and extract (hour from created_at)=16 and extract (minute from created_at)=15
;
                                                                                  QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_category_created_at_idx on demo (actual time=23.135..1925.909 rows=70 loops=1)
   Index Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00+00'::timestamp with time zone) AND (created_at <= '2025-12-31 00:00:00+00'::timestamp with time zone))
   Filter: ((EXTRACT(hour FROM created_at) = '16'::numeric) AND (EXTRACT(minute FROM created_at) = '15'::numeric))
   Rows Removed by Filter: 99930
   Storage Table Read Requests: 98
   Storage Table Read Execution Time: 1057.112 ms
   Storage Table Rows Scanned: 100000
   Storage Index Read Requests: 98
   Storage Index Read Execution Time: 14.704 m
Enter fullscreen mode Exit fullscreen mode

With a TIMESTAMPTZ argument, the extract is not pushed down, even though it is an immutable function. YugabyteDB behavior reverts to what PostgreSQL does: Filter after the Index Scan result, which fetches 100000 Table Rows Scanned from the storage (DocDB, distributed on YugabyteDB nodes) to the query layer (YSQL, the Postgres backend you are connected to) to discard 99930 Rows Removed by Filter.

This is not efficient. Hettie's trick helps to reduce the table rows scanned:

yugabyte=# explain (analyze, costs off, summary off, dist)
 select demo.* from demo where (category,created_at) in (
  select category,created_at from demo
   where category=0
   and created_at between '01-01-2025' and '12-31-2025'
   and extract (hour from created_at)=16 and extract (minute from created_at)=15
);
                                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop (actual time=107.361..179.929 rows=70 loops=1)
   ->  HashAggregate (actual time=105.868..105.996 rows=70 loops=1)
         Group Key: demo_1.category, demo_1.created_at
         Batches: 1  Memory Usage: 32kB
         ->  Index Only Scan using demo_category_created_at_idx on demo demo_1 (actual time=2.232..105.641 rows=70 loops=1)
               Index Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00+00'::timestamp with time zone) AND (created_at <= '2025-12-31 00:00:00+00'::timestamp with time zone))
               Filter: ((EXTRACT(hour FROM created_at) = '16'::numeric) AND (EXTRACT(minute FROM created_at) = '15'::numeric))
               Rows Removed by Filter: 99930
               Heap Fetches: 0
               Storage Index Read Requests: 98
               Storage Index Read Execution Time: 55.127 ms
               Storage Index Rows Scanned: 100000
   ->  Index Scan using demo_category_created_at_idx on demo (actual time=1.027..1.031 rows=1 loops=70)
         Index Cond: ((category = 0) AND (created_at = demo_1.created_at))
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 0.469 ms
         Storage Table Rows Scanned: 1
         Storage Index Read Requests: 1
         Storage Index Read Execution Time: 0.464 ms
         Storage Index Rows Scanned: 1

Enter fullscreen mode Exit fullscreen mode

Note that this saves only the table rows scanned, but the predicate with the non-immutable extract function is still applied as a filter and cannot be pushed down to the storage, resulting in 100000 index entries fetched.

Use an immutable function (TO_CHAR on TIMESTAMPTZ AT TIME ZONE 'GMT' )

One solution is to create an immutable function to extract the hours and minutes without depending on the default timezone:

create function GMT_HH24MI(ts timestamptz) returns text as $SQL$
 select to_char(ts at time zone 'GMT','HH24MI')
$SQL$ 
 language sql 
 immutable -- because it operates always on GMT whatever the default timezone is
;
Enter fullscreen mode Exit fullscreen mode

This function can be indexed:

create index on demo ( category, created_at, GMT_HH24MI(created_at) )
;
Enter fullscreen mode Exit fullscreen mode

The query must be rewritten to use the function:

yugabyte=# explain (analyze, costs off, summary off, dist)
select demo.* from demo
 where category=0
 and created_at between '01-01-2025' and '12-31-2025'
 and GMT_HH24MI(created_at)='1615'
;
                                                                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_category_created_at_gmthh24mi_idx on demo (actual time=73.116..73.610 rows=70 loops=1)
   Index Cond: ((category = 0) AND (created_at >= '2025-01-01 00:00:00+00'::timestamp with time zone) AND (created_at <= '2025-12-31 00:00:00+00'::timestamp with time zone) AND (GMT_HH24MI(created_at) = '1615'::text))
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 1.915 ms
   Storage Table Rows Scanned: 70
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 70.931 ms
   Storage Index Rows Scanned: 70
Enter fullscreen mode Exit fullscreen mode

Creating an immutable function and adding it to the index key or include clause allows PostgreSQL and YugabyteDB to apply the predicate to the index entries without an Index-Only Scan.

If you use such an immutable function, I recommend that the function's name describes the specific behavior. In this example, my function extracts the time of a particular time zone (GMT). The semantics are different:

  • The original example with TIMESTAMPTZ and EXTRACT shows records created when it was 15:16 in the time zone of the user who runs the query later, but it may have been a different clock time for the one who created it.
  • My GMT_HH24MI function shows those created when it was 15:16 UTC.
  • With a TIMESTAMP without a time zone, EXTRACT shows those created when it was 15:16 in the creation's time zone, which may not match the querying user's time zone.

When specifying 15:16 in the query, which of these three results does the user expect? You cannot know without asking.


🌐 All these considerations originate from client/server times. Today, you don't want your stored data to depend on where the application server runs. The application server and the database's default time zone should be identical and probably UTC. TIMESTAMP or AT TIME ZONE 'GMT' provides better optimization possibilities, as their results do not depend on environment variables but rely on the guarantee that those timezone settings are consistent across all servers and do not change.

💡 Always look at the explain (analyze, buffers) output in PostgreSQL or explain (analyze, dist) in YugabyteDB. In general, you should avoid Rows Removed by Filter on many rows, especially with distributed databases that must fetch them across the network before applying the Filter condition. The number of buffers or rows scanned should not exceed the scan result (actual rows=).

Top comments (0)