Franck Pachot for YugabyteDB

Posted on Jan 26, 2025 • Edited on Feb 11, 2025

ESR (Equality, Sort, Range) rule applied to YugabyteDB and PostgreSQL indexes

#yugabytedb #postgres #index #mongodb

TL;DR: To make developers' lives easier, MongoDB has documented a simple rule for creating an efficient index: the The ESR (Equality, Sort, Range) Rule. This rule applies to any databases with sorted indexes, such as B-trees or LSM trees.
Let's demonstrate it on YugabyteDB.

I recently read an article on LinkedIn (Optimizing Large Databases: Insights from PostgreSQL Index Experiments) where, in the section "What Surprised Me," it seems to suggest that a single-column index is superior to a composite index.

I responded that the composite index was a good approach for PostgreSQL but, given the filtering selectivity, the column order was not, and I provided an example: https://dbfiddle.uk/cFA1mGNe.

While I’ve written a comprehensive article on Improving Your SQL Indexing, and MongoDB published a more developer-friendly explanation of the same principle, The ESR (Equality, Sort, Range) Rule, that originates from Optimizing MongoDB Compound Indexes by A. Jesse Jiryu Davis, let's use this particular example to demonstrate it.

I've run the example on YugabyteDB 2.25, which behaves like PostgreSQL but shows more information in the execution plan with the dist and debug options.

YugabyteDB Index Scan with Sort and Range

I created the following table and inserted 100,000 rows:

create table transactions as
 select
  gen_random_uuid() as id,
  100*random() as user_id,
  now() - 90*random()*interval '1 day' as transaction_date
 from generate_series(1,100000) n
;

The following query gets 100 rows at offset 100 for a range of "user_id" sorted by "transaction_date" in descending order:

prepare q(int, int) as
 SELECT * FROM transactions 
 WHERE user_id BETWEEN $1 AND $2 
 ORDER BY transaction_date DESC LIMIT 100 OFFSET 100
;

index on (user_id, transaction_date )

With an index on (user_id, transaction_date ), it has to read 9167 index entries and 9167 table rows to filter on the "user_id" range, then sort the result and get the Top-200 to return 100 rows at offset 100:

yugabyte=> create index tx_usr_dat on transactions
           (user_id desc, transaction_date desc)
;
CREATE INDEX

yugabyte=> explain( analyze, dist, debug, costs off, summary off)
           execute q(1,10)
;
                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------
 Limit (actual time=34.804..34.826 rows=100 loops=1)
   ->  Sort (actual time=34.785..34.804 rows=200 loops=1)
         Sort Key: transaction_date DESC
         Sort Method: top-N heapsort  Memory: 50kB
         ->  Index Scan using tx_usr_dat on transactions (actual time=4.249..33.520 rows=9167 loops=1)
               Index Cond: ((user_id >= '1'::double precision) AND (user_id <= '10'::double precision))
               Storage Table Read Requests: 9
               Storage Table Read Execution Time: 24.774 ms
               Storage Table Rows Scanned: 9167
               Storage Index Read Requests: 9
               Storage Index Read Execution Time: 1.775 ms
               Storage Index Rows Scanned: 9167
               Metric rocksdb_number_db_seek: 9176.000
               Metric rocksdb_number_db_next: 27468.000
               Metric docdb_keys_found: 18342.000

The query has two filters:

one in the WHERE clause on a range of "user_id" for which the selectivity is 9167 / 100,000 = 9%
one in the ORDER BY on "transaction_date" that limits to the Top-200 results, for which the selectivity is 200 / 100,000 = 0.2%

There is a third filtering with OFFSET that skips the first 100 rows, but there's no way to push it down to an index. In general, you should not use OFFSET for pagination, but keep track of the last page value and use it as an additional range predicate (Pagination with an OFFSET is better without OFFSET).

Given the selectivity of the filters that can be pushed down to an index, it is better to use the index range scan for the most selective and start the index key with "transaction_date":

yugabyte=> drop index tx_usr_dat;
DROP INDEX

yugabyte=> create index tx_dat
           on transactions (transaction_date desc)
;
CREATE INDEX

yugabyte=> explain( analyze, dist, debut, costs off, summary off) 
           execute q(1,10)
;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Limit (actual time=7.499..12.404 rows=100 loops=1)
   ->  Index Scan using tx_dat on transactions (actual time=2.630..12.380 rows=200 loops=1)
         Storage Filter: ((user_id >= '1'::double precision) AND (user_id <= '10'::double precision))
         Storage Table Read Requests: 3
         Storage Table Read Execution Time: 8.295 ms
         Storage Table Rows Scanned: 2248
         Storage Index Read Requests: 3
         Storage Index Read Execution Time: 2.925 ms
         Storage Index Rows Scanned: 2248
         Metric rocksdb_number_db_seek: 2251.000
         Metric rocksdb_number_db_next: 6769.000
         Metric docdb_keys_found: 4499.000

This has read 2,248 index entries and 2,248 table rows, four times less than with the other index. This execution plan avoids a sort operation, so it can read fewer rows (instead of reading all rows for the "user_id" range, sorting them, and obtaining the Top-200).

We can do better. The composite index was a good idea, but according to the ESR (Equality, Sort, Range) rule, the index key must start with the sort column, "transaction_date", as there is no equality predicate here, and have the range column after, "user_id":

yugabyte=> drop index tx_dat;
DROP INDEX

yugabyte=> create index tx_dat_usr on transactions 
           (transaction_date desc, user_id desc)
;
CREATE INDEX

yugabyte=> explain( analyze, dist, debug, costs off, summary off) 
           execute q(1,10)
;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Limit (actual time=3.565..3.608 rows=100 loops=1)
   ->  Index Scan using tx_dat_usr on transactions (actual time=3.520..3.586 rows=200 loops=1)
         Index Cond: ((user_id >= '1'::double precision) AND (user_id <= '10'::double precision))
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 1.236 ms
         Storage Table Rows Scanned: 200
         Storage Index Read Requests: 1
         Storage Index Read Execution Time: 2.049 ms
         Storage Index Rows Scanned: 200
         Metric rocksdb_number_db_seek: 201.000
         Metric rocksdb_number_db_next: 2615.000
         Metric docdb_keys_found: 401.000

(9 rows)

The composite index avoids sorting because the key starts with "transaction_date". It also allows additional filtering because the entries include "user_id".

The filtering to obtain the 200 rows needed for the result occurs during the initial operation of scanning the index entries. However, it requires reading more data because "user_id" is not confined to a single range. Instead, it is spread across each "transaction_date", as indicated by the number of seek and next operations.
Here, YugabyteDB reads the entries in descending order of "transaction_date", and for each entry, it successfully seeks (rocksdb_number_db_seek: 201) to the appropriate "user_id" range within it.
Two thousand index entries were read (rocksdb_number_db_next: 2615), which only occurred while scanning the index blocks.

Cost-Based Optimizer, Custom and Generic Plans

In some cases, the selectivity of the range predicate in the WHERE clause is higher than the selectivity of the selectivity of the ORDER BY ... LIMIT. Having the selective column used for the range predicate first in the index can make sense when this is the case. You can test it by changing the data inserted.

This is important because after five executions, the PostgreSQL and YugabyteDB query planner may opt for a generic plan, and the index choice may not be the best for the values you provide. There are two solutions to avoid that:

create only the necessary indexes. If you don't need an index starting with "user_id" for other queries, do not create it.
set plan_cache_mode to force_custom_plan rather than the default auto

PostgreSQL Index Scan with Sort and Range

You can run the same command in PostgreSQL and see the similar advantage of a composite index with the sort column before the range column. The differences with YugabyteDB are:

You must run VACUUM after inserting data to get reproducible behavior.
Use buffers instead of dist, debug formats for explain analyze. It will not display the number of rows scanned from the index and the table, but the number of shared buffer hits will indicate what is read.
An Index Scan in PostgreSQL will also show the rows filtered after reading the table as Rows Removed by Filter. We didn't observe that with YugabyteDB because it was pushed down to the storage even when filtering on the table's rows.
PostgreSQL doesn't do a skip scan, but reading too many index entries is less problematic since they are not distributed and can be read from the shared memory buffers.

Here is what I've run on PostgreSQL 17, creating two indexes:

create table transactions as
 select
  gen_random_uuid() as id,
  random(1,100) as user_id,
  now() - random(1,90)*interval '1 day' as transaction_date
 from generate_series(1,100000) n
;

create index tx_usr_dat on transactions (user_id, transaction_date);
create index tx_dat     on transactions (transaction_date);

vacuum analyze transactions;

prepare q(int, int) as
 SELECT * FROM transactions WHERE user_id BETWEEN $1 AND $2 ORDER BY transaction_date DESC LIMIT 100 OFFSET 100
;
explain( analyze, buffers) execute q(1,10);

Here is the execution plan for explain( analyze, buffers) execute q(1,10), picking the index starting with transaction_date to avoid a sort operation:

Limit  (cost=54.75..109.20 rows=100 width=28) (actual time=0.565..1.629 rows=100 loops=1)
  Buffers: shared hit=1067
  ->  Index Scan Backward using tx_dat on transactions  (cost=0.29..5440.29 rows=9990 width=28) (actual time=0.024..1.609 rows=200 loops=1)
        Filter: ((user_id >= 1) AND (user_id <= 10))
        Rows Removed by Filter: 1823
        Buffers: shared hit=1067
Planning:
  Buffers: shared hit=35
Planning Time: 2.211 ms
Execution Time: 1.873 ms

I replace this index with a composite one, adding "user_id" at the end of the key:


drop index tx_dat;
create index tx_dat_usr on transactions (transaction_date, user_id);

vacuum analyze transactions;

The plan using it reads fewer buffers (207 instead of 1067) and shows no Rows Removed by Filter because it was filtered by the Index Cond:

Limit  (cost=44.99..89.68 rows=100 width=28) (actual time=0.170..0.312 rows=100 loops=1)
  Buffers: shared hit=207
  ->  Index Scan Backward using tx_dat_usr on transactions  (cost=0.29..4533.64 rows=10143 width=28) (actual time=0.017..0.297 rows=200 loops=1)
        Index Cond: ((user_id >= 1) AND (user_id <= 10))
        Buffers: shared hit=207
Planning:
  Buffers: shared hit=31
Planning Time: 0.511 ms
Execution Time: 0.336 ms

The new index is covering the WHERE clause. I explained the various levels of how an index can cover a query in a recent article:

Covering Index nuances: which columns to cover (WHERE, ORDER BY, LIMIT, SELECT)?

I explained how covering indexes can reduce the required reads when scanning multiple rows in an SQL database. This is particularly useful in Distributed SQL databases, such as YugabyteDB, where additional reads may result in network latency.

linkedin.com

MongoDB Index Scan with Sort and Range

As explained in the MongoDB documentation, the ESR (Equality, Sort, Range) rule assists developers in creating composite indexes with optimal column ordering. Let's test the same example and examine the execution plans.

I created a similar dataset as documents in a collection:

db.transactions.insertMany(
  Array.from({ length: 100000 }, (_, i) => ({
    _id: new ObjectId(),
    user_id: Math.floor(Math.random() * 100),
    transaction_date: new Date(Date.now()-Math.random()*90*24*60*60*1000)
  }))
);

I created a function equivalent to the SQL prepared statement:

function queryTransactions(userIdStart, userIdEnd) {
  return db.transactions
    .find({ user_id: { $gte: userIdStart, $lte: userIdEnd } })
    .sort({ transaction_date: -1 })
    .skip(100)
    .limit(100);
}

I create the index that starts with "user_id", the range predicate before the sort, and display the execution plan:

mdb> db.transactions.createIndex({ user_id: -1, transaction_date: -1 });

user_id_-1_transaction_date_-1

mdb> queryTransactions(1, 10).explain("executionStats").executionStats;

{
  executionSuccess: true,
  nReturned: 100,
  executionTimeMillis: 12,
  totalKeysExamined: 9968,
  totalDocsExamined: 100,
  executionStages: {
    isCached: false,
    stage: 'FETCH',
    nReturned: 100,
    executionTimeMillisEstimate: 11,
    works: 10170,
    advanced: 100,
    needTime: 10069,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    docsExamined: 100,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'SKIP',
      nReturned: 100,
      executionTimeMillisEstimate: 9,
      works: 10170,
      advanced: 100,
      needTime: 10069,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      skipAmount: 100,
      inputStage: {
        stage: 'SORT',
        nReturned: 200,
        executionTimeMillisEstimate: 9,
        works: 10170,
        advanced: 200,
        needTime: 9969,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 1,
        sortPattern: { transaction_date: -1 },
        memLimit: 33554432,
        limitAmount: 200,
        type: 'default',
        totalDataSizeSorted: 13800,
        usedDisk: false,
        spills: 0,
        spilledDataStorageSize: 0,
        inputStage: {
          stage: 'IXSCAN',
          nReturned: 9968,
          executionTimeMillisEstimate: 8,
          works: 9969,
          advanced: 9968,
          needTime: 0,
          needYield: 0,
          saveState: 0,
          restoreState: 0,
          isEOF: 1,
          keyPattern: { user_id: -1, transaction_date: -1 },
          indexName: 'user_id_-1_transaction_date_-1',
          isMultiKey: false,
          multiKeyPaths: { user_id: [], transaction_date: [] },
          isUnique: false,
          isSparse: false,
          isPartial: false,
          indexVersion: 2,
          direction: 'forward',
          indexBounds: {
            user_id: [ '[10, 1]' ],
            transaction_date: [ '[MaxKey, MinKey]' ]
          },
          keysExamined: 9968,
          seeks: 1,
          dupsTested: 0,
          dupsDropped: 0
        }
      }
    }
  }
}

The index bounds array makes it clear how the index is scanned by ranges (It must be guessed from the Index Cond in PostgreSQL, or the number of seeks in YugabyteDB):

          indexBounds: {
            user_id: [ '[10, 1]' ],
            transaction_date: [ '[MaxKey, MinKey]' ]
          },
          keysExamined: 9968,

9968 keys have been examined, as they were filtered only on "user_id" with a full range of "transaction_date."
Those have been sorted to get the Top-200 (stage: 'SORT', limitAmount: 200) and skipped to get the offset (stage: 'SKIP', skipAmount: 100).

I ran the same after creating the index that follows the ESR rule:

mdb> db.transactions.createIndex({ transaction_date: -1, user_id: -1 });

transaction_date_-1_user_id_-1

mdb> queryTransactions(1, 10).explain("executionStats").executionStats
{
  executionSuccess: true,
  nReturned: 100,
  executionTimeMillis: 20,
  totalKeysExamined: 1902,
  totalDocsExamined: 1902,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 100,
    executionTimeMillisEstimate: 18,
    works: 1903,
    advanced: 100,
    needTime: 1802,
    needYield: 0,
    saveState: 1,
    restoreState: 1,
    isEOF: 1,
    limitAmount: 100,
    inputStage: {
      stage: 'SKIP',
      nReturned: 100,
      executionTimeMillisEstimate: 18,
      works: 1902,
      advanced: 100,
      needTime: 1802,
      needYield: 0,
      saveState: 1,
      restoreState: 1,
      isEOF: 0,
      skipAmount: 100,
      inputStage: {
        stage: 'FETCH',
        filter: { '$and': [ { user_id: [Object] }, { user_id: [Object] } ] },
        nReturned: 200,
        executionTimeMillisEstimate: 17,
        works: 1902,
        advanced: 200,
        needTime: 1702,
        needYield: 0,
        saveState: 1,
        restoreState: 1,
        isEOF: 0,
        docsExamined: 1902,
        alreadyHasObj: 0,
        inputStage: {
          stage: 'IXSCAN',
          nReturned: 1902,
          executionTimeMillisEstimate: 3,
          works: 1902,
          advanced: 1902,
          needTime: 0,
          needYield: 0,
          saveState: 1,
          restoreState: 1,
          isEOF: 0,
          keyPattern: { transaction_date: -1, user_id: -1 },
          indexName: 'transaction_date_-1_user_id_-1',
          isMultiKey: false,
          multiKeyPaths: { transaction_date: [], user_id: [] },
          isUnique: false,
          isSparse: false,
          isPartial: false,
          indexVersion: 2,
          direction: 'forward',
          indexBounds: {
            transaction_date: [ '[MaxKey, MinKey]' ],
            user_id: [ '[MaxKey, MinKey]' ]
          },
          keysExamined: 1902,
          seeks: 1,
          dupsTested: 0,
          dupsDropped: 0
        }
      }
    }
  }
}

The index bounds show that no range scan is applied:

          indexBounds: {
            transaction_date: [ '[MaxKey, MinKey]' ],
            user_id: [ '[MaxKey, MinKey]' ]
          },
          keysExamined: 1902,

Still, it examined less keys in IXSCAN because there's no SORT operation and the SKIP and LIMIT can operate without reading all entries.

However, even if the "user_id" is present in the index, the filtering happens in a FETCH operation, after getting the document:

      inputStage: {
        stage: 'FETCH',
        filter: { '$and': [ { user_id: [Object] }, { user_id: [Object] } ] },
        nReturned: 200,
        docsExamined: 1902,
        inputStage: {
          stage: 'IXSCAN',
          nReturned: 1902,

In this example (running in MongoDB 8.0), having "user_id" in the index was useless, and I would have the same execution plan with an index on { transaction_date: -1 } only. That may be a topic for a future blog post.

In short, when the first field of the index is used by sort, like here no "Equality" in "Equality,Sort,Range", there must be a predicate on it to push down the range predicate to the IXSCAN's index bounds. I add transaction_date: { $gte: MinKey }:

mdb> function queryTransactions(userIdStart, userIdEnd) {
  return db.transactions
    .find({
      user_id: { $gte: userIdStart, $lte: userIdEnd } ,
      transaction_date: { $gte: MinKey }
    }).sort({ transaction_date: -1 })
    .skip(100)
    .limit(100);
}

mdb> queryTransactions(1, 10).explain("executionStats").executionStats;

{
...
  nReturned: 100,
  totalKeysExamined: 2012,
  totalDocsExamined: 100,
  executionStages: {
    stage: 'LIMIT',
    nReturned: 100,
...
      inputStage: {
        stage: 'SKIP',
        nReturned: 100,
...
    inputStage: {
      stage: 'FETCH',
      nReturned: 100,
      docsExamined: 100,
...
        inputStage: {
          stage: 'IXSCAN',
          nReturned: 200,
          keyPattern: { transaction_date: -1, user_id: 1 },
          indexName: 'transaction_date_-1_user_id_1',
          multiKeyPaths: { transaction_date: [], user_id: [] },
          direction: 'forward',
          indexBounds: {
            transaction_date: [ '[MaxKey, MinKey]' ],
            user_id: [ '[1, 10]' ]
          },
          keysExamined: 2012,
          seeks: 1813,
...

This IXSCAN covers the range filter in addition to the sort, and only the necessary 200 documents have been examined by FETCH after the SKIP.

This demonstrates a key principle for creating composite indexes (ESR). When your query includes a selective equality predicate (E), your index key should begin with it to allow a single range to be read during the index scan. Particularly for pagination queries, the index key must add the sort (S) keys after, ensuring that the single range is returned in the correct order. If a selective range (R) predicate exists instead of a sort, it should be placed in the index key following the equality predicates. When both sort (S) and range (R) are present, the decision depends on selectivity and the database's ability to filter after a sorted range. Generally, avoiding a sort operation is preferable for scalability, as the sorting cost rises with the data volume.

This ESR rule applies to YugabyteDB:

the column used for equality predicate (E) must be in front of the index key and can be HASH, ASC, or DESC
the columns used to sort (S) for ORDER BY ... LIMIT follow in the index key, with ASC or DESC. Try to match the ORDER BY ASC/DESC to avoid backward scans that are more expensive in an LSM tree. If there's no sort to prevent it, the most selective range will follow
the columns used for range (R) or other predicate can follow in the key definition or the INCLUDE clause to filter the index entries before fetching the rows from the table. If it is the primary key index, there's no need to add it because all columns are in the primary index.

This article was full of links. Here is one more from a workshop on indexing best practices:

Top comments (1)

Boopathi • Feb 16 '25

Thank you so much very helpfull