Global Secondary Indexes in Distributed SQL

In contrast to sharding (Aurora Limitless), which employs only local indexes, distributed SQL (Aurora DSQL) distributes the secondary indexes on their indexed columns, independently of the table distributed on the primary key. This allows for the optimization of all access patterns, including those that do not share the same sharding key as the table, and enables global enforcement of unique constraints.

Users of NoSQL or monolithic SQL databases often wonder how this can scale because an Index Scan on a secondary index is like a join between the index and the table, and there's a fear that joins don't scale. I've addressed joins in a previous post, and here is one about the secondary index scan. I'll run an example on Aurora DSQL and YugabyteDB, which work similarly but with little difference in the execution plans displayed. Let's prove that Distributed SQL databases can scale with consistent global secondary indexes.

I've created a table with a primary key and a secondary index defined by a unique constraint:

create table demo (
 id    int primary key,
 value int unique
);

\d demo

with start as ( select count(*) from demo)
 insert into demo select count+n, count+n 
 from start, generate_series(1,1000) n
\watch c=1000

Two remarks:

To avoid future problems, it is recommended to use bigint instead of int. However, in the preview version of Aurora DSQL, it seems that bigint cannot be used for index conditions, so I used int.
YugabyteDB defaults to hash sharding. As I'm going to run range queries, I disabled this by setting yb_use_hash_splitting_by_default to off.

PostgreSQL stores tables in heap tables, and all indexes, including the primary key, are secondary indexes. Aurora DSQL and YugabyteDB store the table in their primary key index. They use a different way to display this while showing a PostgreSQL-compatible output:

Aurora DSQL presents the primary key as if it is a covering index, which makes sense as it is an index that includes all columns of the table:

dsql=> \d demo
                Table "public.demo"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 id     | integer |           | not null |
 value  | integer |           |          |
Indexes:
    "demo_pkey" PRIMARY KEY, btree_index (id) INCLUDE (value)
    "demo_value_key" UNIQUE CONSTRAINT, btree_index (value)

Consequently, an index scan on the primary key is presented as an Index Only Scan (Index Only Scan using demo_pkey) in Aurora DSQL.

YugabyteDB presents the primary index as PostgreSQL would have presented it, and it's not immediately visible that it includes all columns:

yugabyte=> \d demo
                Table "public.demo"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
 id     | integer |           | not null |
 value  | integer |           |          |
Indexes:
    "demo_pkey" PRIMARY KEY, lsm (id ASC)
    "demo_value_key" UNIQUE CONSTRAINT, lsm (value ASC)

An index scan on the primary key is presented as an Index Scan (Index Scan using demo_pkey) by YugabyteDB. Users must understand that an index Scan with a primary key is physically equivalent to an Index-Only Scan.

The ASC option is indicated because, in addition to ASC and DESC for ascending and descending order, it can also be HASH when a hash function is applied by hash-sharding (see code here).

Your opinion matters. How do you prefer the display of the primary key index? I like the Aurora DSQL style, displayed like covering indexes, reflecting how they are stored. However, I can understand some users prefer the display in YugabyteDB, closer to the CREATE INDEX statement issued. In all cases, I like that Aurora DSQL shows the scan as Index Only Scan, which truly shows the performance benefit of primary key access.

On this table, I've run range queries with various sizes on "id" to get a primary index scan and on "value" to get a secondary index scan to read the other column from the table:

select * from demo where id between 1 and  1;
select * from demo where id between 1 and  2;
...
select * from demo where id between 1 and  20000


select * from demo where value between 1 and  2;
select * from demo where value between 1 and  2;
...
select * from demo where value between 1 and  20000

Here is what I've run to generate those queries and execute them (with \gexec):

set enable_seqscan=off;
with cols(col) as (values ('id'),('value'))
 select format ('
 explain (analyze,dist) select * from demo 
 where %I between %s and  %s
 ', col, 1, n) from cols,generate_series(1,20000) n
;
\o tmp.txt
\gexec
\o

I disabled Seq Scan to run an Index Scan (or Index-Only Scan) even when scanning many rows, when the query planner may pick a full table scan instead. I've run it with EXPLAIN ANALYZE to gather all execution plans and get the time for the index scan operation and the number of rows.

I've gathered the number of rows, time in milliseconds, and scan method to a file easy to open with Excel:

awk '
$0~re{
printf "%8s %10.5f %4d %8.2f %s\n" \
 ,gensub("tmp(....).*","\\1",1,FILENAME) \
 ,gensub(re,"\\7",1)/gensub(re,"\\8",1) \
 ,gensub(re,"\\8",1) \
 ,gensub(re,"\\7",1) \
 ,gensub(re,"\\1",1) \
}' re='(.*) on demo  [(]cost=([0-9.]+)[.][.]([0-9.]+) rows=([0-9.]+) width=([0-9.]+)[)] [(]actual time=([0-9.]+)[.][.]([0-9.]+) rows=([0-9.]+) loops=1[)]' tmp.txt

Here is the result. For both databases, the response time from the secondary index is higher than the primary one. Still, scanning a few rows is fast. It takes single-digit milliseconds when scanning fewer than 500 rows, which is typical in OLTP. When the number of rows to scan increases, the time increases linearly but slowly. You must read more than 5,000 rows to take more than 100 milliseconds in Aurora DSQL.

Distributed SQL databases must optimize the distributed calls by batching the access to the table from the index. In monolithic databases, which read from the shared buffer pool in memory, each index entry of an Index Scan represents an additional buffer get for the table. This would not be scalable in a distributed database where the table row may be on a different node. Distributed SQL databases batch the index entries to read and send a remote call to get a batch of rows. This is why the network latency isn't added for each row but covers thousands of rows.

YugabyteDB provides statistics about distributed calls with the DIST option of EXPLAIN ANALYZE and it is easy to verify that the batch size is 1024:

yugabyte=> explain (analyze, dist, costs off, summary off)
           select * from demo where value between 1 and  1024
;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Index Scan using demo_value_key on demo (actual time=3.697..3.924 rows=1024 loops=1)
   Index Cond: ((value >= 1) AND (value <= 1024))
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 2.490 ms
   Storage Table Rows Scanned: 1024
   Storage Index Read Requests: 1
   Storage Index Read Execution Time: 0.867 ms
   Storage Index Rows Scanned: 1024
(8 rows)

yugabyte=> explain (analyze, dist, costs off, summary off) 
           select * from demo where value between 1 and  1025
;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Index Scan using demo_value_key on demo (actual time=3.856..4.462 rows=1025 loops=1)
   Index Cond: ((value >= 1) AND (value <= 1025))
   Storage Table Read Requests: 2
   Storage Table Read Execution Time: 2.712 ms
   Storage Table Rows Scanned: 1025
   Storage Index Read Requests: 2
   Storage Index Read Execution Time: 0.999 ms
   Storage Index Rows Scanned: 1025
(8 rows)

When reading from zero to 1024 rows, a secondary Index Scan sends one read request to the index and one to the table. Scanning from 1025 to 2048 rows requires two read requests for each.
A GUC parameter controls this in YugabyteDB:

yugabyte=> show yb_fetch_row_limit;
 yb_fetch_row_limit
--------------------
 1024
(1 row)

It is not easy to see it as the reads are fast, but it is slightly visible for the secondary index when zooming around 1024 rows:

I haven't noticed anything substantial suggesting a batch size for Aurora DSQL, and the preview version lacks further statistics. I hope Aurora DSQL general availability will give us more information about the distributed calls in the execution plan.

The benefit of batching is that the latency is low (one or two RPC) for a few rows. With more rows, the latency increases, but the time per row stays minimal because multiple rows share the same RPC. Users often expect the response time to be proportional to the result set.
Here is, on a logarithmic scale, the milliseconds per row:

For YugabyteDB, the 1024 and 2048 batch sizes are visible for the primary and secondary indexes. I see something similar for Aurora DSQL, around 1000 and 2000 rows for the secondary index only.

I've examined the behaviors of the two Distributed SQL databases but haven’t compared their timing. First, let’s note that Aurora DSQL is currently in preview, and we can expect more optimizations soon. The read paths are slightly different between the two, and performance can vary depending on the deployment and the internal balancing of the rows.
For Aurora DSQL, all reads come from the nodes in the region you’re connected to, while YugabyteDB may read from a remote region, which might lead to increased latency. In my case, I was connected to the region set for Raft leader preference, so my reads also come from the local region.
Lastly, it’s worth mentioning that Aurora DSQL is disaggregated storage and serverless, which could mean a bit more communication between components, but this offers an advantage in elasticity. With different trade-offs, both databases provide consistent secondary indexes with high performance. For more information about other databases, Aled DeBrie compared different approaches: