Franck Pachot for AWS Heroes

Posted on Nov 26

Aurora Limitless - Resharding

#aws #postgres #sharded #database

Re-sharding to scale out can be operationally complex when using database sharding. However, Aurora Limitless simplifies this process. Since the tables are already divided into more table slices, re-sharding re-balances these slices when adding a new shard node.

My shard group is minimal, consisting of 16 ACUs, which only permits a maximum of two shards.

postgres_limitless=> SELECT rds_aurora.limitless_split_shard('4');
ERROR:  split shard job failed
DETAIL:  Insufficient capacity in the DB shard group.
The maximum capacity configured for the DB shard group is 16.0 ACUs. The DB shard group consumes 16.0 ACUs, but the scaling job requires 4.0 ACUs.
Increase the maximum capacity and retry the command.

I increase the ACUs to be able to grow:

I am initiating a shard split job for the shard associated with subcluster ID 4.

postgres_limitless=> SELECT rds_aurora.limitless_split_shard('4');
 limitless_split_shard
-----------------------
         1732395750168
(1 row)

This process runs in the background, and the job number tracks progress:

postgres_limitless=> select * from rds_aurora.limitless_list_shard_scale_jobs(1732395750168);
    job_id     |   action    |      job_details      |   status    |    submission_time     | message
---------------+-------------+-----------------------+-------------+------------------------+---------
 1732395750168 | SPLIT_SHARD | Split Shard 4 by User | IN_PROGRESS | 2024-11-23 21:03:07+00 |
(1 row)

This operation may not be fully online, and the application may get errors. I've run PgBench, which runs a count(*) from my table.

pgbench -P 5 -T 3600 -c 10 -nf /dev/stdin <<< 'select count(*) from my_sharded_table'

This operation involved ten clients, and eight encountered an error during the Split Shard process:

progress: 635.0 s, 219.4 tps, lat 45.647 ms stddev 6.212, 0 failed
progress: 640.0 s, 229.0 tps, lat 43.580 ms stddev 4.231, 0 failed
progress: 645.0 s, 227.4 tps, lat 44.006 ms stddev 4.953, 0 failed
pgbench: error: client 2 script 0 aborted in command 0 query 0: ERROR:  could not connect to node "6"
DETAIL:  refer to internal logs for details
pgbench: error: client 9 script 0 aborted in command 0 query 0: ERROR:  could not connect to node "6"
DETAIL:  refer to internal logs for details
pgbench: error: client 5 script 0 aborted in command 0 query 0: ERROR:  could not connect to node "6"
DETAIL:  refer to internal logs for details
pgbench: error: client 0 script 0 aborted in command 0 query 0: ERROR:  could not connect to node "6"
DETAIL:  refer to internal logs for details
pgbench: error: client 4 script 0 aborted in command 0 query 0: ERROR:  could not connect to node "6"
DETAIL:  refer to internal logs for details
pgbench: error: client 3 script 0 aborted in command 0 query 0: ERROR:  could not connect to node "6"
DETAIL:  refer to internal logs for details
progress: 650.0 s, 59.8 tps, lat 43.951 ms stddev 4.315, 0 failed
progress: 655.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 660.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 665.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
pgbench: error: client 6 script 0 aborted in command 0 query 0: ERROR:  could not connect to node "6"
DETAIL:  refer to internal logs for details
pgbench: error: client 7 script 0 aborted in command 0 query 0: ERROR:  could not connect to node "6"
DETAIL:  refer to internal logs for details
progress: 670.0 s, 0.0 tps, lat 0.000 ms stddev 0.000, 0 failed
progress: 675.0 s, 17.0 tps, lat 674.184 ms stddev 4001.588, 0 failed
progress: 680.0 s, 30.4 tps, lat 66.340 ms stddev 32.994, 0 failed
progress: 685.0 s, 35.6 tps, lat 55.638 ms stddev 27.393, 0 failed
progress: 690.0 s, 33.2 tps, lat 60.957 ms stddev 38.752, 0 failed

The shard with subcluster ID 6 was added by splitting shard 4:

postgres_limitless=> select distinct dest_subcluster,dest_type,dest_instance_az from rds_aurora.limitless_stat_subclusters
natural join (select subcluster_id as dest_subcluster, subcluster_type as dest_type from rds_aurora.limitless_subclusters)
where dest_subcluster != orig_subcluster order by 1 ;

 dest_subcluster | dest_type | dest_instance_az
-----------------+-----------+------------------
 2               | router    | eu-west-1a
 3               | router    | eu-west-1b
 4               | shard     | eu-west-1a
 5               | shard     | eu-west-1b
 6               | shard     | eu-west-1a
(5 rows)

It appears that the router was aware of the new shard before it became accessible.

This information can be seen in the partitioning details on the router.

postgres_limitless=> \d+ my_sharded_table
                                                 Partitioned table "public.my_sharded_table"
 Column |  Type   | Collation | Nullable |                   Default                    | Storage | Compression | Stats target | Description
--------+---------+-----------+----------+----------------------------------------------+---------+-------------+--------------+-------------
 id     | bigint  |           | not null | nextval('my_sharded_table_id_seq'::regclass) | plain   |             |              |
 value  | integer |           |          |                                              | plain   |             |              |
Partition key: HASH (id)
Partitions: my_sharded_table_fs00001 FOR VALUES FROM (MINVALUE) TO ('0'), FOREIGN,
            my_sharded_table_fs00002 FOR VALUES FROM ('4611686018427387904') TO (MAXVALUE), FOREIGN,
            my_sharded_table_fs00003 FOR VALUES FROM ('0') TO ('4611686018427387904'), FOREIGN

It was not visible with two shards, as the boundaries were MINVALUE, 0 and MAXVALUE, but the shard key is hashed to a 64-bits signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 and it is now visible as it was split at 4,611,686,018,427,387,904.

After a while, the split job is successfully completed.

postgres_limitless=> select *, now() from rds_aurora.limitless_list_shard_scale_jobs(1732395750168);
    job_id     |   action    |      job_details      | status  |    submission_time     |                  message                  |             now
---------------+-------------+-----------------------+---------+------------------------+-------------------------------------------+------------------------------
 1732395750168 | SPLIT_SHARD | Split Shard 4 by User | SUCCESS | 2024-11-23 21:03:07+00 | Scaling job succeeded.                   +| 2024-11-23 21:26:56.01932+00
               |             |                       |         |                        | New shard instance with ID 6 was created. |
(1 row)

It took 20 minutes to split one shard with only empty tables. MY shards are not well balanced. Let's split the second shard :

postgres_limitless=SELECT rds_aurora.limitless_split_shard('5');
 limitless_split_shard
-----------------------
         1732397272908
(1 row)

... (20 minutes later)
postgres_limitless=> select * from rds_aurora.limitless_list_shard_scale_jobs(1732395750168);
    job_id     |   action    |      job_details      | status  |    submission_time     |                  message
---------------+-------------+-----------------------+---------+------------------------+-------------------------------------------
 1732395750168 | SPLIT_SHARD | Split Shard 4 by User | SUCCESS | 2024-11-23 21:03:07+00 | Scaling job succeeded.                   +
               |             |                       |         |                        | New shard instance with ID 6 was created.
(1 row)

The history of all my jobs is visible, including my failed attempts before increasing the ACUs:

postgres_limitless=> select * from rds_aurora.limitless_list_shard_scale_jobs() order by submission_time;
    job_id     |   action    |      job_details      | status  |    submission_time     |                                                                      message
---------------+-------------+-----------------------+---------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------
 1732393910300 | SPLIT_SHARD | Split Shard 4 by User | FAILED  | 2024-11-23 20:32:05+00 | Insufficient capacity in the DB shard group.                                                                                                      +
               |             |                       |         |                        | The maximum capacity configured for the DB shard group is 16.0 ACUs. The DB shard group consumes 16.0 ACUs, but the scaling job requires 4.0 ACUs.+
               |             |                       |         |                        | Increase the maximum capacity and retry the command.
 1732394377543 | SPLIT_SHARD | Split Shard 4 by User | FAILED  | 2024-11-23 20:40:06+00 | Insufficient capacity in the DB shard group.                                                                                                      +
               |             |                       |         |                        | The maximum capacity configured for the DB shard group is 48.0 ACUs. The DB shard group consumes 43.5 ACUs, but the scaling job requires 7.0 ACUs.+
               |             |                       |         |                        | Increase the maximum capacity and retry the command.
 1732395750168 | SPLIT_SHARD | Split Shard 4 by User | SUCCESS | 2024-11-23 21:03:07+00 | Scaling job succeeded.                                                                                                                            +
               |             |                       |         |                        | New shard instance with ID 6 was created.
 1732396104682 | SPLIT_SHARD | Split Shard 5 by User | FAILED  | 2024-11-23 21:09:08+00 | Unable to initiate more than one scaling job.                                                                                                     +
               |             |                       |         |                        | Split shard job with job ID 1732395750168 is currently in progress.
 1732397272908 | SPLIT_SHARD | Split Shard 5 by User | SUCCESS | 2024-11-23 21:28:09+00 | Scaling job succeeded.                                                                                                                            +
               |             |                       |         |                        | New shard instance with ID 7 was created.
(5 rows)

The outcome of those two splits is four shards.

natural join (select subcluster_id as dest_subcluster, subcluster_type as dest_type from rds_aurora.limitless_subclusters)
where dest_subcluster != orig_subcluster order by 1 ;
 dest_subcluster | dest_type | dest_instance_az
-----------------+-----------+------------------
 2               | router    | eu-west-1a
 3               | router    | eu-west-1b
 4               | shard     | eu-west-1a
 5               | shard     | eu-west-1b
 6               | shard     | eu-west-1a
 7               | shard     | eu-west-1a
(6 rows)

Since I maintained a power of two for the number of shards, my table is well distributed:

postgres_limitless=> \d+ my_sharded_table
                                                 Partitioned table "public.my_sharded_table"
 Column |  Type   | Collation | Nullable |                   Default                    | Storage | Compression | Stats target | Description
--------+---------+-----------+----------+----------------------------------------------+---------+-------------+--------------+-------------
 id     | bigint  |           | not null | nextval('my_sharded_table_id_seq'::regclass) | plain   |             |              |
 value  | integer |           |          |                                              | plain   |             |              |
Partition key: HASH (id)
Partitions: my_sharded_table_fs00001 FOR VALUES FROM ('-4611686018427387904') TO ('0'), FOREIGN,
            my_sharded_table_fs00002 FOR VALUES FROM ('4611686018427387904') TO (MAXVALUE), FOREIGN,
            my_sharded_table_fs00003 FOR VALUES FROM ('0') TO ('4611686018427387904'), FOREIGN,
            my_sharded_table_fs00004 FOR VALUES FROM (MINVALUE) TO ('-4611686018427387904'), FOREIGN

Finally, I checked the execution plan and saw that I needed one more operation to complete the split:

postgres_limitless=> explain (analyze) select * from my_sharded_table
;
                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=100.00..2353.76 rows=4552 width=12) (actual time=70.448..70.449 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00004 my_sharded_table_1  (cost=100.00..1065.50 rows=2275 width=12) (actual time=1.817..1.817 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00001 my_sharded_table_2  (cost=100.00..100.00 rows=1 width=12) (actual time=65.252..65.252 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00003 my_sharded_table_3  (cost=100.00..1065.50 rows=2275 width=12) (actual time=1.509..1.509 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00002 my_sharded_table_4  (cost=100.00..100.00 rows=1 width=12) (actual time=1.593..1.593 rows=0 loops=1)
 Planning Time: 0.092 ms
 Execution Time: 82.219 ms
(7 rows)

The cardinality estimation is incorrect because the new virtual partitions lack statistics. I will run ANALYZE again.

postgres_limitless=> analyze my_sharded_table;
ANALYZE

postgres_limitless=> explain (analyze) select * from my_sharded_table
;
                                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=100.00..400.02 rows=4 width=12) (actual time=17.225..17.226 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00004 my_sharded_table_1  (cost=100.00..100.00 rows=1 width=12) (actual time=2.941..2.941 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00001 my_sharded_table_2  (cost=100.00..100.00 rows=1 width=12) (actual time=1.906..1.906 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00003 my_sharded_table_3  (cost=100.00..100.00 rows=1 width=12) (actual time=2.891..2.891 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00002 my_sharded_table_4  (cost=100.00..100.00 rows=1 width=12) (actual time=9.212..9.212 rows=0 loops=1)
 Planning Time: 0.884 ms
 Execution Time: 33.311 ms
(7 rows)

To save my cloud credits, I lower the ACUs to the minimum level:

The operation was completed quickly, and there were still four shards remaining.

postgres_limitless=> \d+ my_sharded_table
                                                 Partitioned table "public.my_sharded_table"
 Column |  Type   | Collation | Nullable |                   Default                    | Storage | Compression | Stats target | Description
--------+---------+-----------+----------+----------------------------------------------+---------+-------------+--------------+-------------
 id     | bigint  |           | not null | nextval('my_sharded_table_id_seq'::regclass) | plain   |             |              |
 value  | integer |           |          |                                              | plain   |             |              |
Partition key: HASH (id)
Partitions: my_sharded_table_fs00001 FOR VALUES FROM ('-4611686018427387904') TO ('0'), FOREIGN,
            my_sharded_table_fs00002 FOR VALUES FROM ('4611686018427387904') TO (MAXVALUE), FOREIGN,
            my_sharded_table_fs00003 FOR VALUES FROM ('0') TO ('4611686018427387904'), FOREIGN,
            my_sharded_table_fs00004 FOR VALUES FROM (MINVALUE) TO ('-4611686018427387904'), FOREIGN

postgres_limitless=> explain (analyze) select * from my_sharded_table;
                                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=100.00..400.02 rows=4 width=12) (actual time=13.691..13.692 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00004 my_sharded_table_1  (cost=100.00..100.00 rows=1 width=12) (actual time=1.442..1.442 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00001 my_sharded_table_2  (cost=100.00..100.00 rows=1 width=12) (actual time=8.800..8.800 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00003 my_sharded_table_3  (cost=100.00..100.00 rows=1 width=12) (actual time=1.418..1.418 rows=0 loops=1)
   ->  Async Foreign Scan on my_sharded_table_fs00002 my_sharded_table_4  (cost=100.00..100.00 rows=1 width=12) (actual time=1.808..1.808 rows=0 loops=1)
 Planning Time: 0.105 ms
 Execution Time: 36.324 ms
(7 rows)

Aurora's limitless elasticity enables both vertical scalability with Aurora Serverless and horizontal scalability through database sharding. I needed to increase the ACUs to split, but I can reduce them afterward probably because the shards have scaled down.

As with all database sharding solutions, horizontal scaling requires significant data movement and brief periods of downtime. Aurora Limitless automates these processes. You can initiate a shard split, which actually splits the virtual partitions by moving the physical slices. This operation may take several hours, and applications should be prepared to handle some disconnections during this time. Additionally, at least in this preview version, you must run the ANALYZE command manually.

DEV Community

Aurora Limitless - Resharding

Top comments (0)

Read next

Practical DynamoDB - Locking Reads

Unlocking Aurora DSQL with AWS Lambda: A Seamless Solution for Serverless, Scalable, and Event-Driven Architectures

Dr. Swami Sivasubramanian's Keynote - under 5 minutes

Building the Backbone: Entities Part 2, Agent