🚀 Autonomous Sharding in YugabyteDB (with automatic tablet splitting)

#yugabytedb #distributed #sql #database

By default, a YugabyteDB database table, or index, is horizontally partitioned by hash on the first column, with a number of tablets, by default, depending on the number of nodes. This is good to distribute data. But for columns that are read on a range, a hash function is not the right choice. You define range sharding by mentioning ASC or DESC for the primary key or secondary index columns. But then, by default, the table or index is created with only one tablet. Of course, you can split it on specific values at creation time, or later. But you don't want to manage this manually, right? YugabyteDB can split this single tablet while loading data.

configuration

In this post, I'm showing the basic behavior. I'll explain the mechanism in other posts. I'm using all defaults in a 3 nodes 2.13.1 cluster (preview version - the stable version may disable it by default to ease rolling upgrades):

--enable_automatic_tablet_splitting=true
--tablet_force_split_threshold_bytes=107374182400
--tablet_split_high_phase_shard_count_per_node=24
--tablet_split_high_phase_size_threshold_bytes=10737418240
--tablet_split_low_phase_shard_count_per_node=8
--tablet_split_low_phase_size_threshold_bytes=536870912
--tablet_split_size_threshold_bytes=0

From the parameters you can see two phases. The "low phase" splits quickly when there's a small number of tablets. The "high phase" has a larger threshold to avoid having too many tablets. Here, I'll show the "low phase" starting with an empty table and loading by batch of 100MB. The low phase threshold here is
tablet_split_low_phase_size_threshold_bytes=536870912=512MB

On a RF=3 lab with YugabyteDB 2.13.1 I run the following:

create table demo_auto_split ( id int generated always as identity, junk text, primary key(id asc) );
copy demo_auto_split(junk) from program 'base64 /dev/urandom | head -c 104857600' ;
\watch 1

This creates a range-sharded table which is created with 1 tablet only when you do not define where to split it. Before the auto-splitting feature, you had to manually split it later.

Now I'm just loading data, by 100MB COPY statements, and check what happens. Here is the output:

psql (13.5, server 11.2-YB-2.13.1.0-b0)
Type "help" for help.

yugabyte=# create table demo_auto_split ( id int generated always as identity, junk text, primary key(id asc) );
CREATE TABLE
yugabyte=# copy demo_auto_split(junk) from program 'base64 /dev/urandom | head -c 104857600' ;

COPY 1361788

After loading 100M (in 1361788 rows) my table has SST Files Uncompressed: 211.13M of data (SST Files: 137.67M compressed) in one SST file:

This is the one tablet from the table creation time, covering the whole range of values range: [<start>, <end>)

Split Depth

The last tasks visible in the yb-master GUI were about the creation of this tablet and the election of the leader:

Split Depth 1

I'm now running one COPY command again to load 100M (1361788 rows). I have 2723576 rows with id from 1 to 2723576 and less than 500MB:

One more COPY command again to load 100M (1361788 rows). I have 4085364 rows with id from 1 to 4085364 and still less than 500MB of SST Files (compressed)

During the load, we reach the threshold of 500MB (SST Files: 512.80M). For a few seconds there's no information:

and then it shows SST Files: 542.25M:

The tablet at SplitDepth=0 storing all rows in the range: [<start>, <end>) is still running but I see two more tablets, at SplitDepth=1 split at value 1895723, one already Running and the other still Creating

After a few minutes, those two new tablets are Running and the previous one is Deleted

The Deleted ones are kept, like the Raft logs, in case a follower was out of the network for a while so that it can catch-up quickly when back.

The auto splitting is driven by the master, so you can find the tasks there (Get Tablet Split Key, Split Tablet, then Stepdown Leader and Delete Tablet):

There are more information in the logs where you can see the complex operations happening in background, to run this while the application is running, reading and writing, and synchronizing though the Raft protocol.

I have now two tablets with SST Files: 188.36M and SST Files: 353.99M

auto-split phases

The threshold here that has triggered the split is tablet_split_low_phase_size_threshold_bytes=536870912 which means that the low phase splits the tablets larger than 512MB. This low phase is quite aggressive, because tablets can be much larger. The idea is to split to distribute load over the node's vCPUs and RAM even for medium size tables. But, once all resources are used, having too many tablets is not optimal, so this low phase stops for a table that has on average 8 tablets per node, configurable with tablet_split_low_phase_shard_count_per_node=8.

Then the tablets will grow to a larger size until they reach tablet_split_high_phase_size_threshold_bytes=10737418240 which means that the high phase will split them when they are larger than 10GB. The goal here is to be less aggressive, because the distribution is already correct, but keep tablets easy to be moved when new nodes are added. This high phase has a limit in number of tablets, configured by tablet_split_high_phase_shard_count_per_node=24 which is, again, an average per node. This also means that if you add new nodes, the tablets are re-balanced, and this average lowers.

Then the 24 tablets, on each node, for this table or index, will grow over those 10GB but there's another limit to avoid unmanageable tablet size. This is set to 100GB by default with tablet_force_split_threshold_bytes=107374182400. Then they continue to grow but you probably scale the cluster because this means a table with 240GB per node

This feature is described in:
https://github.com/yugabyte/yugabyte-db/issues/8039