DEV Community

Frits Hoogland for YugabyteDB

Posted on

LSM-tree storage in YugabyteDB and packed rows

YugabyteDB uses the PostgreSQL source code for PostgreSQL compatibility, which we call YSQL. However, once we store tuples, we transform the PostgreSQL tuples into a storage format that is called 'protobuf', which gets send to our distributed storage layer called 'DocDB' via RPC calls. After DocDB has received the tuples, it is stored in the database we use for the storage, which is rocks db. Rocksdb is a key-value store which stores the PostgreSQL tuples (as well as YCQL tuples from our Cassandra compatible storage engine) as key-value pairs and sub-key-value pairs.

Rocksdb being an LSM-tree based database is an append-only store, which has some specifics that are very different than traditional block-based database engines. One of these specifics is the the first layer or level of storage is in-memory only. The way persistence is guaranteed for these rows is using WAL, which is a familiar property of both block based and LSM-tree based storage engines. This is called 'WAL for PostgreSQL, and redo for Oracle, to name two common databases).

To see how that works can be done by creating a test YugabyteDB cluster which allows you access to the operating system layer, such as using the Vagrant boxes, Docker containers, Kubernetes, using precompiled executables for Linux or MacOS or build it from source, and run a YugabyteDB database with any replication factor you like.

First, create a test table:

drop table if exists table_1;
create table table_1(
id int,
f1 varchar,
f2 int,
f3 timestamp,
constraint table_1_pk primary key ( id )
) split into 1 tablets;
Enter fullscreen mode Exit fullscreen mode

Please mind the 'split into 1 tablets' clause. A YSQL table by default stores its tuples based on the primary key definition. The default type of the primary key is hash, which means that rows would be spread over the tablets essentially random. To prevent from having to look into multiple tablets, we split the number of tablets into 1 tablet here. Every tablet is a rocks db database.

Let's insert some rows:

insert into table_1 values
(1, 'AAAA', 1111, current_timestamp),
(2, 'BBBB', 2222, current_timestamp),
(3, 'CCCC', 3333, current_timestamp);
Enter fullscreen mode Exit fullscreen mode

And use yb_stats to obtain the YugabyteDB table id:

% yb_stats --print --entities --table-name-match table_1
Table:    ysql.yugabyte.table_1, state: RUNNING, id: 000033e8000030008000000000004300
Tablet:   ysql.yugabyte.table_1.fe2e4ae866174c099ed12649fa8fa05c state: RUNNING
            ( VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER VOTER,yb-2.local:9100,LEADER )
Enter fullscreen mode Exit fullscreen mode

The filter --table-name-match shows one table, the table table_1 with id 000033e8000030008000000000004300, and it consists of one tablet, with id: fe2e4ae866174c099ed12649fa8fa05c, which consists of 3 replicas because of the replication factor.

As described previously, the first level of storage for an LSM-tree based database is the memtable. This means the data sits in memory only currently. We must ask the database to force flushing the memtable to disk. In YugabyteDB, the directory structure for the rocks db databases is:
{fs_data_dirs}/yb-data/tserver/data/rocksdb/table-{table id}/tablet-{tablet id}.

The way to ask the database to flush the tablets for a table to disk is, for this table:

yb-admin -init_master_addrs localhost:7100 flush_table_by_id 000033e8000030008000000000004300 600
Flushed [000033e8000030008000000000004300] tables.
Enter fullscreen mode Exit fullscreen mode

Mind the number '600': you need to specify a timeout value.

Now we can go to the directory where the tablet stores its files:

cd /mnt/d0/yb-data/tserver/data/rocksdb/table-000033e8000030008000000000004300/tablet-fe2e4ae866174c099ed12649fa8fa05c
Enter fullscreen mode Exit fullscreen mode

And execute sst_dump to print the SST files:

sst_dump --command=scan --file=. --output_format=decoded_regulardb
from [] to []
Process ./000010.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1669469147804503 }]) -> null; intent doc ht: HT{ physical: 1669469147795462 }
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1669469147804503 w: 1 }]) -> "AAAA"; intent doc ht: HT{ physical: 1669469147795462 w: 1 }
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(2); HT{ physical: 1669469147804503 w: 2 }]) -> 1111; intent doc ht: HT{ physical: 1669469147795462 w: 2 }
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(3); HT{ physical: 1669469147804503 w: 3 }]) -> 722784347775780; intent doc ht: HT{ physical: 1669469147795462 w: 3 }
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1669469147804503 w: 4 }]) -> null; intent doc ht: HT{ physical: 1669469147795462 w: 4 }
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1669469147804503 w: 5 }]) -> "BBBB"; intent doc ht: HT{ physical: 1669469147795462 w: 5 }
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(2); HT{ physical: 1669469147804503 w: 6 }]) -> 2222; intent doc ht: HT{ physical: 1669469147795462 w: 6 }
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(3); HT{ physical: 1669469147804503 w: 7 }]) -> 722784347775780; intent doc ht: HT{ physical: 1669469147795462 w: 7 }
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1669469147804503 w: 8 }]) -> null; intent doc ht: HT{ physical: 1669469147795462 w: 8 }
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1669469147804503 w: 9 }]) -> "CCCC"; intent doc ht: HT{ physical: 1669469147795462 w: 9 }
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(2); HT{ physical: 1669469147804503 w: 10 }]) -> 3333; intent doc ht: HT{ physical: 1669469147795462 w: 10 }
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(3); HT{ physical: 1669469147804503 w: 11 }]) -> 722784347775780; intent doc ht: HT{ physical: 1669469147795462 w: 11 }
Enter fullscreen mode Exit fullscreen mode

This shows us the 3 tuples we inserted. The DocKey contains a hash value (first field) and the primary key column value, which is 1, 2 or 3. Every row is started with a 'liveness key', which is the SystemColumnId(0), and then follow the the 3 columns. The column marked as 1 (ColumnId(1)) which is the f1 column in the YSQL table, is a varchar field, and we can see the value, the column marked as 2 is the f2 column which is an integer field, for which we can see the value, and the column marked as 3 is the f3 column and is a timestamp field, which are microseconds since 2000-01-01 00:00:00.

Let's see how this works with packed columns! Packed columns is a feature of YugabyteDB's rocksdb implementation where we take multiple columns of a table and store it in a single key in the rocksdb data.

$ sst_dump --command=scan --file=. --output_format=decoded_regulardb
from [] to []
Process ./000010.sst
Sst file format: block-based
Not found (yb/docdb/schema_packing.cc:173): Schema packing not found: 0: . Key: SubDocKey(DocKey(0x1210, [1], []), [HT{ physical: 1669549573660326 logical: 5 }])
Not found (yb/docdb/schema_packing.cc:173): Schema packing not found: 0: . Key: SubDocKey(DocKey(0xc0c4, [2], []), [HT{ physical: 1669549573660326 logical: 5 w: 1 }])
Not found (yb/docdb/schema_packing.cc:173): Schema packing not found: 0: . Key: SubDocKey(DocKey(0xfca0, [3], []), [HT{ physical: 1669549573660326 logical: 5 w: 2 }])
Enter fullscreen mode Exit fullscreen mode

The sst_dump command in version 2.15.3.2 is not yet updated to handle packed rows. I created an issue for it: [DocDB] sst_dump does not recognise packed rows and displays 'Schema packing not found: 0: .'.

Top comments (0)