DEV Community

Frits Hoogland for YugabyteDB

Posted on

A look into JSONB in YugabyteDB

To see how JSON is stored in YugabyteDB, we have to take a look at both PostgreSQL and YugabyteDB.

Let's create two test tables for investigation:

drop table if exists test1;
create table test1 
(
id int primary key,
val json
)
split into 1 tablets
;
drop table if exists test2;
create table test2 
(
id int primary key,
val jsonb
)
split into 1 tablets
;
Enter fullscreen mode Exit fullscreen mode

The reason I create two tables is to be able to see the difference between PostgreSQL JSON and JSONB. The JSON data type in PostgreSQL is storing JSON as text, while the JSONB is storing JSON binary. The advice from the PostgreSQL documentation is to use JSONB.

With the tables in place, let's create some rows in both:

-- test1 / json
insert into test1 values (1,'{ "f1": "aaaaa", "f2": "bbbbb", "f3": "ccccc", "f4": "ddddd", "f5": "eeeee" }');
insert into test1 values (2,'{ "f1": "fffff", "f2": "ggggg", "f3": "hhhhh", "f4": "iiiii", "f5": "jjjjj" }');
insert into test1 values (3,'{ "f1": "kkkkk", "f2": "lllll", "f3": "mmmmm", "f4": "nnnnn", "f5": "ooooo" }');
-- test2 / jsonb
insert into test2 values (1,'{ "f1": "aaaaa", "f2": "bbbbb", "f3": "ccccc", "f4": "ddddd", "f5": "eeeee" }');
insert into test2 values (2,'{ "f1": "fffff", "f2": "ggggg", "f3": "hhhhh", "f4": "iiiii", "f5": "jjjjj" }');
insert into test2 values (3,'{ "f1": "kkkkk", "f2": "lllll", "f3": "mmmmm", "f4": "nnnnn", "f5": "ooooo" }');
Enter fullscreen mode Exit fullscreen mode

With the data being inserted in YugabyteDB, let's look at how it's stored in YugabyteDB. In order to do that, access to a tablet server that hosts a replica of the table.

The reason for the 'split into 1 tablets' clause is to simplify finding the data: if the table would have been split into more tablets, the row will be distributed over the tablets based on the primary key definition. With the default primary key definition the row would have been randomly distributed over the tablets because of hash sharding. With 1 tablet, all the inserted rows will be in the single tablet.

The simplest way to understand where a tablet is hosted in your YugabyteDB cluster is to use an upcoming version of yb_stats to lookup the table specifics:

% yb_stats --print-entities --table-name-match 'test(1|2)'
Table:    ysql.yugabyte.test1, state: RUNNING, id: 000033e8000030008000000000004232
Tablet:   ysql.yugabyte.test1.f1c52b62513f461095b13cb0425b2ef0 state: RUNNING
            ( VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,LEADER VOTER,yb-2.local:9100,FOLLOWER )
Table:    ysql.yugabyte.test2, state: RUNNING, id: 000033e8000030008000000000004237
Tablet:   ysql.yugabyte.test2.a535c9ff10fb4fb2b874770ce5cb4ac7 state: RUNNING
            ( VOTER,yb-1.local:9100,FOLLOWER VOTER,yb-3.local:9100,FOLLOWER VOTER,yb-2.local:9100,LEADER )
Enter fullscreen mode Exit fullscreen mode

This queries the entities (objects) from the live database, and shows you the two tables test1 and test2 in the yugabyte database, the table id, and the tablet of each of the tables. It also shows the replicas that each tablet has. In the case of a replication factor 3 cluster with 3 nodes, it's obvious that each node gets a replica.

This means we can get to any of the nodes to look at the tablet data for each of the tables: it doesn't matter if a node hosts a follower or leader of a tablet for the two steps that we need to do next.

The first step is to make rocksdb store the table data to an SST file. This might come as a surprise: the data is inserted into YSQL, and committed because of auto commit, yet there is no datafile.

There are two things that are important here: first of all: there is no risk of data-loss, because DocDB provides WAL (write ahead log), which persists any modification to a rocksdb database. The second thing is that LSM-tree provides it's first layer of storage in a memtable, which is, like the name suggests, memory only.

If the memtable grows beyond the maximum memtable size, then a new memtable is created, and the current memtable is made immutable, sorted and stored to disk as SST file.

In order to read the actual data, we have to ask DocDB to (prematurely) persist the memtable so we can read the rocksdb contents. This is done using a yb-admin command. The above --print-entities command prints the table id (please mind to use the table id, lots of other objects in YugabyteDB have id's too!), which can be used to flush the tablet data to disk:

$ yb-admin -init_master_addrs localhost:7100 flush_table_by_id 000033e8000030008000000000004232 600
Flushed [000033e8000030008000000000004232] tables.
$ yb-admin -init_master_addrs localhost:7100 flush_table_by_id 000033e8000030008000000000004237 600
Flushed [000033e8000030008000000000004237] tables.
Enter fullscreen mode Exit fullscreen mode

This will generate an .sst file in:

{fs_data_dirs}/yb-data/tserver/data/rocksdb/table-{table id}/tablet-{tablet id}
Enter fullscreen mode Exit fullscreen mode

If we go to the tablet directory of table1 alias the JSON table, we can then dump the sst file contents in the following way:

$ sst_dump --command=scan --file=. --output_format=decoded_regulardb
from [] to []
Process ./000010.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1669383197054696 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1669383197054696 w: 1 }]) -> "\x9d{ \"f1\": \"aaaaa\", \"f2\": \"bbbbb\", \"f3\": \"ccccc\", \"f4\": \"ddddd\", \"f5\": \"eeeee\" }"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1669383197061929 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1669383197061929 w: 1 }]) -> "\x9d{ \"f1\": \"fffff\", \"f2\": \"ggggg\", \"f3\": \"hhhhh\", \"f4\": \"iiiii\", \"f5\": \"jjjjj\" }"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1669383197066513 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1669383197066513 w: 1 }]) -> "\x9d{ \"f1\": \"kkkkk\", \"f2\": \"lllll\", \"f3\": \"mmmmm\", \"f4\": \"nnnnn\", \"f5\": \"ooooo\" }"
Enter fullscreen mode Exit fullscreen mode

Now let's take a look at the output.

All the SubDocKeys contain a DocKey. The DocKey stores the primary key. The first column in the DocKey is a hash value of the row, the second column is the actual primary key value.

The first row in the output above is the SystemColumnId, which is invisible at the user layer, and is commonly referred to as 'liveness column'. This column is used to indicate the row presence.

The row after the SystemColumnId row is the second column in the table. If you inspect the row data after '->', you see that the actual JSON text is visible. In other words: despite being a JSON column, the JSON textual representation is stored at the DocDB layer, and is what is provided to YSQL if the column is requested.

In all rows alias SubDocKeys you find data with 'HT': the Hybrid (Logical Clock) Time. This is how MVCC is implemented.

Now let's go to the test2 table data, and use the sst_dump tool to see how the JSONB data looks like:

$ sst_dump --command=scan --file=. --output_format=decoded_regulardb
from [] to []
Process ./000010.sst
Sst file format: block-based
SubDocKey(DocKey(0x1210, [1], []), [SystemColumnId(0); HT{ physical: 1669383197078672 }]) -> null
SubDocKey(DocKey(0x1210, [1], []), [ColumnId(1); HT{ physical: 1669383197078672 w: 1 }]) -> "\x05\x00\x00 \x02\x00\x00\x80\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00f1f2f3f4f5aaaaabbbbbcccccdddddeeeee"
SubDocKey(DocKey(0xc0c4, [2], []), [SystemColumnId(0); HT{ physical: 1669383197086482 }]) -> null
SubDocKey(DocKey(0xc0c4, [2], []), [ColumnId(1); HT{ physical: 1669383197086482 w: 1 }]) -> "\x05\x00\x00 \x02\x00\x00\x80\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00f1f2f3f4f5fffffggggghhhhhiiiiijjjjj"
SubDocKey(DocKey(0xfca0, [3], []), [SystemColumnId(0); HT{ physical: 1669383198094809 }]) -> null
SubDocKey(DocKey(0xfca0, [3], []), [ColumnId(1); HT{ physical: 1669383198094809 w: 1 }]) -> "\x05\x00\x00 \x02\x00\x00\x80\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00\x05\x00\x00\x00f1f2f3f4f5kkkkklllllmmmmmnnnnnooooo"
Enter fullscreen mode Exit fullscreen mode

The layout of the sst data is identical: the primary key sits in the DocKey, we got a 'liveness column' per row, and the JSONB column is singular. The data in the JSONB column differs from the JSON column: the JSON data is stored in a binary way so it can be used directly by YSQL.

Top comments (0)